Edward

Reference ed.criticisms ed.inferences ed.models ed.util observations

observations.wikitext103

wikitext103(
    path,
    raw=False
)

Load the Wikitext-103 data set (Merity, Xiong, Bradbury, & Socher, 2016). The dataset consists of Wikipedia articles fitting the Good or Featured article criteria and has a vocabulary of 267,735 words. There are 103,227,021 training, 217,646 validation, and 245,569 test tokens.

Args:

path: str. Path to directory which either stores file or otherwise file will be downloaded and extracted there. Filename is wikitext-2/.
raw: bool, optional. Whether to load the raw data, which does not preprocess any tokens into and newlines into .

Returns:

Tuple of str x_train, x_valid, x_test.

Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. arXiv Preprint arXiv:1609.07843.