observations.text8

text8(path)

Load the text8 data set (Mahoney, 2011). The dataset is preprocessed and has a vocabulary of 27 characters. There are 100 million characters.

Args:

  • path: str. Path to directory which either stores file or otherwise file will be downloaded and extracted there. Filename is text8.

Returns:

Tuple of str x_train, x_test, x_valid.

Mahoney, M. (2011). Large text compression benchmark. Retrieved from http://mattmahoney.net/dc/text.html