observations.enwik8

enwik8(path)

Load enwik8 from the Hutter Prize (Hutter, 2012). The dataset is preprocessed and has a vocabulary of 205 characters. There are 100 million characters.

Args:

  • path: str. Path to directory which either stores file or otherwise file will be downloaded and extracted there. Filename is enwik8.

Returns:

Tuple of str x_train, x_test, x_valid.

Hutter, M. (2012). The human knowledge compression contest. Retrieved from http://prize.hutter1.net