observations.ptb

ptb(path)

Load the Penn Treebank data set (Marcus, Marcinkiewicz, & Santorini, 1993). The dataset is preprocessed and has a vocabulary of 10,000 words, including the end-of-sentence marker and a special symbol () for rare words. There are 929,589 training words, 73,760 validation words, and 82,430 test words.

Args:

  • path: str. Path to directory which either stores file or otherwise file will be downloaded and extracted there. Filename is simple-examples/.

Returns:

Tuple of str x_train, x_test, x_valid.

Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.