observations.ptb
优质
小牛编辑
136浏览
2023-12-01
ptb(path)
Load the Penn Treebank data set (Marcus, Marcinkiewicz, & Santorini, 1993). The dataset is preprocessed and has a vocabulary of 10,000 words, including the end-of-sentence marker and a special symbol () for rare words. There are 929,589 training words, 73,760 validation words, and 82,430 test words.
Args:
path
: str. Path to directory which either stores file or otherwise file will be downloaded and extracted there. Filename issimple-examples/
.
Returns:
Tuple of str x_train, x_test, x_valid
.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.