NLTK的高效术语文档矩阵

宗政楚

2023-03-14

问题内容：

我正在尝试使用NLTK和熊猫创建术语文档矩阵。我写了以下函数：

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    '''to create a Term Document Matrix from a NLTK Corpus'''
    fd_list = []
    for x in range(0, len(xCorpus.fileids())):
        fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
    DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
    DTM.fillna(0,inplace = True)
    return DTM.T

运行它

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

x = fnDTM_Corpus(newcorpus)

它适用于语料库中的一些小文件，但是当我尝试使用4,000个文件（每个约2 kb）的语料库运行它时，出现 MemoryError 。

我想念什么吗？

我正在使用32位python。（在Windows 7、64位OS，Core Quad CPU，8 GB
RAM上）。我真的需要对这种大小的语料库使用64位吗？

问题答案：

感谢Radim和Larsmans。我的目标是要拥有一个与您在R tm中获得的DTM类似的DTM。我决定使用scikit-
learn，部分受此博客文章的启发。这是我想出的代码。

我将其发布在这里，希望其他人会发现它有用。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def fn_tdm_df(docs, xColNames = None, **kwargs):
    ''' create a term document matrix as pandas DataFrame
    with **kwargs you can pass arguments of CountVectorizer
    if xColNames is given the dataframe gets columns Names'''

    #initialize the  vectorizer
    vectorizer = CountVectorizer(**kwargs)
    x1 = vectorizer.fit_transform(docs)
    #create dataFrame
    df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
    if xColNames is not None:
        df.columns = xColNames

    return df

在目录中的文本列表上使用它

DIR = 'C:/Data/'

def fn_CorpusFromDIR(xDIR):
    ''' functions to create corpus from a Directories
    Input: Directory
    Output: A dictionary with 
             Names of files ['ColNames']
             the text in corpus ['docs']'''
    import os
    Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
               ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
    return Res

d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
          xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
          stop_words=None, charset_error = 'replace')

NLTK的高效术语文档矩阵

相关阅读

相关文章

相关问答

相关工具

相关文档