如何有效地将pos_tag_sents（）应用于熊猫数据框

王渊

2023-03-14

问题内容：

在您希望POS标记存储在熊猫数据框中的一列文本（每行只有一句话）的情况下，SO上的大多数实现都使用apply方法

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

NLTK文档建议使用pos_tag_sents（）有效标记多个句子。

这是否适用于此示例，如果是，那么代码是否像更改pso_tag为那样简单，pos_tag_sents或者NLTK意味着段落的文本源

正如评论中提到的那样，pos_tag_sents()目的是每次都减少感受器的负载， 但是问题是如何做到这一点，并且仍然在熊猫数据框中生成一列？

链接到示例数据集20kRows

问题答案：

输入值

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

TL; DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ...

在长：

首先，您可以将Text列提取到字符串列表中：

texts = df['Text'].tolist()

然后可以应用该word_tokenize功能：

map(word_tokenize, texts)

注意，@Boud的建议几乎是相同的，使用df.apply：

df['Text'].apply(word_tokenize)

然后将标记化的文本转储到字符串列表中：

df['Text'].apply(word_tokenize).tolist()

然后，您可以使用pos_tag_sents：

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

然后将列添加回DataFrame中：

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

如何有效地将pos_tag_sents（）应用于熊猫数据框

相关阅读

相关文章

相关问答

相关工具

相关文档