使用nltk改进对人名的提取

周楷

2023-03-14

问题内容：

我正在尝试从文本中提取人名。

有人有推荐的方法吗？

这就是我尝试过的（下面的代码）：我正在使用nltk查找所有标记为人的东西，然后生成该人所有NNP部分的列表。我正在跳过只有一个NNP可以避免抓住一个姓氏的人。

我得到了不错的结果，但是想知道是否有更好的方法来解决这个问题。

码：

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

输出：

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

除了维珍银河，这都是有效的输出。当然，在本文中了解维珍银河不是人的名字是很困难的（也许是不可能的）部分。

问题答案：

必须同意“使我的代码更好”这个网站不太适合的建议，但是我可以为您提供一些 尝试的 途径。

看看斯坦福命名实体识别器（NER）。它的绑定已包含在NLTK v
2.0中，但是您必须下载一些核心文件。这是可以为您完成所有操作的脚本。

我写了这个脚本：

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

并没有那么糟糕的输出：

（’Francois’，’PERSON’）（’R.’，’PERSON’）（’Velde’，’PERSON’）（’Richard’，’PERSON’）（’Branson’，’PERSON’）（’Virgin’
，’PERSON’）（’Galactic’，’PERSON’）（’Bitcoin’，’PERSON’）（’Bitcoin’，’PERSON’）（’Paul’，’PERSON’）（’Krugman’，’PERSON’）
（’Larry’，’PERSON’）（’Summers’，’PERSON’）（’Bitcoin’，’PERSON’）（’Nick’，’PERSON’）（’Colas’，’PERSON’）

希望这会有所帮助。

使用nltk改进对人名的提取

相关阅读

相关文章

相关问答

相关工具

相关文档