在NLTK中使用PunktSentenceTokenizer

有骏祥

2023-03-14

问题内容：

我正在学习使用NLTK的自然语言处理。我遇到了使用PunktSentenceTokenizer给定代码无法理解其实际用途的代码。代码给出：

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A

tokenized = custom_sent_tokenizer.tokenize(sample_text)   #B

def process_content():
try:
    for i in tokenized[:5]:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)

except Exception as e:
    print(str(e))


process_content()

所以，为什么我们要使用PunktSentenceTokenizer。以及标记为A和B的行中发生的情况。我的意思是，有一个训练文本，另一个为示例文本，但是需要两个数据集来获取语音部分标记。

我无法理解的标记为A和的行B。

PS：我确实尝试看过NLTK书，但无法理解PunktSentenceTokenizer的真正用途是什么

问题答案：

PunktSentenceTokenizer是默认句子标记器的抽象类，即sent_tokenize()NLTK中提供的标记。这是无监督多语言句子边界检测的一种实现（Kiss和Strunk（2005）。请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/
init
.py＃L79

给定一个带有多个句子的段落，例如：

>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '

您可以使用sent_tokenize()：

>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
...     print sent
...     print '--------'
... 
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world. 
--------

在sent_tokenize()使用从预先训练模式nltk_data/tokenizers/punkt/english.pickle。您还可以指定其他语言，NLTK中经过预先训练的模型的可用语言列表为：

alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle    PY3                turkish.pickle
estonian.pickle  italian.pickle  README

给定另一种语言的文本，请执行以下操作：

>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "

>>> for sent in sent_tokenize(german_text, language='german'):
...     print sent
...     print '---------'
... 
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. 
---------

要训练自己的punkt模型，请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py和nltk
punkt的训练数据格式

在NLTK中使用PunktSentenceTokenizer

相关阅读

相关文章

相关问答

相关工具

相关文档