问题：

python中的wordnet词形化和pos标记

谢运良

2023-03-14

我想在python中使用wordnet lemmatizer，我了解到默认的pos标记是NOUN，并且它不会为动词输出正确的引理，除非pos标记明确指定为动词。

我的问题是什么是最好的镜头，以便准确地执行上述表达？

我使用nltk.pos_tag做了pos标记，我迷失在将树库pos标记集成到wordnet兼容pos标记中。请帮助

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)

我得到了NN、JJ、VB、RB中的输出标签。如何将这些更改为与wordnet兼容的标签？

我也必须训练nltk。pos_tag（）带有标记语料库，或者我可以直接在数据上使用它进行评估？

共有3个答案

濮阳弘扬

2023-03-14

转换步骤：文档-

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad'

class Splitter(object):
    """
    split the document into sentences and tokenize each sentence
    """
    def __init__(self):
        self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self,text):
        """
        out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.']
        """
        # split into single sentence
        sentences = self.splitter.tokenize(text)
        # tokenization in each sentences
        tokens = [self.tokenizer.tokenize(sent) for sent in sentences]
        return tokens


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
        pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        return pos_tokens

lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

#step 1 split document into sentence followed by tokenization
tokens = splitter.split(text)

#step 2 lemmatization using pos tagger 
lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
print(lemma_pos_token)

韩华美

2023-03-14

与nltk的源代码相同。语料库。读者wordnet(http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)

#{ Part-of-speech constants
 ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
#}
POS_LIST = [NOUN, VERB, ADJ, ADV]

空俊语

2023-03-14

首先，您可以直接使用nltk.pos_tag（），而无需对其进行训练，该函数将从文件中加载一个预训练的标记器，您可以看到带有nltk.tag._POS_TAGGER的文件名：

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'

由于它是用树库语料库训练的，因此它也使用树库标记集。

以下函数将树库标记映射到WordNet词性名称：

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

然后，可以将返回值与lemmatizer一起使用：

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

在将返回值传递给Lemmatizer之前检查返回值，因为空字符串将给出一个键错误。

类似资料：

python中的wordnet词法化和pos标记

问题内容：我想在python中使用wordnet lemmatizer，并且我了解到默认pos标记为NOUN，并且除非为pos标记明确指定为VERB，否则它不会为动词输出正确的引理。我的问题是，为了准确地进行上述词素化，什么是最好的镜头？我使用了pos标记，但是迷失了将树库pos标记集成到wordnet兼容pos标记中的信息。请帮忙我得到了NN，JJ，VB，RB中的输出标签。如何将它们更改
从spaCy中的词根（引理）和词性（POS）标签获取完全形成的单词“文本”

tl；dr.我怎样才能将词根和词性标签组合成一个完全修改过的单词？例如：
匹配POS标签和单词序列

问题内容：我有以下两个带有POS标签的字符串： Sent1 ：“ 类似作家专业或词组工作方式的东西真的很酷。 ” [（’something’，’NN’），（’like’，’IN’），（’how’，’WRB’），（’writer’，’NN’），（’pro’，’NN’），（或），（CC），（短语学，NN），（作品，NNS），（would，MD），（be，VB），（’really’，’RB’）
NLTK WordNet Lemmatizer：难道它不能使单词的所有词形变化吗？

问题内容：我将NLTK WordNet Lemmatizer用于词性标记项目，方法是首先将训练语料库中的每个单词修改为其词干（就地修改），然后仅对新语料库进行训练。但是，我发现lemmatizer不能正常运行。例如，单词被复词化为正确的单词，但是即使在复词之后该词仍然保留。这就像句子“我爱它”。难道不是单词的词干吗？类似地，在残词化之后，许多其他“ ing”形式仍然保留。这是正确的行为吗？
如何在python nltk和wordnet中获取单词/同义词集的所有下位词？

问题内容：我现在有一个wordnet中所有名词的列表，我只想保留车辆中的单词，其余的删除。我该怎么做？下面是我要制作的伪代码，但我不知道如何使它工作问题答案：这会给你从每一个同义词集这是一个所有独特的词下义词的名词“车辆”（第一感觉）的。
从Wordnet中提取词表

我想为我的搜索引擎从数据库中提取一个基本的同义词列表。这包括通常拼写的名字，如Shaun vs.Shawn，Muhammad的不同变体，命名实体的首字母缩写，如United Nations(UN)或SARS（Severe acute respiratory syndrome）。在提取之后，这个同义词列表将被放置在服务器中，并以这样的方式存储--相关术语/同义词的字符串。示例我使用了jaws

python中的wordnet词形化和pos标记

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档