Python NLTK用词网对“更进一步”的词进行词法化

董和泽

2023-03-14

问题内容：

我正在使用python，NLTK和WordNetLemmatizer进行lemmatizer。这是输出我期望的随机文本

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

输出： 'bad'

lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

输出： 'worse'

好吧，这里的一切都很好。其行为与其他形容词一样'better'（对于（不规则形式）或）'older'（请注意，相同的测试'elder'将永远不会输出'old'，但我想wordnet并不是所有现有英语单词的详尽列表）

我的问题是尝试使用以下单词时出现的'furter'：

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

输出： 'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

输出： 'far'

这是与'worse'单词相反的行为！

谁能解释我为什么？是来自wordnet同义词集数据的bug，还是来自我对英语语法的误解？

如果这个问题已经回答，请原谅。我已经在google和SO上搜索了，但是当指定关键字“
further”时，由于这个词的流行，我发现除了乱七八糟之外其他任何相关的东西…

预先谢谢RomainG。

问题答案：

WordNetLemmatizer使用该._morphy函数访问其单词的引理；来自http://www.nltk.org/_modules/nltk/stem/wordnet.html，并以最小长度返回可能的引理。

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

该._morphy函数迭代地应用规则以获得引理。规则不断减少单词的长度，并用MORPHOLOGICAL_SUBSTITUTIONS。然后查看是否有其他单词简短但与简化单词相同：

def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

但是，如果这个词是例外列表，它会返回保持在一个固定值exceptions，看_load_exception_map在http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html：

def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到您的示例，无法从规则中实现worse->bad和further->
far，因此必须从异常列表中进行。由于这是例外列表，因此肯定会有不一致之处。

例外列表保存在~/nltk_data/corpora/wordnet/adv.exc和中~/nltk_data/corpora/wordnet/adv.exc。

来自adv.exc：

best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

来自adj.exc：

...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...

Python NLTK用词网对“更进一步”的词进行词法化

相关阅读

相关文章

相关问答

相关工具

相关文档