xml stanza_使用stanza nlp软件包探索文学

臧增

2023-12-01

xml stanza

The Stanford NLP Group has long been an active player in natural language processing, particularly through their well-known CoreNLP Java toolkit. Until recently though, Stanford NLP has been a less well-known player in the Python community, which is a shame since many NLP practitioners work primarily in Python. But there’s good news! Stanford NLP’s Stanza Python library is coming into its own with the recent release of version 1.1.1!

吨他斯坦福NLP集团长期以来一直是自然语言处理的积极参与者，特别是通过其众所周知CoreNLP Java工具包。直到最近，斯坦福大学NLP一直是Python社区中鲜为人知的参与者，这是可耻的，因为许多NLP从业者主要在Python中工作。但是有个好消息！ Stanford NLP的Stanza Python库在1.1.1版的最新发行版中已独树一帜！

The new Stanza version supports 66 different human languages (which is a big step forward, since NLP has long been very English-centric) and can carry out core NLP tasks like lemmatization and named entity recognition. Stanza is also customizable, which means that users can build their own pipelines and train their own models.

新的Stanza版本支持66种不同的人类语言(这是向前迈出的一大步，因为NLP长期以来一直以英语为中心)，并且可以执行核心的NLP任务，例如去词皮化和命名实体识别。 Stanza也可以自定义，这意味着用户可以构建自己的管道并训练自己的模型。

So, for all you Pythonistas out there, let’s take a look at Stanza and what it can do. We’ll start with a brief overview of core Stanza functionality and then we’ll use it to explore the characters in the classic novel, Moby Dick.

因此，对于所有在那里的Pythonista使用者，让我们看一下Stanza及其功能。我们将首先简要概述Stanza核心功能，然后使用它来探索经典小说Moby Dick中的角色。

管道 (Pipeline)

The Stanza Pipeline can be configured with a variety of options to select the language model, processors, etc. The language model must be downloaded before it can be used in a pipeline.

可为Stanza Pipeline配置各种选项，以选择语言模型，处理器等。必须先下载语言模型，然后才能在管道中使用它。

# load libraries
import stanza
import pandas as pd


# Download English language model and initialize the NLP pipeline.
stanza.download('en')
nlp = stanza.Pipeline('en')

For our exploratory project, let’s use the first paragraph from Moby Dick.

对于我们的探索性项目，让我们使用Moby Dick的第一段。

# Use default pipleline to create a Document object
moby_dick_para1 = "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."


moby_p1 = nlp(moby_dick_para1) # return a Document object

数据对象 (Data Objects)

文件资料 (Documents)

Stanza Document objects include text, tokens, words, dependencies and entities attributes. The tokens, words, dependencies and entities attributes are lists, and individual items can be accessed by index.

Stanza Document对象包括text ， tokens ， words ， dependencies和entities属性。 tokens ， words ， dependencies和entities属性是列表，可以通过索引访问各个项目。

def print_doc_info(doc):
    print(f"Num sentences:\t{len(doc.sentences)}")
    print(f"Num tokens:\t{doc.num_tokens}")
    print(f"Num words:\t{doc.num_words}")
    print(f"Num entities:\t{len(doc.entities)}")


print_doc_info(moby_p1)


# Num sentences:    8
# Num tokens:       222
# Num words:        222
# Num entities:     3

句子 (Sentences)

Each Sentence object contains doc, text, dependencies, tokens, words, and entities attributes. Individual items can be accessed by indexing the appropriate list.

每个Sentence对象都包含doc ， text ， dependencies ， tokens ， words和图entities属性。可以通过索引适当的列表来访问单个项目。

def print_sentence_info(sentence):
    print(f"Text: {sentence.text}")
    print(f"Num tokens:\t{len(sentence.tokens)}")
    print(f"Num words:\t{len(sentence.words)}")
    print(f"Num entities:\t{len(sentence.entities)}")


print_sentence_info(moby_p1.sentences[0])


# Text:             Call me Ishmael.
# Num tokens:       4
# Num words:        4
# Num entities:     1

代币 (Tokens)

Each Token object includes text, words, start_char, and end_char attributes, among others. In cases where a token is a multi-word token, the words attribute will contain each of the underlying words.

每个Token对象都包括text ， words ， start_char和end_char属性。在令牌是多单词令牌的情况下， words属性将包含每个基础单词。

def print_token_info(token):
    print(f"Text:\t{token.text}")
    print(f"Start:\t{token.start_char}")
    print(f"End:\t{token.end_char}")


print_token_info(moby_p1.sentences[0].tokens[2])


# Text:     Ishmael
# Start:    8
# End:      15

话 (Words)

Each Word object includes various word-level annotations, as defined by the various processors (parts-of-speech, lemmatization, etc.), including text, lemma, upos, xpos, feats, and others.

每个Word对象包括各种字级注释，由各个处理器所定义(部件-语音的，词形还原，等等)，包括text ， lemma ， upos ， xpos ， feats ，以及其他。

def print_word_info(word):
    print(f"Text:\t{word.text}")
    print(f"Lemma: \t{word.lemma}")
    print(f"UPOS: \t{word.upos}")
    print(f"XPOS: \t{word.xpos}")


print_word_info(moby_p1.sentences[3].words[4])


# Text:     growing
# Lemma:    grow
# UPOS:     VERB
# XPOS:     VBG


def word_info_df(doc):
    """
    - Parameters: doc (a Stanza Document object)
    - Returns: A Pandas DataFrame object with one row for each token in
      doc, and columns for text, lemma, upos, and xpos.
    """
    rows = []
    for sentence in doc.sentences:
        for word in sentence.words:
            row = {
                "text": word.text,
                "lemma": word.lemma,
                "upos": word.upos,
                "xpos": word.xpos,
            }
            rows.append(row)
    return pd.DataFrame(rows)


word_info_df(moby_p1)


# 	text	lemma	upos	xpos
# 0	Call    call	VERB	VB
# 1	me      I   	PRON	PRP
# 2	Ishmael Ishmael	PROPN	NNP
# 3	.       .       PUNCT   .
# 4	Some	some	DET     DT

实体 (Entities)

Stanza includes a built-in named entity recognition (NER) module, with options for extension and customization. The default pipeline includes the built-in NERProcessor, which recognizes named entities for all token spans. Each Entity object includes attributes for text, tokens, type, start_char, end_char, and others.

Stanza包括一个内置的命名实体识别(NER)模块，带有扩展和自定义选项。默认管道包括内置的NERProcessor ，它可以识别所有令牌范围的命名实体。每个Entity对象都包含text ， tokens ， type ， start_char ， end_char等属性。

def print_entity_info(entity):
    print(f"Text:\t{entity.text}")
    print(f"Type:\t{entity.type}")
    print(f"Start:\t{entity.start_char}")
    print(f"End:\t{entity.end_char}")


print_entity_info(moby_p1.entities[0])


# Text:     Ishmael
# Type:     PERSON
# Start:    8
# End:      15

情绪分析 (Sentiment Analysis)

Stanza includes a built-in sentiment analysis processor, which can be customized as needed. Each Sentence object in a Document includes a sentiment score, where 0 represents negative, 1 represents neutral, and 2 represents positive. To make this a bit more human-readable, we’ll covert the scores to a string descriptor.

Stanza包含一个内置的情绪分析处理器，可以根据需要进行自定义。 Document每个Sentence对象都包含一个sentiment分数，其中0表示否定， 1表示中立， 2表示肯定。为了使它更易于理解，我们将分数隐藏在字符串描述符中。

def sentiment_descriptor(sentence):
    """
    - Parameters: sentence (a Stanza Sentence object)
    - Returns: A string descriptor for the sentiment value of sentence.
    """
    sentiment_value = sentence.sentiment
    if (sentiment_value == 0):
        return "negative"
    elif (sentiment_value == 1):
        return "neutral"
    else:
        return "positive"


print(sentiment_descriptor(moby_p1.sentences[0]))


# neutral


def sentence_sentiment_df(doc):
    """
    - Parameters: doc (a Stanza Document object)
    - Returns: A Pandas DataFrame with one row for each sentence in doc,
      and columns for the sentence text and sentiment descriptor.
    """
    rows = []
    for sentence in doc.sentences:
        row = {
            "text": sentence.text,
            "sentiment": sentiment_descriptor(sentence)
        }
        rows.append(row)
    return pd.DataFrame(rows)


sentence_sentiment_df(moby_p1)


#   text                                                sentiment
# 0 Call me Ishmael.                                    neutral
# 1 Some years ago—never mind how long precisely—h...   neutral
# 2 It is a way I have of driving off the spleen a...   neutral
# 3 Whenever I find myself growing grim about the ...   negative
# 4 This is my substitute for pistol and ball.          neutral
# 5 With a philosophical flourish Cato throws hims...   neutral
# 6 There is nothing surprising in this.                neutral
# 7 If they but knew it, almost all men in their d...   neutral

人物分析 (Character Analysis)

Now that we know a little bit about how to use Stanza, let’s use it to see if we can learn anything about the characters in Moby Dick. First, we’ll have to load up the full text. As many of you will remember, Moby Dick is a long novel, so putting it through the Stanza pipeline can take a while. If you happen to have access to GPUs though, Stanza is GPU-aware and the process will go much faster.

现在我们对如何使用Stanza有了一点了解，让我们用它来看看我们是否可以学习有关Moby Dick中角色的任何信息。首先，我们必须加载全文。你们中许多人都记得，白鲸记( Moby Dick)是一部长篇小说，因此将其放入Stanza管道可能需要一段时间。如果您碰巧可以访问GPU，则Stanza可以识别GPU，并且处理过程将更快。

# load the full text and put it through the pipeline
def load_text_doc(file_path):
    with open(file_path) as f:
        txt = f.read()
    return txt


moby_path = "moby_dick.txt"
moby_dick_text = load_text_doc(moby_path)
moby_dick = nlp(moby_dick_text)


print_doc_info(moby_dick)


# Num sentences:    9966
# Num tokens:       253928
# Num words:        253928
# Num entities:     7955

白鲸迪克角色 (Moby Dick Characters)

Lets use Stanza’s entity recognition function to identify all the characters in Moby Dick. We’ll do this by selecting only those entities that have the type PERSON. Since each entity points back to its containing sentence, we’ll go ahead and save the sentiment of that sentence for future use.

让我们使用Stanza的实体识别功能来识别Moby Dick中的所有字符。我们将仅选择那些类型为PERSON实体来进行此操作。由于每个实体都指向其包含的句子，因此我们将继续保存该句子的情感以备将来使用。

# select person entities
def select_person_entities(doc):
    return [ent for ent in doc.entities if ent.type == "PERSON"]


def person_df(doc):
    """
    - Parameters: doc (a Stanza Document object)
    - Returns: A Pandas DataFrame with one row for each entity in doc
      that has a "PERSON" type, and and columns text, type, start_char, 
      and the sentiment of the sentence in which the entity appears.
    """
    rows = []
    persons = select_person_entities(doc)
    for person in persons:
        row = {
            "text": person.text,
            "type": person.type,
            "start_char": person.start_char,
            "end_char": person.end_char,
            "sentence_sentiment": sentiment_descriptor(person._sent)
        }
        rows.append(row)
    return pd.DataFrame(rows)


characters = person_df(moby_dick)
display(characters.head())


#   text            type    start_char  end_char    sentence_sentiment
# 0 Ishmael         PERSON  29          36          neutral
# 1 Cato            PERSON  890         894         neutral
# 2 Tiger-lilies    PERSON  4226        4238        neutral
# 3 Jove            PERSON  4988        4992        neutral
# 4 Narcissus       PERSON  5080        5089        negative

Now that we have all of the characters from Moby Dick, we can start to analyze the data to see what we can learn about them. First, how many characters are there?

既然我们拥有了Moby Dick的所有角色，我们就可以开始分析数据以了解我们可以了解的内容。首先，有多少个字符？

def num_unique_items(df, col):
    return len(df[col].unique())


num_unique_items(characters, "text")


# 699

Wow! 699 characters (or at least unique PERSON entities) is a lot. Most of those are mentioned just a single time, so perhaps we should take a look at just the major characters.

哇！ 699个字符(或者至少是唯一的PERSON实体)很多。其中大多数仅被提及一次，因此也许我们应该只看一下主要特征。

字符数 (Character Counts)

With our character dataframe in hand, we can now check which characters appear in the text most often. This will give us some idea about which characters are the most important.

有了字符数据框，我们现在可以检查哪些字符最常出现在文本中。这将使我们了解哪些字符最重要。

# Which characters appear most frequently?
def frequency_count(df, col, limit=10):
    return df[col].value_counts().head(limit)


frequency_count(characters, "text")


# Ahab          474
# Stubb         224
# Queequeg      184
# Starbuck      140
# Jonah         81
# Moby Dick     75
# Bildad        70
# Peleg         69
# Pip           68
# Pequod        60

Unsurprisingly for anyone who has read Moby Dick, Captain Ahab is the most-mentioned character in the book. Other members of his crew like Stubb, Queequeg, and Starbuck make appearances in the most-frequent list as well. And of course, Moby Dick himself is in the top 10.

对于阅读过Moby Dick的人来说，毫不奇怪，亚哈船长是书中最常提及的角色。 Stubb，Queequeg和Starbuck等其他机组人员也出现在频率最高的名单中。当然，Moby Dick本人排名前十。

品格情感 (Character Sentiment)

Since each Entity also includes a pointer to its parent sentence, we can now use the sentence sentiment rating that we saved earlier to make a judgement about the overall character sentiment. We’ll do this by converting our sentiment descriptors to a value of -1 for “negative”, 0 for “neutral”, and 1 for “positive”. After that, we can group the various appearances of each character and sum the sentiment value for each sentence the character appears in. A negative sum indicates a negative overall character sentiment, and a positive sum the opposite. And the farther from 0 the sum is, the stronger the sentiment.

由于每个Entity还包括一个指向其父句的指针，因此我们现在可以使用我们先前保存的句子情感等级来判断整体角色情感。为此，我们将情感描述符的值转换为-1 (负)， 0 (中立)和1 (正)。之后，我们可以对每个字符的各种外观进行分组，并对字符出现在每个句子中的情感值求和。负和表示总体字符情感为负，而正和表示相反。而且总和离0越远，情绪就越强。

# What is the sentiment surrounding each character?
def sentiment_descriptor_to_val(descriptor):
    """
    - Parameters: descriptor ("negative", "neutral", or "positive")
    - Returns: -1 for "negative", 0 for "neutral", 1 for "positive"
    """
    if descriptor == "negative":
        return -1
    elif descriptor == "neutral":
        return 0
    else:
        return 1


def character_sentiment(df):
    """
    - Parameters: df (Pandas DataFrame)
    - df must contain "text" and "sentiment_descriptor" columns.
    - Returns: 
    """
    sentiment = df.copy()
    sentiment["sentence_sentiment"] = [
        sentiment_descriptor_to_val(s) for s
        in sentiment["sentence_sentiment"]
    ]
    sentiment = sentiment[["text", "sentence_sentiment"]]
    sentiment = sentiment.groupby("text").sum().reset_index()
    
    return sentiment.sort_values("sentence_sentiment")


sentiment_df = character_sentiment(characters)


print("Characters in the most negative settings.")
display(sentiment_df.head(5))


# Characters in the most negative settings.
#       text        sentence_sentiment
# 6     Ahab        -42
# 508   Queequeg    -24
# 317   Jonah       -18
# 588   Stubb       -11
# 468   Pequod      -10


print("Characters in the most positive settings")
display(sentiment_df.tail(5))


# Characters in the most positive settings
#       text        sentence_sentiment
# 1     Abraham     2
# 424   Monsieur    3
# 218   Gabriel     3
# 401   Mary        4
# 93    Bunger      4

Phew. It would seem that Moby Dick is pretty grim! Almost no characters appear in majority positive sentences — and for those who do, the positivity is quite weak. As for Captain Ahab, his overall sentence sentiment sum is -42! Of course, we haven’t checked to see whether the sentiment is about Ahab, but merely the sentiment of sentences in which Ahab appears. Perhaps this is an indicator that Ahab lives a tortured and unhappy life — it would seem that he isn’t in a lot of happy sentences.

ew 看来Moby Dick很冷酷！在多数肯定的句子中几乎没有字符出现-对于那些肯定的句子，其积极性很弱。至于亚哈船长，他的总句子感度是-42！当然，我们没有检查情绪是否与 Ahab 有关，而只是检查其中出现Ahab的句子的情绪。也许这表明亚哈过着痛苦折磨的不快乐生活-看来他的幸福度并不高。

下一步 (Next Steps)

And that’s it for our quick look at Stanza! If you think Stanza could be a good fit for your needs, I highly encourage you to check it out — the documentation is excellent and has a good overview on usage. Perhaps you too can use it to explore your favorite novel. (And if you do, be sure to let us know the results!)

这就是我们快速浏览Stanza的过程！如果您认为Stanza可以很好地满足您的需求，我强烈建议您检查一下-文档非常好，并且对用法有很好的概述。也许您也可以使用它来探索自己喜欢的小说。 (如果这样做，请务必让我们知道结果！)

有关更多NLP文章和新闻… (For more NLP Articles and News…)

This article originally appeared on Lemmalytica — a blog about language, artificial intelligence, and coding. If you’re interested in more NLP-related resources and articles, check it out!

本文最初发表在Lemmalytica上 -有关语言，人工智能和编码的博客。如果您对更多与NLP相关的资源和文章感兴趣，请查看！

翻译自: https://medium.com/@severinperez/exploring-literature-with-the-stanza-nlp-package-927d5b6556bf