当前位置：首页 > 软件库 > 神经网络/人工智能 > 机器学习/深度学习 >

coreference-resolution

授权协议 Readme

开发语言 Python

所属分类神经网络/人工智能、机器学习/深度学习

软件类型开源软件

地区不详

投递者令狐昌胤

操作系统跨平台

开源组织无

适用人群未知

软件概览

Coreference Resolution

PyTorch 0.4.1 | Python 3.6.5

This repository consists of an efficient, annotated PyTorch reimplementation of the EMNLP paper "End-to-end Neural Coreference Resolution" by Lee et al., 2017. Main code can be found in this file.

Data

The source code assumes access to the English train, test, and development data of OntoNotes Release 5.0. This data should be located in a folder called 'data' inside the main directory. The data consists of 2,802 training documents, 343 development documents, and 348 testing documents. The average length of all documents is 454 words with a maximum length of 4,009 words. The number of mentions and coreferences in each document varies drastically, but is generally correlated with document length.

Since the data require a license from the Linguistic Data Consortium to use, they are thus not supplied here. Information on how to download and preprocess them can be found here and here, respectively.

Beyond the data, the source files also assume access to both Turian embeddings and GloVe embeddings.

Problem Definition

Coreference is defined as occurring when one or more expressions in a document refer back to the an entity that came before it/them. Coreference resolution, then, is the task of finding all expressions that are coreferent with any of the entities found in a given text. While this problem definition seems simple enough, oftentimes the nomenclature found in papers regarding coreference resolution is quite confusing. Visualizing them makes things a bit easier to understand:

Words are colored according to whether they are entities or not. Different colored groups of words are members of the same coreference cluster. Entities that are the only member of their cluster are known as 'singleton' entities.

Why Corefence Resolution is Hard

Entities can be very long and coreferent entities can occur extremely far away from one another. A greedy system would compute every possible span (sequence) of tokens and then compare it to every possible span that came before it. This makes the complexity of the problem O(T⁴), where T is the document length. For a 100 word document this would be 100 million possible options and for the longest document in our dataset, this equates to almost one quadrillion possible combinations.

If this does not make it concrete, imagine that we had the sentence

* Arya Stark walks her direwolf, Nymeria. *

Here we have three entities: Arya Stark, her, and Nymeria. As a native speaker of English it should be trivial to tell that her refers to Arya Stark. But to a machine with no knowledge, how should it know that Arya and Stark should be a single entity rather than two separate ones, that Nymeria does not refer back to her even though they are arguably related, or even that that Arya Stark walks her direwolf, Nymeria is not just one big entity in and of itself?

For another example, consider the sentence

* Napoleon and all of his marvelously dressed, incredibly well-trained, loyal troops marched all the way across the Europe to enter into Russia in an, ultimately unsuccessful, effort to conquer it for their country. *

The word their is referent to Napoleon and all of his marvelously dressed, incredibly well trained, loyal troops; entities can span many, many tokens. Coreferent entities can also occur far away from one another.

Model Architecture

As a forewarning, this paper presents a beast of a model. The authors present the following series of images to provide clarity as to what the model is doing.

1. Token Representation

Tokens are represented using 300-dimension static GloVe embeddings, 50-dimensional static Turian embeddings, and 8-dimensional character embeddings from a CNN with 50-dimensional filter sizes 3, 4, and 5. Dropout with p=0.50 is applied to these embeddings. The token representations are passed into a 2-layer bidirectional LSTM with hidden state sizes of 200. Dropout with p=0.20 is applied to the output of the LSTM.

2. Span Representation

Using the regularized output, span representations are computed by extracting the LSTM hidden states between the index of the first word and the last word. These are used to compute a weighted sum of the hidden states. Then, we concatenate the first and last index with the weighted attention sum and a 20-dimensional feature representation for the total width (length) of the span under consideration. This is done for all spans up to length 10 in the document.

3. Pruning

The span representations are passed into a 3-layer, 150-dimensional feedforward network with ReLU activations and p=0.20 dropout applied between each layer. The output of this feedfoward network is 1-dimensional and represents the 'mention score' of each span in the document. Spans are then pruned in decreasing order of mention score unless, when considering a span i, there exists a previously accepted span j such that START(i) < START(j) <= END(i) < END(j) or START(j) < START(i) <= END(j) < END(j). Only LAMBDA * T spans are kept at the end, where LAMBDA is set to 0.40 and T is the document length.

4. Pairwise Representation

For these spans, pairwise representations are computed for a given span i and its antecedent j by concatenating the span representation for span i, the span representation for span j, the dot product between these representations, and 20-dimensional feature embeddings for genre, distance between the spans, and whether or not the two spans have the same speaker.

5. Final Score and Loss

These representations are passed into a feedforward network similar to that of scoring the spans. Clusters are then formed for these coreferences by identifying chains of coreference links (e.g. span j and span k both refer to span i). The learning objective is to maximize the log-likelihood of all correct antecedents that were not pruned.

Results

Originally from the paper,

Recent Work

The authors have since published another paper, which achieves an F1 score of 73.0.

使用案例

coreference resolution

介绍共指解析，按照百度的定义如下： 1 众所周知，人们为了避免重复，习惯用代词、称谓和缩略语来指代前面提到的实体全称。例如，在文章开始处会写“哈尔滨工业大学”，后面可能会说“哈工大”、“工大”等，还会提到“这所大学”、“她”等。这种现象称为共指现象。简而言之，其目的在于自动识别表示同一实体的名词短语或代词等。举个例子：哈尔滨工业大学，一般学生或者大众喜欢简称为哈工大，工大等，她是一所美丽的
【Chinese Coreference resolution】StanfordNLP代码实现

最近在做一个角色识别的项目，项目中需要用到共指消解的方法，大体了解的有四种：基于Spenbert（https://github.com/troublemaker-r/Chinese_Coreference_Resolution/blob/2b88450eeb3da248fb0f6365c38a32b9fffcb962/README.md）基于问答系统的Span prediction（https