coreference-resolution

授权协议 Readme
开发语言 Python
所属分类 神经网络/人工智能、 机器学习/深度学习
软件类型 开源软件
地区 不详
投 递 者 令狐昌胤
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

Coreference Resolution

PyTorch 0.4.1 | Python 3.6.5

This repository consists of an efficient, annotated PyTorch reimplementation of the EMNLP paper "End-to-end Neural Coreference Resolution" by Lee et al., 2017. Main code can be found in this file.

Data

The source code assumes access to the English train, test, and development data of OntoNotes Release 5.0. This data should be located in a folder called 'data' inside the main directory. The data consists of 2,802 training documents, 343 development documents, and 348 testing documents. The average length of all documents is 454 words with a maximum length of 4,009 words. The number of mentions and coreferences in each document varies drastically, but is generally correlated with document length.

Since the data require a license from the Linguistic Data Consortium to use, they are thus not supplied here. Information on how to download and preprocess them can be found here and here, respectively.

Beyond the data, the source files also assume access to both Turian embeddings and GloVe embeddings.

Problem Definition

Coreference is defined as occurring when one or more expressions in a document refer back to the an entity that came before it/them. Coreference resolution, then, is the task of finding all expressions that are coreferent with any of the entities found in a given text. While this problem definition seems simple enough, oftentimes the nomenclature found in papers regarding coreference resolution is quite confusing. Visualizing them makes things a bit easier to understand:

Words are colored according to whether they are entities or not. Different colored groups of words are members of the same coreference cluster. Entities that are the only member of their cluster are known as 'singleton' entities.

Why Corefence Resolution is Hard

Entities can be very long and coreferent entities can occur extremely far away from one another. A greedy system would compute every possible span (sequence) of tokens and then compare it to every possible span that came before it. This makes the complexity of the problem O(T4), where T is the document length. For a 100 word document this would be 100 million possible options and for the longest document in our dataset, this equates to almost one quadrillion possible combinations.

If this does not make it concrete, imagine that we had the sentence

* Arya Stark walks her direwolf, Nymeria. *

Here we have three entities: Arya Stark, her, and Nymeria. As a native speaker of English it should be trivial to tell that her refers to Arya Stark. But to a machine with no knowledge, how should it know that Arya and Stark should be a single entity rather than two separate ones, that Nymeria does not refer back to her even though they are arguably related, or even that that Arya Stark walks her direwolf, Nymeria is not just one big entity in and of itself?

For another example, consider the sentence

* Napoleon and all of his marvelously dressed, incredibly well-trained, loyal troops marched all the way across the Europe to enter into Russia in an, ultimately unsuccessful, effort to conquer it for their country. *

The word their is referent to Napoleon and all of his marvelously dressed, incredibly well trained, loyal troops; entities can span many, many tokens. Coreferent entities can also occur far away from one another.

Model Architecture

As a forewarning, this paper presents a beast of a model. The authors present the following series of images to provide clarity as to what the model is doing.

1. Token Representation

Tokens are represented using 300-dimension static GloVe embeddings, 50-dimensional static Turian embeddings, and 8-dimensional character embeddings from a CNN with 50-dimensional filter sizes 3, 4, and 5. Dropout with p=0.50 is applied to these embeddings. The token representations are passed into a 2-layer bidirectional LSTM with hidden state sizes of 200. Dropout with p=0.20 is applied to the output of the LSTM.

2. Span Representation

Using the regularized output, span representations are computed by extracting the LSTM hidden states between the index of the first word and the last word. These are used to compute a weighted sum of the hidden states. Then, we concatenate the first and last index with the weighted attention sum and a 20-dimensional feature representation for the total width (length) of the span under consideration. This is done for all spans up to length 10 in the document.

3. Pruning

The span representations are passed into a 3-layer, 150-dimensional feedforward network with ReLU activations and p=0.20 dropout applied between each layer. The output of this feedfoward network is 1-dimensional and represents the 'mention score' of each span in the document. Spans are then pruned in decreasing order of mention score unless, when considering a span i, there exists a previously accepted span j such that START(i) < START(j) <= END(i) < END(j) or START(j) < START(i) <= END(j) < END(j). Only LAMBDA * T spans are kept at the end, where LAMBDA is set to 0.40 and T is the document length.

4. Pairwise Representation

For these spans, pairwise representations are computed for a given span i and its antecedent j by concatenating the span representation for span i, the span representation for span j, the dot product between these representations, and 20-dimensional feature embeddings for genre, distance between the spans, and whether or not the two spans have the same speaker.

5. Final Score and Loss

These representations are passed into a feedforward network similar to that of scoring the spans. Clusters are then formed for these coreferences by identifying chains of coreference links (e.g. span j and span k both refer to span i). The learning objective is to maximize the log-likelihood of all correct antecedents that were not pruned.

Results

Originally from the paper,

Recent Work

The authors have since published another paper, which achieves an F1 score of 73.0.

  • 介绍 共指解析,按照百度的定义如下: 1 众所周知,人们为了避免重复,习惯用代词、称谓和缩略语来指代前面提到的实体全称。例如,在文章开始处会写“哈尔滨工业大学”,后面可能会说“哈工大”、“工大”等,还会提到“这所大学”、“她”等。这种现象称为共指现象。 简而言之,其目的在于自动识别表示同一实体的名词短语或代词等。 举个例子: 哈尔滨工业大学,一般学生或者大众喜欢简称为哈工大,工大等,她是一所美丽的

  • 最近在做一个角色识别的项目,项目中需要用到共指消解的方法,大体了解的有四种: 基于Spenbert(https://github.com/troublemaker-r/Chinese_Coreference_Resolution/blob/2b88450eeb3da248fb0f6365c38a32b9fffcb962/README.md) 基于问答系统的Span prediction(https

 相关资料
  • 我有一个项目A,它依赖于我的Eclipse工作区中的项目B和C。 有时我希望A运行时使用来自存储库的工件B和C,而不是工作区中的版本。因此,我选择A,然后选择'Maven->Disable Workspace Resolution',这样A将使用存储库中的版本。 附加信息:这些项目的确切版本在存储库中。事实上,如果我关闭工作区中的项目B和C,项目A会很高兴地使用B和C的存储库版本。

  • 无需克隆: https://jitpack.io/#niklasholtmeyer/maven-EndpointsCoverage pom.xml: 如果我编译这个插件,并根据Commentblock中的所有内容更改groupId/artifactId,它就能正常工作。 我还尝试了以下版本: 1.0.0 主 Master-F2D0242DF8-1 主控-快照 -快照 发布 最新 新建pom.xml

  • AMD FidelityFX Super Resolution (FSR) 是一种开源、高质量的解决方案,用于把较低分辨率的图像输入转变成高分辨率图像输出。 它使用了一系列尖端算法,特别强调创建高质量的边缘,与直接以原始分辨率渲染相比,性能有了很大的提升。FSR 为昂贵的渲染操作(例如硬件光线追踪)提供“实用性能”。 构建说明 前提条件:要构建FSR样本,请遵循以下说明: 1. 安装以下工具:  

  • Stereo Camera Calibration under Different Resolution In this project we modified existing Matlab toolbox's code to get the stereo calibration run under different resolution cameras. Especially if you