Stanford Named Entity Recognizer (NER) 斯坦福命名实体识别(NER)

奚晟
2023-12-01

以下翻译内容来自:https://nlp.stanford.edu/software/CRF-NER.html

About

关于

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just theCoNLL 2003 English training data.

斯坦福NER是一个基于Java语言实现的命名实体识别器。命名实体识别(NER)标注了文档中的单词序列它们是东西的名字,例如人名、公司名或基因、专有名称。它带有用于命名实体识别的精心设计的特征提取器,以及定义特征提取器的许多选项。包括英语的命名实体识别器的下载,尤其善于识别3类命名实体(人名、组织名、地名)。除此之外,我们还为不同的语言和环境提供了其它模型,包括在CoNLL2003英文训练数据的训练模型。

Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on labeled data, you can actually use this code to build sequence models for NER or any other task. (CRF models were pioneered by Lafferty, McCallum, and Pereira (2001); see Sutton and McCallum (2006) or Sutton and McCallum (2010) for more comprehensible introductions.)

斯坦福NER也被称为CRF分类器。这个软件提供了(任意阶)线性链条件随机场(CRF)序列模型的一般实现。也就是说,通过在标记数据上训练您自己的模型,您实际上可以使用这段代码为NER或任何其他任务构建序列模型。(CRF 模型由 Lafferty, McCallum, and Pereira (2001); 参考Sutton and McCallum (2006) 或Sutton and McCallum (2010) 介绍更容易理解.)

The original CRF code is by Jenny Finkel. The feature extractors are by Dan Klein, Christopher Manning, and Jenny Finkel. Much of the documentation and usability is due to Anna Rafferty. More recent code development has been done by various Stanford NLP Group members.

原始CRF代码由Jenny Finkel编写,特征提取器由Dan Klein、Christopher Manning和Jenny Finkel. 设计。大部分文档和可用性都归功于Anna Rafferty。最近的代码开发是由斯坦福NLP小组的成员完成的。

Stanford NER is available for download, licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation (look at the shell scripts and batch files included in the download), running as a server (look at NERServer in the sources jar file), and a Java API (look at the simple examples in the NERDemo.java file included in the download, and then at the javadocs). Stanford NER code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the fullGPL, which allows many free uses. For distributors of proprietary softwarecommercial licensing is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gifts.

斯坦福NER提供下载,许可证在 GNU General Public License (V2或之后的版本)下面。源代码包括不同组件的包,用于命令行调用(包含在下载中的shell脚本和批文件),以服务器运行(jar文件中的NERServer),以及Java API(间NERDemo.java的简单示例,文件包含在下载中,还有javadocs)。斯坦福NER代码是双重许可的(类似于MySQL,等)。开源许可在fullGPL之下的,它允许多种免费用途。对于专利软件的分销商,可以获得商业许可。如果你不需要商业许可,但想支持维护这些工具,欢迎馈赠。

Citation

引用

The CRF sequence models provided here do not precisely correspond to any published paper, but the correct paper to cite for the model and software is:

此处提供的CRF序列模型与任何已发表的论文并不完全对应,但模型和软件的正确引用论文为:

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling.  Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf

The software provided here is similar to the baseline local+Viterbi model in that paper, but adds new distributional similarity based features (in the -distSim classifiers). Distributional similarity features improve performance but the models require somewhat more memory. Our big English NER models were trained on a mixture of CoNLL, MUC-6, MUC-7 and ACE named entity corpora, and as a result the models are fairly robust across domains.

软件提供了类似于本文中的基线local+Viterbi模型,但是添加了新的基于分布相似性的特性(在-distSim分类器中)。分布相似特征提高了性能,但是模型需要更多的内存。我们得大英文NER模型在CoNLL, MUC-6, MUC-7 和ACE混合的预料上训练了命名实体,并且这个模型在跨领域上显示除了鲁棒性。

Getting started

入门指南

You can try out Stanford NER CRF classifiers or Stanford NER as part of Stanford CoreNLP on the web, to understand what Stanford NER is and whether it will be useful to you.

你可以尝试斯坦福NER CRF分类器或者在web上将斯坦福NER作为斯坦福核心自然语言处理的一部分,来理解什么是斯坦福NER并且感受它对你是否有用。

To use the software on your computer, download the zip file. You then unzip the file by either double-clicing on the zip file, using a program for unpacking zip files, or by using the unzip command. This shord create a stanford-ner folder. There is no installation procedure, you should be able to run Stanford NER from that folder. Normally, Stanford NER is run from the command line (i.e., shell or terminal). Current releases of Stanford NER require Java 1.8 or later. Either make sure you have or get Java 8 or consider running an earlier version of the software (versions through 3.4.1 support Java 6 and 7)..

要在你自己的电脑上使用这个软件,请下载zip文件。然后,通过对zip文件进行双击、使用解压缩zip文件的程序或使用解压缩命令来解压缩该文件。将会创建一个stanford-ner文件夹。没有安装程序,你可以从文件夹中运行斯坦福NER.通常,斯坦福NER从命令行上运行(shell或终端)。当前的Stanford NER版本需要Java 1.8或更高版本。要么确保您已经拥有或获得了Java 8,要么考虑运行该软件的较早版本(从3.4.1到3.4.1的版本都支持Java 6和7)。

NER GUI

NER的图形用户界面

Providing java is on your PATH, you should be able to run an NER GUI demonstration by just clicking. It might work to double-click on the stanford-ner.jar archive but this may well fail as the operating system does not give Java enough memory for our NER system, so it is safer to instead double click on the ner-gui.bat icon (Windows) or ner-gui.sh (Linux/Unix/MacOSX). Then, using the top option from the Classifier menu, load a CRF classifier from the classifiers directory of the distribution. You can then either load a text file or web page from the File menu, or decide to use the default text in the window. Finally, you can now named entity tag the text by pressing the Run NER button.

在你的路径上提供java,你可以通过单击直接运行NER GUI。你可以双击stanford-ner.jar的方式运行,如果失败可能是因为系统没有给你的NER sysytemjava提供足够的内存。因此,更保险的方法是双击Windows系统ner-gui.bat图标,或者是Linux/Unix/MacOSX系统ner-gui.sh。然后,使用分类器菜单中的top选项,从分布式分类器目录中加载CRF分类器。您可以从“File”菜单中加载文本文件或web页面,或者决定使用窗口中的默认文本。最后,您现在可以通过单击Run NER按钮为文本添加命名实体标记。

Single CRF NER Classifier from command-line

命令行中的单个CRF NER分类器

From a command line, you need to have java on your PATH and the stanford-ner.jar file in your CLASSPATH. (The way of doing this depends on your OS/shell.) The supplied ner.bat and ner.sh should work to allow you to tag a single file, when running from inside the Stanford NER folder. For example, for Windows:

在命令行中,您需要在路径上使用java,在CLASSPATH路径中使用stanford-ner.jar文件(这取决于您的操作系统/shell)。当从Stanford NER文件夹中运行时,所提供的ner.bat和ner.sh应该能够允许您标记单个文件。例如,Windows系统

ner file

This corresponds to the full command:

对应的完整命令:

java -mx600m -cp "*;lib\*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt

Or on Unix/Linux you should be able to parse the test file in the distribution directory with the command:

或者在Unix/Linux上,您应该能够使用以下命令解析分发目录中的测试文件:

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt

Here's an output option that will print out entities and their class to the first two columns of a tab-separated columns output file:

这里有一个output选项,它将实体及他们的类别打印到以制表符分隔的列输出文件的前两列:

java -mx600m -cp "*;lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile sample.txt > sample.tsv

Full Stanford NER functionality

全部的斯坦福NER功能

This standalone distribution also allows access to the full NER capabilities of the Stanford CoreNLP pipeline. These capabilities can be accessed via the NERClassifierCombiner class. NERClassifierCombiner allows for multiple CRFs to be used together, and has options for recognizing numeric sequence patterns and time patterns with the rule-based NER of SUTime.

这个独立的发行版还允许访问Stanford CoreNLP管道的完整NER功能。这些功能可以通过NERClassifierCombiner类访问。NERClassifierCombiner能够让多个CRF一起使用,并有选项识别数字序列模式以及SUTime基于规则NER的时间模式。

To use NERClassifierCombiner at the command-line, the jars in lib directory and stanford-ner.jar must be in the CLASSPATH. Here is an example command:

在命令行中使用NERClassifierCombiner,lib目录下的jar包和stanford-ner.jar必须位于类路径下。这是一个命令行的例子:

java -mx1g -cp "*:lib/*" edu.stanford.nlp.ie.NERClassifierCombiner -textFile sample.txt -ner.model classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz

The one difference you should see from above is that Sunday is now recognized as a DATE.

你应该从上面看到的一个区别是,Sunday 现在被识别为一个日期。

Programmatic use via API

通过API的编程使用

You can call Stanford NER from your own code. The file NERDemo.java included in the distribution illustrates several ways of calling the system programatically. We suggest that you start from there, and then look at the javado, etc. as needed.

你能从你自己的代码中调用Stanford NER。NERDemo.java包含了多种阐述使用多种方法实用的调用系统。我们建议你从哪里开始。然后,根据需要查看javadoc

Programmatic use via a service

通过服务器编程使用

Stanford NER can also be set up to run as a server listening on a socket.

斯坦福NER还可以在socket上设置为服务器运行的监听

Questions

问题

You can look at a Powerpoint Introduction to NER and the Stanford NER package [ppt] [pdf]. There is also a list of Frequently Asked Questions (FAQ), with answers! This includes some information on training models. Further documentation is provided in the included README.txt and in the javadocs.

你们可以看一下关于NER和斯坦福NER包的幻灯片介绍[ppt] [pdf]。还有一个常见问题列表(FAQ),有答案!包括一些关于训练模型的信息。所包含的README.txt和javadocs中提供了更多的文档。

Have a support question? Ask us on Stack Overflow using the tag stanford-nlp.

有一个支持的问题?使用stanford-nlp标记询问堆栈溢出

Feedback and bug reports / fixes can be sent to our mailing lists.

反馈和bug报告/修复可以发送到我们的邮件列表。

Mailing Lists

邮件列表

We have 3 mailing lists for the Stanford Named Entity Recognizer, all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at @lists.stanford.edu:

  1. java-nlp-user This is the best list to post to in order to send feature requests, make announcements, or for discussion among JavaNLP users. (Please ask support questions on Stack Overflow using the stanford-nlp tag.)

    You have to subscribe to be able to use this list. Join the list via this webpage or by emailing java-nlp-user-join@lists.stanford.edu. (Leave the subject and message body empty.) You can also look at the list archives.

  2. java-nlp-announce This list will be used only to announce new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3 messages a year). Join the list via this webpage or by emailing java-nlp-announce-join@lists.stanford.edu. (Leave the subject and message body empty.)
  3. java-nlp-support This list goes only to the software maintainers. It's a good address for licensing questions, etc. For general use and support questions, you're better off joining and using java-nlp-user. You cannot join java-nlp-support, but you can mail questions to java-nlp-support@lists.stanford.edu.

 

Download

下载

Download Stanford Named Entity Recognizer version 3.9.2

下载斯坦福命名实体识别器版本3.9.2

The download is a 151M zipped file (mainly consisting of classifier data objects). If you unpack that file, you should have everything needed for English NER (or use as a general CRF). It includes batch files for running under Windows or Unix/Linux/MacOSX, a simple GUI, and the ability to run as a server. Stanford NER requires Java v1.8+. If you want to use Stanford NER for other languages, you'll also need to download model files for those languages; see further below.

下载的是一个151M的压缩文件(主要由分类器数据对象组成)。如果您解压缩该文件,您拥有了英语NER(或作为通用CRF使用)所需的所有内容。它包括了在Windows、Unix/Linux/MacOS上运行的批处理文件、一个简单的GUI, 以及在服务器上运行的能力。斯坦福NER需要Java1.8+的运行环境。如果你希望使用斯坦福NER在其它语言上,你也需要下载那些语言的模型文件。如下所示:

Extensions: Packages by others using Stanford NER

扩展:使用Standform NER的其它包

For some (computer) languages, there are more up-to-date interfaces to Stanford NER available by using it inside Stanford CoreNLP, and you are better off getting those from the CoreNLP page and using them....

对于某些(计算机)语言,在Stanford CoreNLP中使用Stanford NER可以获得更多最新的接口,您最好从CoreNLP页面获得这些接口并使用它们……

Models

模型

Included with Stanford NER are a 4 class model trained on the CoNLL 2003 eng.train, a 7 class model trained on the MUC 6 and MUC 7 training data sets, and a 3 class model trained on both data sets and some additional data (including ACE 2002 and limited amounts of in-house data) on the intersection of those class sets. (The training data for the 3 class model does not include any material from the CoNLL eng.testa or eng.testb data sets, nor any of the MUC 6 or 7 test or devtest datasets, nor Alan Ritter's Twitter NER data, so all of these remain valid tests of its performance.)

所包含的斯坦福NER是一个4类模型,基于CoNLL 2003的英文语料训练的, 一个7类模型在MUC6 和 MUC7训练数据集,以及一个3类模型在以上2个数据集以及额外的数据上训练(包括 ACE 2002 和一些内部数据集 )基于这些类集合的交集(3类模型的训练数据不包括任何来自信息CoNLL eng.testa或eng.testb数据集,也没有任何MUC 6或7测试或devtest数据集,也不包含Alan Ritter's Twitter NER数据,所有这些都是对其性能的有效测试的)

3 class:Location, Person, Organization
4 class:Location, Person, Organization, Misc
7 class:Location, Person, Organization, Money, Percent, Date, Time

 

These models each use distributional similarity features, which provide considerable performance gain at the cost of increasing their size and runtime. We also have models that are the same except without the distributional similarity features. You can find them in our English models jar. You can either unpack the jar file or add it to the classpath; if you add the jar file to the classpath, you can then load the models from the pathedu/stanford/nlp/models/.... You can run jar -tf <jar-file> to get the list of files in the jar file.

这些模型每一个都使用了分布的相似特征,提供了重要的性能收益以增加了他们的大小和运行时间为代价。我们也有模型由相同的期待没有相同的特征分布。你可以在英语models.jar找到他们。你也可以解压缩jar包或者将其添加到classpath中,如果你添加jar文件到classpath,你能从路径edu/stanford/nlp/models/...加载模型。你可以运行 jar -tf <jar-file> 来获取jar文件中的文件列表。

Also available are caseless versions of these models, better for use on texts that are mainly lower or upper case, rather than follow the conventions of standard English

CoreNLP models jars download page

 

Important note: There was a problem with the v3.6.0 English Caseless NER model. See this page.

 

German

A German NER model is available, based on work by Manaal Faruqui and Sebastian Padó. You can find it in the CoreNLP German models jar. For citation and other information relating to the German classifiers, please seeSebastian Pado's German NER page (but the models there are now many years old; you should use the better models that we have!). It is a 4 class IOB1 classifier (see, e.g., Memory-Based Shallow Parsing by Erik F. Tjong Kim Sang). The tags given to words are: I-LOC, I-PER, I-ORG, I-MISC, B-LOC, B-PER, B-ORG, B-MISC, O. It is trained over the CoNLL 2003 data with distributional similarity classes built from the Huge German Corpus.

CoreNLP models jars download page

 

Here are a couple of commands using these models, two sample files, and a couple of notes. Running on TSV files: the models were saved with options for testing on German CoNLL NER files. While the models use just the surface word form, the input reader expects the word in the first column and the class in the fifth colum (1-indexed colums). You can either make the input like that or else change the expectations with, say, the option -map "word=0,answer=1" (0-indexed columns). These models were also trained on data with straight ASCII quotes and BIO entity tags. Also, be careful of the text encoding: The default is Unicode; use -encoding iso-8859-15 if the text is in 8-bit encoding. 

TSV mini test file:  german-ner.tsv — Text mini test file:  german-ner.txt 
java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz -testFile german-ner.tsv
java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz -tokenizerOptions latexQuotes=false -textFile german-ner.txt

 

Spanish

From version 3.4.1 forward, we have a Spanish model available for NER. It is included in the Spanish corenlp models jar.

CoreNLP models jars download page

 

Chinese

中文

We also provide Chinese models built from the Ontonotes Chinese named entity data. There are two models, one using distributional similarity clusters and one without. These are designed to be run on word-segmented Chinese. So, if you want to use these on normal Chinese text, you will first need to run Stanford Word Segmenter or some other Chinese word segmenter, and then run NER on the output of that!

我们也提供了基于Ontonotes中文命名实体数据获得的中文模型。包括2个模型,一个使用了分布相似聚类,另一个没有。这些设计运行在中文分词上。如果你希望在一般的中文文本上使用,你首先需要运行斯坦福分词器或其它中文分词器,然后运行NER输出.

CoreNLP models jars download page

 

Online Demo

We have an online demo of several of our NER models. Special thanks to Dat Hoang, who provided the initial version. Note that the online demo demonstrates single CRF models; in order to see the effect of the time annotator or the combined models, see CoreNLP.

 

Release History

 

VersionDateDescription
3.9.22018-10-16Updated for compatibility
3.9.12018-02-27KBP ner models for Chinese and Spanish
3.8.02017-06-09Updated for compatibility
3.7.02016-10-31Improvements to Chinese and German NER
3.6.02015-12-09Updated for compatibility
3.5.22015-04-20synch standalone and CoreNLP functionality
3.5.12015-01-29Substantial accuracy improvements
3.5.02014-10-26Upgrade to Java 8
3.4.12014-08-27Added Spanish models
3.42014-06-16Fix serialization of new models
3.3.12014-01-04Bugfix release
3.3.02013-11-12Updated for compatibility
3.2.02013-06-20Improved line by line handling
1.2.82013-04-04-nthreads option
1.2.72012-11-11Add Chinese model, include Wikipedia data in 3-class English model
1.2.62012-07-09Minor bug fixes
1.2.52012-05-22Fix encoding issue
1.2.42012-04-07Caseless versions of models supported
1.2.32012-01-06Minor bug fixes
1.2.22011-09-14Improved thread safety
1.2.12011-06-19Models reduced in size but on average improved in accuracy (improved distsim clusters)
1.22011-05-16Normal download includes 3, 4, and 7 class models. Updated for compatibility with other software releases.
1.1.12009-01-16Minor bug and usability fixes, and changed API (in particular the methods to classify and output tagged text)
1.12008-05-07Additional feature flags, various code updates
1.02006-09-18Initial release
 类似资料: