corenlp分词 stanford_开源中文分词工具探析（六）：Stanford CoreNLP

漆雕恺

2023-12-01

CoreNLP是由斯坦福大学开源的一套Java NLP工具，提供诸如：词性标注(part-of-speech (POS) tagger)、命名实体识别(named entity recognizer (NER))、情感分析(sentiment analysis)等功能。

【开源中文分词工具探析】系列：

1. 前言

CoreNLP的中文分词基于CRF模型：

P_w(y|x) = \frac{exp \left( \sum_i w_i f_i(x,y) \right)}{Z_w(x)}

其中，\(Z_w(x)\)为归一化因子，\(w\)为模型的参数，\(f_i(x,y)\)为特征函数。

2. 分解

以下源码分析基于3.7.0版本，分词示例见SegDemo类。

模型

主要模型文件有两份，一份为词典文件dict-chris6.ser.gz：

// dict-chris6.ser.gz 对应于长度为7的Set数组词典

// 共计词数：0+7323+125336+142252+82139+26907+39243

ChineseDictionary::loadDictionary(String serializePath) {

Set[] dict = new HashSet[MAX_LEXICON_LENGTH + 1];

for (int i = 0; i <= MAX_LEXICON_LENGTH; i++) {

dict[i] = Generics.newHashSet();

}

dict = IOUtils.readObjectFromURLOrClasspathOrFileSystem(serializePath);

return dict;

}

词典的索引值为词的长度，比如第0个词典中没有词，第1个词典为长度为1的词，第6个词典为长度为6的词。其中，第6个词典为半成词，比如，有词“《双峰》(电”、“８０年国家领”、“１８２４年英”。

另一份为CRF训练模型文件/ctb.gz：

CRFClassifier::loadClassifier(ObjectInputStream ois, Properties props) {

Object o = ois.readObject();

if (o instanceof List) {

labelIndices = (List>) o; // label索引

}

classIndex = (Index) ois.readObject(); // 序列标注label

featureIndex = (Index) ois.readObject(); // 特征

flags = (SeqClassifierFlags) ois.readObject(); // 模型配置

Object featureFactory = ois.readObject(); // 特征模板，用于生成特征

else if (featureFactory instanceof FeatureFactory) {

featureFactories = Generics.newArrayList();

featureFactories.add((FeatureFactory) featureFactory);

}

windowSize = ois.readInt(); // 窗口大小为2

weights = (double[][]) ois.readObject(); // 特征+label 对应的权重

Set lcWords = (Set) ois.readObject(); // Set为空

else {

knownLCWords = new MaxSizeConcurrentHashSet<>(lcWords);

}

reinit();

}

不同于其他分词器采用B、M、E、S四中label来做分词，CoreNLP的中文分词label只有两种，“1”表示当前字符与前一字符连接成词，“0”则表示当前字符为另一词的开始，换言之前一字符为上一个词的结尾。

class CRFClassifier {

classIndex: class edu.stanford.nlp.util.HashIndex

["1","0"]

}

// 中文分词label对应的类

public static class AnswerAnnotation implements CoreAnnotation{}

特征

CoreNLP的特征如下(示例)：

class CRFClassifier {

// 特征

featureIndex: class edu.stanford.nlp.util.HashIndex

size = 3408491

0=的膀cc2|C

1=身也pc|C

44=LSSLp2spscsc2s|C

45=科背p2p|C

46=迪。cc2|C

...

=球-行pc2|CnC

=音非cc2|CpC

// 权重

weights: double[3408491][2]

[[2.2114868426005005E-5, -2.2114868091546352E-5]...]

}

特征后缀只有3类：C, CpC, CnC，分别代表了三大类特征；均由特征模板生成：

// 特征模板List

featureFactories: ArrayList

0 = Gale2007ChineseSegmenterFeatureFactory

// 具体特征模板

Gale2007ChineseSegmenterFeatureFactory::getCliqueFeatures() {

if (clique == cliqueC) {

addAllInterningAndSuffixing(features, featuresC(cInfo, loc), "C");

} else if (clique == cliqueCpC) {

addAllInterningAndSuffixing(features, featuresCpC(cInfo, loc), "CpC");

addAllInterningAndSuffixing(features, featuresCnC(cInfo, loc - 1), "CnC");

}

特征模板只用到了两个特征簇cliqueC与cliqueCpC，其中，cliqueC由函数featuresC()实现，cliqueCpC由函数featuresCpC()与featuresCnC()

Gale2007ChineseSegmenterFeatureFactory::featuresC() {

if (flags.useWord1) {

// Unigram 特征

features.add(charc +"::c"); // c[0]

features.add(charc2+"::c2"); // c[1]

features.add(charp +"::p"); // c[-1]

features.add(charp2 +"::p2"); // c[-2]

// Bigram 特征

features.add(charc +charc2 +"::cn"); // c[0]c[1]

features.add(charc +charc3 +"::cn2"); // c[0]c[2]

features.add(charp +charc +"::pc"); // c[-1]c[0]

features.add(charp +charc2 +"::pn"); // c[-1]c[1]

features.add(charp2 +charp +"::p2p"); // c[-2]c[-1]

features.add(charp2 +charc +"::p2c"); // c[-2]c[0]

features.add(charc2 +charc +"::n2c"); // c[1]c[0]

}

// 三个字符c[-1]c[0]c[1]对应的LBeginAnnotation、LMiddleAnnotation、LEndAnnotation 三种label特征

// 结果特征分别以6种形式结尾，"-lb", "-lm", "-le", "-plb", "-plm", "-ple", "-c2lb", "-c2lm", "-c2le"

// null || ".../models/segmenter/chinese/dict-chris6.ser.gz"

if (flags.dictionary != null || flags.serializedDictionary != null) {

dictionaryFeaturesC(CoreAnnotations.LBeginAnnotation.class,

CoreAnnotations.LMiddleAnnotation.class,

CoreAnnotations.LEndAnnotation.class,

"", features, p, c, c2);

}

// 特征 c[1]c[0], c[1]

if (flags.useFeaturesC4gram || flags.useFeaturesC5gram || flags.useFeaturesC6gram) {

features.add(charp2 + charp + "p2p");

features.add(charp2 + "p2");

}

// Unicode特征

if (flags.useUnicodeType || flags.useUnicodeType4gram || flags.useUnicodeType5gram) {

features.add(uTypep + "-" + uTypec + "-" + uTypec2 + "-uType3");

}

// UnicodeType特征

if (flags.useUnicodeType4gram || flags.useUnicodeType5gram) {

features.add(uTypep2 + "-" + uTypep + "-" + uTypec + "-" + uTypec2 + "-uType4");

}

// UnicodeBlock特征

if (flags.useUnicodeBlock) {

features.add(p.getString(CoreAnnotations.UBlockAnnotation.class) + "-" + c.getString(CoreAnnotations

.UBlockAnnotation.class) + "-" + c2.getString(CoreAnnotations.UBlockAnnotation.class) + "-uBlock");

}

// Shape特征

if (flags.useShapeStrings) {

if (flags.useShapeStrings1) {

features.add(p.getString(CoreAnnotations.ShapeAnnotation.class) + "ps");

features.add(c.getString(CoreAnnotations.ShapeAnnotation.class) + "cs");

features.add(c2.getString(CoreAnnotations.ShapeAnnotation.class) + "c2s");

}

if (flags.useShapeStrings3) {

features.add(p.getString(CoreAnnotations.ShapeAnnotation.class) + c.getString(CoreAnnotations

.ShapeAnnotation.class) + c2.getString(CoreAnnotations.ShapeAnnotation.class) + "pscsc2s");

}

if (flags.useShapeStrings4) {

features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class) + p.getString(CoreAnnotations

.ShapeAnnotation.class) + c.getString(CoreAnnotations.ShapeAnnotation.class) + c2.getString

(CoreAnnotations.ShapeAnnotation.class) + "p2spscsc2s");

}

if (flags.useShapeStrings5) {

features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class) + p.getString(CoreAnnotations

.ShapeAnnotation.class) + c.getString(CoreAnnotations.ShapeAnnotation.class) + c2.getString

(CoreAnnotations.ShapeAnnotation.class) + c3.getString(CoreAnnotations.ShapeAnnotation.class)

+ "p2spscsc2sc3s");

}

Gale2007ChineseSegmenterFeatureFactory::featuresCpC() {}

Gale2007ChineseSegmenterFeatureFactory::featuresCnC() {}

三大类特征分别以“|C”为结尾(共计有32个)、以“|CpC”结尾(共计有37个)、以“|CnC”结尾(共计有9个)；总计78个特征。个人感觉CoreNLP定义的特征过于复杂，大部分特征并没有什么用。

CoreNLP后面处理流程跟其他分词器别无二样了，求每个label的权重加权之和，Viterbi解码求解最大概率路径，解析label序列得到分词结果。CoreNLP分词速度巨慢，效果也一般，在PKU、MSR测试集上的表现如下：

测试集

分词器

准确率

召回率

PKU

thulac4j

0.948

0.936

0.942

CoreNLP

0.901

0.894

0.897

MSR

thulac4j

0.866

0.896

0.881

CoreNLP

0.822

0.859

0.840

3.参考资料

[1] Huihsin, Tseng, et al. "A conditional random field word segmenter." Fourth SIGHAN Workshop. 2005.

[2] Chang, Pi-Chuan, Michel Galley, and Christopher D. Manning. "Optimizing Chinese word segmentation for machine translation performance." Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, 2008.

corenlp分词 stanford_开源中文分词工具探析（六）：Stanford CoreNLP

相关阅读

相关文章

相关问答

相关文档