LingPipe深度剖析

吕树

2023-12-01

Plingpipe是alias公司开发的一款自然语言处理软件包。提供了文本分类，命名体识别、情感分类、中文分词、词性标注、拼写检查、聚类等一系列的NLP算法接口，最近工作也需要用到这个工具，于是深入调研了一把，下面是一些本人的总结。

一、命名体识别：

1、原理

有监督的统计模型和一些更简单直接的方法，比如词典匹配和基于规则的正则匹配

该模型需要注意两点：第一，lingpipe要求训练语料必须标出所有的实体和类型信息，第二，训练语料和测试语料应该是同一类型的数据。如果训练数据数微博文本，测试语料是新闻语料，测试效果可能就不好。

1）基于规则的命名体识别

主要通过正则表达式匹配的方式（邮件的匹配），在lingpipe中不仅提供了训练接口，还提供了测试接口

输入：

2）基于精确词典匹配的命名体识别：只要是词典中出现的全部识别成命名体，在lingpipe中专门有一个开关来控制该功能是否被激活

3）基于模糊词典匹配的命名体识别：But when it is run, it doesn't just look for exact matches, but forall matches within a fixed weighted edit distance threshold. 在lingpipe中专门有一个开关来控制该功能是否被激活

4）训练：传模型的存放位置

TokenizerFactoryfactory= IndoEuropeanTokenizerFactory.INSTANCE;

HmmCharLmEstimator hmmEstimator

= newHmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);

CharLmHmmChunker chunkerEstimator

= newCharLmHmmChunker(factory,hmmEstimator);

System.out.println("Setting up DataParser");

GeneTagParser parser = new GeneTagParser();

parser.setHandler(chunkerEstimator);

2、 Running a Statistical Named Entity Recognizer

1） First-Best Named Entity Chunking：只输出一种结果，就是程序程序认为最好的结果

2） N-Best Named Entity Chunking：输出多种结果，而且结果的按联合概率从大到小排序后输出

3） Confidence Named Entity Chunking：也是输出多种结果，而且还输出confidence值（有多大的自信心认为结果是正确的）

二、聚类

1、原理

（1）计算文本之间的编辑距离（Edit-distance）

（2）用二叉树表示文本的编辑距离

（3）确定文本被聚为哪些类（切分二叉树）

方法一：设定一个类高度的阈值，在划分类时保证每一类的子树高度小于或等于这个阈值，最后子树的数量就是聚类的数量

方法二：循环从最高的一棵二叉子树开始切分，直到子树的数量达到某一个阈值，

2、Lingpipe提供的两个重要类

1）基本的不相交分类（Simple Disjoint Clustering）

提供了一个简单的分类接口，其中E是传入的待聚类的文本的类型

interface Clusterer<E> {

Set<Set<E>> cluster(Set<? extendsE> elements);

}

2）层次聚类和聚类分析（HierarchicalClustering and Dendrograms）

HierarchicalClustering接口继承了Simple Disjoint接口，该接口下有一个hierarchicalCluster方法，返回的是Dendrogram类行的集合。Dendrogram类集合里存放的是每一篇文本的二叉树表示文本。

interface HierarchicalClusterer<E> extends Clusterer<E> {

    Dendrogram<E> hierarchicalCluster(Set<? extends E> elements);

三、词性标注

算法原理：基于大规模标注好的语料训练出来的模型进行标注

进行测试的时候有三种机制：

First-BestResults:只输出一种结果，概率最高的结果

N-best Results：输出N中结果

Confidence-Based Results：输出结果的同时还输出自信度

四、情感分类

算法原理：语言模型，默认的元数是8，先把主观句子提取出来，然后再从主观句中提取出正面的和负面的。

用了两套情感词表：主观和客观词表，正面和负面词表

五、句子检测

从文档中检测出所有完整的句子

提供了一个句子模型接口SentenceModel Interface，检测句子，输入的是文本，输出的是句子列表

The SentenceModel interfacespecifies a means of doing sentence segmentation from arrays of tokens andwhitespaces.

该接口提供了两个方法：

1、int[] boundaryIndices(String[] tokens,

                      String[] whitespaces)

返回句子的结束字符的位置，tokens是字符列表，whitespace是空格列表，这两个参数在初始化语料时调用TOKENIZER_FACTORY.tokenizer（）方法得到

2、void boundaryIndices(String[] tokens,

                     String[] whitespaces,

                     int start,

                     int end,

                     Collection<Integer> indices)

六、拼写检测（spellingcorrection）

纠正拼写错误、自动给没有空格分割的文本加空格、删除自动插入的连接符、替换PDF转换后自动生成的连体字母

算法原理：噪声信道模型（noisy channel model），或称信源信道模型，这是一个普适性的模型，被用于语音识别、拼写纠错、机器翻译、中文分词、词性标注、音字转换等众多应用领域。其形式很简单，如下图所示：

算法流程：（http://blog.csdn.net/fkyyly/article/details/42933281）

1、通过词典匹配容易确定为“Non-wordspelling error”；

2、然后通过计算最小编辑距离获取最相似的candidate correction，此时，我们希望选择概率最大的w作为最终的拼写建议，基于噪声信道模型思想，需要进一步计算P(w)和P(x|w)。

七、字符串匹配

1、用途：计算两个字符串的相似度，重复数据删除、记录链接到术语的提取、拼写检查、K近邻分类

2、算法原理：两个字符串的距离越近，它们的相似度越高，两个字符串的接近度越高，它们也越相似，所以lingpipe提供了两个重要的接口：

public interface Distance<E> {

    public double distance(E e1, E e2);

}//计算e1和e2的距离

public interface Proximity<E> {

    public double proximity(E e1, E e2);

}//计算e1和e2的接近度

3、两个字符串的距离有很多种：SimpleEdit Distance、Weighted Edit Distance、Jaccard Distance、Jaro-Winkler Distance、TF/IDF Distance

3.1 SimpleEdit Distance

Simple Edit Distance又叫做Damerau-LevensteinDistance（拉文斯坦距离）

这个距离由Damerau和Levenstein提出的，就是把一个字符串A通过一些列变换（插入、删除、替换，不包括交换）得到字符串B的最少步骤

例如：‘Bobe’和“Bode”的拉文斯坦距离是1，

两个字符串的接近度proximity(x,y) = -distance(x,y)。

3.2 Weighted Edit Distance

3.3 Jaccard Distance

Eg:

S1：我喜欢吃苹果

S2：我喜欢吃香蕉

Jaccard Distance（s1,s2） = 4

八、 Significant Phrases（关键词抽取）

Collections（词集合）：Collocations are phrases which are seen together more than you wouldexpect given an estimate of how frequent each token is and how often they areseen together.

. For example, 'Los Angeles' has a higherscore than 'Tie Breaker' because we see 'Los' 67 times, 'Angeles' 67 times and'Los Angeles' 67 times. So 'Los' and 'Angeles' always occurs with the largerphrase--a high correlation. On the other hand 'Tie' occurs 15 times, 'Breaker'8 times and 'Tie Breaker' 8 times, so Tie only occurs with the larger phrasehalf the time, less of a correlation.

新词发现： finding phrases that occur significantly more often in theforeground corpus than they would be expected to from the background corpus

九、数据库文挖掘（Database Text Mining）

主要实现了根据需求从数据库查询数据，主要分为三个步骤，没有用到算法，都是数据库的操作。

§ Loading MEDLINE data into the database,using the LingMed MEDLINE parser and the JDBC API to access a RDBMS.

§ Using the LingPipe API to annotate textdata in the database, and to store the annotations back into the database.

§ SQL database queries over the annotateddata.

十、Hyphenation & SyllabificationTutorial（断字识音）

两种方法，：基于规则和基于噪声信道模型

十二、语言种类识别

这是是预处理操作，Languageidentification is the problem of classifying a sample of characters based onits language.

十三、词义排歧（Word Sense Disambiguation）

确定多义词在文本中的语义

在有监督的情况下：词义排歧问题变成一个分类问题，需要借助其他的类似本体和词典来辅助词义排歧

在无监督的情况下，词义排歧问题变成一个聚类问题，分层聚类在这里就起到作用

十四、奇异值分解

......还在研究中，待更新

LingPipe深度剖析

N-best Results：输出N中结果

相关阅读

相关文章

相关问答

相关文档