所以可以直接配置gradle依赖。对不同的语言通过classifier选择对应的model。其中models是其他语言models的基础,默认可以处理English,必须引入。我们需要处理中文,所以还需要:models-chinese。
然而models和models-chinese两个包很大,下载有点慢(对网速自信的童鞋可以无视“然而”)。所以我就用迅雷下载好通过本地文件引入。
// Apply the java plugin to add support for Java
apply plugin: 'java'
// In this section you declare where to find the dependencies of your project
repositories {
// Use 'jcenter' for resolving your dependencies.
// You can declare any Maven/Ivy/file repository here.
maven {
url "http://maven.aliyun.com/nexus/content/groups/public"
}
jcenter()
}
// In this section you declare the dependencies for your production and test code
dependencies {
// https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp
compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0'
compile files('lib/stanford-corenlp-3.8.0-models.jar')
compile files('lib/stanford-chinese-corenlp-2017-06-09-models.jar')
//compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0', classifier:'models'
//compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0', classifier:'models-chinese'
testCompile 'junit:junit:4.12'
}
之后就是通过配置实例化StanfordCoreNLP,然后对文本进行处理(annotate)。在models-chinese包中有针对中文的配置文件:StanfordCoreNLP-chinese.properties。可以在实例化StanfordCoreNLP时直接传入,这里通过properties加载,为了方便修改配置。
具体源码看官方Demo:StanfordCoreNlpDemo.java,这里只是针对中文处理进行了一些修改。由于中文处理需要的内存比较大,所以配置jvm参数:-Xms512M -Xmx4096M
// Add in sentiment
Properties props = new Properties();
props.load(StanfordCoreNlpDemo.class.getClassLoader().getResourceAsStream("StanfordCoreNLP-chinese.properties"));
//props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
//StanfordCoreNLP pipeline = new StanfordCoreNLP();
// Initialize an Annotation with some text to be annotated. The text is the argument to the constructor.
Annotation annotation;
if (args.length > 0) {
annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
} else {
annotation = new Annotation(" 循环经济是人类社会发展的必然选择,包装废弃物资源化是循环经济的要求。"
+ "包装废弃物资源化是一项系统工程,应从企业、区域和社会三个层面上进行,"
+ "因此,产生了三种包装废弃物资源化模式,即基于清洁生产、生态工业园区和基于社会层面的包装废弃物资源化模式。");
}
// run all the selected Annotators on this text
pipeline.annotate(annotation);
StanfordCoreNLP的各个组件(annotator),即我们在properties中配置的annotators。他们之间有一定的依赖关系:https://stanfordnlp.github.io/CoreNLP/dependencies.html。
tokenize(Tokenization 分词)
ssplit(Sentence Splitting 断句)
pos(Part of Speech Tagging 词性标注)
lemma(Lemmatization 词干提取)
ner(Named Entity Recognition 命名实体识别)
parse(Constituency Parsing 语法分析)
depparse(Dependency Parsing 依存分析)
dcoref(Coreference Resolution 同指消解)
natlog(Natural Logic Polarity)