jieba、HanLP、Stanza实现中文文本处理的基本任务

鲁英卫

2023-12-01

因为课程需要跑通NLP的基本任务，这里记录一下安装使用的过程、参考的官方文档以及我遇到的坑

英文文本见我的另一篇文章：NLTK+StanfordCoreNLP实现英文文本处理的基本任务

一、jieba

jieba官方文档：https://github.com/fxsjy/jieba

完成以下任务：

分词
自定义词典
停用词
关键词提取
词性标注

import jieba 
import collections # 
import numpy # 文本处理包
import pandas as pd

1. 中英文分词

## 读取语料
f = open("./ch.txt", "r", encoding = "utf-8")
kong = f.read()
f.close()
chtxt = '我是刘伶俐。今天真是个好天气，一起去南京大学信息管理学院上课吧！' 
entxt = 'Hello it\'s Marshall Lee. Welcome to this wonderful land. Hope you can find what you want.'

## 精确模式下切词
res = jieba.cut(chtxt)
res2 = jieba.cut(entxt)
print("中文切词结果（精确模式+未加载停用词、自定义词表）：\n"+str([cutword for cutword in res])) 
print("英文切词结果（未加载停用词、自定义词表）：\n"+str([cutword for cutword in res2]))

中文切词结果（精确模式+未加载停用词、自定义词表）：
['我', '是', '刘伶俐', '。', '今天', '真是', '个', '好', '天气', '，', '一起', '去', '南京大学', '信息管理学院', '上课', '吧', '！']
英文切词结果（未加载停用词、自定义词表）：
['Hello', ' ', 'it', "'", 's', ' ', 'Marshall', ' ', 'Lee', '.', ' ', 'Welcome', ' ', 'to', ' ', 'this', ' ', 'wonderful', ' ', 'land', '.', ' ', 'Hope', ' ', 'you', ' ', 'can', ' ', 'find', ' ', 'what', ' ', 'you', ' ', 'want', '.']

以上是未加载停用词、自定义词表的结果，可见刘/伶俐、信息管理/学院未被正确分词。

jieba还提供并行分词提升处理速度。
使用jieba.enable_parallel([并行数]) 开启并行模式
jieba提供的paddle模式（基于机器学习）可提升处理性能，但性价比不高，详见https://blog.csdn.net/learn_forlife/article/details/109485780

2. 载入自定义词典

# 2. 自定义词典
## 读取自定义词表，该词典中定义了“刘伶俐”、“信息管理学院”
f = open("./tools/userdict.txt", "r", encoding = "utf-8")
stopwords = f.read().split("\n")
for wd in stopwords:
    jieba.add_word(str(wd))
f.close()

res = jieba.cut(chtxt)
print("【中文切词结果（+自定义词表）】\n"+str([cutword for cutword in res]))

【中文切词结果（+自定义词表）】
['我', '是', '刘伶俐', '。', '今天', '真是', '个', '好', '天气', '，', '一起', '去', '南京大学', '信息管理学院', '上课', '吧', '！']

3. 停用词

jieba不提供加载停用词的方法，需要自己写

## 3.1. 读取自定义停用词表
f = open("./tools/hit_stopwords.txt", "r", encoding = "utf-8")
stopwords = f.read().split("\n")
f.close()

## 3.2. 切词
res = jieba.cut(chtxt)
res2 = []
for wd in res:
    if wd not in stopwords:
        res2.append(wd)

print("【切词结果（+自定义词表+停用词）】\n"+str([cutword for cutword in res2]))

【切词结果（+自定义词表+停用词）】
['刘伶俐', '今天', '真是', '好', '天气', '一起', '去', '南京大学', '信息管理学院', '上课']

4. 关键词提取

jieba提供基于TF-IDF（默认）算法和TextRank算法的关键词提取方法

import jieba.analyse

## 使用TF-IDF算法提取二十大关键词
with open('./20th.txt','r',encoding='utf-8') as f:
    news = f.read()
f.close()

### 展示权重最大的前5个关键词
print('【二十大关键词提取】')
jieba.analyse.extract_tags(news, topK=5)

【二十大关键词提取】
['亚洲', '和平', '发展', '安全', '坚持']

显然，小说中人物名字的TF-IDF权值较大

5. 词性标注

jieba的词性标注表见https://www.jianshu.com/p/dbaa841fe580/

import jieba.posseg

for word, flag in jieba.posseg.cut(chtxt):
    print('(%s/%s)' % (word, flag),end=',')

(我/r),(是/v),(刘伶俐/x),(。/x),(今天/t),(真是/d),(个/m),(好/a),(天气/n),(，/x),(一起/m),(去/v),(南京大学/nt),(信息管理学院/x),(上课/v),(吧/y),(！/x),

二、HanLP

使用时注意全程关闭梯子！

HanLP提供非常方便的自然语言处理方法，只需执行HanLP('文本内容', task='模型参数')即可实现一项任务。模型参数见github官方文档。

安装HanLP：pip install hanlp_restful

HanLP官网

github官方文档

完成以下任务：

命名实体识别
依存句法分析

1. 初始化HanLP

由于服务器算力有限，匿名用户每分钟限2次调用。如果需要更多调用次数，进入HanLP2.1 RESTful API开源社区免费申请申请免费公益API秘钥auth

# 创建客户端，填入服务器地址和秘钥：
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth='你申请的密钥，没有则是None', language='zh') # 使用公益密钥，zh中文，mul多语种

2. 命名实体识别

HanLP提供基于PKU、MSRA、Ontonotes语料的实体识别，返回一个四元组，每个四元组表示[命名实体, 类型标签, 起始下标, 终止下标]

print(HanLP(chtxt, tasks='ner*')) #使用所有语料库规则

{
  "tok/fine": [
    ["我", "是", "刘伶俐", "。"],
    ["今天", "真是", "个", "好", "天气", "，", "一起", "去", "南京", "大学", "信息", "管理", "学院", "上课", "吧", "！"]
  ],
  "ner/msra": [
    [["刘伶俐", "PERSON", 2, 3]],
    [["今天", "DATE", 0, 1], ["南京大学信息管理学院", "ORGANIZATION", 8, 13]]
  ],
  "ner/pku": [
    [["刘伶俐", "nr", 2, 3]],
    [["南京大学信息管理学院", "nt", 8, 13]]
  ],
  "ner/ontonotes": [
    [["刘伶俐", "PERSON", 2, 3]],
    [["今天", "DATE", 0, 1], ["南京大学信息管理学院", "ORG", 8, 13]]
  ]
}

可以看到，对于这段语料来说，基于MSRA和Ontonotes语料的识别效果较好。

3. 语义依存分析

返回值为一个Document

doc = HanLP(chtxt, tasks='sdp')
print('【HanLP中文语义依存分析结果, type:%s】\n %s' % (type(doc.to_conll),doc.to_conll()))
# TODO:如何可视化显示？

【HanLP中文语义依存分析, type:<class 'method'>】
 1	我	_	_	_	_	_	_	2:Exp	_
2	是	_	_	_	_	_	_	0:Root	_
3	刘伶俐	_	_	_	_	_	_	2:Clas	_
4	。	_	_	_	_	_	_	2:mPunc	_

1	今天	_	_	_	_	_	_	2:Exp	_
2	真是	_	_	_	_	_	_	0:Root	_
3	个	_	_	_	_	_	_	5:Qp	_
4	好	_	_	_	_	_	_	5:Desc	_
5	天气	_	_	_	_	_	_	2:Clas	_
6	，	_	_	_	_	_	_	2:mPunc	_
7	一起	_	_	_	_	_	_	8:Mann	_
8	去	_	_	_	_	_	_	2:eSucc	_
9	南京	_	_	_	_	_	_	10:Nmod	_
10	大学	_	_	_	_	_	_	13:Poss	_
11	信息	_	_	_	_	_	_	12:Sco	_
12	管理	_	_	_	_	_	_	13:Nmod	_
13	学院	_	_	_	_	_	_	8:Lfin	_
14	上课	_	_	_	_	_	_	8:ePurp	_
15	吧	_	_	_	_	_	_	14:mTone	_
16	！	_	_	_	_	_	_	14:mPunc	_

三、Stanza (Official StanfordNLP for python)

python调用coreNLP有许多种方法，英文文本处理中用的是nltk提供的服务器接口，因为每次都要开端口很麻烦，所以尝试从本地直接调用

~~使用stanfordcorenlp详见StanfordNLP#Python,需要从官网下载stanford-corenlp-4.5.1-models-chinese.jar包(1.5G)~~

……由于在使用stanfordcorenlp by Lynten Guo. A过程中遇到太多坑，于是直接使用官方推荐的Stanza库。

Stanza是StanfordNLP官方开发的Python库，详见Stanza官方说明文档，Stanza安装使用教程

import stanza
# stanza.download('zh') # 下载中文处理model

国内使用stanza.download()下载资源包时会报错，科学上网也无果，只好直接去官网手动下载resources.json和zh-hans model（注意版本与报错信息一致），根据报错信息解压后放到.../stanza_resources/zh-hans/目录下

之后就可以用zh_nlp处理各种自然语言处理任务了。

Stanza的使用方法为：在Stanza.Pipline方法中指定需要完成的任务参数，输入请求后Stanza将会将所有分析结果打包成一个Document

关于Document的格式说明详见：https://stanfordnlp.github.io/stanza/data_objects.html#document

或者直接抄官网的用例作业：https://stanfordnlp.github.io/stanza/neural_pipeline.html

# 预装载模型
zh_nlp = stanza.Pipeline(
    'zh-hans',
    verbose=False, ## 不显示model加载时的输出
    download_method=None, ## 因为资源包是手动下载的，所以加载模型时不能允许它联网更新最新版本，否则又会报一样的错.
    processors = 'tokenize,pos,lemma,depparse,ner' ## 指定需要加载的模型（不指定就默认全部加载）
)

doc = zh_nlp('懒惰的熊猫在冬天的草丛中拍打泥巴，远处的李明正在沐浴着朝阳歌唱。')
print('【分词】') 
print(*[f'id: {token.id}\ttext: {token.text}' for token in doc.sentences[0].tokens], sep='\n')
print('【实体识别】') 
print(print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n'))
print('【依存分析】') 
print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}' for sent in doc.sentences for word in sent.words], sep='\n')

【分词】
id: (1,)	text: 懒惰
id: (2,)	text: 的
id: (3,)	text: 熊猫
id: (4,)	text: 在
id: (5,)	text: 冬天
id: (6,)	text: 的
id: (7,)	text: 草丛
id: (8,)	text: 中
id: (9,)	text: 拍打
id: (10,)	text: 泥巴
id: (11,)	text: ，
id: (12,)	text: 远处
id: (13,)	text: 的
id: (14,)	text: 李
id: (15,)	text: 明正
id: (16,)	text: 在
id: (17,)	text: 沐
id: (18,)	text: 浴
id: (19,)	text: 着
id: (20,)	text: 朝阳
id: (21,)	text: 歌唱
id: (22,)	text: 。
【实体识别】
entity: 冬天	type: DATE
entity: 李	type: PERSON
None
【依存分析】
id: 1	word: 懒惰	head id: 3	head: 熊猫	deprel: amod
id: 2	word: 的	head id: 1	head: 懒惰	deprel: mark:rel
id: 3	word: 熊猫	head id: 21	head: 歌唱	deprel: nsubj
id: 4	word: 在	head id: 7	head: 草丛	deprel: case
id: 5	word: 冬天	head id: 7	head: 草丛	deprel: nmod
id: 6	word: 的	head id: 5	head: 冬天	deprel: case
id: 7	word: 草丛	head id: 9	head: 拍打	deprel: obl
id: 8	word: 中	head id: 7	head: 草丛	deprel: acl
id: 9	word: 拍打	head id: 21	head: 歌唱	deprel: advcl
id: 10	word: 泥巴	head id: 9	head: 拍打	deprel: obj
id: 11	word: ，	head id: 21	head: 歌唱	deprel: punct
id: 12	word: 远处	head id: 14	head: 李	deprel: acl:relcl
id: 13	word: 的	head id: 12	head: 远处	deprel: mark:rel
id: 14	word: 李	head id: 21	head: 歌唱	deprel: nsubj
id: 15	word: 明正	head id: 14	head: 李	deprel: flat:name
id: 16	word: 在	head id: 21	head: 歌唱	deprel: advcl
id: 17	word: 沐	head id: 16	head: 在	deprel: obj
id: 18	word: 浴	head id: 21	head: 歌唱	deprel: advcl
id: 19	word: 着	head id: 18	head: 浴	deprel: aux
id: 20	word: 朝阳	head id: 16	head: 在	deprel: obj
id: 21	word: 歌唱	head id: 0	head: root	deprel: root
id: 22	word: 。	head id: 21	head: 歌唱	deprel: punct

可见，Stanza做中文的切词和实体识别的效果并不理想

此外，以上的模型都没有提供方便的可视化方法，关于依存树的可视化是接下来的TODO List。