记录中文工具包FoolNLTK

张丰

2023-12-01

1.前言

可能不是最快的开源中文分词，但很可能是最准的开源中文分词
基于BiLSTM模型训练而成
包含分词，词性标注，实体识别,　都有比较高的准确率
用户自定义词典

2. 安装

安装foolnltk之前，电脑必须先安装tensorflow，且 tensorflow的版本不能高于2.0, 如果高于则必须先 uninstalll, 然后在安装1.X版本的。我用的python 3.x, 具体如下：

tensorflow 2.0以后没有 tensorflow.contrib
降低版本后问题解决
卸载tensorflow
pip uninstall tensorflow
安装tensorflow 1.14
pip install tensorflow==1.14.0
当然有其他方法，但是比较麻烦，我还是采用最简单直接的方法

3.使用

分词

import fool
import warnings
warnings.filterwarnings("ignore")  # 为了防止导入时警告报红

fool.cut("小王在北京")  # list of list , 注意jieba这个函数返回的是惰性对象
>>>>
[['小王', '在', '北京']]

词性标准

传入一句话
fool.pos_cut("小王在北京")  # list of list of tuple 
>>>> tuple: (分词，词性)
[[('小王', 'nr'), ('在', 'p'), ('北京', 'ns')]]


传入两句话时
 # 用列表括起来["句子1“，"句子2"， "句子3"...]
fool.pos_cut(["小王在北京", "小李在吃炸鸡"])
# 每句话构成一个list
# list of list of tuple
>>>>
[[('小王', 'nr'), ('在', 'p'), ('北京', 'ns')],
 [('小李', 'nr'), ('在', 'p'), ('吃', 'v'), ('炸鸡', 'n')]]

实体识别

text = ["小王在北京","祖国天安门你好啊"]
words, ners = fool.analysis(text)
warnings.filterwarnings("ignore")
print("words: ",words)
print("ners: ",ners)

words : list of list of tuple (分词，词性）
ners: list of list of tuple (实体位置，实体类别，实体名称）
# 始终记住：每句话构成一个 list
>>>>
words:  [[('小王', 'nr'), ('在', 'p'), ('北京', 'ns')], [('祖国', 'n'), ('天安门', 'ns'), ('你', 'r'), ('好', 'a'), ('啊', 'y')]]

ners:  [[(0, 2, 'person', '小王'), (3, 5, 'location', '北京')], [(2, 5, 'location', '天安门')]]

记录中文工具包FoolNLTK

1.前言

2. 安装

3.使用

分词

词性标准

实体识别

相关阅读

相关文章

相关问答

相关文档