Keras的Tokenizer分词器

袁俊弼

2023-12-01

Tokenizer类

keras.preprocessing.text.Tokenizer(
                   num_words=None, 
                   filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', 
                   lower=True, 
                   split=' ', 
                   char_level=False, 
                   oov_token=None, 
                   document_count=0)

文本切分实用类，这个类可以通过将每个文本转化为整数序列（每个整数都是词典中标记的索引）或一个将文本语料向量化（每个标记对应的系数可以是二进制值、基于词数量、基于tf-idf等）。

参数：

num_words: 保留的单词的最大数量，基于单词频率。只有出现频率最高的num_words-1个单词会被保存
filters: 由需要从文本中过滤掉的字符组成的字符串。默认为所有的标点符号，加上制表符及换行符，减去字符'
lower: 布尔变量，是否将文本转化为小写
split: 字符串变量，用于单词分割的分隔符
oov_token: 如果给定，其将会被添加到word_index并在调用text_to_sequence时用于取代out-of-vocabulary（超出字典）的单词

成员变量：

document_count: 整型，处理的文档数量
word_index: 字典类型，单词到索引的映射
index_word: 字典类型，索引到单词的映射
word_counts: 字典类型，每个单词出现的总频次
word_docs: 字典类型，出现单词的文档的数量
index_docs: 字典类型，单词索引对应的出现单词的文档的数量

成员函数：

fit_on_text(texts): 通过文档列表更新tokenizer的词典。
texts_to_sequences(texts): 将文档列表转换为向量,维度为[len(texts)，len(text)] – (文档数，每条文档的长度)
texts_to_matrix(texts): 将文档列表转换为矩阵表示,维度为[len(texts),num_words]

示例：

texts =["Whatever is worth doing is worth doing well.",
        "Happiness is a way station between too much and too little.",
        "In love folly is always sweet.",
        "The hard part isn’t making the decision. It’s living with it."]

tokenizer = keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)

# 根据输入的文本列表更新内部字典
tokenizer.fit_on_texts(texts)

# document_count: 处理的文档数量
print("document_count: ",tokenizer.document_count)
#word_index: 单词到索引的映射
print("word_index: \n",tokenizer.word_index)
# index_word 索引到单词的映射
print("word_index: \n",tokenizer.index_word)
# word_counts: 字典类型，每个单词出现的总频次
print("word_counts: \n",tokenizer.word_counts)
#word_docs: 字典类型，出现单词的文档的数量
print("word_docs: \n",tokenizer.word_docs)
#index_docs: 字典类型，单词索引对应的出现单词的文档的数量
print("index_docs: \n",tokenizer.index_docs)

document_count: 4
word_index: #按词频从小到大的顺序
{'is': 1, 'worth': 2, 'doing': 3, 'too': 4, 'the': 5, 'whatever': 6, 'well': 7, 'happiness': 8, 'a': 9, 'way': 10, 'station': 11, 'between': 12, 'much': 13, 'and': 14, 'little': 15, 'in': 16, 'love': 17, 'folly': 18, 'always': 19, 'sweet': 20, 'hard': 21, 'part': 22, 'isn’t': 23, 'making': 24, 'decision': 25, 'it’s': 26, 'living': 27, 'with': 28, 'it': 29}
word_index:
{1: 'is', 2: 'worth', 3: 'doing', 4: 'too', 5: 'the', 6: 'whatever', 7: 'well', 8: 'happiness', 9: 'a', 10: 'way', 11: 'station', 12: 'between', 13: 'much', 14: 'and', 15: 'little', 16: 'in', 17: 'love', 18: 'folly', 19: 'always', 20: 'sweet', 21: 'hard', 22: 'part', 23: 'isn’t', 24: 'making', 25: 'decision', 26: 'it’s', 27: 'living', 28: 'with', 29: 'it'}
word_counts:
OrderedDict([('whatever', 1), ('is', 4), ('worth', 2), ('doing', 2), ('well', 1), ('happiness', 1), ('a', 1), ('way', 1), ('station', 1), ('between', 1), ('too', 2), ('much', 1), ('and', 1), ('little', 1), ('in', 1), ('love', 1), ('folly', 1), ('always', 1), ('sweet', 1), ('the', 2), ('hard', 1), ('part', 1), ('isn’t', 1), ('making', 1), ('decision', 1), ('it’s', 1), ('living', 1), ('with', 1), ('it', 1)])
word_docs:
defaultdict(<class 'int'>, {'doing': 1, 'is': 3, 'well': 1, 'worth': 1, 'whatever': 1, 'little': 1, 'happiness': 1, 'station': 1, 'much': 1, 'and': 1, 'way': 1, 'between': 1, 'a': 1, 'too': 1, 'sweet': 1, 'in': 1, 'love': 1, 'folly': 1, 'always': 1, 'the': 1, 'isn’t': 1, 'it’s': 1, 'living': 1, 'decision': 1, 'with': 1, 'part': 1, 'making': 1, 'it': 1, 'hard': 1})
index_docs:
defaultdict(<class 'int'>, {3: 1, 1: 3, 7: 1, 2: 1, 6: 1, 15: 1, 8: 1, 11: 1, 13: 1, 14: 1, 10: 1, 12: 1, 9: 1, 4: 1, 20: 1, 16: 1, 17: 1, 18: 1, 19: 1, 5: 1, 23: 1, 26: 1, 27: 1, 25: 1, 28: 1, 22: 1, 24: 1, 29: 1, 21: 1})

# 对词频进行排序
new_fre = sorted(tokenizer.word_counts.items(), key = lambda i:i[1], reverse = True)
print("new_fre:\n",new_fre)

new_fre:
[('is', 4), ('worth', 2), ('doing', 2), ('too', 2), ('the', 2), ('whatever', 1), ('well', 1), ('happiness', 1), ('a', 1), ('way', 1), ('station', 1), ('between', 1), ('much', 1), ('and', 1), ('little', 1), ('in', 1), ('love', 1), ('folly', 1), ('always', 1), ('sweet', 1), ('hard', 1), ('part', 1), ('isn’t', 1), ('making', 1), ('decision', 1), ('it’s', 1), ('living', 1), ('with', 1), ('it', 1)]

对比tokenizer.index_word的输出可以看到，字典中单词的索引是按照词频来排序的。

#texts_to_sequences(texts): 将文档列表转换为向量,维度为[len(texts)，len(text)] -- (文档数，每条文档的长度)
print("texts_to_sequences: \n",tokenizer.texts_to_sequences(texts))
#texts_to_matrix(texts): 将文档列表转换为矩阵表示,维度为[len(texts),num_words]
print("texts_to_matrix: \n",tokenizer.texts_to_matrix(texts))

texts_to_sequences: #句子中每一个单词在index_word的所对应的索引排列的
[[6, 1, 2, 3, 1, 2, 3, 7], [8, 1, 9, 10, 11, 12, 4, 13, 14, 4, 15], [16, 17, 18, 1, 19, 20], [5, 21, 22, 23, 24, 5, 25, 26, 27, 28, 29]]
texts_to_matrix:
[[0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0.
0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.
1. 1. 1. 1. 1. 1.]]

hashing_trick（哈希技巧）

n=10
keras.preprocessing.text.hashing_trick(texts[0], n, hash_function=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

[3, 1, 8, 7, 1, 8, 7, 6]

将文本转化为在固定维度哈希空间的索引序列。

参数:

text: 输入文本（字符串）。
n: 哈希空间的维度。
hash_function: 默认为python哈希函数，可以是md5或输入为字符串输出为整型数的任何函数。注意，由于md5不是稳定的哈希函数，hash不是固定哈希函数，因此每次运行结果并不一致。
filters: 需要过滤掉的字符的列表（或连接），比如标点符号。默认为 !"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n, 包括基本标点符号、制表符和换行符。
lower: 布尔变量，是否将文本转化为小写。
_ split: 字符串，单词分割分隔符

返回值:

整数单词索引列表（不保证唯一性）。

0是保留索引，不会分配给任何单词。

由于哈希函数可能存在冲突，多个单词可能对应相同的索引。冲突的概率与哈希空间的维度及不同对象的数量有关。

texts[0]

'Whatever is worth doing is worth doing well.'

keras.preprocessing.text.hashing_trick(texts[0], n, hash_function=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

[3, 1, 8, 7, 1, 8, 7, 6]

hashing_trick只有文本和n两个参数是必须输入的，输入文本看了下输出，发现其输出为文本对应的整数列表，而且程序启动后每个单词对应的hash值都是相同的，但是将程序重新启动后值会改变。所以就把它的源代码翻出来看看

def hashing_trick(text, n,
                  hash_function=None,
                  filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                  lower=True,
                  split=' '):
    if hash_function is None:
        hash_function = hash
    elif hash_function == 'md5':
        def hash_function(w):
            return int(md5(w.encode()).hexdigest(), 16)

    # 将文本转换为单词列表
    seq = text_to_word_sequence(text,
                                filters=filters,
                                lower=lower,
                                split=split)
    # 试验哈希函数将单词转换为哈希值，并将其按照`n`的值取除余（使其小于n）
    return [(hash_function(w) % (n - 1) + 1) for w in seq]

其中hash_function使用了python内置的哈希函数，所以每次启动单词对应的哈希值会变，但单次启动内每次运行哈希值不变。

one_hot（独热编码）

keras.preprocessing.text.one_hot(texts[0], n, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

[3, 1, 8, 7, 1, 8, 7, 6]

One-hot将一段文本编码为尺寸为n的单词索引列表。这是使用hash作为hashing_trick函数的包装器，不保证单词到索引映射的唯一性

参数:

text: 输入文本（字符串）
n: 词汇表的维度
filters: 需要过滤掉的字符的列表（或连接），比如标点符号。默认为 !"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n, 包括基本标点符号、制表符和换行符。
lower: 布尔变量，是否将文本转化为小写。
_ split: 字符串，单词分割分隔符

返回值:

介于[1,n]之间的整数列表，每个整数代表一个单词（不保证唯一性）。

text_to_word_sequence

keras.preprocessing.text.text_to_word_sequence(texts[0], filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

['whatever', 'is', 'worth', 'doing', 'is', 'worth', 'doing', 'well']

将文本转换为单词（标记）序列。

参数:

text: 输入文本（字符串）。
filters: 需要过滤掉的字符的列表（或连接），比如标点符号。默认为 !"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n, 包括基本标点符号、制表符和换行符
lower: 布尔变量，是否将文本转化为小写。
_ split: 字符串，单词分割分隔符

返回值:

单词（标记）序列。

Keras的Tokenizer分词器

Tokenizer类

hashing_trick（哈希技巧）

one_hot（独热编码）

text_to_word_sequence

相关阅读

相关文章

相关问答

相关文档