gensim

Document: some text.
Corpus: a collection of documents.
Vector: a mathematically convenient representation of a document. original vector representation
Model: an algorithm for transforming vectors from one representation to another. eg. tf-idf

Document和Vector之间的区别在于，前者是文本，而后者是数学上方便的文本表示形式。有时，人们会互换使用这些术语：例如，给定任意文档D，而不是说 “the vector that corresponds to document D”，而是说“the vector D”或“document D”。这以含糊不清为代价实现了简洁。只要您记得Document存在于文档空间中，并且Vector存在于向量量空间中，上述歧义是可以接受的。

1. Document

在Gensim中，Document是文本序列类型的对象（在Python 3中通常称为str）。Document可以是简短的140个字符的推文、单个段落（即期刊文章摘要）、新闻文章或书籍中的任何内容。如：

document = "Human machine interface for lab abc computer applications"

2. Corpus

Corpus是Document对象的集合, 语料库在Gensim中扮演两个角色：

（1）用于训练模型的输入。在训练过程中，模型会使用该训练语料库来查找常见的主题和话题，从而初始化其内部模型参数。Gensim专注于无监督模型，因此不需要人工干预，例如昂贵的注释或手工标记文档。

（2）Documents to organize。训练后，可以使用主题模型从新文档（训练语料库中未显示的文档）中提取主题。这样的语料库可以进行相似性查询、语义相似度查询、聚类等

这是一个示例语料库。它由9个文档组成，其中每个文档都是一个由单个句子组成的字符串。

text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]
注：上面的例子将整个语料库加载到内存中。在实践中，语料库可能非常大，因此无法将其加载到内存中。Gensim通过一次传输一个文档来智能地处理这些语料库。有关详细信息，请参阅 Corpus Streaming – One Document at a Time。

2.1 预处理：

收集完语料库后，通常需要执行许多预处理步骤。我们将使其保持简单，仅删除一些常用的英语单词（例如“ the”）和仅在语料库中出现一次的单词。在此过程中，我们将标记数据。标记化将文档分解为单词（在这种情况下，使用空格作为分隔符）。

# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)     # 所有key的value初始化为0，即frequency[word] = 0
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)

结果：

[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'], ['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']
]

有更好的方法来执行预处理： gensim.utils.simple_preprocess()函数。

2.2 建立词汇表：gensim.corpora.Dictionary

在继续之前，我们希望将语料库中的每个单词与唯一的整数ID相关联，问题和ID之间的映射称为dictionary，可以使用gensim.corpora.Dictionary类来做到这一点。该词典定义了我们处理过程所知道的所有单词的词汇表vocabulary 。

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference
print(dictionary)

在这里，我们使用gensim.corpora.dictionary.Dictionary类为语料库中出现的所有单词分配了唯一的整数id。这遍及所有文本，收集了字数统计和相关统计信息。最后，我们看到处理后的语料库中有12个不同的词，这意味着每个文档将由12个数字表示（即，由12维向量表示）。

输出：

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

由于我们的语料库很小，因此gensim.corpora.Dictionary中只有12个不同的标记。对于较大的语料库，包含成千上万个标记的字典是很常见的。

要查看单词及其ID之间的映射：

2.3 dictionary.token2id：输出每个token与ID的对应

toid = dictionary.token2id
print(toid)

结果：“token”: id 键值对

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

3. Vector

为了推断语料库中的潜在结构，我们需要一种使用数学方式操作的文档表示方式。

一种方法是将每个文档表示为特征向量：

dense vector：(1, 0.0), (2, 2.0), (3, 5.0)
sparse vector or bag-of-words vector：(2, 2.0), (3, 5.0) gensim方式

另一种方法是：bag-of-words model 词袋模型

即每个文档都由一个向量表示，该向量包含字典中每个单词的频率计数。

dictionary：['coffee', 'milk', 'sugar', 'spoon'].

A document： "coffee milk coffee"

Represented by the vector： [2, 1, 0, 0]

词袋模型的主要特性之一是，它完全忽略了编码文档中标记的顺序，这就是词袋名称的来源。

我们处理的语料库中有12个唯一的词，这意味着在bag-of-words模型下，每个文档将由12维向量表示。我们可以使用dictionary将标记化的文档转换为这些12维向量。

3.1 dictionary.doc2bow：将标记化文档转换为向量

为文档创建bag-of-words表示法：The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector [(0,1),(1,1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

例如，假设我们要对短语“Human computer interaction”进行向量化处理（请注意，该短语不在我们的原始语料库中）。我们可以使用dictionary的doc2bow方法为文档创建bag-of-words表示法，该方法将返回单词计数的稀疏表示法：

new_doc = "Human computer minors interaction minors"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

结果vectorization：按 dictionary.token2id 的ID值排序

[(0, 1), (1, 1), (11, 2)] 每个元组中的第一个值对应于dictionary中token的ID，第二个值对应于此token的计数。

请注意，“interaction ”在原始语料库中没有出现，因此没有包含在vectorization中。还要注意，这个向量只包含文档中实际出现的单词的值。因为任何给定的文档只包含dictionary中多个单词中的几个单词，所以没有出现在vectorization中的单词被隐式地表示为零，以节省空间。

4. Model

现在，我们已经对语料库进行了矢量化处理，我们可以开始使用模型对其进行转换了。我们使用模型作为抽象术语，指的是从一种文档表示形式到另一种文档表示形式的转换。在gensim中，文档表示为向量，因此可以将Model视为两个向量空间之间的转换。当模型读取训练语料库时，将在训练过程中学习此转换的详细信息。

4.1 tf-idf 模型

Model的一个简单示例是tf-idf，将向量从词袋模型转换为向量空间：

这是一个简单的例子，让我们初始化tf-idf模型，在语料库上对其进行训练，并转换字符串“ system minors”

from gensim import models

# 训练： 输入训练集的词袋表示
bow_corpus = [dictionary.doc2bow(word) for word in processed_corpus]   # 词袋向量模式
pprint.pprint(bow_corpus)
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use

tfidf = models.TfidfModel(bow_corpus)    # 词袋向量模式 ——> tfidf向量化模式

# 测试
words = "system minors".lower().split()   # ['system', 'minors']
bow_test = dictionary.doc2bow(words)      # [(5, 1), (11, 1)]
result = tfidf[bow_test]                  # [(5, 0.5898341626740045), (11, 0.8075244024440723)]

tfidf模型再次返回一个元组列表，其中第一个元素是token的ID，第二个元素是tf-idf权重。

训练集的词袋表示：

[[(0, 1), (1, 1), (2, 1)],
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
[(2, 1), (5, 1), (7, 1), (8, 1)],
[(1, 1), (5, 2), (8, 1)],
[(3, 1), (6, 1), (7, 1)],
[(9, 1)],
[(9, 1), (10, 1)],
[(9, 1), (10, 1), (11, 1)],
[(4, 1), (10, 1), (11, 1)]]

请注意，与“system”相对应的ID（在原始语料库中出现过4次）的权重已低于与“minors”相对应的ID（仅出现过两次）的权重。

您可以将经过训练的模型保存到磁盘上，然后再加载回去，以继续对新的训练文档进行训练或转换新文档。 gensim提供了许多不同的模型/转换。有关更多信息，请参见 Topics and Transformations。

调用model [corpus]只会在旧的语料库文档流周围创建一个包装器-实际的转换是在文档迭代过程中即时完成的。在调用corpus_transformed = model [corpus]时，我们无法转换整个语料库，因为这将意味着将结果存储在主内存中，并且与gensim的独立于内存的目标相矛盾。如果您要遍历转换的corpus_transformed多次，并且转换成本很高，请首先将生成的语料库序列化到磁盘，然后继续使用它。

4.2 Corpus Streaming

请注意，以上语料库作为纯Python列表完全位于内存中。在这个简单的示例中，这没什么大不了，但是为了清楚起见，我们假设语料库中有数百万个文档。将它们全部存储在RAM中是行不通的。而是假设documents 存储在磁盘上的文件中，每行一个document。Gensim仅要求语料库一次必须能够返回一个document 向量：

Gensim的灵活性：即语料库不必必须是列表、NumPy数组或Pandas数据框等。 Gensim接受在迭代时能连续产生document的任何对象。

这种灵活性允许我们创建自己的语料库类，直接从磁盘、网络、数据库等流式传输文档。Gensim中模型不需要所有的向量同时保存在RAM中，你甚至可以动态创建文档！

假设每个文档在一个文件中占据一行并不重要；您可以将__iter__函数塑造为适合自己的输入格式，无论它是什么格式。遍历目录、解析XML、访问网络…只需将你的输入解析为每个文档的token“列表，然后通过dictionary将这些tokens转换为ID，并在__iter__中生成稀疏向量。

2. 模型创建完成的后续操作—gensim.similarities

创建模型后，您可以使用它进行各种有趣的操作。例如，要通过TfIdf转换整个语料并对其进行索引，以准备相似性查询：

# 相似性
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

# 查询我们的文档query_document与语料库中每个文档的相似性：
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]  # ndarray
print(list(enumerate(sims)))
# [(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
# Document 3 has a similarity score of 0.718=72%, document 2 has a similarity score of 42% etc.

for doc, score in sorted(enumerate(sims), key=lambda x:x[1], reverse=True):
    print(doc, ' ', score)

结果：

3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0

5. KeyedVectors

参考：gensim中常用的Word2Vec，Phrases，Phraser，KeyedVectors

5.1 KeyedVectors

5.2 词向量保存、加载的各种数据格式

model.save()不能利用文本编辑器查看，但是保存了训练的全部信息，可以在读取后追加训练。模型的保存与加载：保留了模型训练的所有状态信息，如权重文件，二叉树和词汇频率等，加载后可以进行再/追加训练
model.wv.save() 保存在一个KeyedVectors实例中，但是保存时丢失了词汇树等部分信息，不能追加训练。词向量文件的保存与加载：丢弃了模型训练的状态信息，加载后不可以进行再/追加训练
model.wv.save_word2vec_format()：二进制方式