Gensim

孙星鹏

2023-12-01

Gensim

从最原始的非结构化的文本中，无监督的学习到文本隐层的主题向量表达；

支持包括LDA TF-IDFLSA word2vec等主题模型算法。

官网

基本概念

语料 Corpus
向量 Vector
稀疏向量 SparseVector
模型 Model

安装

安装环境

Ubuntu18.04
Anaconda3-5.3.1

!pip install gensim

!conda list | grep gensim

gensim                    3.8.3                     <pip>

!pip install PyHamcrest
!pip show PyHamcrest

import gensim

gensim.__version__

'3.8.3'

训练语料的预处理

原始字符串 -> 稀疏向量

原始文本 -> 分词、去除停用词等 -> 文档特征列表

词袋模型文档特征——word

from gensim import corpora

texts = [['a', 'b', 'c'],
    ['a', 'd', 'b']]
dictionary = corpora.Dictionary(texts) 
corpus = [dictionary.doc2bow(text) for text in texts] #词袋模型 doc2bow
print(corpus)
print()
print(corpus[0])
print(corpus[1])

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (1, 1), (3, 1)]]

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (1, 1), (3, 1)]

help(corpora.Dictionary)

Help on class Dictionary in module gensim.corpora.dictionary:

class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
 |  Dictionary(documents=None, prune_at=2000000)
 |  
 |  Dictionary encapsulates the mapping between normalized words and their integer ids.
 |  
 |  Notable instance attributes:
 |  
 |  Attributes
 |  ----------
 |  token2id : dict of (str, int)
 |      token -> tokenId.
 |  id2token : dict of (int, str)
 |      Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
 |  cfs : dict of (int, int)
 |      Collection frequencies: token_id -> how many instances of this token are contained in the documents.
 |  dfs : dict of (int, int)
 |      Document frequencies: token_id -> how many documents contain this token.
 |  num_docs : int
 |      Number of documents processed.
 |  num_pos : int
 |      Total number of corpus positions (number of processed words).
 |  num_nnz : int
 |      Total number of non-zeroes in the BOW matrix (sum of the number of unique
 |      words per document over the entire corpus).
 |  
 |  Method resolution order:
 |      Dictionary
 |      gensim.utils.SaveLoad
 |      collections.abc.Mapping
 |      collections.abc.Collection
 |      collections.abc.Sized
 |      collections.abc.Iterable
 |      collections.abc.Container
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, tokenid)
 |      Get the string token that corresponds to `tokenid`.
 |      
 |      Parameters
 |      ----------
 |      tokenid : int
 |          Id of token.
 |      
 |      Returns
 |      -------
 |      str
 |          Token corresponding to `tokenid`.
 |      
 |      Raises
 |      ------
 |      KeyError
 |          If this Dictionary doesn't contain such `tokenid`.
 |  
 |  __init__(self, documents=None, prune_at=2000000)
 |      Parameters
 |      ----------
 |      documents : iterable of iterable of str, optional
 |          Documents to be used to initialize the mapping and collect corpus statistics.
 |      prune_at : int, optional
 |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
 |          footprint, the correctness is not guaranteed.
 |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> texts = [['human', 'interface', 'computer']]
 |          >>> dct = Dictionary(texts)  # initialize a Dictionary
 |          >>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
 |          >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
 |          [(0, 1), (6, 1)]
 |  
 |  __iter__(self)
 |      Iterate over all tokens.
 |  
 |  __len__(self)
 |      Get number of stored tokens.
 |      
 |      Returns
 |      -------
 |      int
 |          Number of stored tokens.
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  add_documents(self, documents, prune_at=2000000)
 |      Update dictionary from a collection of `documents`.
 |      
 |      Parameters
 |      ----------
 |      documents : iterable of iterable of str
 |          Input corpus. All tokens should be already **tokenized and normalized**.
 |      prune_at : int, optional
 |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
 |          footprint, the correctness is not guaranteed.
 |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
 |          >>> dct = Dictionary(corpus)
 |          >>> len(dct)
 |          5
 |          >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
 |          >>> len(dct)
 |          10
 |  
 |  compactify(self)
 |      Assign new word ids to all words, shrinking any gaps.
 |  
 |  doc2bow(self, document, allow_update=False, return_missing=False)
 |      Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
 |      
 |      Parameters
 |      ----------
 |      document : list of str
 |          Input document.
 |      allow_update : bool, optional
 |          Update self, by adding new tokens from `document` and updating internal corpus statistics.
 |      return_missing : bool, optional
 |          Return missing tokens (tokens present in `document` but not in self) with frequencies?
 |      
 |      Return
 |      ------
 |      list of (int, int)
 |          BoW representation of `document`.
 |      list of (int, int), dict of (str, int)
 |          If `return_missing` is True, return BoW representation of `document` + dictionary with missing
 |          tokens and their frequencies.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
 |          >>> dct.doc2bow(["this", "is", "máma"])
 |          [(2, 1)]
 |          >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
 |          ([(2, 1)], {u'this': 1, u'is': 1})
 |  
 |  doc2idx(self, document, unknown_word_index=-1)
 |      Convert `document` (a list of words) into a list of indexes = list of `token_id`.
 |      Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
 |      
 |      Parameters
 |      ----------
 |      document : list of str
 |          Input document
 |      unknown_word_index : int, optional
 |          Index to use for words not in the dictionary.
 |      
 |      Returns
 |      -------
 |      list of int
 |          Token ids for tokens in `document`, in the same order.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["a", "a", "b"], ["a", "c"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
 |          [0, 0, 2, -1, 2]
 |  
 |  filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
 |      Filter out tokens in the dictionary by their frequency.
 |      
 |      Parameters
 |      ----------
 |      no_below : int, optional
 |          Keep tokens which are contained in at least `no_below` documents.
 |      no_above : float, optional
 |          Keep tokens which are contained in no more than `no_above` documents
 |          (fraction of total corpus size, not an absolute number).
 |      keep_n : int, optional
 |          Keep only the first `keep_n` most frequent tokens.
 |      keep_tokens : iterable of str
 |          Iterable of tokens that **must** stay in dictionary after filtering.
 |      
 |      Notes
 |      -----
 |      This removes all tokens in the dictionary that are:
 |      
 |      #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or 
 |      
 |      #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
 |      #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
 |      
 |      After the pruning, resulting gaps in word ids are shrunk.
 |      Due to this gap shrinking, **the same word may have a different word id before and after the call
 |      to this function!**
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> len(dct)
 |          5
 |          >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
 |          >>> len(dct)
 |          1
 |  
 |  filter_n_most_frequent(self, remove_n)
 |      Filter out the 'remove_n' most frequent tokens that appear in the documents.
 |      
 |      Parameters
 |      ----------
 |      remove_n : int
 |          Number of the most frequent tokens that will be removed.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> len(dct)
 |          5
 |          >>> dct.filter_n_most_frequent(2)
 |          >>> len(dct)
 |          3
 |  
 |  filter_tokens(self, bad_ids=None, good_ids=None)
 |      Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
 |      
 |      Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
 |      
 |      Parameters
 |      ----------
 |      bad_ids : iterable of int, optional
 |          Collection of word ids to be removed.
 |      good_ids : collection of int, optional
 |          Keep selected collection of word ids and remove the rest.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> 'ema' in dct.token2id
 |          True
 |          >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
 |          >>> 'ema' in dct.token2id
 |          False
 |          >>> len(dct)
 |          4
 |          >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
 |          >>> len(dct)
 |          1
 |  
 |  iteritems(self)
 |  
 |  iterkeys = __iter__(self)
 |  
 |  itervalues(self)
 |  
 |  keys(self)
 |      Get all stored ids.
 |      
 |      Returns
 |      -------
 |      list of int
 |          List of all token ids.
 |  
 |  merge_with(self, other)
 |      Merge another dictionary into this dictionary, mapping the same tokens to the same ids
 |      and new tokens to new ids.
 |      
 |      Notes
 |      -----
 |      The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
 |      `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
 |      
 |      Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
 |      from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
 |      
 |      Parameters
 |      ----------
 |      other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
 |          Other dictionary.
 |      
 |      Return
 |      ------
 |      :class:`gensim.models.VocabTransform`
 |          Transformation object.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
 |          >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
 |          >>> dct_1.doc2bow(corpus_2[0])
 |          [(0, 1)]
 |          >>> transformer = dct_1.merge_with(dct_2)
 |          >>> dct_1.doc2bow(corpus_2[0])
 |          [(0, 1), (3, 2)]
 |  
 |  patch_with_special_tokens(self, special_token_dict)
 |      Patch token2id and id2token using a dictionary of special tokens.
 |      
 |      
 |      **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may  want to specify
 |      special tokens that behave differently than others.
 |      One example is the "unknown" token, and another is the padding token.
 |      It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
 |      would be one way to specify this.
 |      
 |      Parameters
 |      ----------
 |      special_token_dict : dict of (str, int)
 |          dict containing the special tokens as keys and their wanted indices as values.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>>
 |          >>> special_tokens = {'pad': 0, 'space': 1}
 |          >>> print(dct.token2id)
 |          {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
 |          >>>
 |          >>> dct.patch_with_special_tokens(special_tokens)
 |          >>> print(dct.token2id)
 |          {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
 |  
 |  save_as_text(self, fname, sort_by_word=True)
 |      Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
 |      
 |      Parameters
 |      ----------
 |      fname : str
 |          Path to output file.
 |      sort_by_word : bool, optional
 |          Sort words in lexicographical order before writing them out?
 |      
 |      Notes
 |      -----
 |      Format::
 |      
 |          num_docs
 |          id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
 |          id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
 |          ....
 |          id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
 |      
 |      This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
 |      to other tools and frameworks. For better performance and to store the entire object state,
 |      including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
 |      :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
 |          Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>> from gensim.test.utils import get_tmpfile
 |          >>>
 |          >>> tmp_fname = get_tmpfile("dictionary")
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>>
 |          >>> dct = Dictionary(corpus)
 |          >>> dct.save_as_text(tmp_fname)
 |          >>>
 |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
 |          >>> assert dct.token2id == loaded_dct.token2id
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  from_corpus(corpus, id2word=None)
 |      Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
 |      
 |      Parameters
 |      ----------
 |      corpus : iterable of iterable of (int, number)
 |          Corpus in BoW format.
 |      id2word : dict of (int, object)
 |          Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
 |      
 |      Notes
 |      -----
 |      This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
 |      text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
 |      then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
 |      `id2word` is an optional dictionary that maps the `word_id` to a token.
 |      In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
 |      
 |      Returns
 |      -------
 |      :class:`~gensim.corpora.dictionary.Dictionary`
 |          Inferred dictionary from corpus.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
 |          >>> dct = Dictionary.from_corpus(corpus)
 |          >>> len(dct)
 |          3
 |  
 |  from_documents(documents)
 |      Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
 |      
 |      Equivalent to `Dictionary(documents=documents)`.
 |      
 |      Parameters
 |      ----------
 |      documents : iterable of iterable of str
 |          Input corpus.
 |      
 |      Returns
 |      -------
 |      :class:`~gensim.corpora.dictionary.Dictionary`
 |          Dictionary initialized from `documents`.
 |  
 |  load_from_text(fname)
 |      Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
 |      
 |      Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
 |      
 |      Parameters
 |      ----------
 |      fname: str
 |          Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
 |          Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>> from gensim.test.utils import get_tmpfile
 |          >>>
 |          >>> tmp_fname = get_tmpfile("dictionary")
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>>
 |          >>> dct = Dictionary(corpus)
 |          >>> dct.save_as_text(tmp_fname)
 |          >>>
 |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
 |          >>> assert dct.token2id == loaded_dct.token2id
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from gensim.utils.SaveLoad:
 |  
 |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
 |      Save the object to a file.
 |      
 |      Parameters
 |      ----------
 |      fname_or_handle : str or file-like
 |          Path to output file or already opened file-like object. If the object is a file handle,
 |          no special array handling will be performed, all attributes will be saved to the same file.
 |      separately : list of str or None, optional
 |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
 |          them into separate files. This prevent memory errors for large objects, and also allows
 |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
 |          loading and sharing the large arrays in RAM between multiple processes.
 |      
 |          If list of str: store these attributes into separate files. The automated size check
 |          is not performed in this case.
 |      sep_limit : int, optional
 |          Don't store arrays smaller than this separately. In bytes.
 |      ignore : frozenset of str, optional
 |          Attributes that shouldn't be stored at all.
 |      pickle_protocol : int, optional
 |          Protocol number for pickle.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.utils.SaveLoad.load`
 |          Load object from file.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from gensim.utils.SaveLoad:
 |  
 |  load(fname, mmap=None) from abc.ABCMeta
 |      Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
 |      
 |      Parameters
 |      ----------
 |      fname : str
 |          Path to file that contains needed object.
 |      mmap : str, optional
 |          Memory-map option.  If the object was saved with large arrays stored separately, you can load these arrays
 |          via mmap (shared memory) using `mmap='r'.
 |          If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.utils.SaveLoad.save`
 |          Save object to file.
 |      
 |      Returns
 |      -------
 |      object
 |          Object loaded from `fname`.
 |      
 |      Raises
 |      ------
 |      AttributeError
 |          When called on an object instance instead of class (this is a class method).
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from gensim.utils.SaveLoad:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from collections.abc.Mapping:
 |  
 |  __contains__(self, key)
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  get(self, key, default=None)
 |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
 |  
 |  items(self)
 |      D.items() -> a set-like object providing a view on D's items
 |  
 |  values(self)
 |      D.values() -> an object providing a view on D's values
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from collections.abc.Mapping:
 |  
 |  __hash__ = None
 |  
 |  __reversed__ = None
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from collections.abc.Collection:
 |  
 |  __subclasshook__(C) from abc.ABCMeta
 |      Abstract classes can override this to customize issubclass().
 |      
 |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
 |      It should return True, False or NotImplemented.  If it returns
 |      NotImplemented, the normal algorithm is used.  Otherwise, it
 |      overrides the normal algorithm (and the outcome is cached).

help(corpora.Dictionary.doc2bow)

Help on function doc2bow in module gensim.corpora.dictionary:

doc2bow(self, document, allow_update=False, return_missing=False)
    Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
    
    Parameters
    ----------
    document : list of str
        Input document.
    allow_update : bool, optional
        Update self, by adding new tokens from `document` and updating internal corpus statistics.
    return_missing : bool, optional
        Return missing tokens (tokens present in `document` but not in self) with frequencies?
    
    Return
    ------
    list of (int, int)
        BoW representation of `document`.
    list of (int, int), dict of (str, int)
        If `return_missing` is True, return BoW representation of `document` + dictionary with missing
        tokens and their frequencies.
    
    Examples
    --------
    .. sourcecode:: pycon
    
        >>> from gensim.corpora import Dictionary
        >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
        >>> dct.doc2bow(["this", "is", "máma"])
        [(2, 1)]
        >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
        ([(2, 1)], {u'this': 1, u'is': 1})

主题向量的变换

通过挖掘语料中隐藏的语义结构特征 -> 文本向量

TF-IDF模型

from gensim import models
tfidf = models.TfidfModel(corpus)
doc_bow = [(0, 1), (1, 1), (2, 1)]
print(tfidf[doc_bow])

[(2, 1.0)]

help(models.TfidfModel)

Help on class TfidfModel in module gensim.models.tfidfmodel:

class TfidfModel(gensim.interfaces.TransformationABC)
 |  TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
 |  
 |  Objects of this class realize the transformation between word-document co-occurrence matrix (int)
 |  into a locally/globally weighted TF-IDF matrix (positive floats).
 |  
 |  Examples
 |  --------
 |  .. sourcecode:: pycon
 |  
 |      >>> import gensim.downloader as api
 |      >>> from gensim.models import TfidfModel
 |      >>> from gensim.corpora import Dictionary
 |      >>>
 |      >>> dataset = api.load("text8")
 |      >>> dct = Dictionary(dataset)  # fit dictionary
 |      >>> corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format
 |      >>>
 |      >>> model = TfidfModel(corpus)  # fit model
 |      >>> vector = model[corpus[0]]  # apply model to the first corpus document
 |  
 |  Method resolution order:
 |      TfidfModel
 |      gensim.interfaces.TransformationABC
 |      gensim.utils.SaveLoad
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, bow, eps=1e-12)
 |      Get the tf-idf representation of an input vector and/or corpus.
 |      
 |      bow : {list of (int, int), iterable of iterable of (int, int)}
 |          Input document in the `sparse Gensim bag-of-words format
 |          <https://radimrehurek.com/gensim/intro.html#core-concepts>`_,
 |          or a streamed corpus of such documents.
 |      eps : float
 |          Threshold value, will remove all position that have tfidf-value less than `eps`.
 |      
 |      Returns
 |      -------
 |      vector : list of (int, float)
 |          TfIdf vector, if `bow` is a single document
 |      :class:`~gensim.interfaces.TransformedCorpus`
 |          TfIdf corpus, if `bow` is a corpus.
 |  
 |  __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
 |      Compute TF-IDF by multiplying a local component (term frequency) with a global component
 |      (inverse document frequency), and normalizing the resulting documents to unit length.
 |      Formula for non-normalized weight of term :math:`i` in document :math:`j` in a corpus of :math:`D` documents
 |      
 |      .. math:: weight_{i,j} = frequency_{i,j} * log_2 \frac{D}{document\_freq_{i}}
 |      
 |      or, more generally
 |      
 |      .. math:: weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document\_freq_{i}, D)
 |      
 |      so you can plug in your own custom :math:`wlocal` and :math:`wglobal` functions.
 |      
 |      Parameters
 |      ----------
 |      corpus : iterable of iterable of (int, int), optional
 |          Input corpus
 |      id2word : {dict, :class:`~gensim.corpora.Dictionary`}, optional
 |          Mapping token - id, that was used for converting input data to bag of words format.
 |      dictionary : :class:`~gensim.corpora.Dictionary`
 |          If `dictionary` is specified, it must be a `corpora.Dictionary` object and it will be used.
 |          to directly construct the inverse document frequency mapping (then `corpus`, if specified, is ignored).
 |      wlocals : callable, optional
 |          Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
 |          (other options: :func:`numpy.sqrt`, `lambda tf: 0.5 + (0.5 * tf / tf.max())`, etc.).
 |      wglobal : callable, optional
 |          Function for global weighting, default is :func:`~gensim.models.tfidfmodel.df2idf`.
 |      normalize : {bool, callable}, optional
 |          Normalize document vectors to unit euclidean length? You can also inject your own function into `normalize`.
 |      smartirs : str, optional
 |          SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System,
 |          a mnemonic scheme for denoting tf-idf weighting variants in the vector space model.
 |          The mnemonic for representing a combination of weights takes the form XYZ,
 |          for example 'ntc', 'bpn' and so on, where the letters represents the term weighting of the document vector.
 |      
 |          Term frequency weighing:
 |              * `b` - binary,
 |              * `t` or `n` - raw,
 |              * `a` - augmented,
 |              * `l` - logarithm,
 |              * `d` - double logarithm,
 |              * `L` - log average.
 |      
 |          Document frequency weighting:
 |              * `x` or `n` - none,
 |              * `f` - idf,
 |              * `t` - zero-corrected idf,
 |              * `p` - probabilistic idf.
 |      
 |          Document normalization:
 |              * `x` or `n` - none,
 |              * `c` - cosine,
 |              * `u` - pivoted unique,
 |              * `b` - pivoted character length.
 |      
 |          Default is 'nfc'.
 |          For more information visit `SMART Information Retrieval System
 |          <https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System>`_.
 |      pivot : float or None, optional
 |          In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
 |          normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
 |          slope) * pivot`.
 |      
 |          You can either set the `pivot` by hand, or you can let Gensim figure it out automatically with the following
 |          two steps:
 |      
 |              * Set either the `u` or `b` document normalization in the `smartirs` parameter.
 |              * Set either the `corpus` or `dictionary` parameter. The `pivot` will be automatically determined from
 |                the properties of the `corpus` or `dictionary`.
 |      
 |          If `pivot` is None and you don't follow steps 1 and 2, then pivoted document length normalization will be
 |          disabled. Default is None.
 |      
 |          See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
 |      slope : float, optional
 |          In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
 |          normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
 |          slope) * pivot`.
 |      
 |          Setting the `slope` to 0.0 uses only the `pivot` as the norm, and setting the `slope` to 1.0 effectively
 |          disables pivoted document length normalization. Singhal [2]_ suggests setting the `slope` between 0.2 and
 |          0.3 for best results. Default is 0.25.
 |      
 |          See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
 |      
 |      See Also
 |      --------
 |      ~gensim.sklearn_api.tfidf.TfIdfTransformer : Class that also uses the SMART scheme.
 |      resolve_weights : Function that also uses the SMART scheme.
 |      
 |      References
 |      ----------
 |      .. [1] Singhal, A., Buckley, C., & Mitra, M. (1996). `Pivoted Document Length
 |         Normalization <http://singhal.info/pivoted-dln.pdf>`_. *SIGIR Forum*, 51, 176–184.
 |      .. [2] Singhal, A. (2001). `Modern information retrieval: A brief overview <http://singhal.info/ieee2001.pdf>`_.
 |         *IEEE Data Eng. Bull.*, 24(4), 35–43.
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  initialize(self, corpus)
 |      Compute inverse document weights, which will be used to modify term frequencies for documents.
 |      
 |      Parameters
 |      ----------
 |      corpus : iterable of iterable of (int, int)
 |          Input corpus.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  load(*args, **kwargs) from builtins.type
 |      Load a previously saved TfidfModel class. Handles backwards compatibility from
 |      older TfidfModel versions which did not use pivoted document normalization.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from gensim.utils.SaveLoad:
 |  
 |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
 |      Save the object to a file.
 |      
 |      Parameters
 |      ----------
 |      fname_or_handle : str or file-like
 |          Path to output file or already opened file-like object. If the object is a file handle,
 |          no special array handling will be performed, all attributes will be saved to the same file.
 |      separately : list of str or None, optional
 |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
 |          them into separate files. This prevent memory errors for large objects, and also allows
 |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
 |          loading and sharing the large arrays in RAM between multiple processes.
 |      
 |          If list of str: store these attributes into separate files. The automated size check
 |          is not performed in this case.
 |      sep_limit : int, optional
 |          Don't store arrays smaller than this separately. In bytes.
 |      ignore : frozenset of str, optional
 |          Attributes that shouldn't be stored at all.
 |      pickle_protocol : int, optional
 |          Protocol number for pickle.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.utils.SaveLoad.load`
 |          Load object from file.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from gensim.utils.SaveLoad:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

Modules API Reference

文档相似度计算

from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
from gensim.models import LsiModel

model = LsiModel(common_corpus[:3], id2word=common_dictionary)  # train model
vector = model[common_corpus[4]]  # apply model to BoW document
model.add_documents(common_corpus[4:])  # update model with new documents
tmp_fname = get_tmpfile("lsi.model")
model.save(tmp_fname)  # save model
loaded_model = LsiModel.load(tmp_fname)  # load model

help(common_corpus)

Help on list object:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __mul__(self, value, /)
 |      Return self*value.
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __reversed__(self, /)
 |      Return a reverse iterator over the list.
 |  
 |  __rmul__(self, value, /)
 |      Return value*self.
 |  
 |  __setitem__(self, key, value, /)
 |      Set self[key] to value.
 |  
 |  __sizeof__(self, /)
 |      Return the size of the list in memory, in bytes.
 |  
 |  append(self, object, /)
 |      Append object to the end of the list.
 |  
 |  clear(self, /)
 |      Remove all items from list.
 |  
 |  copy(self, /)
 |      Return a shallow copy of the list.
 |  
 |  count(self, value, /)
 |      Return number of occurrences of value.
 |  
 |  extend(self, iterable, /)
 |      Extend list by appending elements from the iterable.
 |  
 |  index(self, value, start=0, stop=9223372036854775807, /)
 |      Return first index of value.
 |      
 |      Raises ValueError if the value is not present.
 |  
 |  insert(self, index, object, /)
 |      Insert object before index.
 |  
 |  pop(self, index=-1, /)
 |      Remove and return item at index (default last).
 |      
 |      Raises IndexError if list is empty or index is out of range.
 |  
 |  remove(self, value, /)
 |      Remove first occurrence of value.
 |      
 |      Raises ValueError if the value is not present.
 |  
 |  reverse(self, /)
 |      Reverse *IN PLACE*.
 |  
 |  sort(self, /, *, key=None, reverse=False)
 |      Stable sort *IN PLACE*.
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None

from pprint import pprint

pprint(common_corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

help(common_dictionary)

Help on Dictionary in module gensim.corpora.dictionary object:

class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
 |  Dictionary(documents=None, prune_at=2000000)
 |  
 |  Dictionary encapsulates the mapping between normalized words and their integer ids.
 |  
 |  Notable instance attributes:
 |  
 |  Attributes
 |  ----------
 |  token2id : dict of (str, int)
 |      token -> tokenId.
 |  id2token : dict of (int, str)
 |      Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
 |  cfs : dict of (int, int)
 |      Collection frequencies: token_id -> how many instances of this token are contained in the documents.
 |  dfs : dict of (int, int)
 |      Document frequencies: token_id -> how many documents contain this token.
 |  num_docs : int
 |      Number of documents processed.
 |  num_pos : int
 |      Total number of corpus positions (number of processed words).
 |  num_nnz : int
 |      Total number of non-zeroes in the BOW matrix (sum of the number of unique
 |      words per document over the entire corpus).
 |  
 |  Method resolution order:
 |      Dictionary
 |      gensim.utils.SaveLoad
 |      collections.abc.Mapping
 |      collections.abc.Collection
 |      collections.abc.Sized
 |      collections.abc.Iterable
 |      collections.abc.Container
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, tokenid)
 |      Get the string token that corresponds to `tokenid`.
 |      
 |      Parameters
 |      ----------
 |      tokenid : int
 |          Id of token.
 |      
 |      Returns
 |      -------
 |      str
 |          Token corresponding to `tokenid`.
 |      
 |      Raises
 |      ------
 |      KeyError
 |          If this Dictionary doesn't contain such `tokenid`.
 |  
 |  __init__(self, documents=None, prune_at=2000000)
 |      Parameters
 |      ----------
 |      documents : iterable of iterable of str, optional
 |          Documents to be used to initialize the mapping and collect corpus statistics.
 |      prune_at : int, optional
 |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
 |          footprint, the correctness is not guaranteed.
 |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> texts = [['human', 'interface', 'computer']]
 |          >>> dct = Dictionary(texts)  # initialize a Dictionary
 |          >>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
 |          >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
 |          [(0, 1), (6, 1)]
 |  
 |  __iter__(self)
 |      Iterate over all tokens.
 |  
 |  __len__(self)
 |      Get number of stored tokens.
 |      
 |      Returns
 |      -------
 |      int
 |          Number of stored tokens.
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  add_documents(self, documents, prune_at=2000000)
 |      Update dictionary from a collection of `documents`.
 |      
 |      Parameters
 |      ----------
 |      documents : iterable of iterable of str
 |          Input corpus. All tokens should be already **tokenized and normalized**.
 |      prune_at : int, optional
 |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
 |          footprint, the correctness is not guaranteed.
 |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
 |          >>> dct = Dictionary(corpus)
 |          >>> len(dct)
 |          5
 |          >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
 |          >>> len(dct)
 |          10
 |  
 |  compactify(self)
 |      Assign new word ids to all words, shrinking any gaps.
 |  
 |  doc2bow(self, document, allow_update=False, return_missing=False)
 |      Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
 |      
 |      Parameters
 |      ----------
 |      document : list of str
 |          Input document.
 |      allow_update : bool, optional
 |          Update self, by adding new tokens from `document` and updating internal corpus statistics.
 |      return_missing : bool, optional
 |          Return missing tokens (tokens present in `document` but not in self) with frequencies?
 |      
 |      Return
 |      ------
 |      list of (int, int)
 |          BoW representation of `document`.
 |      list of (int, int), dict of (str, int)
 |          If `return_missing` is True, return BoW representation of `document` + dictionary with missing
 |          tokens and their frequencies.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
 |          >>> dct.doc2bow(["this", "is", "máma"])
 |          [(2, 1)]
 |          >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
 |          ([(2, 1)], {u'this': 1, u'is': 1})
 |  
 |  doc2idx(self, document, unknown_word_index=-1)
 |      Convert `document` (a list of words) into a list of indexes = list of `token_id`.
 |      Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
 |      
 |      Parameters
 |      ----------
 |      document : list of str
 |          Input document
 |      unknown_word_index : int, optional
 |          Index to use for words not in the dictionary.
 |      
 |      Returns
 |      -------
 |      list of int
 |          Token ids for tokens in `document`, in the same order.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["a", "a", "b"], ["a", "c"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
 |          [0, 0, 2, -1, 2]
 |  
 |  filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
 |      Filter out tokens in the dictionary by their frequency.
 |      
 |      Parameters
 |      ----------
 |      no_below : int, optional
 |          Keep tokens which are contained in at least `no_below` documents.
 |      no_above : float, optional
 |          Keep tokens which are contained in no more than `no_above` documents
 |          (fraction of total corpus size, not an absolute number).
 |      keep_n : int, optional
 |          Keep only the first `keep_n` most frequent tokens.
 |      keep_tokens : iterable of str
 |          Iterable of tokens that **must** stay in dictionary after filtering.
 |      
 |      Notes
 |      -----
 |      This removes all tokens in the dictionary that are:
 |      
 |      #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or 
 |      
 |      #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
 |      #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
 |      
 |      After the pruning, resulting gaps in word ids are shrunk.
 |      Due to this gap shrinking, **the same word may have a different word id before and after the call
 |      to this function!**
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> len(dct)
 |          5
 |          >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
 |          >>> len(dct)
 |          1
 |  
 |  filter_n_most_frequent(self, remove_n)
 |      Filter out the 'remove_n' most frequent tokens that appear in the documents.
 |      
 |      Parameters
 |      ----------
 |      remove_n : int
 |          Number of the most frequent tokens that will be removed.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> len(dct)
 |          5
 |          >>> dct.filter_n_most_frequent(2)
 |          >>> len(dct)
 |          3
 |  
 |  filter_tokens(self, bad_ids=None, good_ids=None)
 |      Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
 |      
 |      Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
 |      
 |      Parameters
 |      ----------
 |      bad_ids : iterable of int, optional
 |          Collection of word ids to be removed.
 |      good_ids : collection of int, optional
 |          Keep selected collection of word ids and remove the rest.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>> 'ema' in dct.token2id
 |          True
 |          >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
 |          >>> 'ema' in dct.token2id
 |          False
 |          >>> len(dct)
 |          4
 |          >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
 |          >>> len(dct)
 |          1
 |  
 |  iteritems(self)
 |  
 |  iterkeys = __iter__(self)
 |  
 |  itervalues(self)
 |  
 |  keys(self)
 |      Get all stored ids.
 |      
 |      Returns
 |      -------
 |      list of int
 |          List of all token ids.
 |  
 |  merge_with(self, other)
 |      Merge another dictionary into this dictionary, mapping the same tokens to the same ids
 |      and new tokens to new ids.
 |      
 |      Notes
 |      -----
 |      The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
 |      `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
 |      
 |      Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
 |      from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
 |      
 |      Parameters
 |      ----------
 |      other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
 |          Other dictionary.
 |      
 |      Return
 |      ------
 |      :class:`gensim.models.VocabTransform`
 |          Transformation object.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
 |          >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
 |          >>> dct_1.doc2bow(corpus_2[0])
 |          [(0, 1)]
 |          >>> transformer = dct_1.merge_with(dct_2)
 |          >>> dct_1.doc2bow(corpus_2[0])
 |          [(0, 1), (3, 2)]
 |  
 |  patch_with_special_tokens(self, special_token_dict)
 |      Patch token2id and id2token using a dictionary of special tokens.
 |      
 |      
 |      **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may  want to specify
 |      special tokens that behave differently than others.
 |      One example is the "unknown" token, and another is the padding token.
 |      It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
 |      would be one way to specify this.
 |      
 |      Parameters
 |      ----------
 |      special_token_dict : dict of (str, int)
 |          dict containing the special tokens as keys and their wanted indices as values.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>> dct = Dictionary(corpus)
 |          >>>
 |          >>> special_tokens = {'pad': 0, 'space': 1}
 |          >>> print(dct.token2id)
 |          {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
 |          >>>
 |          >>> dct.patch_with_special_tokens(special_tokens)
 |          >>> print(dct.token2id)
 |          {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
 |  
 |  save_as_text(self, fname, sort_by_word=True)
 |      Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
 |      
 |      Parameters
 |      ----------
 |      fname : str
 |          Path to output file.
 |      sort_by_word : bool, optional
 |          Sort words in lexicographical order before writing them out?
 |      
 |      Notes
 |      -----
 |      Format::
 |      
 |          num_docs
 |          id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
 |          id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
 |          ....
 |          id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
 |      
 |      This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
 |      to other tools and frameworks. For better performance and to store the entire object state,
 |      including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
 |      :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
 |          Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>> from gensim.test.utils import get_tmpfile
 |          >>>
 |          >>> tmp_fname = get_tmpfile("dictionary")
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>>
 |          >>> dct = Dictionary(corpus)
 |          >>> dct.save_as_text(tmp_fname)
 |          >>>
 |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
 |          >>> assert dct.token2id == loaded_dct.token2id
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  from_corpus(corpus, id2word=None)
 |      Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
 |      
 |      Parameters
 |      ----------
 |      corpus : iterable of iterable of (int, number)
 |          Corpus in BoW format.
 |      id2word : dict of (int, object)
 |          Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
 |      
 |      Notes
 |      -----
 |      This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
 |      text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
 |      then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
 |      `id2word` is an optional dictionary that maps the `word_id` to a token.
 |      In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
 |      
 |      Returns
 |      -------
 |      :class:`~gensim.corpora.dictionary.Dictionary`
 |          Inferred dictionary from corpus.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>>
 |          >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
 |          >>> dct = Dictionary.from_corpus(corpus)
 |          >>> len(dct)
 |          3
 |  
 |  from_documents(documents)
 |      Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
 |      
 |      Equivalent to `Dictionary(documents=documents)`.
 |      
 |      Parameters
 |      ----------
 |      documents : iterable of iterable of str
 |          Input corpus.
 |      
 |      Returns
 |      -------
 |      :class:`~gensim.corpora.dictionary.Dictionary`
 |          Dictionary initialized from `documents`.
 |  
 |  load_from_text(fname)
 |      Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
 |      
 |      Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
 |      
 |      Parameters
 |      ----------
 |      fname: str
 |          Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
 |          Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
 |      
 |      Examples
 |      --------
 |      .. sourcecode:: pycon
 |      
 |          >>> from gensim.corpora import Dictionary
 |          >>> from gensim.test.utils import get_tmpfile
 |          >>>
 |          >>> tmp_fname = get_tmpfile("dictionary")
 |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
 |          >>>
 |          >>> dct = Dictionary(corpus)
 |          >>> dct.save_as_text(tmp_fname)
 |          >>>
 |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
 |          >>> assert dct.token2id == loaded_dct.token2id
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  __slotnames__ = []
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from gensim.utils.SaveLoad:
 |  
 |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
 |      Save the object to a file.
 |      
 |      Parameters
 |      ----------
 |      fname_or_handle : str or file-like
 |          Path to output file or already opened file-like object. If the object is a file handle,
 |          no special array handling will be performed, all attributes will be saved to the same file.
 |      separately : list of str or None, optional
 |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
 |          them into separate files. This prevent memory errors for large objects, and also allows
 |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
 |          loading and sharing the large arrays in RAM between multiple processes.
 |      
 |          If list of str: store these attributes into separate files. The automated size check
 |          is not performed in this case.
 |      sep_limit : int, optional
 |          Don't store arrays smaller than this separately. In bytes.
 |      ignore : frozenset of str, optional
 |          Attributes that shouldn't be stored at all.
 |      pickle_protocol : int, optional
 |          Protocol number for pickle.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.utils.SaveLoad.load`
 |          Load object from file.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from gensim.utils.SaveLoad:
 |  
 |  load(fname, mmap=None) from abc.ABCMeta
 |      Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
 |      
 |      Parameters
 |      ----------
 |      fname : str
 |          Path to file that contains needed object.
 |      mmap : str, optional
 |          Memory-map option.  If the object was saved with large arrays stored separately, you can load these arrays
 |          via mmap (shared memory) using `mmap='r'.
 |          If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.utils.SaveLoad.save`
 |          Save object to file.
 |      
 |      Returns
 |      -------
 |      object
 |          Object loaded from `fname`.
 |      
 |      Raises
 |      ------
 |      AttributeError
 |          When called on an object instance instead of class (this is a class method).
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from gensim.utils.SaveLoad:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from collections.abc.Mapping:
 |  
 |  __contains__(self, key)
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  get(self, key, default=None)
 |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
 |  
 |  items(self)
 |      D.items() -> a set-like object providing a view on D's items
 |  
 |  values(self)
 |      D.values() -> an object providing a view on D's values
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from collections.abc.Mapping:
 |  
 |  __hash__ = None
 |  
 |  __reversed__ = None
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from collections.abc.Collection:
 |  
 |  __subclasshook__(C) from abc.ABCMeta
 |      Abstract classes can override this to customize issubclass().
 |      
 |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
 |      It should return True, False or NotImplemented.  If it returns
 |      NotImplemented, the normal algorithm is used.  Otherwise, it
 |      overrides the normal algorithm (and the outcome is cached).

help(get_tmpfile)

Help on function get_tmpfile in module gensim.test.utils:

get_tmpfile(suffix)
    Get full path to file `suffix` in temporary folder.
    This function doesn't creates file (only generate unique name).
    Also, it may return different paths in consecutive calling.
    
    Parameters
    ----------
    suffix : str
        Suffix of file.
    
    Returns
    -------
    str
        Path to `suffix` file in temporary folder.
    
    Examples
    --------
    Using this function we may get path to temporary file and use it, for example, to store temporary model.
    
    .. sourcecode:: pycon
    
        >>> from gensim.models import LsiModel
        >>> from gensim.test.utils import get_tmpfile, common_dictionary, common_corpus
        >>>
        >>> tmp_f = get_tmpfile("toy_lsi_model")
        >>>
        >>> model = LsiModel(common_corpus, id2word=common_dictionary)
        >>> model.save(tmp_f)
        >>>
        >>> loaded_model = LsiModel.load(tmp_f)

help(models.LsiModel)

Help on class LsiModel in module gensim.models.lsimodel:

class LsiModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)
 |  LsiModel(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
 |  
 |  Model for `Latent Semantic Indexing
 |  <https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing>`_.
 |  
 |  The decomposition algorithm is described in `"Fast and Faster: A Comparison of Two Streamed
 |  Matrix Decomposition Algorithms" <https://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_.
 |  
 |  Notes
 |  -----
 |  * :attr:`gensim.models.lsimodel.LsiModel.projection.u` - left singular vectors,
 |  * :attr:`gensim.models.lsimodel.LsiModel.projection.s` - singular values,
 |  * ``model[training_corpus]`` - right singular vectors (can be reconstructed if needed).
 |  
 |  See Also
 |  --------
 |  `FAQ about LSI matrices
 |  <https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi>`_.
 |  
 |  Examples
 |  --------
 |  .. sourcecode:: pycon
 |  
 |      >>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
 |      >>> from gensim.models import LsiModel
 |      >>>
 |      >>> model = LsiModel(common_corpus[:3], id2word=common_dictionary)  # train model
 |      >>> vector = model[common_corpus[4]]  # apply model to BoW document
 |      >>> model.add_documents(common_corpus[4:])  # update model with new documents
 |      >>> tmp_fname = get_tmpfile("lsi.model")
 |      >>> model.save(tmp_fname)  # save model
 |      >>> loaded_model = LsiModel.load(tmp_fname)  # load model
 |  
 |  Method resolution order:
 |      LsiModel
 |      gensim.interfaces.TransformationABC
 |      gensim.utils.SaveLoad
 |      gensim.models.basemodel.BaseTopicModel
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, bow, scaled=False, chunksize=512)
 |      Get the latent representation for `bow`.
 |      
 |      Parameters
 |      ----------
 |      bow : {list of (int, int), iterable of list of (int, int)}
 |          Document or corpus in BoW representation.
 |      scaled : bool, optional
 |          If True - topics will be scaled by the inverse of singular values.
 |      chunksize :  int, optional
 |          Number of documents to be used in each applying chunk.
 |      
 |      Returns
 |      -------
 |      list of (int, float)
 |          Latent representation of topics in BoW format for document **OR**
 |      :class:`gensim.matutils.Dense2Corpus`
 |          Latent representation of corpus in BoW format if `bow` is corpus.
 |  
 |  __init__(self, corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
 |      Construct an `LsiModel` object.
 |      
 |      Either `corpus` or `id2word` must be supplied in order to train the model.
 |      
 |      Parameters
 |      ----------
 |      corpus : {iterable of list of (int, float), scipy.sparse.csc}, optional
 |          Stream of document vectors or sparse matrix of shape (`num_documents`, `num_terms`).
 |      num_topics : int, optional
 |          Number of requested factors (latent dimensions)
 |      id2word : dict of {int: str}, optional
 |          ID to word mapping, optional.
 |      chunksize :  int, optional
 |          Number of documents to be used in each training chunk.
 |      decay : float, optional
 |          Weight of existing observations relatively to new ones.
 |      distributed : bool, optional
 |          If True - distributed mode (parallel execution on several machines) will be used.
 |      onepass : bool, optional
 |          Whether the one-pass algorithm should be used for training.
 |          Pass `False` to force a multi-pass stochastic algorithm.
 |      power_iters: int, optional
 |          Number of power iteration steps to be used.
 |          Increasing the number of power iterations improves accuracy, but lowers performance
 |      extra_samples : int, optional
 |          Extra samples to be used besides the rank `k`. Can improve accuracy.
 |      dtype : type, optional
 |          Enforces a type for elements of the decomposed matrix.
 |  
 |  __str__(self)
 |      Get a human readable representation of model.
 |      
 |      Returns
 |      -------
 |      str
 |          A human readable string of the current objects parameters.
 |  
 |  add_documents(self, corpus, chunksize=None, decay=None)
 |      Update model with new `corpus`.
 |      
 |      Parameters
 |      ----------
 |      corpus : {iterable of list of (int, float), scipy.sparse.csc}
 |          Stream of document vectors or sparse matrix of shape (`num_terms`, num_documents).
 |      chunksize : int, optional
 |          Number of documents to be used in each training chunk, will use `self.chunksize` if not specified.
 |      decay : float, optional
 |          Weight of existing observations relatively to new ones,  will use `self.decay` if not specified.
 |      
 |      Notes
 |      -----
 |      Training proceeds in chunks of `chunksize` documents at a time. The size of `chunksize` is a tradeoff
 |      between increased speed (bigger `chunksize`) vs. lower memory footprint (smaller `chunksize`).
 |      If the distributed mode is on, each chunk is sent to a different worker/computer.
 |  
 |  get_topics(self)
 |      Get the topic vectors.
 |      
 |      Notes
 |      -----
 |      The number of topics can actually be smaller than `self.num_topics`, if there were not enough factors
 |      in the matrix (real rank of input matrix smaller than `self.num_topics`).
 |      
 |      Returns
 |      -------
 |      np.ndarray
 |          The term topic matrix with shape (`num_topics`, `vocabulary_size`)
 |  
 |  print_debug(self, num_topics=5, num_words=10)
 |      Print (to log) the most salient words of the first `num_topics` topics.
 |      
 |      Unlike :meth:`~gensim.models.lsimodel.LsiModel.print_topics`, this looks for words that are significant for
 |      a particular topic *and* not for others. This *should* result in a
 |      more human-interpretable description of topics.
 |      
 |      Alias for :func:`~gensim.models.lsimodel.print_debug`.
 |      
 |      Parameters
 |      ----------
 |      num_topics : int, optional
 |          The number of topics to be selected (ordered by significance).
 |      num_words : int, optional
 |          The number of words to be included per topics (ordered by significance).
 |  
 |  save(self, fname, *args, **kwargs)
 |      Save the model to a file.
 |      
 |      Notes
 |      -----
 |      Large internal arrays may be stored into separate files, with `fname` as prefix.
 |      
 |      Warnings
 |      --------
 |      Do not save as a compressed file if you intend to load the file back with `mmap`.
 |      
 |      Parameters
 |      ----------
 |      fname : str
 |          Path to output file.
 |      *args
 |          Variable length argument list, see :meth:`gensim.utils.SaveLoad.save`.
 |      **kwargs
 |          Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.save`.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.models.lsimodel.LsiModel.load`
 |  
 |  show_topic(self, topicno, topn=10)
 |      Get the words that define a topic along with their contribution.
 |      
 |      This is actually the left singular vector of the specified topic.
 |      
 |      The most important words in defining the topic (greatest absolute value) are included
 |      in the output, along with their contribution to the topic.
 |      
 |      Parameters
 |      ----------
 |      topicno : int
 |          The topics id number.
 |      topn : int
 |          Number of words to be included to the result.
 |      
 |      Returns
 |      -------
 |      list of (str, float)
 |          Topic representation in BoW format.
 |  
 |  show_topics(self, num_topics=-1, num_words=10, log=False, formatted=True)
 |      Get the most significant topics.
 |      
 |      Parameters
 |      ----------
 |      num_topics : int, optional
 |          The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
 |      num_words : int, optional
 |          The number of words to be included per topics (ordered by significance).
 |      log : bool, optional
 |          If True - log topics with logger.
 |      formatted : bool, optional
 |          If True - each topic represented as string, otherwise - in BoW format.
 |      
 |      Returns
 |      -------
 |      list of (int, str)
 |          If `formatted=True`, return sequence with (topic_id, string representation of topics) **OR**
 |      list of (int, list of (str, float))
 |          Otherwise, return sequence with (topic_id, [(word, value), ... ]).
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  load(fname, *args, **kwargs) from builtins.type
 |      Load a previously saved object using :meth:`~gensim.models.lsimodel.LsiModel.save` from file.
 |      
 |      Notes
 |      -----
 |      Large arrays can be memmap'ed back as read-only (shared memory) by setting the `mmap='r'` parameter.
 |      
 |      Parameters
 |      ----------
 |      fname : str
 |          Path to file that contains LsiModel.
 |      *args
 |          Variable length argument list, see :meth:`gensim.utils.SaveLoad.load`.
 |      **kwargs
 |          Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.load`.
 |      
 |      See Also
 |      --------
 |      :meth:`~gensim.models.lsimodel.LsiModel.save`
 |      
 |      Returns
 |      -------
 |      :class:`~gensim.models.lsimodel.LsiModel`
 |          Loaded instance.
 |      
 |      Raises
 |      ------
 |      IOError
 |          When methods are called on instance (should be called from class).
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __slotnames__ = []
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from gensim.utils.SaveLoad:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from gensim.models.basemodel.BaseTopicModel:
 |  
 |  print_topic(self, topicno, topn=10)
 |      Get a single topic as a formatted string.
 |      
 |      Parameters
 |      ----------
 |      topicno : int
 |          Topic id.
 |      topn : int
 |          Number of words from topic that will be used.
 |      
 |      Returns
 |      -------
 |      str
 |          String representation of topic, like '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + ... '.
 |  
 |  print_topics(self, num_topics=20, num_words=10)
 |      Get the most significant topics (alias for `show_topics()` method).
 |      
 |      Parameters
 |      ----------
 |      num_topics : int, optional
 |          The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
 |      num_words : int, optional
 |          The number of words to be included per topics (ordered by significance).
 |      
 |      Returns
 |      -------
 |      list of (int, list of (str, float))
 |          Sequence with (topic_id, [(word, value), ... ]).

Gensim

Gensim

基本概念

安装

安装环境

训练语料的预处理

主题向量的变换

文档相似度计算

相关阅读

相关文章

相关问答