从最原始的非结构化的文本中,无监督的学习到文本隐层的主题向量表达;
支持包括LDA TF-IDFLSA word2vec等主题模型算法。
!pip install gensim
!conda list | grep gensim
gensim 3.8.3 <pip>
!pip install PyHamcrest
!pip show PyHamcrest
import gensim
gensim.__version__
'3.8.3'
原始字符串 -> 稀疏向量
原始文本 -> 分词、去除停用词等 -> 文档特征列表
词袋模型 文档特征——word
from gensim import corpora
texts = [['a', 'b', 'c'],
['a', 'd', 'b']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts] #词袋模型 doc2bow
print(corpus)
print()
print(corpus[0])
print(corpus[1])
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (1, 1), (3, 1)]]
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (1, 1), (3, 1)]
help(corpora.Dictionary)
Help on class Dictionary in module gensim.corpora.dictionary:
class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
| Dictionary(documents=None, prune_at=2000000)
|
| Dictionary encapsulates the mapping between normalized words and their integer ids.
|
| Notable instance attributes:
|
| Attributes
| ----------
| token2id : dict of (str, int)
| token -> tokenId.
| id2token : dict of (int, str)
| Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
| cfs : dict of (int, int)
| Collection frequencies: token_id -> how many instances of this token are contained in the documents.
| dfs : dict of (int, int)
| Document frequencies: token_id -> how many documents contain this token.
| num_docs : int
| Number of documents processed.
| num_pos : int
| Total number of corpus positions (number of processed words).
| num_nnz : int
| Total number of non-zeroes in the BOW matrix (sum of the number of unique
| words per document over the entire corpus).
|
| Method resolution order:
| Dictionary
| gensim.utils.SaveLoad
| collections.abc.Mapping
| collections.abc.Collection
| collections.abc.Sized
| collections.abc.Iterable
| collections.abc.Container
| builtins.object
|
| Methods defined here:
|
| __getitem__(self, tokenid)
| Get the string token that corresponds to `tokenid`.
|
| Parameters
| ----------
| tokenid : int
| Id of token.
|
| Returns
| -------
| str
| Token corresponding to `tokenid`.
|
| Raises
| ------
| KeyError
| If this Dictionary doesn't contain such `tokenid`.
|
| __init__(self, documents=None, prune_at=2000000)
| Parameters
| ----------
| documents : iterable of iterable of str, optional
| Documents to be used to initialize the mapping and collect corpus statistics.
| prune_at : int, optional
| Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
| footprint, the correctness is not guaranteed.
| Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> texts = [['human', 'interface', 'computer']]
| >>> dct = Dictionary(texts) # initialize a Dictionary
| >>> dct.add_documents([["cat", "say", "meow"], ["dog"]]) # add more document (extend the vocabulary)
| >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
| [(0, 1), (6, 1)]
|
| __iter__(self)
| Iterate over all tokens.
|
| __len__(self)
| Get number of stored tokens.
|
| Returns
| -------
| int
| Number of stored tokens.
|
| __str__(self)
| Return str(self).
|
| add_documents(self, documents, prune_at=2000000)
| Update dictionary from a collection of `documents`.
|
| Parameters
| ----------
| documents : iterable of iterable of str
| Input corpus. All tokens should be already **tokenized and normalized**.
| prune_at : int, optional
| Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
| footprint, the correctness is not guaranteed.
| Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
| >>> dct = Dictionary(corpus)
| >>> len(dct)
| 5
| >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
| >>> len(dct)
| 10
|
| compactify(self)
| Assign new word ids to all words, shrinking any gaps.
|
| doc2bow(self, document, allow_update=False, return_missing=False)
| Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
|
| Parameters
| ----------
| document : list of str
| Input document.
| allow_update : bool, optional
| Update self, by adding new tokens from `document` and updating internal corpus statistics.
| return_missing : bool, optional
| Return missing tokens (tokens present in `document` but not in self) with frequencies?
|
| Return
| ------
| list of (int, int)
| BoW representation of `document`.
| list of (int, int), dict of (str, int)
| If `return_missing` is True, return BoW representation of `document` + dictionary with missing
| tokens and their frequencies.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
| >>> dct.doc2bow(["this", "is", "máma"])
| [(2, 1)]
| >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
| ([(2, 1)], {u'this': 1, u'is': 1})
|
| doc2idx(self, document, unknown_word_index=-1)
| Convert `document` (a list of words) into a list of indexes = list of `token_id`.
| Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
|
| Parameters
| ----------
| document : list of str
| Input document
| unknown_word_index : int, optional
| Index to use for words not in the dictionary.
|
| Returns
| -------
| list of int
| Token ids for tokens in `document`, in the same order.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["a", "a", "b"], ["a", "c"]]
| >>> dct = Dictionary(corpus)
| >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
| [0, 0, 2, -1, 2]
|
| filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
| Filter out tokens in the dictionary by their frequency.
|
| Parameters
| ----------
| no_below : int, optional
| Keep tokens which are contained in at least `no_below` documents.
| no_above : float, optional
| Keep tokens which are contained in no more than `no_above` documents
| (fraction of total corpus size, not an absolute number).
| keep_n : int, optional
| Keep only the first `keep_n` most frequent tokens.
| keep_tokens : iterable of str
| Iterable of tokens that **must** stay in dictionary after filtering.
|
| Notes
| -----
| This removes all tokens in the dictionary that are:
|
| #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or
|
| #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
| #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
|
| After the pruning, resulting gaps in word ids are shrunk.
| Due to this gap shrinking, **the same word may have a different word id before and after the call
| to this function!**
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>> len(dct)
| 5
| >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
| >>> len(dct)
| 1
|
| filter_n_most_frequent(self, remove_n)
| Filter out the 'remove_n' most frequent tokens that appear in the documents.
|
| Parameters
| ----------
| remove_n : int
| Number of the most frequent tokens that will be removed.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>> len(dct)
| 5
| >>> dct.filter_n_most_frequent(2)
| >>> len(dct)
| 3
|
| filter_tokens(self, bad_ids=None, good_ids=None)
| Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
|
| Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
|
| Parameters
| ----------
| bad_ids : iterable of int, optional
| Collection of word ids to be removed.
| good_ids : collection of int, optional
| Keep selected collection of word ids and remove the rest.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>> 'ema' in dct.token2id
| True
| >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
| >>> 'ema' in dct.token2id
| False
| >>> len(dct)
| 4
| >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
| >>> len(dct)
| 1
|
| iteritems(self)
|
| iterkeys = __iter__(self)
|
| itervalues(self)
|
| keys(self)
| Get all stored ids.
|
| Returns
| -------
| list of int
| List of all token ids.
|
| merge_with(self, other)
| Merge another dictionary into this dictionary, mapping the same tokens to the same ids
| and new tokens to new ids.
|
| Notes
| -----
| The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
| `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
|
| Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
| from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
|
| Parameters
| ----------
| other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
| Other dictionary.
|
| Return
| ------
| :class:`gensim.models.VocabTransform`
| Transformation object.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
| >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
| >>> dct_1.doc2bow(corpus_2[0])
| [(0, 1)]
| >>> transformer = dct_1.merge_with(dct_2)
| >>> dct_1.doc2bow(corpus_2[0])
| [(0, 1), (3, 2)]
|
| patch_with_special_tokens(self, special_token_dict)
| Patch token2id and id2token using a dictionary of special tokens.
|
|
| **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may want to specify
| special tokens that behave differently than others.
| One example is the "unknown" token, and another is the padding token.
| It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
| would be one way to specify this.
|
| Parameters
| ----------
| special_token_dict : dict of (str, int)
| dict containing the special tokens as keys and their wanted indices as values.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>>
| >>> special_tokens = {'pad': 0, 'space': 1}
| >>> print(dct.token2id)
| {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
| >>>
| >>> dct.patch_with_special_tokens(special_tokens)
| >>> print(dct.token2id)
| {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
|
| save_as_text(self, fname, sort_by_word=True)
| Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
|
| Parameters
| ----------
| fname : str
| Path to output file.
| sort_by_word : bool, optional
| Sort words in lexicographical order before writing them out?
|
| Notes
| -----
| Format::
|
| num_docs
| id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
| id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
| ....
| id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
|
| This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
| to other tools and frameworks. For better performance and to store the entire object state,
| including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
| :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
|
| See Also
| --------
| :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
| Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>> from gensim.test.utils import get_tmpfile
| >>>
| >>> tmp_fname = get_tmpfile("dictionary")
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>>
| >>> dct = Dictionary(corpus)
| >>> dct.save_as_text(tmp_fname)
| >>>
| >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
| >>> assert dct.token2id == loaded_dct.token2id
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| from_corpus(corpus, id2word=None)
| Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
|
| Parameters
| ----------
| corpus : iterable of iterable of (int, number)
| Corpus in BoW format.
| id2word : dict of (int, object)
| Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
|
| Notes
| -----
| This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
| text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
| then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
| `id2word` is an optional dictionary that maps the `word_id` to a token.
| In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
|
| Returns
| -------
| :class:`~gensim.corpora.dictionary.Dictionary`
| Inferred dictionary from corpus.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
| >>> dct = Dictionary.from_corpus(corpus)
| >>> len(dct)
| 3
|
| from_documents(documents)
| Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
|
| Equivalent to `Dictionary(documents=documents)`.
|
| Parameters
| ----------
| documents : iterable of iterable of str
| Input corpus.
|
| Returns
| -------
| :class:`~gensim.corpora.dictionary.Dictionary`
| Dictionary initialized from `documents`.
|
| load_from_text(fname)
| Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
|
| Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
|
| Parameters
| ----------
| fname: str
| Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
|
| See Also
| --------
| :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
| Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>> from gensim.test.utils import get_tmpfile
| >>>
| >>> tmp_fname = get_tmpfile("dictionary")
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>>
| >>> dct = Dictionary(corpus)
| >>> dct.save_as_text(tmp_fname)
| >>>
| >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
| >>> assert dct.token2id == loaded_dct.token2id
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset()
|
| ----------------------------------------------------------------------
| Methods inherited from gensim.utils.SaveLoad:
|
| save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
| Save the object to a file.
|
| Parameters
| ----------
| fname_or_handle : str or file-like
| Path to output file or already opened file-like object. If the object is a file handle,
| no special array handling will be performed, all attributes will be saved to the same file.
| separately : list of str or None, optional
| If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
| them into separate files. This prevent memory errors for large objects, and also allows
| `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
| loading and sharing the large arrays in RAM between multiple processes.
|
| If list of str: store these attributes into separate files. The automated size check
| is not performed in this case.
| sep_limit : int, optional
| Don't store arrays smaller than this separately. In bytes.
| ignore : frozenset of str, optional
| Attributes that shouldn't be stored at all.
| pickle_protocol : int, optional
| Protocol number for pickle.
|
| See Also
| --------
| :meth:`~gensim.utils.SaveLoad.load`
| Load object from file.
|
| ----------------------------------------------------------------------
| Class methods inherited from gensim.utils.SaveLoad:
|
| load(fname, mmap=None) from abc.ABCMeta
| Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
|
| Parameters
| ----------
| fname : str
| Path to file that contains needed object.
| mmap : str, optional
| Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays
| via mmap (shared memory) using `mmap='r'.
| If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
|
| See Also
| --------
| :meth:`~gensim.utils.SaveLoad.save`
| Save object to file.
|
| Returns
| -------
| object
| Object loaded from `fname`.
|
| Raises
| ------
| AttributeError
| When called on an object instance instead of class (this is a class method).
|
| ----------------------------------------------------------------------
| Data descriptors inherited from gensim.utils.SaveLoad:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from collections.abc.Mapping:
|
| __contains__(self, key)
|
| __eq__(self, other)
| Return self==value.
|
| get(self, key, default=None)
| D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.
|
| items(self)
| D.items() -> a set-like object providing a view on D's items
|
| values(self)
| D.values() -> an object providing a view on D's values
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from collections.abc.Mapping:
|
| __hash__ = None
|
| __reversed__ = None
|
| ----------------------------------------------------------------------
| Class methods inherited from collections.abc.Collection:
|
| __subclasshook__(C) from abc.ABCMeta
| Abstract classes can override this to customize issubclass().
|
| This is invoked early on by abc.ABCMeta.__subclasscheck__().
| It should return True, False or NotImplemented. If it returns
| NotImplemented, the normal algorithm is used. Otherwise, it
| overrides the normal algorithm (and the outcome is cached).
help(corpora.Dictionary.doc2bow)
Help on function doc2bow in module gensim.corpora.dictionary:
doc2bow(self, document, allow_update=False, return_missing=False)
Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
Parameters
----------
document : list of str
Input document.
allow_update : bool, optional
Update self, by adding new tokens from `document` and updating internal corpus statistics.
return_missing : bool, optional
Return missing tokens (tokens present in `document` but not in self) with frequencies?
Return
------
list of (int, int)
BoW representation of `document`.
list of (int, int), dict of (str, int)
If `return_missing` is True, return BoW representation of `document` + dictionary with missing
tokens and their frequencies.
Examples
--------
.. sourcecode:: pycon
>>> from gensim.corpora import Dictionary
>>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
>>> dct.doc2bow(["this", "is", "máma"])
[(2, 1)]
>>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
([(2, 1)], {u'this': 1, u'is': 1})
通过挖掘语料中隐藏的语义结构特征 -> 文本向量
TF-IDF模型
from gensim import models
tfidf = models.TfidfModel(corpus)
doc_bow = [(0, 1), (1, 1), (2, 1)]
print(tfidf[doc_bow])
[(2, 1.0)]
help(models.TfidfModel)
Help on class TfidfModel in module gensim.models.tfidfmodel:
class TfidfModel(gensim.interfaces.TransformationABC)
| TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
|
| Objects of this class realize the transformation between word-document co-occurrence matrix (int)
| into a locally/globally weighted TF-IDF matrix (positive floats).
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> import gensim.downloader as api
| >>> from gensim.models import TfidfModel
| >>> from gensim.corpora import Dictionary
| >>>
| >>> dataset = api.load("text8")
| >>> dct = Dictionary(dataset) # fit dictionary
| >>> corpus = [dct.doc2bow(line) for line in dataset] # convert corpus to BoW format
| >>>
| >>> model = TfidfModel(corpus) # fit model
| >>> vector = model[corpus[0]] # apply model to the first corpus document
|
| Method resolution order:
| TfidfModel
| gensim.interfaces.TransformationABC
| gensim.utils.SaveLoad
| builtins.object
|
| Methods defined here:
|
| __getitem__(self, bow, eps=1e-12)
| Get the tf-idf representation of an input vector and/or corpus.
|
| bow : {list of (int, int), iterable of iterable of (int, int)}
| Input document in the `sparse Gensim bag-of-words format
| <https://radimrehurek.com/gensim/intro.html#core-concepts>`_,
| or a streamed corpus of such documents.
| eps : float
| Threshold value, will remove all position that have tfidf-value less than `eps`.
|
| Returns
| -------
| vector : list of (int, float)
| TfIdf vector, if `bow` is a single document
| :class:`~gensim.interfaces.TransformedCorpus`
| TfIdf corpus, if `bow` is a corpus.
|
| __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
| Compute TF-IDF by multiplying a local component (term frequency) with a global component
| (inverse document frequency), and normalizing the resulting documents to unit length.
| Formula for non-normalized weight of term :math:`i` in document :math:`j` in a corpus of :math:`D` documents
|
| .. math:: weight_{i,j} = frequency_{i,j} * log_2 \frac{D}{document\_freq_{i}}
|
| or, more generally
|
| .. math:: weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document\_freq_{i}, D)
|
| so you can plug in your own custom :math:`wlocal` and :math:`wglobal` functions.
|
| Parameters
| ----------
| corpus : iterable of iterable of (int, int), optional
| Input corpus
| id2word : {dict, :class:`~gensim.corpora.Dictionary`}, optional
| Mapping token - id, that was used for converting input data to bag of words format.
| dictionary : :class:`~gensim.corpora.Dictionary`
| If `dictionary` is specified, it must be a `corpora.Dictionary` object and it will be used.
| to directly construct the inverse document frequency mapping (then `corpus`, if specified, is ignored).
| wlocals : callable, optional
| Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
| (other options: :func:`numpy.sqrt`, `lambda tf: 0.5 + (0.5 * tf / tf.max())`, etc.).
| wglobal : callable, optional
| Function for global weighting, default is :func:`~gensim.models.tfidfmodel.df2idf`.
| normalize : {bool, callable}, optional
| Normalize document vectors to unit euclidean length? You can also inject your own function into `normalize`.
| smartirs : str, optional
| SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System,
| a mnemonic scheme for denoting tf-idf weighting variants in the vector space model.
| The mnemonic for representing a combination of weights takes the form XYZ,
| for example 'ntc', 'bpn' and so on, where the letters represents the term weighting of the document vector.
|
| Term frequency weighing:
| * `b` - binary,
| * `t` or `n` - raw,
| * `a` - augmented,
| * `l` - logarithm,
| * `d` - double logarithm,
| * `L` - log average.
|
| Document frequency weighting:
| * `x` or `n` - none,
| * `f` - idf,
| * `t` - zero-corrected idf,
| * `p` - probabilistic idf.
|
| Document normalization:
| * `x` or `n` - none,
| * `c` - cosine,
| * `u` - pivoted unique,
| * `b` - pivoted character length.
|
| Default is 'nfc'.
| For more information visit `SMART Information Retrieval System
| <https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System>`_.
| pivot : float or None, optional
| In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
| normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
| slope) * pivot`.
|
| You can either set the `pivot` by hand, or you can let Gensim figure it out automatically with the following
| two steps:
|
| * Set either the `u` or `b` document normalization in the `smartirs` parameter.
| * Set either the `corpus` or `dictionary` parameter. The `pivot` will be automatically determined from
| the properties of the `corpus` or `dictionary`.
|
| If `pivot` is None and you don't follow steps 1 and 2, then pivoted document length normalization will be
| disabled. Default is None.
|
| See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
| slope : float, optional
| In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
| normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
| slope) * pivot`.
|
| Setting the `slope` to 0.0 uses only the `pivot` as the norm, and setting the `slope` to 1.0 effectively
| disables pivoted document length normalization. Singhal [2]_ suggests setting the `slope` between 0.2 and
| 0.3 for best results. Default is 0.25.
|
| See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
|
| See Also
| --------
| ~gensim.sklearn_api.tfidf.TfIdfTransformer : Class that also uses the SMART scheme.
| resolve_weights : Function that also uses the SMART scheme.
|
| References
| ----------
| .. [1] Singhal, A., Buckley, C., & Mitra, M. (1996). `Pivoted Document Length
| Normalization <http://singhal.info/pivoted-dln.pdf>`_. *SIGIR Forum*, 51, 176–184.
| .. [2] Singhal, A. (2001). `Modern information retrieval: A brief overview <http://singhal.info/ieee2001.pdf>`_.
| *IEEE Data Eng. Bull.*, 24(4), 35–43.
|
| __str__(self)
| Return str(self).
|
| initialize(self, corpus)
| Compute inverse document weights, which will be used to modify term frequencies for documents.
|
| Parameters
| ----------
| corpus : iterable of iterable of (int, int)
| Input corpus.
|
| ----------------------------------------------------------------------
| Class methods defined here:
|
| load(*args, **kwargs) from builtins.type
| Load a previously saved TfidfModel class. Handles backwards compatibility from
| older TfidfModel versions which did not use pivoted document normalization.
|
| ----------------------------------------------------------------------
| Methods inherited from gensim.utils.SaveLoad:
|
| save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
| Save the object to a file.
|
| Parameters
| ----------
| fname_or_handle : str or file-like
| Path to output file or already opened file-like object. If the object is a file handle,
| no special array handling will be performed, all attributes will be saved to the same file.
| separately : list of str or None, optional
| If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
| them into separate files. This prevent memory errors for large objects, and also allows
| `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
| loading and sharing the large arrays in RAM between multiple processes.
|
| If list of str: store these attributes into separate files. The automated size check
| is not performed in this case.
| sep_limit : int, optional
| Don't store arrays smaller than this separately. In bytes.
| ignore : frozenset of str, optional
| Attributes that shouldn't be stored at all.
| pickle_protocol : int, optional
| Protocol number for pickle.
|
| See Also
| --------
| :meth:`~gensim.utils.SaveLoad.load`
| Load object from file.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from gensim.utils.SaveLoad:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
from gensim.models import LsiModel
model = LsiModel(common_corpus[:3], id2word=common_dictionary) # train model
vector = model[common_corpus[4]] # apply model to BoW document
model.add_documents(common_corpus[4:]) # update model with new documents
tmp_fname = get_tmpfile("lsi.model")
model.save(tmp_fname) # save model
loaded_model = LsiModel.load(tmp_fname) # load model
help(common_corpus)
Help on list object:
class list(object)
| list(iterable=(), /)
|
| Built-in mutable sequence.
|
| If no argument is given, the constructor creates a new empty list.
| The argument must be an iterable if specified.
|
| Methods defined here:
|
| __add__(self, value, /)
| Return self+value.
|
| __contains__(self, key, /)
| Return key in self.
|
| __delitem__(self, key, /)
| Delete self[key].
|
| __eq__(self, value, /)
| Return self==value.
|
| __ge__(self, value, /)
| Return self>=value.
|
| __getattribute__(self, name, /)
| Return getattr(self, name).
|
| __getitem__(...)
| x.__getitem__(y) <==> x[y]
|
| __gt__(self, value, /)
| Return self>value.
|
| __iadd__(self, value, /)
| Implement self+=value.
|
| __imul__(self, value, /)
| Implement self*=value.
|
| __init__(self, /, *args, **kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|
| __iter__(self, /)
| Implement iter(self).
|
| __le__(self, value, /)
| Return self<=value.
|
| __len__(self, /)
| Return len(self).
|
| __lt__(self, value, /)
| Return self<value.
|
| __mul__(self, value, /)
| Return self*value.
|
| __ne__(self, value, /)
| Return self!=value.
|
| __repr__(self, /)
| Return repr(self).
|
| __reversed__(self, /)
| Return a reverse iterator over the list.
|
| __rmul__(self, value, /)
| Return value*self.
|
| __setitem__(self, key, value, /)
| Set self[key] to value.
|
| __sizeof__(self, /)
| Return the size of the list in memory, in bytes.
|
| append(self, object, /)
| Append object to the end of the list.
|
| clear(self, /)
| Remove all items from list.
|
| copy(self, /)
| Return a shallow copy of the list.
|
| count(self, value, /)
| Return number of occurrences of value.
|
| extend(self, iterable, /)
| Extend list by appending elements from the iterable.
|
| index(self, value, start=0, stop=9223372036854775807, /)
| Return first index of value.
|
| Raises ValueError if the value is not present.
|
| insert(self, index, object, /)
| Insert object before index.
|
| pop(self, index=-1, /)
| Remove and return item at index (default last).
|
| Raises IndexError if list is empty or index is out of range.
|
| remove(self, value, /)
| Remove first occurrence of value.
|
| Raises ValueError if the value is not present.
|
| reverse(self, /)
| Reverse *IN PLACE*.
|
| sort(self, /, *, key=None, reverse=False)
| Stable sort *IN PLACE*.
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs) from builtins.type
| Create and return a new object. See help(type) for accurate signature.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __hash__ = None
from pprint import pprint
pprint(common_corpus)
[[(0, 1), (1, 1), (2, 1)],
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
[(2, 1), (5, 1), (7, 1), (8, 1)],
[(1, 1), (5, 2), (8, 1)],
[(3, 1), (6, 1), (7, 1)],
[(9, 1)],
[(9, 1), (10, 1)],
[(9, 1), (10, 1), (11, 1)],
[(4, 1), (10, 1), (11, 1)]]
help(common_dictionary)
Help on Dictionary in module gensim.corpora.dictionary object:
class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
| Dictionary(documents=None, prune_at=2000000)
|
| Dictionary encapsulates the mapping between normalized words and their integer ids.
|
| Notable instance attributes:
|
| Attributes
| ----------
| token2id : dict of (str, int)
| token -> tokenId.
| id2token : dict of (int, str)
| Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
| cfs : dict of (int, int)
| Collection frequencies: token_id -> how many instances of this token are contained in the documents.
| dfs : dict of (int, int)
| Document frequencies: token_id -> how many documents contain this token.
| num_docs : int
| Number of documents processed.
| num_pos : int
| Total number of corpus positions (number of processed words).
| num_nnz : int
| Total number of non-zeroes in the BOW matrix (sum of the number of unique
| words per document over the entire corpus).
|
| Method resolution order:
| Dictionary
| gensim.utils.SaveLoad
| collections.abc.Mapping
| collections.abc.Collection
| collections.abc.Sized
| collections.abc.Iterable
| collections.abc.Container
| builtins.object
|
| Methods defined here:
|
| __getitem__(self, tokenid)
| Get the string token that corresponds to `tokenid`.
|
| Parameters
| ----------
| tokenid : int
| Id of token.
|
| Returns
| -------
| str
| Token corresponding to `tokenid`.
|
| Raises
| ------
| KeyError
| If this Dictionary doesn't contain such `tokenid`.
|
| __init__(self, documents=None, prune_at=2000000)
| Parameters
| ----------
| documents : iterable of iterable of str, optional
| Documents to be used to initialize the mapping and collect corpus statistics.
| prune_at : int, optional
| Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
| footprint, the correctness is not guaranteed.
| Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> texts = [['human', 'interface', 'computer']]
| >>> dct = Dictionary(texts) # initialize a Dictionary
| >>> dct.add_documents([["cat", "say", "meow"], ["dog"]]) # add more document (extend the vocabulary)
| >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
| [(0, 1), (6, 1)]
|
| __iter__(self)
| Iterate over all tokens.
|
| __len__(self)
| Get number of stored tokens.
|
| Returns
| -------
| int
| Number of stored tokens.
|
| __str__(self)
| Return str(self).
|
| add_documents(self, documents, prune_at=2000000)
| Update dictionary from a collection of `documents`.
|
| Parameters
| ----------
| documents : iterable of iterable of str
| Input corpus. All tokens should be already **tokenized and normalized**.
| prune_at : int, optional
| Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
| footprint, the correctness is not guaranteed.
| Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
| >>> dct = Dictionary(corpus)
| >>> len(dct)
| 5
| >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
| >>> len(dct)
| 10
|
| compactify(self)
| Assign new word ids to all words, shrinking any gaps.
|
| doc2bow(self, document, allow_update=False, return_missing=False)
| Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
|
| Parameters
| ----------
| document : list of str
| Input document.
| allow_update : bool, optional
| Update self, by adding new tokens from `document` and updating internal corpus statistics.
| return_missing : bool, optional
| Return missing tokens (tokens present in `document` but not in self) with frequencies?
|
| Return
| ------
| list of (int, int)
| BoW representation of `document`.
| list of (int, int), dict of (str, int)
| If `return_missing` is True, return BoW representation of `document` + dictionary with missing
| tokens and their frequencies.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
| >>> dct.doc2bow(["this", "is", "máma"])
| [(2, 1)]
| >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
| ([(2, 1)], {u'this': 1, u'is': 1})
|
| doc2idx(self, document, unknown_word_index=-1)
| Convert `document` (a list of words) into a list of indexes = list of `token_id`.
| Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
|
| Parameters
| ----------
| document : list of str
| Input document
| unknown_word_index : int, optional
| Index to use for words not in the dictionary.
|
| Returns
| -------
| list of int
| Token ids for tokens in `document`, in the same order.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["a", "a", "b"], ["a", "c"]]
| >>> dct = Dictionary(corpus)
| >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
| [0, 0, 2, -1, 2]
|
| filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
| Filter out tokens in the dictionary by their frequency.
|
| Parameters
| ----------
| no_below : int, optional
| Keep tokens which are contained in at least `no_below` documents.
| no_above : float, optional
| Keep tokens which are contained in no more than `no_above` documents
| (fraction of total corpus size, not an absolute number).
| keep_n : int, optional
| Keep only the first `keep_n` most frequent tokens.
| keep_tokens : iterable of str
| Iterable of tokens that **must** stay in dictionary after filtering.
|
| Notes
| -----
| This removes all tokens in the dictionary that are:
|
| #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or
|
| #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
| #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
|
| After the pruning, resulting gaps in word ids are shrunk.
| Due to this gap shrinking, **the same word may have a different word id before and after the call
| to this function!**
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>> len(dct)
| 5
| >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
| >>> len(dct)
| 1
|
| filter_n_most_frequent(self, remove_n)
| Filter out the 'remove_n' most frequent tokens that appear in the documents.
|
| Parameters
| ----------
| remove_n : int
| Number of the most frequent tokens that will be removed.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>> len(dct)
| 5
| >>> dct.filter_n_most_frequent(2)
| >>> len(dct)
| 3
|
| filter_tokens(self, bad_ids=None, good_ids=None)
| Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
|
| Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
|
| Parameters
| ----------
| bad_ids : iterable of int, optional
| Collection of word ids to be removed.
| good_ids : collection of int, optional
| Keep selected collection of word ids and remove the rest.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>> 'ema' in dct.token2id
| True
| >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
| >>> 'ema' in dct.token2id
| False
| >>> len(dct)
| 4
| >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
| >>> len(dct)
| 1
|
| iteritems(self)
|
| iterkeys = __iter__(self)
|
| itervalues(self)
|
| keys(self)
| Get all stored ids.
|
| Returns
| -------
| list of int
| List of all token ids.
|
| merge_with(self, other)
| Merge another dictionary into this dictionary, mapping the same tokens to the same ids
| and new tokens to new ids.
|
| Notes
| -----
| The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
| `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
|
| Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
| from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
|
| Parameters
| ----------
| other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
| Other dictionary.
|
| Return
| ------
| :class:`gensim.models.VocabTransform`
| Transformation object.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
| >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
| >>> dct_1.doc2bow(corpus_2[0])
| [(0, 1)]
| >>> transformer = dct_1.merge_with(dct_2)
| >>> dct_1.doc2bow(corpus_2[0])
| [(0, 1), (3, 2)]
|
| patch_with_special_tokens(self, special_token_dict)
| Patch token2id and id2token using a dictionary of special tokens.
|
|
| **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may want to specify
| special tokens that behave differently than others.
| One example is the "unknown" token, and another is the padding token.
| It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
| would be one way to specify this.
|
| Parameters
| ----------
| special_token_dict : dict of (str, int)
| dict containing the special tokens as keys and their wanted indices as values.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>> dct = Dictionary(corpus)
| >>>
| >>> special_tokens = {'pad': 0, 'space': 1}
| >>> print(dct.token2id)
| {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
| >>>
| >>> dct.patch_with_special_tokens(special_tokens)
| >>> print(dct.token2id)
| {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
|
| save_as_text(self, fname, sort_by_word=True)
| Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
|
| Parameters
| ----------
| fname : str
| Path to output file.
| sort_by_word : bool, optional
| Sort words in lexicographical order before writing them out?
|
| Notes
| -----
| Format::
|
| num_docs
| id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
| id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
| ....
| id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
|
| This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
| to other tools and frameworks. For better performance and to store the entire object state,
| including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
| :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
|
| See Also
| --------
| :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
| Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>> from gensim.test.utils import get_tmpfile
| >>>
| >>> tmp_fname = get_tmpfile("dictionary")
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>>
| >>> dct = Dictionary(corpus)
| >>> dct.save_as_text(tmp_fname)
| >>>
| >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
| >>> assert dct.token2id == loaded_dct.token2id
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| from_corpus(corpus, id2word=None)
| Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
|
| Parameters
| ----------
| corpus : iterable of iterable of (int, number)
| Corpus in BoW format.
| id2word : dict of (int, object)
| Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
|
| Notes
| -----
| This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
| text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
| then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
| `id2word` is an optional dictionary that maps the `word_id` to a token.
| In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
|
| Returns
| -------
| :class:`~gensim.corpora.dictionary.Dictionary`
| Inferred dictionary from corpus.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>>
| >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
| >>> dct = Dictionary.from_corpus(corpus)
| >>> len(dct)
| 3
|
| from_documents(documents)
| Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
|
| Equivalent to `Dictionary(documents=documents)`.
|
| Parameters
| ----------
| documents : iterable of iterable of str
| Input corpus.
|
| Returns
| -------
| :class:`~gensim.corpora.dictionary.Dictionary`
| Dictionary initialized from `documents`.
|
| load_from_text(fname)
| Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
|
| Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
|
| Parameters
| ----------
| fname: str
| Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
|
| See Also
| --------
| :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
| Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.corpora import Dictionary
| >>> from gensim.test.utils import get_tmpfile
| >>>
| >>> tmp_fname = get_tmpfile("dictionary")
| >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
| >>>
| >>> dct = Dictionary(corpus)
| >>> dct.save_as_text(tmp_fname)
| >>>
| >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
| >>> assert dct.token2id == loaded_dct.token2id
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset()
|
| __slotnames__ = []
|
| ----------------------------------------------------------------------
| Methods inherited from gensim.utils.SaveLoad:
|
| save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
| Save the object to a file.
|
| Parameters
| ----------
| fname_or_handle : str or file-like
| Path to output file or already opened file-like object. If the object is a file handle,
| no special array handling will be performed, all attributes will be saved to the same file.
| separately : list of str or None, optional
| If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
| them into separate files. This prevent memory errors for large objects, and also allows
| `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
| loading and sharing the large arrays in RAM between multiple processes.
|
| If list of str: store these attributes into separate files. The automated size check
| is not performed in this case.
| sep_limit : int, optional
| Don't store arrays smaller than this separately. In bytes.
| ignore : frozenset of str, optional
| Attributes that shouldn't be stored at all.
| pickle_protocol : int, optional
| Protocol number for pickle.
|
| See Also
| --------
| :meth:`~gensim.utils.SaveLoad.load`
| Load object from file.
|
| ----------------------------------------------------------------------
| Class methods inherited from gensim.utils.SaveLoad:
|
| load(fname, mmap=None) from abc.ABCMeta
| Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
|
| Parameters
| ----------
| fname : str
| Path to file that contains needed object.
| mmap : str, optional
| Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays
| via mmap (shared memory) using `mmap='r'.
| If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
|
| See Also
| --------
| :meth:`~gensim.utils.SaveLoad.save`
| Save object to file.
|
| Returns
| -------
| object
| Object loaded from `fname`.
|
| Raises
| ------
| AttributeError
| When called on an object instance instead of class (this is a class method).
|
| ----------------------------------------------------------------------
| Data descriptors inherited from gensim.utils.SaveLoad:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from collections.abc.Mapping:
|
| __contains__(self, key)
|
| __eq__(self, other)
| Return self==value.
|
| get(self, key, default=None)
| D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.
|
| items(self)
| D.items() -> a set-like object providing a view on D's items
|
| values(self)
| D.values() -> an object providing a view on D's values
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from collections.abc.Mapping:
|
| __hash__ = None
|
| __reversed__ = None
|
| ----------------------------------------------------------------------
| Class methods inherited from collections.abc.Collection:
|
| __subclasshook__(C) from abc.ABCMeta
| Abstract classes can override this to customize issubclass().
|
| This is invoked early on by abc.ABCMeta.__subclasscheck__().
| It should return True, False or NotImplemented. If it returns
| NotImplemented, the normal algorithm is used. Otherwise, it
| overrides the normal algorithm (and the outcome is cached).
help(get_tmpfile)
Help on function get_tmpfile in module gensim.test.utils:
get_tmpfile(suffix)
Get full path to file `suffix` in temporary folder.
This function doesn't creates file (only generate unique name).
Also, it may return different paths in consecutive calling.
Parameters
----------
suffix : str
Suffix of file.
Returns
-------
str
Path to `suffix` file in temporary folder.
Examples
--------
Using this function we may get path to temporary file and use it, for example, to store temporary model.
.. sourcecode:: pycon
>>> from gensim.models import LsiModel
>>> from gensim.test.utils import get_tmpfile, common_dictionary, common_corpus
>>>
>>> tmp_f = get_tmpfile("toy_lsi_model")
>>>
>>> model = LsiModel(common_corpus, id2word=common_dictionary)
>>> model.save(tmp_f)
>>>
>>> loaded_model = LsiModel.load(tmp_f)
help(models.LsiModel)
Help on class LsiModel in module gensim.models.lsimodel:
class LsiModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)
| LsiModel(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
|
| Model for `Latent Semantic Indexing
| <https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing>`_.
|
| The decomposition algorithm is described in `"Fast and Faster: A Comparison of Two Streamed
| Matrix Decomposition Algorithms" <https://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_.
|
| Notes
| -----
| * :attr:`gensim.models.lsimodel.LsiModel.projection.u` - left singular vectors,
| * :attr:`gensim.models.lsimodel.LsiModel.projection.s` - singular values,
| * ``model[training_corpus]`` - right singular vectors (can be reconstructed if needed).
|
| See Also
| --------
| `FAQ about LSI matrices
| <https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi>`_.
|
| Examples
| --------
| .. sourcecode:: pycon
|
| >>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
| >>> from gensim.models import LsiModel
| >>>
| >>> model = LsiModel(common_corpus[:3], id2word=common_dictionary) # train model
| >>> vector = model[common_corpus[4]] # apply model to BoW document
| >>> model.add_documents(common_corpus[4:]) # update model with new documents
| >>> tmp_fname = get_tmpfile("lsi.model")
| >>> model.save(tmp_fname) # save model
| >>> loaded_model = LsiModel.load(tmp_fname) # load model
|
| Method resolution order:
| LsiModel
| gensim.interfaces.TransformationABC
| gensim.utils.SaveLoad
| gensim.models.basemodel.BaseTopicModel
| builtins.object
|
| Methods defined here:
|
| __getitem__(self, bow, scaled=False, chunksize=512)
| Get the latent representation for `bow`.
|
| Parameters
| ----------
| bow : {list of (int, int), iterable of list of (int, int)}
| Document or corpus in BoW representation.
| scaled : bool, optional
| If True - topics will be scaled by the inverse of singular values.
| chunksize : int, optional
| Number of documents to be used in each applying chunk.
|
| Returns
| -------
| list of (int, float)
| Latent representation of topics in BoW format for document **OR**
| :class:`gensim.matutils.Dense2Corpus`
| Latent representation of corpus in BoW format if `bow` is corpus.
|
| __init__(self, corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
| Construct an `LsiModel` object.
|
| Either `corpus` or `id2word` must be supplied in order to train the model.
|
| Parameters
| ----------
| corpus : {iterable of list of (int, float), scipy.sparse.csc}, optional
| Stream of document vectors or sparse matrix of shape (`num_documents`, `num_terms`).
| num_topics : int, optional
| Number of requested factors (latent dimensions)
| id2word : dict of {int: str}, optional
| ID to word mapping, optional.
| chunksize : int, optional
| Number of documents to be used in each training chunk.
| decay : float, optional
| Weight of existing observations relatively to new ones.
| distributed : bool, optional
| If True - distributed mode (parallel execution on several machines) will be used.
| onepass : bool, optional
| Whether the one-pass algorithm should be used for training.
| Pass `False` to force a multi-pass stochastic algorithm.
| power_iters: int, optional
| Number of power iteration steps to be used.
| Increasing the number of power iterations improves accuracy, but lowers performance
| extra_samples : int, optional
| Extra samples to be used besides the rank `k`. Can improve accuracy.
| dtype : type, optional
| Enforces a type for elements of the decomposed matrix.
|
| __str__(self)
| Get a human readable representation of model.
|
| Returns
| -------
| str
| A human readable string of the current objects parameters.
|
| add_documents(self, corpus, chunksize=None, decay=None)
| Update model with new `corpus`.
|
| Parameters
| ----------
| corpus : {iterable of list of (int, float), scipy.sparse.csc}
| Stream of document vectors or sparse matrix of shape (`num_terms`, num_documents).
| chunksize : int, optional
| Number of documents to be used in each training chunk, will use `self.chunksize` if not specified.
| decay : float, optional
| Weight of existing observations relatively to new ones, will use `self.decay` if not specified.
|
| Notes
| -----
| Training proceeds in chunks of `chunksize` documents at a time. The size of `chunksize` is a tradeoff
| between increased speed (bigger `chunksize`) vs. lower memory footprint (smaller `chunksize`).
| If the distributed mode is on, each chunk is sent to a different worker/computer.
|
| get_topics(self)
| Get the topic vectors.
|
| Notes
| -----
| The number of topics can actually be smaller than `self.num_topics`, if there were not enough factors
| in the matrix (real rank of input matrix smaller than `self.num_topics`).
|
| Returns
| -------
| np.ndarray
| The term topic matrix with shape (`num_topics`, `vocabulary_size`)
|
| print_debug(self, num_topics=5, num_words=10)
| Print (to log) the most salient words of the first `num_topics` topics.
|
| Unlike :meth:`~gensim.models.lsimodel.LsiModel.print_topics`, this looks for words that are significant for
| a particular topic *and* not for others. This *should* result in a
| more human-interpretable description of topics.
|
| Alias for :func:`~gensim.models.lsimodel.print_debug`.
|
| Parameters
| ----------
| num_topics : int, optional
| The number of topics to be selected (ordered by significance).
| num_words : int, optional
| The number of words to be included per topics (ordered by significance).
|
| save(self, fname, *args, **kwargs)
| Save the model to a file.
|
| Notes
| -----
| Large internal arrays may be stored into separate files, with `fname` as prefix.
|
| Warnings
| --------
| Do not save as a compressed file if you intend to load the file back with `mmap`.
|
| Parameters
| ----------
| fname : str
| Path to output file.
| *args
| Variable length argument list, see :meth:`gensim.utils.SaveLoad.save`.
| **kwargs
| Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.save`.
|
| See Also
| --------
| :meth:`~gensim.models.lsimodel.LsiModel.load`
|
| show_topic(self, topicno, topn=10)
| Get the words that define a topic along with their contribution.
|
| This is actually the left singular vector of the specified topic.
|
| The most important words in defining the topic (greatest absolute value) are included
| in the output, along with their contribution to the topic.
|
| Parameters
| ----------
| topicno : int
| The topics id number.
| topn : int
| Number of words to be included to the result.
|
| Returns
| -------
| list of (str, float)
| Topic representation in BoW format.
|
| show_topics(self, num_topics=-1, num_words=10, log=False, formatted=True)
| Get the most significant topics.
|
| Parameters
| ----------
| num_topics : int, optional
| The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
| num_words : int, optional
| The number of words to be included per topics (ordered by significance).
| log : bool, optional
| If True - log topics with logger.
| formatted : bool, optional
| If True - each topic represented as string, otherwise - in BoW format.
|
| Returns
| -------
| list of (int, str)
| If `formatted=True`, return sequence with (topic_id, string representation of topics) **OR**
| list of (int, list of (str, float))
| Otherwise, return sequence with (topic_id, [(word, value), ... ]).
|
| ----------------------------------------------------------------------
| Class methods defined here:
|
| load(fname, *args, **kwargs) from builtins.type
| Load a previously saved object using :meth:`~gensim.models.lsimodel.LsiModel.save` from file.
|
| Notes
| -----
| Large arrays can be memmap'ed back as read-only (shared memory) by setting the `mmap='r'` parameter.
|
| Parameters
| ----------
| fname : str
| Path to file that contains LsiModel.
| *args
| Variable length argument list, see :meth:`gensim.utils.SaveLoad.load`.
| **kwargs
| Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.load`.
|
| See Also
| --------
| :meth:`~gensim.models.lsimodel.LsiModel.save`
|
| Returns
| -------
| :class:`~gensim.models.lsimodel.LsiModel`
| Loaded instance.
|
| Raises
| ------
| IOError
| When methods are called on instance (should be called from class).
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __slotnames__ = []
|
| ----------------------------------------------------------------------
| Data descriptors inherited from gensim.utils.SaveLoad:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from gensim.models.basemodel.BaseTopicModel:
|
| print_topic(self, topicno, topn=10)
| Get a single topic as a formatted string.
|
| Parameters
| ----------
| topicno : int
| Topic id.
| topn : int
| Number of words from topic that will be used.
|
| Returns
| -------
| str
| String representation of topic, like '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + ... '.
|
| print_topics(self, num_topics=20, num_words=10)
| Get the most significant topics (alias for `show_topics()` method).
|
| Parameters
| ----------
| num_topics : int, optional
| The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
| num_words : int, optional
| The number of words to be included per topics (ordered by significance).
|
| Returns
| -------
| list of (int, list of (str, float))
| Sequence with (topic_id, [(word, value), ... ]).