我能够运行gensim的LDA代码,并获得各自关键词的前10个主题。
现在,我想更进一步,通过查看将哪些文档归类到每个主题中,来了解LDA算法的准确性。gensim LDA有可能吗?
基本上我想做这样的事情,但是在python中并使用gensim。
具有主题模型的LDA,如何查看不同文档属于哪些主题?
使用主题的概率,您可以尝试设置一些阈值并将其用作聚类基线,但是我敢肯定,比这种“ hacky”方法有更好的聚类方法。
from gensim import corpora, models, similarities
from itertools import chain
""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]
# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
update_every=1, chunksize=10000, passes=1)
# Prints the topics.
for top in lda.print_topics():
print top
print
# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print
cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]
print cluster1
print cluster2
print cluster3
[out]
:
0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user
0.333333333333
['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']
为了更清楚一点:
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
for topic in doc:
for topic_id, score in topic:
scores.append(score)
threshold = sum(scores)/len(scores)
上面的代码是所有单词和所有主题的所有主题的总分。然后通过分数的数量归一化总和。
问题内容: 我正在使用topicmodels包中的LDA,并且已经在大约30.000个文档上运行了LDA,获得了30个主题,并且获得了主题的前10个字,它们看起来非常好。但是我想看看哪些文档最有可能属于哪个主题,该怎么办? 问题答案: 如何使用内置数据集。这将向您显示哪些文档属于哪个主题的可能性最高。 那是你想做的吗? 此答案的提示:https : //stat.ethz.ch/pipermail
我的项目有多个具有方法的类。如何告诉Spring Boot Maven插件应该使用哪个类作为主类?
问题内容: 我有一个值列表和bin边缘列表。现在,我需要检查所有值属于它们的bin。除了遍历值然后遍历bin并检查该值是否属于当前bin之外,还有没有比Python更有效的方法了,例如: 对我来说,这看起来并不漂亮。谢谢! 问题答案: 可能为时已晚,但为将来参考,numpy具有执行此操作的功能: http://docs.scipy.org/doc/numpy/reference/generated
/inventory/group_vars/directory有 我希望host_1引用/inventory/group_vars/host_1.yml中的变量
潜在狄利克雷分配(LDA)是一个主题模型,用于查找一组文档背后的潜在变量(主题)。我使用python gensim包,有两个问题: > 我打印出每个主题最频繁的单词(我尝试了10,20,50个主题),发现单词的分布非常“平坦”:意味着即使是最频繁的单词也只有1%的概率... 大多数主题都是相似的:这意味着每个主题中最常用的单词重叠很多,并且主题中的高频词几乎共享同一组单词。。。 我想问题可能是因为
如何知道GitHub上一个blob的版本是在哪个分支上? 举例: 我们知道34f2265010a55f425e365bc61c5d8c2b3175b62a 是这个文件:bucketlist.ts的一个commit版本,但是我们如何才能知道此commit是属于哪个分支呢?