我试图通过Lucene实现显式语义分析(ESA)。
在匹配文档时,如何考虑查询中的术语TFIDF?
有更好的办法吗?
Lucene已经支持TF/IDF评分,当然,默认情况下,所以我不太确定你在找什么。
实际上,这听起来有点像您希望根据查询本身中的TF/IDF来衡量查询术语。因此,让我们考虑其中的两个要素:
>
tf:Lucene将每个查询项的得分相加。如果同一个查询项在查询中出现两次(如字段:(aab)
),加倍后的项将获得更重的权重,相当于(但绝不等于)增加2。
a b a
匹配四个查询项(a b a a
,但不匹配C D
)a b c
匹配五个查询项(a b a c a
,但不匹配D
)因此,这个特定的得分元素将更强烈地得分第二个文档。
下面是docab a
explain
(请参见indexsearcher.explain)的输出:
0.26880693 = (MATCH) product of:
0.40321037 = (MATCH) sum of:
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.07690979 = (MATCH) weight(text:b in 0) [DefaultSimilarity], result of:
0.07690979 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.6666667 = coord(4/6)
0.43768594 = (MATCH) product of:
0.52522314 = (MATCH) sum of:
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:b in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.217584 = (MATCH) weight(text:c in 1) [DefaultSimilarity], result of:
0.217584 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.435168 = queryWeight, product of:
1.0 = idf(docFreq=1, maxDocs=2)
0.435168 = queryNorm
0.5 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.0 = idf(docFreq=1, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.8333333 = coord(5/6)
但是,也请注意第二个文件中“C”一词在coord和idf中的不同之处。这些分数的影响只是抹去了你从增加同一项的倍数中获得的提升。如果向查询添加足够的,它们最终会交换位置。对
C
上的匹配结果进行了计算,认为它是一个更有意义的结果。
问题内容: 如何找到向量之间的余弦相似度? 我需要找到相似性来衡量两行文本之间的相关性。 例如,我有两个句子: 用户界面系统 用户界面机 …及其在tF-idf之后的向量,然后使用LSI进行标准化,例如 和。 如何测量这些向量之间的相似性? 问题答案: 我最近在大学的信息检索部门做了一些tf-idf的工作。我使用了这种余弦相似度方法,该方法使用Jama:Java Matrix Package 。 有
问题内容: 如果我在mysql中有两个字符串: 有没有办法使用MYSQL获得这两个字符串之间的相似性百分比?例如,这里有3个单词是相似的,因此相似度应为: count(@a和@b之间的相似单词)/(count(@a)+ count(@b)-count(intersection)) 和结果是3 /(4 + 4-3)= 0.6 高度赞赏任何想法! 问题答案: 您可以使用此功能(从http://www.
问题内容: 我有两个查询,如下所示: 我想一次执行两个查询。 但是,然后告诉我如何处理每个单独设置的表。实际上,在ASP.NET中,我们使用数据集来处理两个查询, 如何使用PHP / MYSQL做同样的事情? 问题答案: 更新: 显然可以通过将标志传递给。请参阅使用PHP在一条语句中执行多个SQL查询。尽管如此,任何当前的读者都应避免使用-class函数,而更喜欢PDO。 您无法使用PHP中的常规
考虑以下数据结构: 是否有办法使用Cosmos DBSQL在单个查询中选择与“父”相关的所有文档?结果必须包括父、子和grand_child文档。
问题内容: 我有两个标准化张量,我需要计算这些张量之间的余弦相似度。如何使用TensorFlow做到这一点? 问题答案: 这将完成工作: 此打印
问题内容: 假设您在数据库中按以下方式构造了一个表: 为了清楚起见,应输出: 请注意,由于向量存储在数据库中,因此我们仅需要存储非零条目。在此示例中,我们只有两个向量$ v_ {99} =(4,3,4,0)$和$ v_ {1234} =(0,5,2,3)$都在$ \ mathbb {R}中^ 4 $。 这些向量的余弦相似度应为$ \ displaystyle \ frac {23} {\ sqrt