我想计算两个单词列表之间的相似度,例如:
['email','user','this','email','address','customer']
类似于此列表:
['email','mail','address','netmail']
例如,我希望比其他列表具有更高的相似性百分比:['address','ip','network']
即使 address
该列表中存在相似性
。
由于您实际上还无法演示晶体输出,因此以下是我的最佳镜头:
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
在上面的两个列表中,我们将找到列表中每个元素与其余元素之间的余弦相似度。即email
从list_B
与每一个元素list_A
:
def word2vec(word):
from collections import Counter
from math import sqrt
# count the characters in word
cw = Counter(word)
# precomputes a set of the different characters
sw = set(cw)
# precomputes the "length" of the word vector
lw = sqrt(sum(c*c for c in cw.values()))
# return a tuple
return cw, sw, lw
def cosdis(v1, v2):
# which characters are common to the two words?
common = v1[1].intersection(v2[1])
# by definition of cosine distance we have
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
threshold = 0.80 # if needed
for key in list_A:
for word in list_B:
try:
# print(key)
# print(word)
res = cosdis(word2vec(word), word2vec(key))
# print(res)
print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
# if res > threshold:
# print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
except IndexError:
pass
输出 :
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365
注意:我也已
threshold
在代码中注释了该部分,以防万一您只需要单词的相似度超过某个阈值(即80%)
编辑 :
OP : 但是我想做的不是逐字比较,而是逐个列出
使用Counter
和math
:
from collections import Counter
import math
counterA = Counter(list_A)
counterB = Counter(list_B)
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
print(counter_cosine_similarity(counterA, counterB) * 100)
输出 :
53.03300858899106
我有以下数据 我想根据列, 我怎样才能做到这一点?
问题内容: 我一直在遵循一个教程,该教程显示了如何制作word2vec模型。 本教程使用以下代码: (未提供其他信息,但我想这来自) 现在,我已经对该方法进行了一些研究,但对此却知之甚少。据我了解,它已被许多功能取代。 我应该使用什么?有,它有一个参数(似乎正确),但没有参数。 在这种情况下我可以使用什么? 问题答案: Keras文档中有一些尚不清楚的事情,我认为了解这些至关重要: 对于keras
问题内容: 我有两个标准化张量,我需要计算这些张量之间的余弦相似度。如何使用TensorFlow做到这一点? 问题答案: 这将完成工作: 此打印
问题内容: 我有一个词表 我想将每个列表项与一个字符串进行比较,并且输出应该是最相似的词。示例:如果是,则是最相似的词。如何在python中执行此操作?通常,我在清单中所用的单词可以很好地区分。 问题答案: 使用difflib: 正如您从仔细阅读源代码可以看到的那样,“接近”匹配项的排序从最佳到最差。
问题内容: 如果我在mysql中有两个字符串: 有没有办法使用MYSQL获得这两个字符串之间的相似性百分比?例如,这里有3个单词是相似的,因此相似度应为: count(@a和@b之间的相似单词)/(count(@a)+ count(@b)-count(intersection)) 和结果是3 /(4 + 4-3)= 0.6 高度赞赏任何想法! 问题答案: 您可以使用此功能(从http://www.
问题内容: 我有一个数据集,其中包含工人及其年龄,性别,地址等人口统计信息及其工作地点。我从数据集创建了一个RDD,并将其转换为DataFrame。 每个ID有多个条目。因此,我创建了一个DataFrame,其中仅包含工人的ID和他/她工作过的各个办公室位置。 我想根据他们的办公地点来计算每个工人与其他每个工人之间的余弦相似度。 因此,我遍历了DataFrame的各行,从DataFrame检索了一
问题内容: 假设您在数据库中按以下方式构造了一个表: 为了清楚起见,应输出: 请注意,由于向量存储在数据库中,因此我们仅需要存储非零条目。在此示例中,我们只有两个向量$ v_ {99} =(4,3,4,0)$和$ v_ {1234} =(0,5,2,3)$都在$ \ mathbb {R}中^ 4 $。 这些向量的余弦相似度应为$ \ displaystyle \ frac {23} {\ sqrt