字符串度量(string metric, a.k.a a string similarity metric or string distance function)是度量两个文本字符串之间的距离的度量,用于近似字符串匹配或比较以及模糊字符串搜索。
字符串度量的一个要求是满足三角形不等式。
以下介绍一些常用的字符串度量:
Levenshtein distance:也称编辑距离(edit distance),两个单词之间的Levenshtein距离是将一个单词转换成另一个单词所需的最小操作数(包括单个字符的插入、删除或替换)。https://leetcode-cn.com/problems/edit-distance/
Damerau–Levenshtein distance:是Levenshtein distance的推广,还可以考虑两个相邻字符的互换。
Hamming distance:长度相等的两个字符串之间的汉明距离是对应符号不同的位置数;换句话说,它测量将一个字符串更改为另一个字符串所需的最少替换次数。
Longest common substring / Longest common subsequence:最长公共子串和最长公共子序列,尽管它们不是字符串度量,但还是可以用来度量字符串之间的相似度。
Jaro–Winkler distance:https://blog.csdn.net/asty9000/article/details/81348857
参考资料:https://en.wikipedia.org/wiki/String_metric
# 计算与指定单词编辑距离为1、2的所有字符串。
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
"All edits that are two edits away from `word`."
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))