以特殊字符开头或结尾的单词的单词边界会产生意外结果

松雅健

2023-03-14

问题内容：

说我想匹配短语Sortes\index[persons]{Sortes}中短语的存在test Sortes\index[persons]{Sortes} text。

使用pythonre我可以做到这一点：

>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

这行得通，但我想避免使用搜索模式Sortes对短语给出肯定的结果test Sortes\index[persons]{Sortes} text。

>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>

所以我使用这种\b模式，像这样：

search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)

现在，我没有比赛。

如果搜索模式不包含任何字符[]{}，则可以使用。例如：

>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>

另外，如果我删除final r'\b'，那么它也可以工作：

re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

此外，文档中还提到了\b

请注意，形式上，\ b定义为\ w和\ W字符之间的边界（反之亦然）或\ w与字符串的开头/结尾之间的边界。

所以，我想替换最后\b有(\W|$)：

>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>

瞧，它起作用了！这里发生了什么？我想念什么？

问题答案：

查看单词边界匹配什么：

单词边界可以出现在以下三个位置之一：

如果字符串中的第一个字符是单词字符，则在字符串中第一个字符之前。

如果字符串中的最后一个字符是单词字符，则在字符串的最后一个字符之后。

字符串中的两个字符之间，其中一个是单词字符，另一个不是单词字符。

在您的模式中，}\b只有在单词char }（字母，数字或_）之后才匹配。

使用时，(\W|$)您需要显式使用非单词或字符串结尾。

在这些情况下，我总是建议基于否定环顾的字词边界：

re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

在此，(?<!\w)如果当前位置的左侧紧邻有一个字符char
，则负向后搜索将使匹配失败；如果当前位置的右侧紧邻有一个字符char，则(?!\w)负向搜索将使匹配失败。

实际上，可以很容易地进一步自定义这些环视模式（例如，只有在模式周围有字母
时才使匹配失败，请使用[^\W\d_]代替\w，或者如果只允许空格周围的匹配，请使用(?<!\S)/(?!\S)环视边界）。

以特殊字符开头或结尾的单词的单词边界会产生意外结果

相关阅读

相关文章

相关问答

相关工具

相关文档