我在理解ElasticSearch中的regexp机制时遇到了麻烦。我有代表物业单位的文件:
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
从今天开始,DailyAvailability字段按天数对未来两年的财产可用性进行编码。’A’表示可用,’U’不可删除,’I’可以签入,’O’可以签出。如何编写正则表达式过滤器以获取特定日期可用的所有单位?
我试图在DailyAvailability字段中找到具有特定长度和偏移量的’A’子字符串。例如,查找从今天起7天内可使用7天的广告单元:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
该查询返回具有DateAvailability的实例单元,该实例单元从“
UUUUUUUUUUUUUUUUUUUUUUIAA”开始,但在字段内部包含合适的序列。如何锚定整个源字符串的正则表达式?ES文档说,lucene
regex应该默认锚定。
PS我试过了'^.{7}a{7}.*$'
。返回空集。
看起来您正在使用text
数据类型进行html" target="_blank">存储Unit.DailyAvailability
(如果使用动态映射,这也是字符串的默认值)。您应该考虑改用keyword
数据类型。
让我详细解释一下。
text
字段中间的内容匹配?text
数据类型所发生的是,对数据进行了分析以进行全文搜索。它进行了一些转换,例如降低大小写并拆分为令牌。
让我们尝试对您的输入使用Analyze
API
:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
响应为:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
如您所见,Elasticsearch将您的输入分为三个标记并将它们小写。这看起来是出乎意料的,但是如果您认为它实际上是试图促进人类语言单词的搜索,那是有道理的-
没有那么长的单词。
这就是为什么现在regexp
查询".{7}a{7}.*"
将匹配:有一个标记,实际上有很多开始a
的,这是一种预期行为的regexp
查询。
… Elasticsearch会将正则表达式应用于令牌生成器针对该字段生成的术语,而不应用于该字段的原始文本。
regexp
查询考虑整个字符串?这很简单:不要使用分析仪。该类型keyword
按原样存储您提供的字符串。
使用这样的映射:
PUT my_regexes
{
"mappings": {
"doc": {
"properties": {
"Unit": {
"properties": {
"DailyAvailablity": {
"type": "keyword"
}
}
}
}
}
}
}
您将可以进行如下查询,以匹配帖子中的文档:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
请注意,查询变得区分大小写,因为未分析该字段。
这regexp
将不再返回任何结果:".{12}a{7}.*"
这将: ".{12}A{7}.*"
正则表达式锚定:
Lucene的模式始终是固定的。提供的模式必须与整个字符串匹配。
看起来锚定错误的原因很可能是因为令牌在分析的text
字段中分裂了。
希望有帮助!
下面是我正在使用的正则表达式的最新版本,它抛出了错误“Invalid regular expression” XSD:正则表达式在位置4验证失败:当前选项设置不支持此表达式。 我在xsd文件中得到了这个异常,我正在message broker(IIB)中开发这个xsd。有谁能帮我解决这个问题吗?
行动时刻 - 使用正则表达式 Unlang允许在条件检查中进行正则表达式计算。这些通常是Posix正则表达式。运算符=〜和!〜与正则表达式相关联。为了简单的概念证明,我们将修改上一个练习: 1.编辑FreeRADIUS配置目录下的sites-available / default虚拟服务器,并在该部分顶部的post-auth部分中添加以下内容: if(request:Framed-Protocol
问题内容: 正则表达式可以匹配空格 或 字符串的开头吗? 我正在尝试用英镑符号替换缩写为GBP的货币。我可以匹配任何以GBP开头的东西,但我想更加保守一些,并在它周围寻找某些定界符。 我可以同时做后两个例子吗? 问题答案: 使用OR“ ”运算符:
sorter: "${$(...props)=>{timeSort(createTime)}$}$", ..$}$"."${$.. 希望结果 :sorter: (...props)=>{timeSort(createTime)}, ..$}$"."${$.. 规则: "${$ 和 $}$" 是一对,将他们替换为空。
本文向大家介绍Java正则表达式使用,包括了Java正则表达式使用的使用技巧和注意事项,需要的朋友参考一下 一:抓取网页中的Email地址 利用正则表达式匹配网页中的文本 将网页内容分割提取 打印结果: 867124664@qq.com 260678675@QQ.com 806208721@qq.com hr_1985@163.com 32575987@qq.com qingchen0501@12
使用正则表达式 现在我们已经看了一些简单的正则表达式,那么我们实际在 Python 中是如何使用它们的呢? re 模块提供了一个正则表达式引擎的接口,可以让你将 REs 编译成对象并用它们来进行匹配。