我使用ElasticSearch对数据库进行索引。我试图使用edgeNGram标记器切割字符串,以射出要求“新字符串必须长于4个字符”的字符串。我使用以下代码创建索引:
PUT test
POST /test/_close
PUT /test/_settings
{
"analysis": {
"analyzer": {
"index_edge_ngram" : {
"type": "custom",
"filter": ["custom_word_delimiter"],
"tokenizer" : "left_tokenizer"
}
},
"filter" : {
"custom_word_delimiter" : {
"type": "word_delimiter",
"generate_word_parts": "true",
"generate_number_parts": "true",
"catenate_words": "false",
"catenate_numbers": "false",
"catenate_all": "false",
"split_on_case_change": "false",
"preserve_original": "false",
"split_on_numerics": "true",
"ignore_case": "true"
}
},
"tokenizer" : {
"left_tokenizer" : {
"max_gram" : 30,
"min_gram" : 5,
"type" : "edgeNGram"
}
}
}
}
POST /test/_open
现在我运行test来查看结果
GET /test/_analyze?analyzer=index_edge_ngram&text=please pay for multiple wins with only one payment
并得到结果
{
"tokens": [
{
"token": "pleas",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 2
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 3
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "p",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 6
},
{
"token": "pa",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 7
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 8
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 9
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 10
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 11
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 12
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 13
},
{
"token": "f",
"start_offset": 11,
"end_offset": 12,
"type": "word",
"position": 14
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 15
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 16
},
{
"token": "fo",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 17
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 18
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 19
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 20
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 21
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 22
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 23
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 24
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 25
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 26
},
{
"token": "m",
"start_offset": 15,
"end_offset": 16,
"type": "word",
"position": 27
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 28
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 29
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 30
},
{
"token": "mu",
"start_offset": 15,
"end_offset": 17,
"type": "word",
"position": 31
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 32
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 33
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 34
},
{
"token": "mul",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 35
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 36
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 37
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 38
},
{
"token": "mult",
"start_offset": 15,
"end_offset": 19,
"type": "word",
"position": 39
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 40
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 41
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 42
},
{
"token": "multi",
"start_offset": 15,
"end_offset": 20,
"type": "word",
"position": 43
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 44
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 45
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 46
},
{
"token": "multip",
"start_offset": 15,
"end_offset": 21,
"type": "word",
"position": 47
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 48
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 49
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 50
},
{
"token": "multipl",
"start_offset": 15,
"end_offset": 22,
"type": "word",
"position": 51
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 52
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 53
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 54
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 55
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 56
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 57
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 58
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 59
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 60
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 61
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 62
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 63
},
{
"token": "w",
"start_offset": 24,
"end_offset": 25,
"type": "word",
"position": 64
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 65
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 66
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 67
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 68
},
{
"token": "wi",
"start_offset": 24,
"end_offset": 26,
"type": "word",
"position": 69
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 70
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 71
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 72
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 73
},
{
"token": "win",
"start_offset": 24,
"end_offset": 27,
"type": "word",
"position": 74
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 75
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 76
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 77
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 78
},
{
"token": "wins",
"start_offset": 24,
"end_offset": 28,
"type": "word",
"position": 79
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 80
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 81
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 82
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 83
},
{
"token": "wins",
"start_offset": 24,
"end_offset": 28,
"type": "word",
"position": 84
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 85
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 86
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 87
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 88
},
{
"token": "wins",
"start_offset": 24,
"end_offset": 28,
"type": "word",
"position": 89
},
{
"token": "w",
"start_offset": 29,
"end_offset": 30,
"type": "word",
"position": 90
}
]
}
在构建自定义分析器时,值得一步一步地检查分析链中的每一步所生成的内容:
在您的例子中,如果检查标记器阶段的结果,它是这样的。我们只是将标记器
(即left_tokenizer
)指定为参数。
curl -XGET 'localhost:9201/test/_analyze?tokenizer=left_tokenizer&pretty' -d 'please pay for multiple wins with only one payment'
{
"tokens" : [ {
"token" : "pleas",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "please",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 2
}, {
"token" : "please ",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 3
}, {
"token" : "please p",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 4
}, {
"token" : "please pa",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 5
}, {
"token" : "please pay",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 6
}, {
"token" : "please pay ",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 7
}, {
"token" : "please pay f",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 8
}, {
"token" : "please pay fo",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 9
}, {
"token" : "please pay for",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 10
}, {
"token" : "please pay for ",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 11
}, {
"token" : "please pay for m",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 12
}, {
"token" : "please pay for mu",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 13
}, {
"token" : "please pay for mul",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 14
}, {
"token" : "please pay for mult",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 15
}, {
"token" : "please pay for multi",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 16
}, {
"token" : "please pay for multip",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 17
}, {
"token" : "please pay for multipl",
"start_offset" : 0,
"end_offset" : 22,
"type" : "word",
"position" : 18
}, {
"token" : "please pay for multiple",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 19
"position" : 20
}, {
"token" : "please pay for multiple w",
"start_offset" : 0,
"end_offset" : 25,
"type" : "word",
"position" : 21
}, {
"token" : "please pay for multiple wi",
"start_offset" : 0,
"end_offset" : 26,
"type" : "word",
"position" : 22
}, {
"token" : "please pay for multiple win",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 23
}, {
"token" : "please pay for multiple wins",
"start_offset" : 0,
"end_offset" : 28,
"type" : "word",
"position" : 24
}, {
"token" : "please pay for multiple wins ",
"start_offset" : 0,
"end_offset" : 29,
"type" : "word",
"position" : 25
}, {
"token" : "please pay for multiple wins w",
"start_offset" : 0,
"end_offset" : 30,
"type" : "word",
"position" : 26
} ]
}
因此,您的left_tokenizer
将整个句子视为单个标记输入,并将其标记化从5个字符到30个字符,这就是它在wins
处停止的原因(这回答了问题3)
如上面所示,有些令牌是重复的,因为word_delimiter
令牌筛选器孤立地处理来自令牌器的每个令牌,因此“重复”(回答问题4)和短于5个字符的令牌(回答问题1)
我不认为这是您希望它工作的方式,但从您的问题中并不清楚您希望它如何工作,即您希望能够进行的搜索类型。我在这里提供的只是你所看到的解释。
我正在使用edge ngram标记器来提供部分匹配。我的文件看起来像 我的映射如下 以下查询给了我3个正确的文档(,,) 但是当我输入时,它会给我0个文档 我希望这将返回1个文档,但出于某种原因,它似乎没有索引令牌中的数字。让我知道,如果我错过了什么东西在这里。
最大ngram 36 在速度和内存方面,这会得到真正糟糕的加班吗?有没有更好的方法来部分搜索UUID?例如,我有7e222584-0818-49b0-875b-2774f4bf939b,我希望能够使用9b0搜索它
基数聚合计算不同值的近似计数。但是,为什么即使对于存储在单个碎片中的索引,它也显示不正确的值呢?
我有以下格式的弹性搜索文档 } } 我的要求是,当我搜索特定字符串(string.string)时,我只想获得该字符串的FileOffSet(string.FileOffSet)。我该怎么做? 谢谢
我使用Elasticsearch允许用户输入要搜索的术语。例如,我要搜索以下属性'name': 如果使用以下代码搜索或,我希望返回此文档。 我尝试过做一个bool must和做多个术语,但它似乎只有在整个字符串都匹配的情况下才起作用。 所以我真正想做的是,这个词是否以任何顺序包含两个词。 有人能帮我走上正轨吗?我已经在这上面砸了一段时间了。
我在elasticsearch中创建了一个观察器,当我们在索引中10分钟没有新条目或事件时,它会报告,这可以通过查看条目中的源字段来进一步划分。 我只得到了索引的最后10分钟,并查看了桶中不存在哪个源。 为此,我首先创建我们收到的所有源类型的列表,然后从返回的存储桶键创建一个列表。然后,我想比较列表以查看缺少哪个列表,然后将其传递到消息中。 我在for循环中遇到了一个通用错误。任何反馈对弹性和无痛