问题：

edgeNGram标记器的弹性搜索问题

洪照

2023-03-14

我使用ElasticSearch对数据库进行索引。我试图使用edgeNGram标记器切割字符串，以射出要求“新字符串必须长于4个字符”的字符串。我使用以下代码创建索引：

PUT test
POST /test/_close

PUT /test/_settings
{
    "analysis": {
      "analyzer": {
      "index_edge_ngram" : {
                "type": "custom",  
                "filter": ["custom_word_delimiter"],                
        "tokenizer" : "left_tokenizer"
      }         
    },
    "filter" : {
            "custom_word_delimiter" : {
                "type": "word_delimiter",
                "generate_word_parts": "true",
                "generate_number_parts": "true",
                "catenate_words": "false",
                "catenate_numbers": "false",
                "catenate_all": "false",
                "split_on_case_change": "false",
                "preserve_original": "false",
                "split_on_numerics": "true",
                "ignore_case": "true"
            }      
    },
    "tokenizer" : {
      "left_tokenizer" : {
        "max_gram" : 30,
        "min_gram" : 5,
        "type" : "edgeNGram"
      }
    }       
    } 
}

POST /test/_open

现在我运行test来查看结果

GET /test/_analyze?analyzer=index_edge_ngram&text=please pay for multiple wins with only one payment

并得到结果

{
   "tokens": [
      {
         "token": "pleas",
         "start_offset": 0,
         "end_offset": 5,
         "type": "word",
         "position": 1
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 2
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 3
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 4
      },
      {
         "token": "p",
         "start_offset": 7,
         "end_offset": 8,
         "type": "word",
         "position": 5
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 6
      },
      {
         "token": "pa",
         "start_offset": 7,
         "end_offset": 9,
         "type": "word",
         "position": 7
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 8
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 9
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 10
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 11
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 12
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 13
      },
      {
         "token": "f",
         "start_offset": 11,
         "end_offset": 12,
         "type": "word",
         "position": 14
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 15
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 16
      },
      {
         "token": "fo",
         "start_offset": 11,
         "end_offset": 13,
         "type": "word",
         "position": 17
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 18
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 19
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 20
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 21
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 22
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 23
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 24
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 25
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 26
      },
      {
         "token": "m",
         "start_offset": 15,
         "end_offset": 16,
         "type": "word",
         "position": 27
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 28
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 29
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 30
      },
      {
         "token": "mu",
         "start_offset": 15,
         "end_offset": 17,
         "type": "word",
         "position": 31
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 32
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 33
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 34
      },
      {
         "token": "mul",
         "start_offset": 15,
         "end_offset": 18,
         "type": "word",
         "position": 35
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 36
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 37
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 38
      },
      {
         "token": "mult",
         "start_offset": 15,
         "end_offset": 19,
         "type": "word",
         "position": 39
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 40
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 41
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 42
      },
      {
         "token": "multi",
         "start_offset": 15,
         "end_offset": 20,
         "type": "word",
         "position": 43
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 44
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 45
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 46
      },
      {
         "token": "multip",
         "start_offset": 15,
         "end_offset": 21,
         "type": "word",
         "position": 47
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 48
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 49
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 50
      },
      {
         "token": "multipl",
         "start_offset": 15,
         "end_offset": 22,
         "type": "word",
         "position": 51
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 52
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 53
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 54
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 55
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 56
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 57
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 58
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 59
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 60
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 61
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 62
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 63
      },
      {
         "token": "w",
         "start_offset": 24,
         "end_offset": 25,
         "type": "word",
         "position": 64
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 65
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 66
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 67
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 68
      },
      {
         "token": "wi",
         "start_offset": 24,
         "end_offset": 26,
         "type": "word",
         "position": 69
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 70
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 71
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 72
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 73
      },
      {
         "token": "win",
         "start_offset": 24,
         "end_offset": 27,
         "type": "word",
         "position": 74
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 75
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 76
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 77
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 78
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 79
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 80
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 81
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 82
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 83
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 84
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 85
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 86
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 87
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 88
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 89
      },
      {
         "token": "w",
         "start_offset": 29,
         "end_offset": 30,
         "type": "word",
         "position": 90
      }
   ]
}

共有1个答案

钦耀

2023-03-14

在构建自定义分析器时，值得一步一步地检查分析链中的每一步所生成的内容：

首先，标记器将您的输入切片并切成标记
然后令牌筛选器将步骤1中的令牌作为输入并执行其操作
最后应用字符筛选器

在您的例子中，如果检查标记器阶段的结果，它是这样的。我们只是将标记器（即left_tokenizer)指定为参数。

 curl -XGET 'localhost:9201/test/_analyze?tokenizer=left_tokenizer&pretty' -d 'please pay for multiple wins with only one payment'

{
  "tokens" : [ {
    "token" : "pleas",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "please",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "please ",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "please p",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "please pa",
    "start_offset" : 0,
    "end_offset" : 9,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "please pay",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "please pay ",
    "start_offset" : 0,
    "end_offset" : 11,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "please pay f",
    "start_offset" : 0,
    "end_offset" : 12,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "please pay fo",
    "start_offset" : 0,
    "end_offset" : 13,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "please pay for",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "please pay for ",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "please pay for m",
    "start_offset" : 0,
    "end_offset" : 16,
    "type" : "word",
    "position" : 12
  }, {
    "token" : "please pay for mu",
    "start_offset" : 0,
    "end_offset" : 17,
    "type" : "word",
    "position" : 13
  }, {
    "token" : "please pay for mul",
    "start_offset" : 0,
    "end_offset" : 18,
    "type" : "word",
    "position" : 14
  }, {
    "token" : "please pay for mult",
    "start_offset" : 0,
    "end_offset" : 19,
    "type" : "word",
    "position" : 15
  }, {
    "token" : "please pay for multi",
    "start_offset" : 0,
    "end_offset" : 20,
    "type" : "word",
    "position" : 16
  }, {
    "token" : "please pay for multip",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "word",
    "position" : 17
  }, {
    "token" : "please pay for multipl",
    "start_offset" : 0,
    "end_offset" : 22,
    "type" : "word",
    "position" : 18
  }, {
    "token" : "please pay for multiple",
    "start_offset" : 0,
    "end_offset" : 23,
    "type" : "word",
    "position" : 19
    "position" : 20
  }, {
    "token" : "please pay for multiple w",
    "start_offset" : 0,
    "end_offset" : 25,
    "type" : "word",
    "position" : 21
  }, {
    "token" : "please pay for multiple wi",
    "start_offset" : 0,
    "end_offset" : 26,
    "type" : "word",
    "position" : 22
  }, {
    "token" : "please pay for multiple win",
    "start_offset" : 0,
    "end_offset" : 27,
    "type" : "word",
    "position" : 23
  }, {
    "token" : "please pay for multiple wins",
    "start_offset" : 0,
    "end_offset" : 28,
    "type" : "word",
    "position" : 24
  }, {
    "token" : "please pay for multiple wins ",
    "start_offset" : 0,
    "end_offset" : 29,
    "type" : "word",
    "position" : 25
  }, {
    "token" : "please pay for multiple wins w",
    "start_offset" : 0,
    "end_offset" : 30,
    "type" : "word",
    "position" : 26
  } ]
}

null

因此，您的left_tokenizer将整个句子视为单个标记输入，并将其标记化从5个字符到30个字符，这就是它在wins处停止的原因（这回答了问题3）

如上面所示，有些令牌是重复的，因为word_delimiter令牌筛选器孤立地处理来自令牌器的每个令牌，因此“重复”（回答问题4）和短于5个字符的令牌（回答问题1）

我不认为这是您希望它工作的方式，但从您的问题中并不清楚您希望它如何工作，即您希望能够进行的搜索类型。我在这里提供的只是你所看到的解释。

类似资料：

弹性搜索中的edge_ngram标记器问题

我正在使用edge ngram标记器来提供部分匹配。我的文件看起来像我的映射如下以下查询给了我3个正确的文档（，，）但是当我输入时，它会给我0个文档我希望这将返回1个文档，但出于某种原因，它似乎没有索引令牌中的数字。让我知道，如果我错过了什么东西在这里。
UUID的弹性搜索Ngram标记器性能

最大ngram 36 在速度和内存方面，这会得到真正糟糕的加班吗？有没有更好的方法来部分搜索UUID？例如，我有7e222584-0818-49b0-875b-2774f4bf939b，我希望能够使用9b0搜索它
弹性搜索基数问题

基数聚合计算不同值的近似计数。但是，为什么即使对于存储在单个碎片中的索引，它也显示不正确的值呢？
弹性搜索的搜索查询

我有以下格式的弹性搜索文档 } } 我的要求是，当我搜索特定字符串（string.string）时，我只想获得该字符串的FileOffSet（string.FileOffSet）。我该怎么做？谢谢
弹性搜索多项搜索

我使用Elasticsearch允许用户输入要搜索的术语。例如，我要搜索以下属性'name': 如果使用以下代码搜索或，我希望返回此文档。我尝试过做一个bool must和做多个术语，但它似乎只有在整个字符串都匹配的情况下才起作用。所以我真正想做的是，这个词是否以任何顺序包含两个词。有人能帮我走上正轨吗？我已经在这上面砸了一段时间了。
无痛脚本弹性搜索观察器的问题

我在elasticsearch中创建了一个观察器，当我们在索引中10分钟没有新条目或事件时，它会报告，这可以通过查看条目中的源字段来进一步划分。我只得到了索引的最后10分钟，并查看了桶中不存在哪个源。为此，我首先创建我们收到的所有源类型的列表，然后从返回的存储桶键创建一个列表。然后，我想比较列表以查看缺少哪个列表，然后将其传递到消息中。我在for循环中遇到了一个通用错误。任何反馈对弹性和无痛

edgeNGram标记器的弹性搜索问题

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档