当前位置: 首页 > 知识库问答 >
问题:

edgeNGram标记器的弹性搜索问题

洪照
2023-03-14

我使用ElasticSearch对数据库进行索引。我试图使用edgeNGram标记器切割字符串,以射出要求“新字符串必须长于4个字符”的字符串。我使用以下代码创建索引:

PUT test
POST /test/_close

PUT /test/_settings
{
    "analysis": {
      "analyzer": {
      "index_edge_ngram" : {
                "type": "custom",  
                "filter": ["custom_word_delimiter"],                
        "tokenizer" : "left_tokenizer"
      }         
    },
    "filter" : {
            "custom_word_delimiter" : {
                "type": "word_delimiter",
                "generate_word_parts": "true",
                "generate_number_parts": "true",
                "catenate_words": "false",
                "catenate_numbers": "false",
                "catenate_all": "false",
                "split_on_case_change": "false",
                "preserve_original": "false",
                "split_on_numerics": "true",
                "ignore_case": "true"
            }      
    },
    "tokenizer" : {
      "left_tokenizer" : {
        "max_gram" : 30,
        "min_gram" : 5,
        "type" : "edgeNGram"
      }
    }       
    } 
}

POST /test/_open

现在我运行test来查看结果

GET /test/_analyze?analyzer=index_edge_ngram&text=please pay for multiple wins with only one payment

并得到结果

{
   "tokens": [
      {
         "token": "pleas",
         "start_offset": 0,
         "end_offset": 5,
         "type": "word",
         "position": 1
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 2
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 3
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 4
      },
      {
         "token": "p",
         "start_offset": 7,
         "end_offset": 8,
         "type": "word",
         "position": 5
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 6
      },
      {
         "token": "pa",
         "start_offset": 7,
         "end_offset": 9,
         "type": "word",
         "position": 7
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 8
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 9
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 10
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 11
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 12
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 13
      },
      {
         "token": "f",
         "start_offset": 11,
         "end_offset": 12,
         "type": "word",
         "position": 14
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 15
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 16
      },
      {
         "token": "fo",
         "start_offset": 11,
         "end_offset": 13,
         "type": "word",
         "position": 17
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 18
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 19
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 20
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 21
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 22
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 23
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 24
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 25
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 26
      },
      {
         "token": "m",
         "start_offset": 15,
         "end_offset": 16,
         "type": "word",
         "position": 27
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 28
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 29
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 30
      },
      {
         "token": "mu",
         "start_offset": 15,
         "end_offset": 17,
         "type": "word",
         "position": 31
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 32
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 33
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 34
      },
      {
         "token": "mul",
         "start_offset": 15,
         "end_offset": 18,
         "type": "word",
         "position": 35
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 36
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 37
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 38
      },
      {
         "token": "mult",
         "start_offset": 15,
         "end_offset": 19,
         "type": "word",
         "position": 39
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 40
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 41
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 42
      },
      {
         "token": "multi",
         "start_offset": 15,
         "end_offset": 20,
         "type": "word",
         "position": 43
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 44
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 45
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 46
      },
      {
         "token": "multip",
         "start_offset": 15,
         "end_offset": 21,
         "type": "word",
         "position": 47
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 48
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 49
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 50
      },
      {
         "token": "multipl",
         "start_offset": 15,
         "end_offset": 22,
         "type": "word",
         "position": 51
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 52
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 53
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 54
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 55
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 56
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 57
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 58
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 59
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 60
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 61
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 62
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 63
      },
      {
         "token": "w",
         "start_offset": 24,
         "end_offset": 25,
         "type": "word",
         "position": 64
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 65
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 66
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 67
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 68
      },
      {
         "token": "wi",
         "start_offset": 24,
         "end_offset": 26,
         "type": "word",
         "position": 69
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 70
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 71
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 72
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 73
      },
      {
         "token": "win",
         "start_offset": 24,
         "end_offset": 27,
         "type": "word",
         "position": 74
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 75
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 76
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 77
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 78
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 79
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 80
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 81
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 82
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 83
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 84
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 85
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 86
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 87
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 88
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 89
      },
      {
         "token": "w",
         "start_offset": 29,
         "end_offset": 30,
         "type": "word",
         "position": 90
      }
   ]
}

共有1个答案

钦耀
2023-03-14

在构建自定义分析器时,值得一步一步地检查分析链中的每一步所生成的内容:

  1. 首先,标记器将您的输入切片并切成标记
  2. 然后令牌筛选器将步骤1中的令牌作为输入并执行其操作
  3. 最后应用字符筛选器

在您的例子中,如果检查标记器阶段的结果,它是这样的。我们只是将标记器(即left_tokenizer)指定为参数。

 curl -XGET 'localhost:9201/test/_analyze?tokenizer=left_tokenizer&pretty' -d 'please pay for multiple wins with only one payment'
{
  "tokens" : [ {
    "token" : "pleas",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "please",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "please ",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "please p",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "please pa",
    "start_offset" : 0,
    "end_offset" : 9,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "please pay",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "please pay ",
    "start_offset" : 0,
    "end_offset" : 11,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "please pay f",
    "start_offset" : 0,
    "end_offset" : 12,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "please pay fo",
    "start_offset" : 0,
    "end_offset" : 13,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "please pay for",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "please pay for ",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "please pay for m",
    "start_offset" : 0,
    "end_offset" : 16,
    "type" : "word",
    "position" : 12
  }, {
    "token" : "please pay for mu",
    "start_offset" : 0,
    "end_offset" : 17,
    "type" : "word",
    "position" : 13
  }, {
    "token" : "please pay for mul",
    "start_offset" : 0,
    "end_offset" : 18,
    "type" : "word",
    "position" : 14
  }, {
    "token" : "please pay for mult",
    "start_offset" : 0,
    "end_offset" : 19,
    "type" : "word",
    "position" : 15
  }, {
    "token" : "please pay for multi",
    "start_offset" : 0,
    "end_offset" : 20,
    "type" : "word",
    "position" : 16
  }, {
    "token" : "please pay for multip",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "word",
    "position" : 17
  }, {
    "token" : "please pay for multipl",
    "start_offset" : 0,
    "end_offset" : 22,
    "type" : "word",
    "position" : 18
  }, {
    "token" : "please pay for multiple",
    "start_offset" : 0,
    "end_offset" : 23,
    "type" : "word",
    "position" : 19
    "position" : 20
  }, {
    "token" : "please pay for multiple w",
    "start_offset" : 0,
    "end_offset" : 25,
    "type" : "word",
    "position" : 21
  }, {
    "token" : "please pay for multiple wi",
    "start_offset" : 0,
    "end_offset" : 26,
    "type" : "word",
    "position" : 22
  }, {
    "token" : "please pay for multiple win",
    "start_offset" : 0,
    "end_offset" : 27,
    "type" : "word",
    "position" : 23
  }, {
    "token" : "please pay for multiple wins",
    "start_offset" : 0,
    "end_offset" : 28,
    "type" : "word",
    "position" : 24
  }, {
    "token" : "please pay for multiple wins ",
    "start_offset" : 0,
    "end_offset" : 29,
    "type" : "word",
    "position" : 25
  }, {
    "token" : "please pay for multiple wins w",
    "start_offset" : 0,
    "end_offset" : 30,
    "type" : "word",
    "position" : 26
  } ]
}
    null

因此,您的left_tokenizer将整个句子视为单个标记输入,并将其标记化从5个字符到30个字符,这就是它在wins处停止的原因(这回答了问题3)

如上面所示,有些令牌是重复的,因为word_delimiter令牌筛选器孤立地处理来自令牌器的每个令牌,因此“重复”(回答问题4)和短于5个字符的令牌(回答问题1)

我不认为这是您希望它工作的方式,但从您的问题中并不清楚您希望它如何工作,即您希望能够进行的搜索类型。我在这里提供的只是你所看到的解释。

 类似资料:
  • 我正在使用edge ngram标记器来提供部分匹配。我的文件看起来像 我的映射如下 以下查询给了我3个正确的文档(,,) 但是当我输入时,它会给我0个文档 我希望这将返回1个文档,但出于某种原因,它似乎没有索引令牌中的数字。让我知道,如果我错过了什么东西在这里。

  • 最大ngram 36 在速度和内存方面,这会得到真正糟糕的加班吗?有没有更好的方法来部分搜索UUID?例如,我有7e222584-0818-49b0-875b-2774f4bf939b,我希望能够使用9b0搜索它

  • 基数聚合计算不同值的近似计数。但是,为什么即使对于存储在单个碎片中的索引,它也显示不正确的值呢?

  • 我有以下格式的弹性搜索文档 } } 我的要求是,当我搜索特定字符串(string.string)时,我只想获得该字符串的FileOffSet(string.FileOffSet)。我该怎么做? 谢谢

  • 我使用Elasticsearch允许用户输入要搜索的术语。例如,我要搜索以下属性'name': 如果使用以下代码搜索或,我希望返回此文档。 我尝试过做一个bool must和做多个术语,但它似乎只有在整个字符串都匹配的情况下才起作用。 所以我真正想做的是,这个词是否以任何顺序包含两个词。 有人能帮我走上正轨吗?我已经在这上面砸了一段时间了。

  • 我在elasticsearch中创建了一个观察器,当我们在索引中10分钟没有新条目或事件时,它会报告,这可以通过查看条目中的源字段来进一步划分。 我只得到了索引的最后10分钟,并查看了桶中不存在哪个源。 为此,我首先创建我们收到的所有源类型的列表,然后从返回的存储桶键创建一个列表。然后,我想比较列表以查看缺少哪个列表,然后将其传递到消息中。 我在for循环中遇到了一个通用错误。任何反馈对弹性和无痛