当前位置: 首页 > 面试题库 >

具有词组匹配功能的Edge NGram

何飞翰
2023-03-14
问题内容

我需要自动完成短语。例如,当我搜索 “老年痴呆症”时 ,我想获取 “老年痴呆症”

为此,我配置了Edge NGram
tokenizer
。我尝试了两者,edge_ngram_analyzer并将其standard作为查询正文中的分析器。但是,尝试匹配短语时无法获得结果。

我究竟做错了什么?

我的查询:

{
  "query":{
    "multi_match":{
      "query":"dementia in alz",
      "type":"phrase",
      "analyzer":"edge_ngram_analyzer",
      "fields":["_all"]
    }
  }
}

我的映射:

...
"type" : {
  "_all" : {
    "analyzer" : "edge_ngram_analyzer",
    "search_analyzer" : "standard"
  },
  "properties" : {
    "field" : {
      "type" : "string",
      "analyzer" : "edge_ngram_analyzer",
      "search_analyzer" : "standard"
    },
...
"settings" : {
  ...
  "analysis" : {
    "filter" : {
      "stem_possessive_filter" : {
        "name" : "possessive_english",
        "type" : "stemmer"
      }
    },
    "analyzer" : {
      "edge_ngram_analyzer" : {
        "filter" : [ "lowercase" ],
        "tokenizer" : "edge_ngram_tokenizer"
      }
    },
    "tokenizer" : {
      "edge_ngram_tokenizer" : {
        "token_chars" : [ "letter", "digit", "whitespace" ],
        "min_gram" : "2",
        "type" : "edgeNGram",
        "max_gram" : "25"
      }
    }
  }
  ...

我的文档:

{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfHfBE5CzEm8aJ3Xp", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:48.665Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1400541", 
    "Diagnosis": "F00.0 -  Dementia in Alzheimer's disease with early onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}, 
{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfICrE5CzEm8aJ4Dc", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:51.240Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1424351", 
    "Diagnosis": "F00.1 -  Dementia in Alzheimer's disease with late onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}

分析 “老年痴呆症” 一词:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "de", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 4, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 0, 
      "position": 2
    }, 
    {
      "end_offset": 5, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 0, 
      "position": 3
    }, 
    {
      "end_offset": 6, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 0, 
      "position": 4
    }, 
    {
      "end_offset": 7, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 0, 
      "position": 5
    }, 
    {
      "end_offset": 8, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 0, 
      "position": 6
    }, 
    {
      "end_offset": 9, 
      "token": "dementia ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 7
    }, 
    {
      "end_offset": 10, 
      "token": "dementia i", 
      "type": "word", 
      "start_offset": 0, 
      "position": 8
    }, 
    {
      "end_offset": 11, 
      "token": "dementia in", 
      "type": "word", 
      "start_offset": 0, 
      "position": 9
    }, 
    {
      "end_offset": 12, 
      "token": "dementia in ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 10
    }, 
    {
      "end_offset": 13, 
      "token": "dementia in a", 
      "type": "word", 
      "start_offset": 0, 
      "position": 11
    }, 
    {
      "end_offset": 14, 
      "token": "dementia in al", 
      "type": "word", 
      "start_offset": 0, 
      "position": 12
    }, 
    {
      "end_offset": 15, 
      "token": "dementia in alz", 
      "type": "word", 
      "start_offset": 0, 
      "position": 13
    }, 
    {
      "end_offset": 16, 
      "token": "dementia in alzh", 
      "type": "word", 
      "start_offset": 0, 
      "position": 14
    }, 
    {
      "end_offset": 17, 
      "token": "dementia in alzhe", 
      "type": "word", 
      "start_offset": 0, 
      "position": 15
    }, 
    {
      "end_offset": 18, 
      "token": "dementia in alzhei", 
      "type": "word", 
      "start_offset": 0, 
      "position": 16
    }, 
    {
      "end_offset": 19, 
      "token": "dementia in alzheim", 
      "type": "word", 
      "start_offset": 0, 
      "position": 17
    }, 
    {
      "end_offset": 20, 
      "token": "dementia in alzheime", 
      "type": "word", 
      "start_offset": 0, 
      "position": 18
    }, 
    {
      "end_offset": 21, 
      "token": "dementia in alzheimer", 
      "type": "word", 
      "start_offset": 0, 
      "position": 19
    }
  ]
}

问题答案:

非常感谢rendel帮助我找到了正确的解决方案!

Andrei Stefan的解决方案不是最佳的。

为什么?首先,搜索分析器中缺少小写过滤器,搜索不便;情况必须严格匹配。lowercase需要使用带过滤器的自定义分析器,而不是"analyzer": "keyword"

其次, 分析部分是错误的 !在索引时间内,使用字符串分析“ 早期发作的阿尔茨海默病中的F00.0-痴呆
edge_ngram_analyzer。使用此分析器,我们具有以下字典数组作为被分析的字符串:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 6, 
      "token": "0 ", 
      "type": "word", 
      "start_offset": 4, 
      "position": 2
    }, 
    {
      "end_offset": 9, 
      "token": "  ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 3
    }, 
    {
      "end_offset": 10, 
      "token": "  d", 
      "type": "word", 
      "start_offset": 7, 
      "position": 4
    }, 
    {
      "end_offset": 11, 
      "token": "  de", 
      "type": "word", 
      "start_offset": 7, 
      "position": 5
    }, 
    {
      "end_offset": 12, 
      "token": "  dem", 
      "type": "word", 
      "start_offset": 7, 
      "position": 6
    }, 
    {
      "end_offset": 13, 
      "token": "  deme", 
      "type": "word", 
      "start_offset": 7, 
      "position": 7
    }, 
    {
      "end_offset": 14, 
      "token": "  demen", 
      "type": "word", 
      "start_offset": 7, 
      "position": 8
    }, 
    {
      "end_offset": 15, 
      "token": "  dement", 
      "type": "word", 
      "start_offset": 7, 
      "position": 9
    }, 
    {
      "end_offset": 16, 
      "token": "  dementi", 
      "type": "word", 
      "start_offset": 7, 
      "position": 10
    }, 
    {
      "end_offset": 17, 
      "token": "  dementia", 
      "type": "word", 
      "start_offset": 7, 
      "position": 11
    }, 
    {
      "end_offset": 18, 
      "token": "  dementia ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 12
    }, 
    {
      "end_offset": 19, 
      "token": "  dementia i", 
      "type": "word", 
      "start_offset": 7, 
      "position": 13
    }, 
    {
      "end_offset": 20, 
      "token": "  dementia in", 
      "type": "word", 
      "start_offset": 7, 
      "position": 14
    }, 
    {
      "end_offset": 21, 
      "token": "  dementia in ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 15
    }, 
    {
      "end_offset": 22, 
      "token": "  dementia in a", 
      "type": "word", 
      "start_offset": 7, 
      "position": 16
    }, 
    {
      "end_offset": 23, 
      "token": "  dementia in al", 
      "type": "word", 
      "start_offset": 7, 
      "position": 17
    }, 
    {
      "end_offset": 24, 
      "token": "  dementia in alz", 
      "type": "word", 
      "start_offset": 7, 
      "position": 18
    }, 
    {
      "end_offset": 25, 
      "token": "  dementia in alzh", 
      "type": "word", 
      "start_offset": 7, 
      "position": 19
    }, 
    {
      "end_offset": 26, 
      "token": "  dementia in alzhe", 
      "type": "word", 
      "start_offset": 7, 
      "position": 20
    }, 
    {
      "end_offset": 27, 
      "token": "  dementia in alzhei", 
      "type": "word", 
      "start_offset": 7, 
      "position": 21
    }, 
    {
      "end_offset": 28, 
      "token": "  dementia in alzheim", 
      "type": "word", 
      "start_offset": 7, 
      "position": 22
    }, 
    {
      "end_offset": 29, 
      "token": "  dementia in alzheime", 
      "type": "word", 
      "start_offset": 7, 
      "position": 23
    }, 
    {
      "end_offset": 30, 
      "token": "  dementia in alzheimer", 
      "type": "word", 
      "start_offset": 7, 
      "position": 24
    }, 
    {
      "end_offset": 33, 
      "token": "s ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 25
    }, 
    {
      "end_offset": 34, 
      "token": "s d", 
      "type": "word", 
      "start_offset": 31, 
      "position": 26
    }, 
    {
      "end_offset": 35, 
      "token": "s di", 
      "type": "word", 
      "start_offset": 31, 
      "position": 27
    }, 
    {
      "end_offset": 36, 
      "token": "s dis", 
      "type": "word", 
      "start_offset": 31, 
      "position": 28
    }, 
    {
      "end_offset": 37, 
      "token": "s dise", 
      "type": "word", 
      "start_offset": 31, 
      "position": 29
    }, 
    {
      "end_offset": 38, 
      "token": "s disea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 30
    }, 
    {
      "end_offset": 39, 
      "token": "s diseas", 
      "type": "word", 
      "start_offset": 31, 
      "position": 31
    }, 
    {
      "end_offset": 40, 
      "token": "s disease", 
      "type": "word", 
      "start_offset": 31, 
      "position": 32
    }, 
    {
      "end_offset": 41, 
      "token": "s disease ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 33
    }, 
    {
      "end_offset": 42, 
      "token": "s disease w", 
      "type": "word", 
      "start_offset": 31, 
      "position": 34
    }, 
    {
      "end_offset": 43, 
      "token": "s disease wi", 
      "type": "word", 
      "start_offset": 31, 
      "position": 35
    }, 
    {
      "end_offset": 44, 
      "token": "s disease wit", 
      "type": "word", 
      "start_offset": 31, 
      "position": 36
    }, 
    {
      "end_offset": 45, 
      "token": "s disease with", 
      "type": "word", 
      "start_offset": 31, 
      "position": 37
    }, 
    {
      "end_offset": 46, 
      "token": "s disease with ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 38
    }, 
    {
      "end_offset": 47, 
      "token": "s disease with e", 
      "type": "word", 
      "start_offset": 31, 
      "position": 39
    }, 
    {
      "end_offset": 48, 
      "token": "s disease with ea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 40
    }, 
    {
      "end_offset": 49, 
      "token": "s disease with ear", 
      "type": "word", 
      "start_offset": 31, 
      "position": 41
    }, 
    {
      "end_offset": 50, 
      "token": "s disease with earl", 
      "type": "word", 
      "start_offset": 31, 
      "position": 42
    }, 
    {
      "end_offset": 51, 
      "token": "s disease with early", 
      "type": "word", 
      "start_offset": 31, 
      "position": 43
    }, 
    {
      "end_offset": 52, 
      "token": "s disease with early ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 44
    }, 
    {
      "end_offset": 53, 
      "token": "s disease with early o", 
      "type": "word", 
      "start_offset": 31, 
      "position": 45
    }, 
    {
      "end_offset": 54, 
      "token": "s disease with early on", 
      "type": "word", 
      "start_offset": 31, 
      "position": 46
    }, 
    {
      "end_offset": 55, 
      "token": "s disease with early ons", 
      "type": "word", 
      "start_offset": 31, 
      "position": 47
    }, 
    {
      "end_offset": 56, 
      "token": "s disease with early onse", 
      "type": "word", 
      "start_offset": 31, 
      "position": 48
    }
  ]
}

如您所见,整个字符串使用2到25个字符的标记大小标记。字符串以线性方式标记,所有新标记的所有空格和位置都加一。

它有几个问题:

  1. edge_ngram_analyzer产生 无用的令牌 将永远不会被搜索的,例如:“ 0 ”,“”,“ d ”,“ SD ”,“ 病瓦特 ”等
  2. 而且, 它并没有产生 很多 有用的标记 ,例如:“ 疾病 ”,“ 早期发作 ”等。如果您尝试搜索这些词中的任何一个,将会有0个结果。
  3. 注意,最后一个标记是“ 早病 ”。最后的“ t ” 在哪里?由于"max_gram" : "25"我们“ 丢失 ”了所有字段中的某些文本。您无法再搜索该文本,因为没有标记。
  4. trim过滤器仅混淆过滤多余的空格时,它可以通过一个标记来实现的问题。
  5. edge_ngram_analyzer增量是用于位置的查询,如短语查询问题的每个标记的位置。人们应该使用edge_ngram_filter替代方法,在生成ngram时将 保留 令牌 的位置

要使用的映射设置:

...
"mappings": {
    "Type": {
       "_all":{
          "analyzer": "edge_ngram_analyzer", 
          "search_analyzer": "keyword_analyzer"
        }, 
        "properties": {
          "Field": {
            "search_analyzer": "keyword_analyzer",
             "type": "string",
             "analyzer": "edge_ngram_analyzer"
          },
...
...
"settings": {
   "analysis": {
      "filter": {
         "english_poss_stemmer": {
            "type": "stemmer",
            "name": "possessive_english"
         },
         "edge_ngram": {
           "type": "edgeNGram",
           "min_gram": "2",
           "max_gram": "25",
           "token_chars": ["letter", "digit"]
         }
      },
      "analyzer": {
         "edge_ngram_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
           "tokenizer": "standard"
         },
         "keyword_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer"],
           "tokenizer": "standard"
         }
      }
   }
}
...

看一下分析:

{
  "tokens": [
    {
      "end_offset": 5, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 17, 
      "token": "de", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 20, 
      "token": "in", 
      "type": "word", 
      "start_offset": 18, 
      "position": 3
    }, 
    {
      "end_offset": 32, 
      "token": "al", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alz", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzh", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhe", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhei", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheim", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheime", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheimer", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 40, 
      "token": "di", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dis", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dise", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disea", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "diseas", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disease", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 45, 
      "token": "wi", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "wit", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "with", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 51, 
      "token": "ea", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "ear", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "earl", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "early", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 57, 
      "token": "on", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "ons", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onse", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onset", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }
  ]
}

On index time a text is tokenized by standard tokenizer, then separate words
are filtered by lowercase, possessive_english and edge_ngram filters.
Tokens are produced only for words. On search time a text is tokenized by
standard tokenizer, then separate words are filtered by lowercase and
possessive_english. The searched words are matched against the tokens which
had been created during the index time.

Thus we make the incremental search possible!

Now, because we do ngram on separate words, we can even execute queries like

{
  'query': {
    'multi_match': {
      'query': 'dem in alzh',  
      'type': 'phrase', 
      'fields': ['_all']
    }
  }
}

and get correct results.

No text is “lost”, everything is searchable and there is no need to deal with
spaces by trim filter anymore.



 类似资料:
  • 本文向大家介绍Vim 匹配相同的单词并高亮功能,包括了Vim 匹配相同的单词并高亮功能的使用技巧和注意事项,需要的朋友参考一下 将光标移动到要匹配的单词处: 总结 以上所述是小编给大家介绍的Vim 匹配相同的单词并高亮功能,希望对大家有所帮助,如果大家有任何疑问欢迎给我留言,小编会及时回复大家的!

  • 问题内容: 我想匹配拼写不同但发音相同的单词。像“邮件”和“男性”,“飞机”和“普通”。我们可以在Elasticsearch中进行这样的匹配吗? 问题答案: 您可以为此使用语音令牌过滤器。语音令牌过滤器是一个插件,需要单独的安装和设置。您可以使用此博客,该博客详细说明了如何设置和使用语音令牌过滤器。

  • 问题内容: 我想搜索包含许多单词的字符串,并检索与其中任何一个匹配的文档。我的索引方法如下: 这是我的搜索方法。我不想寻找特定的词组,但是其中的任何单词。用于搜索的分析器与用于索引的分析器相同。 我是Lucene的新手。有人可以帮我吗? 问题答案: 使用会精确地尝试将短语“单词列表”与短语坡度0匹配。 如果要匹配单词列表中的 任何 术语,可以使用: 或者,您也可以使用,以便您可以要求查询词的数量的

  • 本文向大家介绍Elixir模式匹配功能,包括了Elixir模式匹配功能的使用技巧和注意事项,需要的朋友参考一下 示例            

  • 我正在我的react项目中使用react路由dom V5。我需要通过受保护的路由将道具从应用程序组件(我在其中导入路由器)传递到子组件。问题是道具是空的,没有匹配、位置和历史记录。。 **应用组件 注意:我将道具从应用程序传递到需要使用this.props.match的子组件,但match未定义

  • 给出一个二维板和字典中的单词列表,找出板中的所有单词。 每个单词必须由顺序相邻单元格的字母构成,其中“相邻”单元格是那些水平或垂直相邻的单元格。同一个字母单元格在一个单词中不能使用不止一次。 例如,给定单词和board= 回归[“吃”“誓”]