当前位置: 首页 > 知识库问答 >
问题:

基于前缀和自定义标记器的Elasticsearch自动建议

喻高寒
2023-03-14
"nGram_filter": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
"nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        }

但是我需要添加另一个特性--我需要启用前缀过滤器。例如:当我搜索test_table(10个字符)时,我能够得到结果,因为最大n-gram是10,但是当我尝试test_table_for时,它返回零结果,因为记录test_table_for analyzers没有这个标记。

我怎样才能添加一个基于前缀的过滤器也为现有的n-gram分析器?就像我应该能够得到匹配的结果最多10个字符时搜索(目前工作),而且我应该能够建议什么时候搜索字符串匹配记录从开始以及。

共有1个答案

柯昱
2023-03-14

这在单个分析器中是不可能的,您必须创建另一个字段,在该字段中可以创建用于prefix搜索的edge_ngram标记,添加索引映射,显示其中也包括您当前的分析器。

索引映射

{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 30
                },
                "nGram_filter": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "digit",
                        "punctuation",
                        "symbol"
                    ]
                }
            },
            "analyzer": {
                "prefixanalyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter"
                    ]
                },
                "ngramanalyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "nGram_filter"
                    ]
                }
            }
        },
        "index.max_ngram_diff" : 30
    },
    "mappings": {
        "properties": {
            "title_prefix": {
                "type": "text",
                "analyzer": "prefixanalyzer",
                "search_analyzer": "standard"
            },
            "title" :{
                "type": "text",
                "analyzer": "ngramanalyzer",
                "search_analyzer": "standard"
            }
        }
    }
}

现在可以使用useanalyseAPI来确认前缀标记:

{
    "analyzer": "prefixanalyzer",
    "text" : "test_table_for analyzers"
}
{"tokens":[{"token":"t","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"te","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"tes","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_t","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_ta","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_tab","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_tabl","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_f","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_fo","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"test_table_for","start_offset":0,"end_offset":14,"type":"<ALPHANUM>","position":0},{"token":"a","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"an","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"ana","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"anal","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analy","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyz","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyze","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyzer","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1},{"token":"analyzers","start_offset":15,"end_offset":24,"type":"<ALPHANUM>","position":1}]}
{
    "query": {
        "multi_match": {
            "query": "test_table_for",
            "fields": [
                "title",
                "title_prefix"
            ]
        }
    }
}
 "hits": [
            {
                "_index": "so_63981157",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.45920232,
                "_source": {
                    "title_prefix": "test_table_for analyzers",
                    "title": "test_table_for analyzers"
                }
            }
        ]
 类似资料:
  • 问题内容: 是否可以在JPA / Hibernate中覆盖表名称,以便为所有项目实体添加通用前缀?例如,能够通过“ JBPM5_”前缀为所有JBPM 5表添加前缀。 可接受答案的示例: 问题答案: 一次重命名所有表的一种方法是实现自己的namingStrategy(的实现)。 使用的NamingStrategy在persistence.xml中由

  • 我有一个标记化的文本(拆分的句子和拆分的单词)。并将基于此结构创建Apache Lucene索引。什么是最简单的方法来扩展或替换一个standart标记器使用自定义标记。我正在查看StandardTokenizerImpl,但似乎非常复杂。可能还有别的办法吗?

  • 我是pylucene的新手,我试图构建一个自定义分析器,它只在下划线的基础上标记文本,即它应该保留空白空间。示例:“hi_this is_awesome”应该标记为[“hi”,“this is”,“awesome”]标记。 从各种代码示例中,我了解到需要重写CustomTokenizer的incrementToken方法,并编写CustomAnalyzer,TokenStream需要使用Custo

  • 我正在使用自定义的NGRAM分析器,它有一个NGRAM标记器。我也用过小写过滤器。对于没有字符的搜索,该查询运行良好。但是当我搜索某些符号时,它失败了。由于我使用了小写标记器,Elasticsearch不分析符号。我知道空白标记器可以帮助我解决这个问题。如何在一个分析器中使用两个标记器?下面是映射: 我有办法解决这个问题吗?

  • 问题内容: 我有这段代码可以绘制图形,效果很好。我需要两件事 在域轴(x)上,我希望能够滚动。在标记上,我看到粗线。我希望能够看到此标记的一些可读文本。现在我看到这个输出 同样在域轴上,我有毫值。我可以将其映射到人类可读的日期吗? 问题答案: 您必须结合几种方法: 域滚动替代方案: 尝试一个SlidingXYDataset,在此实现并在此处说明。 启用平移,例如plot.setDomainPann

  • 我正在使用React视图上的Highcharts。出于可访问性和搜索引擎优化的原因,我想定制标记,但我在文档中找不出如何实现。Highcharts会自动插入一个标记,如下所示: <代码> 有没有办法改变这一点,或者这是硬编码的?