03.analyzer-token_filter

全鸿晖
2023-12-01

1. token filters 概览

  token filter 这个名字也直接表达了含义,就是对token stream 进行了add ,remove, change操作,这也是nanlyzer的最后一步。比如一个 lowercase token filter 会将token都变成小写的;一个 stop token filter 会根据词库或者规则去掉一些停用词;一个 synonym token filter 会往token stream中增加一些同义词token。

每个analyzer可以有0个或者多个token filter
token filter并不会改变 tokenizer 分词后对每个term记录的位置信息

token filter的种类简直是太多了,比tokenizer 的种类还很多,这里只介绍一些常见的,后期需要的时候再补充。

  1. lowercase token filter
  2. stemmer token filter
  3. stop
  4. synonym
  5. remove deplicates

主要介绍一下这几种吧。(参考目录,后面又补充了很多)

1. lowercase token filter

顾名思义,大写变小写,除了英文意外还支持希腊语,意大利语,土耳其语

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}

返回
[ the, quick, fox, jumps ]

自定义的一些设置

PUT lowercase_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "whitespace_lowercase" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase"]
                }
            }
        }
    }
}

PUT custom_lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

2. stemmer token filter

这个会进行词干提取,而且针对不同语言提供了很多词干提取器,相当的厉害了。
有很多的词干提取规则,有的是按照算法提取,有的是按照字典提取。

PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "my_stemmer"]
                }
            },
            "filter" : {
                "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                }
            }
        }
    }
}

GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["i am going to shoping"]
}

返回
[i,am,go,to,shope]

3. stop

1. 默认的情况下会干掉这些词

a, an, and, are, as, at, be, but, by, for, if, in, into,
 is, it, no, not, of, on, or, such, that, the, their,
 then, there, these, they, this, to, was, will, with

2. 创建analyzer的时候使用
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": [ "stop" ]
        }
      }
    }
  }
}

3. 配置

stopwords: 停用词,可以引用一些预定义的(比如_english_,参考stop token filter 可以看到这些预定义的信息 这里),也可以直接在这里定一个一个word list.
stopwords_path: 可以引用一个文件中定义的停用词
ignore_case: 匹配的时候是否忽略大小写 true|false
remove_trailing: true|false, 如果token stream中的最后一个token是停用词,那么会不会被删掉,默认为true,在使用 completion suggester 的时候应该设置为false,这样能够更好的为提示词服务。

自定义

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is", "the" ]
        }
      }
    }
  }
}


4. synonym

这个比较有用,也比较有意思,他的作用是产生token的同义词,方便查询的时候使用。但是如果大量配置的时候不够灵活,可能需要通过配置文件的方式进行加载。

1. 配置

synonyms: 直接指定同义词 ,看下面的样例
synonyms_path: 从文件加载同义词
expand:默认是true
lenient: 宽大的,仁慈的,默认false,true的话在解析配置出错的时候忽略。

PUT /test_index02
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["my_stop", "synonym"]
                    }
                },
                "filter" : {
                    "my_stop": {
                        "type" : "stop",
                        "stopwords": ["bar"]
                    },
                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true
                        ,
                        "synonyms" : ["foo, bar => baz"]
                    }
                }
            }
        }
    }
}

上面的配置
1.bar 会被忽略,foo=>baz 被添加了
2.如果synonym部分改成 foo, baz => bar 则不会有任何的同义词映射会被添加
3.如果将lenient 设置为false会报错

{
  "error": {
    "root_cause": [
	...
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "parse_exception: Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: bar was completely eliminated by analyzer"
      }
    }
  },
  "status": 400
}

测试map映射


PUT /test_index02
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : [ "synonym"]
                    }
                },
                "filter" : {

                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true
                        ,
                        "synonyms" : ["foo, bar => baz"]
                    }
                }
            }
        }
    }
}

GET test_index02/_analyze
{
  "analyzer": "synonym",
  "text": ["i am foo bar"]
}

[i, am , baz , baz]


这里可以看到,这里类似于map映射,只是将一个token映射成了另一个token

2. expand的作用

synonym的配置方式有两种一种是上面的样例那样
“synonyms” : [“foo, bar => baz”]
还有一种是
“synonyms” : [“foo, bar , baz”]
对于这种配置,如果expand是false,则映射变成了 bar,baz => foo , 第一个词成了映射的target
expand为true的话,则会变成bar, foo, baz => foo, baz,bar (bar被stop掉了)


PUT /test_index04
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : [ "synonym"]
                    }
                },
                "filter" : {

                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "expand": false
                        ,
                        "synonyms" : ["foo, bar , baz"]
                    }
                }
            }
        }
    }
}

GET test_index04/_analyze
{
  "analyzer": "synonym",
  "text": ["i am baz foo bar"]
}

[i, am, foo, foo, foo ]

可以看到这个等同于 “synonyms” : [“bar , baz => foo”]

3.样例解析
PUT /test_index03
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : [ "synonym"]
                    }
                },
                "filter" : {

                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true
                        ,
                        "synonyms" : ["foo, bar , baz"]
                    }
                }
            }
        }
    }
}

GET test_index03/_analyze
{
  "analyzer": "synonym",
  "text": ["i am foo bar"]
}

[i, am, foo, bar, baz, bar, foo, baz, ]

这个里面体现了相互映射,

注意点,synonym 不能和一些在相同位置产生重复token的filter一起使用,就是在同一position产生了多个不同的token的那种。
es会使用定义在synonym filter前面的哪些filter来处理 synonym filter中定义的同义词列表,比如,如果定义的filter:[“stemer”,“synonym”],则stemer会被用来处理synonym filter配置的同义词列表(不管是直接配置的还是通过文件配置的)。因为synonym 的同义词列表不像普通的文档那样,词与词之间有位置关系,所以这在某些组合使用情况下可能会导致问题。
那些定义在synonym之前的filter要选择输出那个token,asciifolding filter 会选择输出他翻译后的那个。而multiplexer, word_delimiter_graph等则会报错。
如果synonym 之前的filter输出的只有一个token,那么是不用担心报错的哦,比如stemer就不会报错(但是有可能你的同义词列表中有些词会被进行词干提取从而变成了其他的词)

下面是对应的英文原文

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to
parse the entries in a synonym file. 
So, for example, if a synonym filter is placed after a stemmer,
 then the stemmer will also be applied to the synonym entries. 
Because entries in the synonym map cannot have stacked positions, 
some token filters may cause issues here. 
Token filters that produce multiple versions of a token may choose 
which version of the token to emit when parsing synonyms, 
e.g. asciifolding will only produce the folded version of the token.
 Others, e.g. multiplexer, word_delimiter_graph or ngram will throw an error.

实验一 asciifolding filter



PUT /asciifold_only
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "standard_asciifolding" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_ascii_folding"]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            }
        }
    }
}


GET asciifold_only/_analyze
{
  "analyzer" : "standard_asciifolding",
  "text" : ["açaí à  foo "]
}

返回
[ acai, açaí, a, à, foo ]

这个时候设置的是保留原token的方式



PUT /asciifold_seperate_synonym
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "standard_asciifolding" : {
                    "tokenizer" : "standard",
                    "filter" : ["synonym"]
                }
            },
            "filter" : {
                "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["açaí , acai ,foo ,baz"]
                    }
            }
        }
    }
}


GET asciifold_seperate_synonym/_analyze
{
  "analyzer" : "standard_asciifolding",
  "text" : ["acai, açaí, a, à, foo "]
}

返回
[ acai  açaí foo baz  açaí acai foo baz a à foo açaí acai baz ]

然后我们把他联合起来看看会是什么效果


PUT /asciifold_example_syn
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "standard_asciifolding" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_ascii_folding","synonym"]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                },
                "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["açaí ,foo ,baz"]
                    }
            }
        }
    }
}

返回
[ acai foo baz açaí a à foo acai baz]

可以看出比上面少了不少东西,为了更加方便对比,我们都放在一起

[ acai  açaí foo baz  açaí acai foo baz a à foo açaí acai baz ]
[ acai       foo baz  açaí              a à foo      acai baz]


也就是说,不能是那种能够在同一个position产生新的token的filter


PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["keyword_repeat", "synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["foo, bar => baz"]
                    }
                }
            }
        }
    }
}

这里会报错
{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[ES02][10.76.3.145:12300][indices:admin/create]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [keyword_repeat] cannot be used to parse synonyms"
  },
  "status": 400
}


PUT myy_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "mul_analyzer":{
         "tokenizer":"standard",
         "filter":["my_mul_filter","synonym"]
        }
      },
      "filter": {
        "my_mul_filter":{
          "type":"multiplexer",
          "filters":["lowercase","uppercase","synonym"]
        },
        "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["foo, bar => baz"]
                    }
      }
    }
  }
}

也会报错
{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[ES02][10.76.3.145:12300][indices:admin/create]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [my_mul_filter] cannot be used to parse synonyms"
  },
  "status": 400
}

5. remove deplicates

remove position的token ,这种,只有处于同一个position的token才会被干掉。

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer"
  ],
  "text": "jumping dog"
}

像这样,输出是[jumping,jump, dog,dog] 因为keyword_repeat由特殊作用,产生的keyword不能被stemmer处理。所以如此。

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer",
    "remove_duplicates"
  ],
  "text": "jumping dog"
}

输出就变成了[jumping,jump, dog],把同一个位置的重复的token给干掉了。

6. unique token filter

移除token stream中重复的token

GET _analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["unique"],
  "text" : "the quick fox jumps the lazy fox"
}

返回
[ the, quick, fox, jumps, lazy ]

他还有一个配置
only_on_same_position: true|false 如果被设置为true,那么效果和

7. keyword repeat token filter

keyword_repeat token过滤器将每个传入的token作为keyword输出一次,并作为非keyword输出一次,
这个一般都是和stemer类的filter配合使用,因为stemer对keyword类型的token不会再进行处理。
但是对于那些本身就是词干的词,stemer不会处理,那么这些词就会会被输出两次一模一样。
因此,一般会添加一个仅将only_on_same_position设置为true的unique filter 或者是remove_duplicates filter,以删除不必要的重复项。
以上参考这里

同时,一定要保证keyword_repeat filter在stemer前面才能达到想要的效果

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}


PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "porter_stem",
            "remove_duplicates"
          ]
        }
      }
    }
  }
}

8. keyword_marker filter

这个会把token标记为keyword,这样的话,stemer 类filter就不会再对这些token处理了。

9. n-grams token filter

这个会进一步把token打碎

GET _analyze
{
  "tokenizer": "standard",
  "filter": [ "ngram" ],
  "text": "Quick fox"
}

返回
[ Q, Qu, u, ui, i, ic, c, ck, k, f, fo, o, ox, x ]

10. edge_ngram token filter

类似上一个,但是是从token开始算的

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    { "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 2
    }
  ],
  "text": "the quick brown fox jumps"
}
返回
[ t, th, q, qu, b, br, f, fo, j, ju ]

可以看到后面的直接被舍弃了

11. multiplexer filter

这个很强大,这个filter自身没有处理能力,但是能够包含多个filter,类似于提供了一个并行的流处理,每个filter各自分别对token stream进行处理,然后再将结果进行合并。
合并之后会将position相同的token删除。
multiplexer filter的配置属性
filters : 在这里可以配置一些filter来使用
preserve_original : 是否将原来的token添加到过滤的token stream中去,默认为true


PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "mul_analyzer":{
         "tokenizer":"standard",
         "filter":["my_mul_filter"]
        }
      },
      "filter": {
        "my_mul_filter":{
          "type":"multiplexer",
          "filters":["lowercase","uppercase"]
        }
      }
    }
  }
}


GET my_index/_analyze
{
  "analyzer": "mul_analyzer",
  "text": [" i WANT to go "]
}

返回
[i, I , want,  WANT, to, TO, go ,GO]

可以看到是两个链条同时进行,每个链条也一可以有多个filter

PUT /multiplexer_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : [ "my_multiplexer" ]
                }
            },
            "filter" : {
                "my_multiplexer" : {
                    "type" : "multiplexer",
                    "filters" : [ "lowercase", "lowercase, porter_stem" ]
                }
            }
        }
    }
}

POST /multiplexer_example/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "Going HOME"
}


返回
[ Going, going, go, home , Home ]

The synonym and synonym_graph filters use their preceding analysis chain to parse and analyse their synonym lists, 
and will throw an exception if that chain contains token filters that produce multiple tokens at the same position. 
If you want to apply synonyms to a token stream containing a multiplexer,
then you should append the synonym filter to each relevant multiplexer filter list,
rather than placing it after the multiplexer in the main token chain definition.
  

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["keyword_repeat", "synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["foo, bar => baz"]
                    }
                }
            }
        }
    }
}

这里会报错
{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[ES02][10.76.3.145:12300][indices:admin/create]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [keyword_repeat] cannot be used to parse synonyms"
  },
  "status": 400
}

同义词 filter
词干提取 filter
multiplexer filter 会对同义词有帮助(这里)[https://www.elastic.co/guide/en/elasticsearch/reference/7.3/_parsing_synonym_files.html]
unique token filter 默认会在所有的token stream中使用

12.shingle filter

本来以为这个家伙没有啥用,结果后面的文档中频频出现,为了避免影响对后面的内容理解的影响,我还是又回来再整理一遍了。
这个shilngle filter的作用是把token stream中的多个连续的tokens 连接形成一个新的token,在支持match_phrase查询上效果更好,当然,官方实际上不建议你直接使用shingles filter来产生phrase,
更好的方案是是定义对应的field的type为 index-phrases 。

一个简单的使用样例

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "shingle",
      "min_shingle_size": 2,
      "max_shingle_size": 3,
      "output_unigrams": false
    }
  ],
  "text": "quick brown fox jumps"
}

返回
[ quick brown, quick brown fox, brown fox, brown fox jumps, fox jumps ]

可以看到返回的都是2个或者3个word组合产生的token

shingle filter有以下几个配置
max_shingle_size: shingle 产生token的最大的原始token的数量,默认为2
min_shingle_size: shingle 产生token的最大的原始token的数量,默认为2
output_unigrams: 是否输出原来的token,默认为true,输出
output_unigrams_if_no_shingles: 当output_unigrams设置为false的时候,当没有可用的shingle token 可以产生,而这个值又是true的话会输出原来的token,output_unigrams 为true的时候这个值无效
token_separator: 在join 原来的token 形成新的shingle token的时候使用的连接符。
filler_token: 这个一般是在和stop filter联合使用的时候有效,对于那些被stop干掉的token ,会使用这里定义的string代替,默认的情况使用的是"_"

 类似资料: