8.2.1-elasticsearch内置分词器之standard/simple

岳昊空

2023-12-01

ES默认提供了八种内置的analyzer,针对不同的场景可以使用不同的analyzer;

1、standard analyzer

1.1、standard类型及分词效果

在未显式指定analyzer的情况下standard analyzer为默认analyzer,其提供基于语法进行分词(基于Unicode文本分段算法)且在多数语言当中表现都不错;

//测试standard analyzer默认分词效果
//请求参数
POST _analyze
{
  "analyzer": "standard",
  "text": "transimission control protocol is a transport layer protocol"
}

//结果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

以上句子通过分词之后得到的关键词为:
[transmission, control, protocol, is, a, transport, layer, protocol]

1.2、standard类型可配置参数

序号	参数	参数说明
1	max_token_length	原始字符串拆分出的单个token所允许的最大长度,若拆分出的token查询超过最大值则按照最大值位置进行拆分,多余的作为另外的token,默认值为255;
2	stopwords	预定义的停用词,可以为0个或多个,例如_english_或数组类型值,默认值为_none_;
3	stopwords_path	停用词文件路径;

以下实例配置max_token_length参数

//standard参数配置定义
PUT standard_analyzer_token_length_conf_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_analyzer":{
          "type":"standard",
          "max_token_length":5,
          "stopwords":"_english_"
        }
      }
    }
  }
}

//测试standard可配置参数
POST standard_analyzer_token_length_conf_index/_analyze
{
  "analyzer": "english_analyzer",
  "text": "transimission control protocol is a transport layer protocol"
}

//测试结果返回
{
  "tokens" : [
    {
      "token" : "trans",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "imiss",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "ion",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "contr",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "ol",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "proto",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "col",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "trans",
      "start_offset" : 36,
      "end_offset" : 41,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "port",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "proto",
      "start_offset" : 52,
      "end_offset" : 57,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "col",
      "start_offset" : 57,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 13
    }
  ]
}

以上句子通过分词之后得到的关键词为:
[trans, imiss, ion, contr, ol, proto, col, trans, port, layer, proto, col]

1.3、standard analyzer的组成定义

序号	子构件	构件说明
1	Tokenizer	standard tokenizer
2	Token Filters	lowercase token filter,stop token filter(默认禁用)

如果希望自定义一个与standard类似的analyzer,只需要在原定义中配置可配置参数即可,其它的可以完全照搬standard的配置,如下示例:

//测试自定义analyzer
PUT custom_rebuild_standard_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["lowercase"]
        }
      }
    }
  }
}

//测试请求参数
POST custom_rebuild_standard_analyzer_index/_analyze
{
  "text": "transimission control protocol is a transport layer protocol"
}


//测试结果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

自定义的standard若希望使用内置standard的配置参数,必须保证type类型为standard,否则配置的参数无效,示例如下:

//自定义analyzer
PUT custom_rebuild_standard_analyzer_index_1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_analyzer":{
        //此处的type若为standard,则max_token_length有效,反之若为custom则无效
          "type":"custom",
          "tokenizer":"standard",
          "max_token_length":8,
          "filter":["lowercase"]
        }
      }
    }
  }
}

//测试验证
POST custom_rebuild_standard_analyzer_index_1/_analyze
{
  "analyzer": "rebuild_analyzer", 
  "text": "transimission control protocol is a transport layer protocol"
}

以上示例均可自行验证

2、simple analyzer

2.1、simple类型及分词效果

simple类型分词器是根据非字母字符对文本进行拆分,且将处理的所有关键词转换成小写格式

//测试standard analyzer默认分词效果
//请求参数
POST _analyze
{
  "analyzer": "simple",
  "text": "Transimission Control Protocol is a transport layer protocol"
}

//结果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

以上句子通过分词之后得到的关键词为:
[transmission, control, protocol, is, a, transport, layer, protocol]

2.2、默认standard analyzer的组成定义

序号	子构件	构件说明
1	Tokenizer	lowercase tokenizer

如果希望自定义一个与simple类似的analyzer,只需要在在自定义analyzer时指定type为custom,其它的可以完全照搬simple的配置,如下示例:

//测试自定义analyzer
PUT custom_rebuild_simple_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_simple":{
          "tokenizer":"lowercase",
          "filter":[]
        }
      }
    }
  }
}

//测试请求参数
POST custom_rebuild_simple_analyzer_index/_analyze
{
  "text": "transimission control protocol is a transport layer protocol"
}


//测试结果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

8.2.1-elasticsearch内置分词器之standard/simple

1、standard analyzer

1.1、standard类型及分词效果

1.2、standard类型可配置参数

1.3、standard analyzer的组成定义

2、simple analyzer

2.1、simple类型及分词效果

2.2、默认standard analyzer的组成定义

相关阅读

相关文章

相关问答

相关文档