在Elasticsearch中字段未按字母顺序排序

阳宾实

2023-03-14

问题内容：

我有一些带有名称字段的文档。我正在使用名称字段的分析版本进行搜索和not_analyzed排序。排序是在一个级别上进行的，即名称首先是按字母顺序排序的。但是在字母列表中，名称是按字典顺序而不是按字母顺序排序的。这是我使用的映射：

{
  "mappings": {
    "seing": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }

谁能提供相同的解决方案？

问题答案：

深入研究Elasticsearch文档，我偶然发现了这一点：

排序和排序规则

不区分大小写的排序

假设我们有三个用户文档，其名称字段分别包含Boffey，BROWN和bailey。首先，我们将应用在字符串排序和多字段中描述的技术，该方法使用not_analyzed字段进行排序：

PUT /my_index
{
  "mappings": {
    "user": {
      "properties": {
        "name": {                    //1
          "type": "string",
          "fields": {
            "raw": {                 //2
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

该analyzed name字段用于搜索。
该not_analyzed name.raw字段用于排序。

先前的搜索请求将按以下顺序返回文档：BROWN，Boffey，bailey。与字母顺序相反，这被称为字典顺序。从本质上讲，用于表示大写字母的字节的值比用于表示小写字母的字节的值低，因此，名称以最低的字节排在最前面。

这对计算机可能有意义，但对于合理地期望这些名称按字母顺序（无论大小写）的人类而言，意义不大。为此，我们需要以字节顺序对应于所需排序顺序的方式为每个名称建立索引。

换句话说，我们需要一个可以发出单个小写令牌的分析器：

遵循此逻辑，而不是存储原始文档，您需要使用自定义关键字分析器将其小写：

PUT /my_index
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "case_insensitive_sort" : {
          "tokenizer" : "keyword",
          "filter" : ["lowercase"]
        }
      }
    }
  },
  "mappings" : {
    "seing" : {
      "properties" : {
        "name" : {
          "type" : "string",
          "fields" : {
            "raw" : {
              "type" : "string",
              "analyzer" : "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

现在，排序依据name.raw应该 按字母 顺序排序，而不是按 字典顺序 排序。

使用Marvel在我的本地计算机上完成的快速测试：

索引结构：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive_sort": {
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            },
            "keyword": {
              "type": "string",
              "analyzer": "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

测试数据：

PUT /my_index/user/1
{
  "name": "Tim"
}

PUT /my_index/user/2
{
  "name": "TOM"
}

使用原始字段查询：

POST /my_index/user/_search
{
  "sort": "name.raw"
}

结果：

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "TOM"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "Tim"
  ]
}

使用小写字符串查询：

POST /my_index/user/_search
{
  "sort": "name.keyword"
}

结果：

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "tim"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "tom"
  ]
}

我怀疑第二个结果在您的情况下是正确的。

在Elasticsearch中字段未按字母顺序排序

相关阅读

相关文章

相关问答

相关工具

相关文档