elasticsearch copy_to字段在聚合中的行为不正常

马朝斑

2023-03-14

问题内容：

我有一个包含两个字符串字段的索引映射，field1并且field2都被声明为copy_to到另一个名为的字段all_fields。
all_fields索引为“ not_analyzed”。

当我在上创建存储桶聚合时all_fields，我期望field1和field2的键连接在一起的不同存储桶。取而代之的是，我得到了带有未连接的field1和field2键的单独存储桶。

示例：映射：

  {
    "mappings": {
      "myobject": {
        "properties": {
          "field1": {
            "type": "string",
            "index": "analyzed",
            "copy_to": "all_fields"
          },
          "field2": {
            "type": "string",
            "index": "analyzed",
            "copy_to": "all_fields"
          },
          "all_fields": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }

数据在：

  {
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
  }

和

  {
    "field1": "fish chicken something",
    "field2": "dinner",
  }

聚合：

{
  "aggs": {
    "t": {
      "terms": {
        "field": "all_fields"
      }
    }
  }
}

结果：

...
"aggregations": {
    "t": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
            {
                "key": "dinner",
                "doc_count": 1
            },
            {
                "key": "dinner carrot potato broccoli",
                "doc_count": 1
            },
            {
                "key": "fish chicken something",
                "doc_count": 1
            },
            {
                "key": "something here",
                "doc_count": 1
            }
        ]
    }
}

我期待只有2桶，fish chicken somethingdinner和dinner carrot potato broccolisomethinghere

我究竟做错了什么？

问题答案：

您正在寻找的是两个字符串的串联。copy_to即使看起来正在这样做，也不会。从copy_to概念上讲，与您一起从field1和两者创建一组值，而field2不是将它们连接在一起。

对于您的用例，您有两种选择：

使用_source转换
执行脚本聚合

我建议进行_source转换，因为我认为它比编写脚本更有效。意思是，与进行繁重的脚本聚合相比，您在索引编制时付出的代价很小。

对于 _source 转换：

PUT /lastseen
{
  "mappings": {
    "test": {
      "transform": {
        "script": "ctx._source['all_fields'] = ctx._source['field1'] + ' ' + ctx._source['field2']"
      }, 
      "properties": {
        "field1": {
          "type": "string"
        },
        "field2": {
          "type": "string"
        },
        "lastseen": {
          "type": "long"
        },
        "all_fields": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}

和查询：

GET /lastseen/test/_search
{
  "aggs": {
    "NAME": {
      "terms": {
        "field": "all_fields",
        "size": 10
      }
    }
  }
}

对于 脚本聚合
，为了易于执行（意味着使用doc['field'].value而不是使用更昂贵的_source.field），请.raw向field1和添加子字段field2：

PUT /lastseen
{
  "mappings": {
    "test": { 
      "properties": {
        "field1": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "field2": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "lastseen": {
          "type": "long"
        }
      }
    }
  }
}

脚本将使用以下.raw子字段：

{
  "aggs": {
    "NAME": {
      "terms": {
        "script": "doc['field1.raw'].value + ' ' + doc['field2.raw'].value", 
        "size": 10,
        "lang": "groovy"
      }
    }
  }
}

如果没有.raw子字段（是故意创建的not_analyzed），您将需要执行以下操作，这会变得更加昂贵：

{
  "aggs": {
    "NAME": {
      "terms": {
        "script": "_source.field1 + ' ' + _source.field2", 
        "size": 10,
        "lang": "groovy"
      }
    }
  }
}

elasticsearch copy_to字段在聚合中的行为不正常

相关阅读

相关文章

相关问答

相关工具

相关文档