问题：

Elasticsearch：查询嵌套对象

訾稳

2023-03-14

尊敬的弹性搜索专家，
我在查询嵌套对象时遇到问题。允许使用以下简化映射：

{
  "mappings" : {
    "_doc" : {
      "properties" : {
        "companies" : {
          "type": "nested",
          "properties" : {
            "company_id": { "type": "long" },
            "name": { "type": "text" }
          }
        },
        "title": { "type": "text" }
      }
    }
  }
}

并将一些文档放在索引中：

PUT my_index/_doc/1
{
  "title" : "CPU release",
  "companies" : [
    { "company_id" : 1, "name" :  "AMD" },
    { "company_id" : 2, "name" :  "Intel" }
  ]
}

PUT my_index/_doc/2
{
  "title" : "GPU release 2018-01-10",
  "companies" : [
    { "company_id" : 1, "name" :  "AMD" },
    { "company_id" : 3, "name" :  "Nvidia" }
  ]
}

PUT my_index/_doc/3
{
  "title" : "GPU release 2018-03-01",
  "companies" : [
    { "company_id" : 3, "name" :  "Nvidia" }
  ]
}

PUT my_index/_doc/4
{
  "title" : "Chipset release",
  "companies" : [
    { "company_id" : 2, "name" :  "Intel" }
  ]
}

现在我想执行这样的查询：

{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "GPU" } },
        { "nested": {
            "path": "companies",
            "query": {
              "bool": {
                "must": [
                  { "match": { "companies.name": "AMD" } }
                ]
              }
            },
            "inner_hits" : {}
          }
        }
      ]
    }
  }
}

因此，我想得到匹配的公司与数量匹配的文件。因此，上面的查询应该会告诉我：

[
  { "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]

以下查询：

{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "GPU" } }
        { "nested": {
            "path": "companies",
            "query": { "match_all": {} },
            "inner_hits" : {}
          }
        }
      ]
    }
  }
}

应该给我所有公司分配到一个文件，其标题包含“GPU”与匹配的文件数量:

[
  { "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
  { "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]

如果表现良好，是否有可能实现这一结果？我显然对匹配的文档不感兴趣，只对匹配文档和嵌套对象的数量感兴趣。

谢谢你的帮助。

沈开畅

2023-03-14

就弹性搜索而言，你需要做的是:

根据所需条件过滤“父”文档（例如在title中包含GPU，或者在公司列表中提及Nvidia）；
根据一定的标准，一个桶（例如company_id）对"嵌套"文档进行分组；
计算每个桶有多少个“嵌套”文档。

数组中的每个嵌套对象都作为单独的隐藏文档进行索引，这会使生活变得复杂一些。让我们看看如何对它们进行聚合。

您可以通过嵌套、术语和top_hits聚合的组合来实现此目的：

POST my_index/doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "GPU"
          }
        },
        {
          "nested": {
            "path": "companies",
            "query": {
              "match_all": {}
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "Extract nested": {
      "nested": {
        "path": "companies"
      },
      "aggs": {
        "By company id": {
          "terms": {
            "field": "companies.company_id"
          },
          "aggs": {
            "Examples of such company_id": {
              "top_hits": {
                "size": 1
              }
            }
          }
        }
      }
    }
  }
}

这将给出以下输出：

{
  ...
  "hits": { ... },
  "aggregations": {
    "Extract nested": {
      "doc_count": 4, <== How many "nested" documents there were?
      "By company id": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": 3,  <== this bucket's key: "company_id": 3
            "doc_count": 2, <== how many "nested" documents there were with such company_id?
            "Examples of such company_id": {
              "hits": {
                "total": 2,
                "max_score": 1.5897496,
                "hits": [  <== an example, "top hit" for such company_id
                  {
                    "_nested": {
                      "field": "companies",
                      "offset": 1
                    },
                    "_score": 1.5897496,
                    "_source": {
                      "company_id": 3,
                      "name": "Nvidia"
                    }
                  }
                ]
              }
            }
          },
          {
            "key": 1,
            "doc_count": 1,
            "Examples of such company_id": {
              "hits": {
                "total": 1,
                "max_score": 1.5897496,
                "hits": [
                  {
                    "_nested": {
                      "field": "companies",
                      "offset": 0
                    },
                    "_score": 1.5897496,
                    "_source": {
                      "company_id": 1,
                      "name": "AMD"
                    }
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }
}

请注意，对于< code>Nvidia，我们有< code >“doc _ count”:2 。

但是，如果我们想计算得到Nvidia与Intel

这可以通过< code>reverse_nested聚合来实现。

我们需要稍微修改一下我们的查询:

POST my_index/doc/_search
{
  "query": { ... },
  "aggs": {
    "Extract nested": {
      "nested": {
        "path": "companies"
      },
      "aggs": {
        "By company id": {
          "terms": {
            "field": "companies.company_id"
          },
          "aggs": {
            "Examples of such company_id": {
              "top_hits": {
                "size": 1
              }
            },
            "original doc count": { <== we ask ES to count how many there are parent docs
              "reverse_nested": {}
            }
          }
        }
      }
    }
  }
}

结果将如下所示：

{
  ...
  "hits": { ... },
  "aggregations": {
    "Extract nested": {
      "doc_count": 3,
      "By company id": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": 3,
            "doc_count": 2,
            "original doc count": {
              "doc_count": 2  <== how many "parent" documents have such company_id
            },
            "Examples of such company_id": {
              "hits": {
                "total": 2,
                "max_score": 1.5897496,
                "hits": [
                  {
                    "_nested": {
                      "field": "companies",
                      "offset": 1
                    },
                    "_score": 1.5897496,
                    "_source": {
                      "company_id": 3,
                      "name": "Nvidia"
                    }
                  }
                ]
              }
            }
          },
          {
            "key": 1,
            "doc_count": 1,
            "original doc count": {
              "doc_count": 1
            },
            "Examples of such company_id": {
              "hits": {
                "total": 1,
                "max_score": 1.5897496,
                "hits": [
                  {
                    "_nested": {
                      "field": "companies",
                      "offset": 0
                    },
                    "_score": 1.5897496,
                    "_source": {
                      "company_id": 1,
                      "name": "AMD"
                    }
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }
}

为了使区别更加明显，让我们稍微更改一下数据，并在文档列表中添加另一个< code>Nvidia项目:

PUT my_index/doc/2
{
  "title" : "GPU release 2018-01-10",
  "companies" : [
    { "company_id" : 1, "name" :  "AMD" },
    { "company_id" : 3, "name" :  "Nvidia" },
    { "company_id" : 3, "name" :  "Nvidia" }
  ]
}

最后一个查询(带有< code>reverse_nested的查询)将给出以下内容:

  "By company id": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": 3,
        "doc_count": 3,    <== 3 "nested" documents with Nvidia
        "original doc count": {
          "doc_count": 2   <== but only 2 "parent" documents
        },
        "Examples of such company_id": {
          "hits": {
            "total": 3,
            "max_score": 1.5897496,
            "hits": [
              {
                "_nested": {
                  "field": "companies",
                  "offset": 2
                },
                "_score": 1.5897496,
                "_source": {
                  "company_id": 3,
                  "name": "Nvidia"
                }
              }
            ]
          }
        }
      },

如您所见，这是一个难以理解的细微差异，但它完全改变了语义。

虽然在大多数情况下，嵌套查询和聚合的性能应该足够了，但它当然会带来一定的成本。因此，在调整搜索速度时，建议避免使用嵌套或父子类型。

在Elasticsearch中，最佳性能通常是通过非规范化实现的，尽管没有单一的配方，您应该根据需要选择数据模型。

希望这能为您澄清这个嵌套的事情！

Elasticsearch：查询嵌套对象

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档