Elasticsearch：文档pt.2中具有自定义得分字段的影响力得分

游高杰

2023-03-14

问题内容：

拥有这些文件：

{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}

和

{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}

我想获得基于每个标签的置信度值计算的_score。例如，如果您搜索“ mountain”，则显然应该仅返回ID为1的文档；如果您搜索“
landscape”，则得分2应该高于1，因为景观对2的置信度高于1（48.36 vs 33.66）。如果您搜索“ coast
landscape”，则此时间得分1应该高于2，因为doc 1在标签数组中同时包含了Coast和Landscape。我还想将分数与“
boost_multiplier”相乘，以增强某些文档的性能。

我在Elasticsearch中发现了这个问题：文档中具有自定义得分字段的影响力得分

但是，当我尝试接受的解决方案（我在我的ES服务器中启用脚本）时，无论搜索词如何，它都返回带有_score 1.0的两个文档。这是我尝试过的查询：

{
  "query": {
    "nested": {
      "path": "tags",
      "score_mode": "sum",
      "query": {
        "function_score": {
          "query": {
            "match": {
              "tags.tag": "coast landscape"
            }
          },
          "script_score": {
            "script": "doc[\"confidence\"].value"
          }
        }
      }
    }
  }
}

我还尝试了@yahermann在注释中建议的内容，将“ script_score”替换为“ field_value_factor”：{“ field”：“
confidence”}，结果仍然相同。知道为什么它会失败，或者有更好的方法吗？

只是为了全面了解，这是我使用的映射定义：

{
  "mappings": {
    "photo": {
      "properties": {
        "created_at": {
          "type": "date"
        },
        "description": {
          "type": "text"
        },
        "height": {
          "type": "short"
        },
        "id": {
          "type": "keyword"
        },
        "tags": {
          "type": "nested",
          "properties": {
            "tag": { "type": "string" },
            "confidence": { "type": "float"}
          }
        },
        "width": {
          "type": "short"
        },
        "color": {
          "type": "string"
        },
        "boost_multiplier": {
          "type": "float"
        }
      }
    }
  },
  "html" target="_blank">settings": {
    "number_of_shards": 1
  }
}

更新
在下面@Joanna的答案之后，我尝试了查询，但是实际上，无论我在匹配查询，coast，foo，bar中放置什么，它总是返回两个文档都带有_score1.0的文档，我在elasticsearch2.4上进行了尝试Docker中的.6、5.3、5.5.1。这是我得到的答复：

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635

{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}
}]}}

UPDATE-2 我在SO上发现了这一点：Elasticsearch：带有“boost_mode”的“function_score”：“replace”忽略了函数得分

它的基本含义是，如果函数不匹配，则返回1。这是有道理的，但我正在对同一文档运行查询。令人困惑。

最后更新
最终我发现了问题，我很愚蠢。ES101，如果您发送GET请求以搜索api，它将返回所有得分为1.0的文档：）您应该发送POST请求…非常感谢@Joanna，它运行良好！

问题答案：

您可以尝试使用此查询-它结合了得分：confidence和boost_multiplier字段：

{
  "query": {
    "function_score": {
        "query": {
            "bool": {
                "should": [{
                    "nested": {
                      "path": "tags",
                      "score_mode": "sum",
                      "query": {
                        "function_score": {
                          "query": {
                            "match": {
                              "tags.tag": "landscape"
                            }
                          },
                          "field_value_factor": {
                            "field": "tags.confidence",
                            "factor": 1,
                            "missing": 0
                          }
                        }
                      }
                    }
                }]
            }
        },
        "field_value_factor": {
            "field": "boost_multiplier",
            "factor": 1,
            "missing": 0
        }
      }
    }
}

当我搜索coast字词时-它返回：

id=1仅具有此术语的文档具有该术语，得分为"_score": 100.27469。

当我搜索landscape术语时-它返回两个文档：

id=2得分为“ _score”的文档：85.83046
id=1得分为“ _score”的文档：59.7339

由于id=2具有较高confidence字段值的文档，其得分更高。

当我搜索coast landscape术语时-它返回两个文档：

id=1得分为“ _score”的文档：160.00859
id=2得分为“ _score”的文档：85.83046

尽管id=2具有的文档具有较高的confidence字段值，但是具有的文档id=1具有匹配的单词，因此得分更高。通过更改"factor": 1参数的值，您可以决定confidence应多少影响结果。

boost_muliplier字段

当我为一个新文档建立索引时，会发生更有趣的事情：假设它与具有的文档几乎相同，id=2但是我设置了"boost_multiplier" : 4和"id": 3：

{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "3",
  "tags" : [
    ...
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    ...
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 4
}

使用coast landscapeterm 运行相同的查询将返回三个文档：

id=3得分为“ _score”的文档：360.02664
id=1得分为“ _score”的文档：182.09859
id=2得分为“ _score”的文档：90.00666

尽管的文档id=3只有一个匹配的单词（landscape），但其boost_multiplier值大大提高了评分。在此处，"factor": 1您还可以使用决定该值应增加多少分值，并"missing": 0确定如果没有索引该字段应发生什么。

Elasticsearch：文档pt.2中具有自定义得分字段的影响力得分

boost_muliplier字段

相关阅读

相关文章

相关问答

相关工具

相关文档