Elasticsearch分布式搜索引擎程序员搜索推荐

Elasticsearch相关性打分机制学习

2017-03-19  本文已影响1739人  ginobefun

Elasticsearch全文搜索默认采用的是相关性打分TFIDF,在实际的运用中,我们采用Multi-Match给各个字段设置权重、使用should给特定文档权重或使用更高级的Function_Score来自定义打分,借助于Elasticsearch的explain功能,我们可以深入地学习一下其中的机制。

创建一个索引

curl -s -XPUT 'http://localhost:9200/gino_test/' -d '{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}'

插入测试数据:

_index _type _id text fullname
gino_test tweet 1 hello world gino zhang
gino_test tweet 2 gino like world cup gino li
gino_test tweet 3 my cup jsper li

简单情况:单字段匹配打分

POST http://192.168.102.216:9200/gino_test/_search
{
  "explain": true,
  "query": {
    "match": {
      "text": "my cup"
    }
  }
}

查询结果: score_simple.json

打分分析:


score_simple

Elasticsearch目前采用的默认相关性打分采用的是Lucene的TF-IDF技术。

TF-IDF

我们来深入地分析一下这个公式:

score(q,d)  =  queryNorm(q)  · coord(q,d)  · ∑ (tf(t,d) · idf(t)² · t.getBoost() · norm(t,d))    

注意:在计算过程中,涉及的变量应该考虑的是document所在的分片而不是整个index。

score(q,d) = _score(q,d.f)                                               --------- ①
= queryNorm(q) · coord(q,d) · ∑ (tf(t,d) · idf(t)² · t.getBoost() · norm(t,d))
= coord(q,d) · ∑ (tf(t,d) · idf(t)² · t.getBoost() · norm(t,d) · queryNorm(q))
= coord(q,d.f) · ∑ _score(q.ti, d.f) [ti in q]                           --------- ②
= coord(q,d.f) · (_score(q.t1, d.f) + _score(q.t2, d.f))

multi-match多字段匹配打分(best_fields模式)

POST http://192.168.102.216:9200/gino_test/_search
{
  "explain": true,
  "query": {
    "multi_match": {
      "query": "gino cup",
      "fields": [
        "text^8",
        "fullname^5"
      ]
    }
  }
}

查询结果:score_bestfields.json

打分分析:

score(q,d) = max(_score(q, d.fi)) = max(_score(q, d.f1), _score(q, d.f2))
= max(coord(q,d.f1) · (_score(q.t1, d.f1) + _score(q.t2, d.f1)), coord(q,d.f2) · (_score(q.t1, d.f2) + _score(q.t2, d.f2)))

multi-match多字段匹配打分(cross_fields模式)

POST http://192.168.102.216:9200/gino_test/_search
{
  "explain": true,
  "query": {
    "multi_match": {
      "query": "gino cup",
      "type": "cross_fields",
      "fields": [
        "text^8",
        "fullname^5"
      ]
    }
  }
}

查询结果:score_crossfields.json

打分分析:

score(q, d) = ∑ (_score(q.ti, d.f)) = ∑ (_score(q.t1, d.f), _score(q.t1, d.f))
= ∑ (max(coord(q.t1,d.f) · _score(q.t1, d.f1), coord(q.t1,d.f) · _score(q.t1, d.f2)), max(coord(q.t2,d.f) · _score(q.t2, d.f1), coord(q.t2,d.f) · _score(q.t2, d.f2)))

should增加权重打分

为了增加filter的测试,给gino_test/tweet增加一个tags的字段。

PUT /gino_test/_mapping/tweet
{
  "properties": {
    "tags": {
      "type": "string",
      "analyzer": "fulltext_analyzer"
    }
  }
}

增加tags的标签

_index _type _id text fullname tags
gino_test tweet 1 hello world gino zhang new, gino
gino_test tweet 2 gino like world cup gino li hobby, gino
gino_test tweet 3 my cup jsper li goods, jasper
POST http://192.168.102.216:9200/gino_test/_search
{
  "explain": true,
  "query": {
    "bool": {
      "must": {
        "bool": {
          "must": {
            "multi_match": {
              "query": "gino cup",
              "fields": [
                "text^8",
                "fullname^5"
              ],
              "type": "best_fields",
              "operator": "or"
            }
          },
          "should": [
            {
              "term": {
                "tags": {
                  "value": "goods",
                  "boost": 6
                }
              }
            },
            {
              "term": {
                "tags": {
                  "value": "hobby",
                  "boost": 3
                }
              }
            }
          ]
        }
      }
    }
  }
}

查询结果:score_should.json

打分分析:


score_should

增加了should的权重之后,相当于多了一个打分参考项,打分的过程见上面的计算过程。

function_score高级打分机制

DSL格式:

{
    "function_score": {
        "query": {},
        "boost": "boost for the whole query",
        "functions": [
            {
                "filter": {},
                "FUNCTION": {}, 
                "weight": number
            },
            {
                "FUNCTION": {} 
            },
            {
                "filter": {},
                "weight": number
            }
        ],
        "max_boost": number,
        "score_mode": "(multiply|max|...)",
        "boost_mode": "(multiply|replace|...)",
        "min_score" : number
    }
}

支持四种类型发FUNCTION:

来做一个实验。先给index增加一个查看数的字段:

PUT /gino_test/_mapping/tweet
{
  "properties": {
    "views": {
      "type": "long",
      "doc_values": true,
      "fielddata": {
        "format": "doc_values"
    }
  }
}

给三条数据分别加上查看数的值:

POST gino_test/tweet/1/_update
{
    "doc" : {
        "views" : 56
    }
}

最终数据的样子:

_index _type _id text fullname tags views
gino_test tweet 1 hello world gino zhang new, gino 56
gino_test tweet 2 gino like world cup gino li hobby, gino 21
gino_test tweet 3 my cup jsper li goods, jasper 68

执行一个查询:

{
  "explain": true,
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "gino cup",
          "type": "cross_fields",
          "fields": [
            "text^8",
            "fullname^5"
          ]
        }
      },
      "boost": 2,
      "functions": [
        {
          "field_value_factor": {
            "field": "views",
            "factor": 1.2,
            "modifier": "sqrt",
            "missing": 1
          }
        },
        {
          "filter": {
            "term": {
              "tags": {
                "value": "goods"
              }
            }
          },
          "weight": 4
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

查询结果:score_function.json

打分分析:

score(q,d) = score_query(q,d) * (score_fvf(`view`) * score_filter(`tags:goods`))

rescore重打分机制

ES官网介绍: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-request-rescore.html

重打分机制并不会应用到所有的数据中。比如需要查询前10条数据,那么所有的分片先按默认规则查询出前10条数据,然后应用rescore规则进行重打分返回给master节点进行综合排序返回给用户。

rescore支持多个规则计算,以及与原先的默认打分进行运算(权重求和等)。

rescore因为计算的打分的document较少,性能应该会更好一点,但是这个涉及到全局排序,实际运用的场景要注意。

参考材料

  1. Elasticsearch官方文档
  2. ElasticDearch权威指南
  3. Lucene TFIDF算法
扫一扫 关注我的微信公众号
上一篇下一篇

猜你喜欢

热点阅读