Lucene 8--magic WAND

2019-06-12  本文已影响0人  lionel880

ElasticSearch 7更新后,探究top-k的查询优化
参考资料:ES7更新notes
Magic WAND: Faster Retrieval of Top Hits in Elasticsearch

背景

探究优化,总要知道以前是怎么样的
以前如果要得到 top10,则在各个分片上计算 top10,在进行合并比较得到全局的top10,这也是深分页的原罪,分布式情况下,对于全局排序的困境

演变1

95年提出优化初步方案
方案内容:

还是回到刚刚的例子,一开始,查询所有文档,每个文档都要计算得分,随着文档数查询的增多,你需要的top10的最低分增大,当增大到3.0时,我们知道,必须要包含kibana单词,因此就可以将kibana从第一个候选set中移到接下去的查询条件中,使得查询更为精确,是的后续的查询变得更快。你可以想象一个例子
a:1,b:2,c:5,d:10
你一开始全算得分,得到最低分为2后,你知道要查b,c,d
最低分到5,你要查c,d
最低分到10,你要查d

演变2

之前的方案那么好,那么难点在哪呢?以至于过了几年才实现

按照官网的说法:

we were able to optimize collection of the top matches on dynamic indexes. 
With this change, disjunctions went between 40% and 13x faster when running Lucene's benchmark suite.

通过跳过一个无需计算分数的文档,使得查询的效率得到提高,但这有一个前提,每个term的得分不可以为负数

演变3

WAND算法哪些缺陷?
一个异常值,可能会使得所有的算法优化速度下降

Block-max WAND是WAND的变体,它利用了每个块可能具有不同最大分数的事实。
它将数据分为一个个的block,然后利用block的最大值而不是全局最大值,来避免异常值的影响。
也因为有block max,可以使得优化时,某些情况下,能直接跳过整个block

*   Scorer gets the same methods as ImpactsEnum: advanceShallow and getMaxScore, with the same contract
     - Disjunctions and conjunctions skip over entire blocks when the sum of max scores is not competitive.
     - Disjunctions now use the block max score instead of the global max score, which helps skip over more documents.
     - This is not documented in the paper, but when a minimum score is set, the score is computed on the fly in order to move to the next doc faster. I did this based on the observation that computing a score was often less costly than advancing another clause. This is especially useful due to the fact that on the contrary to term queries, the maximum score of a block is often not collected.
*   Another difference with the paper is the fact that we have information about blocks at multiple levels. This is one of the ideas described in a follow-up paper (see section 4 of [http://engineering.nyu.edu/~suel/papers/bmm.pdf](http://engineering.nyu.edu/~suel/papers/bmm.pdf)) which is based on the observation that you sometimes want to advance by more than ${block-size}.

Results on wikibigall need to be taken carefully as some tasks are either faster or slower depending on which query is run. Typically queries whose terms match lots of documents but the intersection matches few documents are a bit slower because it takes time to get a good lower bound for the score. Baseline is master, and patch is the combination of the patch on [~~LUCENE-4198~~](https://issues.apache.org/jira/browse/LUCENE-4198 "Allow codecs to index term impacts") and this patch.

对于ES和lucene的意义

2个改变

现在仍可以命中所有,但这会消耗性能

{
  "value": 1000,
  "relation": "gte" // can only be "eq" if equal of "gte" if `value` is a lower bound of the hit count
}

可以告诉Elasticsearch通过track_total_hits参数计算的命中数。如果实际命中数小于此数,则响应将返回准确的命中计数,否则响应将设置值为track_total_hits“gte”作为“关系”的命中数,这意味着实际命中count实际上大于或等于这个数字。值越低track_total_hits,查询越快

上一篇 下一篇

猜你喜欢

热点阅读