Semi-join DuplicateWeedout 子查询优化

2020-06-23 本文已影响0人轻松的鱼

本篇为子查询优化系列第四篇，参考自 MariaDB 博客： https://mariadb.com/kb/en/duplicateweedout-strategy/
，建议先看 MySQL子查询优化和 Semi-join Materialization 子查询优化策略

实现思路

DuplicateWeedout 是子查询 semijoin 优化的一种实现策略。它的思路很简单，就是执行普通的 join，然后把结果放入到临时表中，这个临时表有一个主键，包含所有字段，从而达到去重效果。因为我们前面的文章提到过，普通 join 和 semijoin 的区别就在于 semijoin 实现了子查询的没有重复结果的语义。

如下SQL 为例，要查找包含人口占总人口的33％以上并且大于100万的大城市的国家：

select * 
from Country 
where 
   Country.code IN (select City.Country
                    from City 
                    where 
                      City.Population > 0.33 * Country.Population and 
                      City.Population > 1*1000*1000);

首先对 Country 表和 City 表进行常规 join 操作：

join 会产生重复结果，如上图所示：德国有3个大城市，所以结果中有3个德国。接下来就是利用临时表去重：

临时表有一个主键，当把 join 结果写入到临时表时，写第 2 个 "Germany" 时会报主键冲突，这样最终结果就只有 1 个 "Germany"。

执行成本

在执行计划中，会有 "Start temporary" 和 "End temporary" 标示出现在 Extra 列中：

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: range
possible_keys: Population,Country
          key: Population
      key_len: 4
          ref: NULL
         rows: 238
        Extra: Using index condition; Start temporary
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: Country
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: world.City.Country
         rows: 1
        Extra: Using where; End temporary
2 rows in set (0.00 sec)

可以看到联接顺序是 City join Country，join 过程用的算法是 NJL（Index Nested-Loop Join）：

创建临时表，包含一个主键；
对驱动表 City 表使用 Population 索引查找满足条件的行，一共扫描 238 行；
每扫描一行，则取出这行数据，到被驱动表 Country 表中查找满足联接条件的行，走主键索引，每次扫描 1行；
重复 1、2，直到遍历完 238 行，得到结果集；
将结果集写入到临时表中，重复值将报错主键冲突，得到最终结果。

这里总扫描行数为 238+238*1=476.

小结

在执行计划中，DuplicateWeedout 显示为 "Start temporary/End temporary"；
通过参数 optimizer_switch 中的 semijoin=on 和 duplicateweedout=on 开启 Semi-join DuplicateWeedout，默认就是开启的。

Semi-join DuplicateWeedout 子查询优化

实现思路

执行成本

小结

猜你喜欢

热点阅读