join over mysql.8.0 2022-04-15

2022-04-15  本文已影响0人  9_SooHyun

ON clause and WHERE clause in join

从语义上看

但从实际上看
WHERE clause有可能会被mysql优化器放置在join之前执行,目的是在join之前减小表的体积。join的驱动表越小越好,小表驱动大表,可以极大提升join效率。

因此,实际情况是ON clause and WHERE clause:

小结:只在OUTER join时,ON和WHERE有作用时间和作用域的区别,影响执行计划和结果;其余类型的join两者等价

参考:https://stackoverflow.com/questions/354070/sql-join-where-clause-vs-on-clause

Nested Loop Join & Hash Join

Nested Loop Join

Nested Loop Join can be further categorized as Naive Nested Loop Join, Indexed Nested Loop Join and Temporary Index Nested Loop Join

NESTED LOOP JOIN

Assume that a join between three tables t1, t2, and t3 is to be executed using the following join types:

Table Join Type note
t1 range 最外层表,范围查找。mysql优化器往往会拿where条件先过滤部分数据得到小表,小表驱动大表
t2 ref 普通索引的等值查找
t3 ALL 最内层表,全表扫描

If a simple NLJ algorithm is used, the join is processed like this:

for each row in t1 matching range {
  for each row in t2 matching reference key {
    for each row in t3 {
      if row satisfies join conditions, send to client
    }
  }
}

如果内循环是全表扫描,时间复杂度就是O(m x n)
如果内循环是索引扫描,时间复杂度就是O(m x ㏒n)
比如,存在2张表,一个10条记录,一个1000万条记录
以小表为驱动表,则代价为:10(通过索引在大表查询一条记录的代价)
如果1000万的大表没有索引的时候,那么进行全表扫描COST很大
因此,在多表连接时,注意被驱动表的连接字段是否需要创建索引,或者连接字段与该表的其他约束条件字段上是否需要创建复合索引。
最好的情况就是被驱动表的连接字段上建有索引*

Hash Join

Beginning with MySQL 8.0.18, MySQL employs a hash join for any query for which [1.each join contains an equi-join condition], and in which [2.there are no indexes that can be applied to any join conditions]
连接条件包含等连接条件,并且没有索引可以应用于任何连接条件时,使用hash join
(注意,这不是指连接条件的字段上没建索引,而是【无法应用索引】。例如小表作为驱动表,也在连接字段上建了索引,但是大表的连接字段上没有建索引,那么这个连接条件就无法应用索引——驱动表总是全表扫描,不会使用驱动表的索引;这时如果被驱动表连接字段没有索引,被驱动表只能全表扫描,也就是“没有索引可以应用于任何连接条件”)

hash join 基本算法

The classic hash join algorithm for an inner join of two relations proceeds as follows:

The first phase is usually called the "build" phase, while the second is called the "probe" phase. Similarly, the join relation on which the hash table is built is called the "build" input, whereas the other input is called the "probe" input.

根据join的条件,对hash后能够得到更小hashtable的一个表进行hash,entries are mappings from the value of the (composite) join attribute to the remaining attributes of that row (whichever ones are needed)。然后遍历另一个表,拿该表的join条件去查hash table,如果match,则两个表的对应行就完成了join。从理论上看是很简单的算法

hash table大小超出可用内存

The amount of memory available is controlled by the system variable join_buffer_size, and can be adjusted at runtime.

If the build input is bigger than the available memory, it chunks some files to disk

也就是说,完整的hashtable,可能一部分在内存中,一部分在磁盘中。在probe的时候就需要分情况处理了:
During the probe phase,

归根到底,所有的匹配都是在内存中完成的,磁盘只是作为一个暂存的中转站,chunk最终要回到内存

more info see: https://dev.mysql.com/blog-archive/hash-join-in-mysql-8/

explain in mysql 8.0

上一篇 下一篇

猜你喜欢

热点阅读