一次低区分度索引的查询优化 2021-12-06

2021-12-07 本文已影响0人 9_SooHyun

背景知识

1.索引

这里只罗列出InnoDB支持的索引：

主键索引(PRIMARY)。一种特殊的唯一索引，受主键约束，不允许有空值
唯一索引(UNIQUE)。索引列允许空值但不允许重复
普通索引(INDEX) 。索引列允许空值和重复
组合索引。多个列共同组成的索引，创建组合索引的规则是首先会对组合索引的最左第一个字段排序，在第一个字段的排序基础上再对第二个字段进行排序，以此类推。因此，对于区分度越大的字段，越应当放在组合索引的越左侧。使用该索引时收到最左匹配原则的约束，直到遇到范围查询(>、<、between、like)而停止匹配

这些索引的底层都是b+树，一个表有几个索引就有几棵b+树

索引总体划分为两类，主键索引也被称为聚簇索引（clustered index），其余都称呼为非主键索引也被称为二级索引（secondary index）

2.mysql explain说明

explain命令可以获得sql statement的执行计划，其输出的列说明如下：
EXPLAIN Output Columns

Column	JSON Name	Meaning
`id`	`select_id`	The `SELECT` identifier
`select_type`	None	The `SELECT` type
`table`	`table_name`	The table for the output row
`partitions`	`partitions`	The matching partitions
`type`	`access_type`	The join type
`possible_keys`	`possible_keys`	The possible indexes to choose
`key`	`key`	The index actually chosen
`key_len`	`key_length`	The length of the chosen key
`ref`	`ref`	The columns compared to the index, shows which columns or constants are compared to the index named in the key column to select rows from the table
`rows`	`rows`	Estimate of rows to be examined. 预估计的由type字段指明的搜索方式的探测总行数
`filtered`	`filtered`	Percentage of rows filtered by table condition. 表示通过查询条件获取的最终记录行数占通过type字段指明的搜索方式搜索出来的记录总行数的百分比
`Extra`	None	Additional information

首先，MySQL使用type扫描表，预计会得到rows条记录
其次，MySql会使用Extra额外的查询条件对这rows行记录做二次过滤
最终，得到符合查询语句的n条记录， filtered = n / rows

EXPLAIN Join Types

The type column of EXPLAIN output describes how tables are joined.
https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types

The following list describes the join types, ordered from the best type to the worst

system：表只有一行记录（等于系统表），这是const类型的特例
const：const is used when you compare all parts of a PRIMARY KEY or UNIQUE index to constant values.
eg:

SELECT * FROM tbl_name WHERE primary_key=1;

eq_ref：唯一索引的等值查找。对于每个索引键，表中只有一条记录与之匹配。常见于主键或唯一索引扫描。对比const，const是直接按主键或唯一键与常量值比较，而eq_ref按主键或唯一键与变量值比较，其实也就是会查多次
eg:

SELECT * FROM ref_table,other_table
  WHERE ref_table.key_column=other_table.column;

SELECT * FROM ref_table,other_table
  WHERE ref_table.key_column_part1=other_table.column
  AND ref_table.key_column_part2=1;

ref：普通索引的等值查找，返回匹配某个index值的所有记录。ref is used if the join cannot select a single row based on the key value
range：范围查找，或者说索引的部分扫描。利用索引来检索【特定范围】的记录，range can be used when a indexed column is compared to a constant using any of the =, <>, >, >=, <, <=, IS NULL, <=>, BETWEEN, LIKE, or IN() operators
这种索引列上的范围扫描优于全索引扫描，只需要开始于某个点，结束于另一个点，不用扫描全部索引
index：扫描整个索引，拿的是索引上的数据。Full Index Scan，index与ALL区别为index类型只遍历索引树。这通常比ALL快，应为索引文件通常比数据文件小。（Index与ALL虽然都是读全表，但index是从索引中读取，而ALL是从硬盘读取）
例如，查询条件是某联合索引的一部分，但又不遵循最左匹配原则时，都可能会采用 index 类型的方式扫描，但它的效率远不如最左匹配原则的查询效率高，index 类型类型的扫描方式是从索引第一个字段一个一个的查找，直到找到符合的某个索引

from mysql8.0 manual:

The index join type is the same as ALL, except that only the index tree is scanned. This occurs two ways:
index与 ALL相同，都是全量扫，但index只是扫描索引树

索引覆盖。If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan（索引覆盖而无需回表的scan） usually is faster than ALL because the size of the index usually is smaller than the table data.
注意，和ALL相比，它们都取得了全表的数据，但如果不是索引覆盖的情况，则index要先读索引再回表随机取数据，这时index就不会比ALL快

按索引顺序查找数据行来执行全表扫描。A full table scan is performed using reads from the index to look up data rows in index order. Uses index does not appear in the Extra column.

MySQL can use this join type when the query uses only columns that are part of a single index.

ALL：全表扫描，拿的是一整个表的全部数据。Full Table Scan，遍历全表以找到匹配的行。A full table scan (also known as a sequential scan) is a scan made on a database where each row of the table is read in a sequential (serial) order and the columns encountered are checked for the validity of a condition.^[1] Full table scans ^[2] are usually the slowest method of scanning a table due to the heavy amount of I/O reads required from the disk which consists of multiple seeks as well as costly disk to memory transfers. [from wiki]。虽然Full Table Scan遍历了全表，但是它利用主键进行了顺序IO，因此有时候全表扫描的速度会比【大量回表】更快，因为回表将产生随机IO

EXPLAIN extra

Using index. 表示索引覆盖，使用索引来直接获取目标列的数据，而不需回表
Using where. A WHERE clause is used to restrict which rows to match against the next table or send to the client.表示在server层对表记录进行进一步过滤
Using index condition. 表示索引下推。Tables are read by accessing index tuples and testing them first to determine whether to read full table rows. In this way, index information is used to defer (“push down”) reading full table rows unless it is necessary. See “Index Condition Pushdown Optimization”.

案例

表说明：

三个表

host. 50w record，主键为host.id，另外存在一个只有2个值的二级索引IDX_host_is_deleted
resource_pool. 100 record，主键resource_pool.id
biz_module. 3w record，主键biz_module.id

目的：

查询每个resource_pool及其内部的host数量 SELECT resource_pool.*, count(host.id) count

原sql

查询耗时7s

SELECT resource_pool.*, count(host.id) count 
FROM `resource_pool` 
Inner JOIN biz_module 
  ON resource_pool.module = biz_module.path 
Left JOIN host 
  ON host.module_id = biz_module.id and host.is_deleted = 0 
// 因为host表存在IDX_host_is_deleted这个二值索引(二值所以区分度极低)，而这里通过explain又会走IDX_host_is_deleted这个二值索引
// 那么就会产生大量的回表，也就是产生大量的随机io，效率极低。远不如直接扫主键的全表扫描快（因为主键是聚簇索引，扫主键是顺序io）
GROUP BY resource_pool.id 
ORDER BY resource_pool.name 
LIMIT 20

执行计划

id	table	type	key	ref	rows	filtered	Extra
1	resource_pool	index	PRIMARY	---	11	100	Using temporary; Using filesort
1	biz_module	ref	IDX_biz_module_path	om2.resource_pool.module	1	100	Using where; Using index
1	host	ref	IDX_host_is_deleted	const	191385	100	Using where

优化：

通过 +0 把在is_deleted上的所有索引在本次查询中置为无效
另外，+0后还发现与host的left join使用了hash join的方式，进一步提升了查询效率

SELECT resource_pool.*, count(host.id) count 
FROM `resource_pool` 
Inner JOIN biz_module 
  ON resource_pool.module = biz_module.path 
Left JOIN host 
  ON host.module_id = biz_module.id and host.is_deleted +0 = 0 
GROUP BY resource_pool.id 
ORDER BY resource_pool.name 
LIMIT 20

or

SELECT resource_pool.*, count(host.id) count 
FROM `resource_pool` 
Inner JOIN biz_module 
  ON resource_pool.module = biz_module.path 
Left JOIN host 
  ignore index(IDX_host_is_deleted)  // 忽略IDX_host_is_deleted这个索引
  ON host.module_id = biz_module.id and host.is_deleted = 0 
GROUP BY resource_pool.id 
ORDER BY resource_pool.name 
LIMIT 20

执行计划

id	table	type	key	ref	rows	filtered	Extra
1	resource_pool	ALL			11	100	Using temporary; Using filesort
1	biz_module	ref	IDX_biz_module_path	om2.resource_pool.module	1	100	Using where; Using index
1	host	ALL			382770	100	Using where; Using join buffer (hash join)

或把count放到server层统计：

SELECT resource_pool.*, count(case when host.is_deleted = 0 then 1 end) count 
FROM `resource_pool` 
Inner JOIN biz_module 
  ON resource_pool.module = biz_module.path 
Left JOIN host 
  ON host.module_id = biz_module.id 
GROUP BY resource_pool.id 
ORDER BY resource_pool.name 
LIMIT 20

这两种写法的执行计划是相同的

原sql Left JOIN的条件下放探究

to be continue

SELECT resource_pool.*, count(host.id) count 
FROM `resource_pool` 
Inner JOIN biz_module 
  ON resource_pool.module = biz_module.path 
Left JOIN host 
  ON host.module_id = biz_module.id 
WHERE host.is_deleted = 0 
GROUP BY resource_pool.id 
ORDER BY resource_pool.name 
LIMIT 20