Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf-tuning: add docs for subquery optimizations #3191

Merged
merged 16 commits into from
May 25, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions correlated-subquery-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,58 @@ category: performance
---

# 关联子查询去关联

在[子查询相关的优化](/subquery-optimization.md)中我们介绍了当没有关联列时,TiDB是如何处理子查询的。由于关联子查询解除关联依赖是一个比较复杂的工作,我们这里介绍一些简单的场景以及这个优化规则的适用范围。
winoros marked this conversation as resolved.
Show resolved Hide resolved

winoros marked this conversation as resolved.
Show resolved Hide resolved
以 `select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` 为例,这里子查询 `t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` 中涉及了关联列上的条件 `t2.b=t1.b`,不过恰好由于这是一个等值条件,因此我们可以将其等价的改写为 `select t1.* from t1, (select b, sum(a) sum_a from t2 group by b) t2 where t1.b = t2.b and t1.a < t2.sum_a;`这样一个关联子查询就被重新改写为 `JOIN` 的形式。

TiDB 之所以要进行这样的改写,是因为关联子查询每次子查询执行时都是要和它的外部查询结果绑定的。在上面的例子中如果 `t1.a` 有一千万个值,那这个子查询就要被重复执行一千万次,因为 `t2.b=t1.b` 这个条件会随着 `t1.a` 值的不同而发生变化。当我们通过一些手段将关联依赖解除后,这个子查询就只需要被执行一次了。

同时这种改写的弊端在于在关联没有被解除时,我们是可以使用关联列上的索引的,也就是说虽然这个子查询可能被重复执行多次,但是每次都可以使用索引过滤数据。而解除关联的变换上,通常是会导致关联列的位置发生改变而导致虽然子查询只被执行了一次,但是单次执行的时间会比没有解除关联时的单次执行时间长。因此在外部的值比较少的情况下,不解除关联依赖反而可能对执行性能更优帮助。我们可以通过[优化规则及表达式下推的黑名单](/blacklist-control-plan.md)中关闭`子查询去关联`优化规则的方式来关闭这个优化。

```
mysql> create table t1(a int, b int);
winoros marked this conversation as resolved.
Show resolved Hide resolved
Query OK, 0 rows affected (0.01 sec)

mysql> create table t2(a int, b int, index idx(b));
Query OK, 0 rows affected (0.02 sec)

mysql> explain select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b);
+----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+
| HashJoin_11 | 9990.00 | root | | inner join, equal:[eq(test.t1.b, test.t2.b)], other cond:lt(cast(test.t1.a), Column#7) |
| ├─HashAgg_23(Build) | 7992.00 | root | | group by:test.t2.b, funcs:sum(Column#8)->Column#7, funcs:firstrow(test.t2.b)->test.t2.b |
| │ └─TableReader_24 | 7992.00 | root | | data:HashAgg_16 |
| │ └─HashAgg_16 | 7992.00 | cop[tikv] | | group by:test.t2.b, funcs:sum(test.t2.a)->Column#8 |
| │ └─Selection_22 | 9990.00 | cop[tikv] | | not(isnull(test.t2.b)) |
| │ └─TableFullScan_21 | 10000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
| └─TableReader_15(Probe) | 9990.00 | root | | data:Selection_14 |
| └─Selection_14 | 9990.00 | cop[tikv] | | not(isnull(test.t1.b)) |
| └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
+----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+
mysql> insert into mysql.opt_rule_blacklist values("decorrelate");
Query OK, 1 row affected (0.00 sec)

mysql> admin reload opt_rule_blacklist;
Query OK, 0 rows affected (0.00 sec)

mysql> explain select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b);
+----------------------------------------+----------+-----------+------------------------+------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------------+----------+-----------+------------------------+------------------------------------------------------------------------------+
| Projection_10 | 10000.00 | root | | test.t1.a, test.t1.b |
| └─Apply_12 | 10000.00 | root | | CARTESIAN inner join, other cond:lt(cast(test.t1.a), Column#7) |
| ├─TableReader_14(Build) | 10000.00 | root | | data:TableFullScan_13 |
| │ └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
| └─MaxOneRow_15(Probe) | 1.00 | root | | |
| └─HashAgg_27 | 1.00 | root | | funcs:sum(Column#10)->Column#7 |
| └─IndexLookUp_28 | 1.00 | root | | |
| ├─IndexRangeScan_25(Build) | 10.00 | cop[tikv] | table:t2, index:idx(b) | range: decided by [eq(test.t2.b, test.t1.b)], keep order:false, stats:pseudo |
| └─HashAgg_17(Probe) | 1.00 | cop[tikv] | | funcs:sum(test.t2.a)->Column#10 |
| └─TableRowIDScan_26 | 10.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
+----------------------------------------+----------+-----------+------------------------+------------------------------------------------------------------------------+

```

上面的样例中展示了打开和关闭优化规则前后执行计划的不同。其中在关闭优化规则的情况下我们可以在 `IndexRangeScan_25(Build)` 的 `operator info` 中看到 `range: decided by [eq(test.t2.b, test.t1.b)]`,便是关联依赖未被接触时,可以使用关联条件进行索引范围查询的展示。
72 changes: 71 additions & 1 deletion subquery-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,74 @@ title: 子查询相关的优化
category: performance
---

# 子查询相关的优化
# 子查询相关的优化

lilin90 marked this conversation as resolved.
Show resolved Hide resolved
通常我们遇到的会是如下情况的子查询:

- `NOT IN (SELECT ... FROM ...)`
- `NOT EXISTS (SELECT ... FROM ...)`
- `IN (SELECT ... FROM ..)`
- `EXISTS (SELECT ... FROM ...)`
- `... >/>=/</<=/=/!= (SELECT ... FROM ...)`

有时,会遇到子查询中包含了非子查询中的列,如 `select * from t where t.a in (select * from t2 where t.b=t2.b)` 中子查询中的 `t.b` 便不是子查询中的列而是从子查询外面引入的列。这种子查询通常会被成为`关联子查询`,外部引入的列会被成为`关联列`,关联子查询相关的优化可以在[关联子查询去关联](/correlated-subquery-optimization.md)中查看。本节中我们主要关注不涉及关联列的子查询。

子查询默认会以[理解 TiDB 执行计划](/query-execution-plan.md)中提到的 `semi join` 作为默认的执行方式,同时对于一些特殊的子查询,TiDB 会做一些逻辑上的替换使得查询可以获得更好的执行性能

## `... < ALL (SELECT ... FROM ...)` 或者 `... > ANY (SELECT ... FROM ...)`

对于这种情况我们可以将 `ALL` 或者 `ANY` 用 `MAX` 以及 `MIN` 来代替。不过由于在表为空时,`MAX(EXPR)` 以及 `MIN(EXPR)` 的结果会为 NULL,其表现形式和 `EXPR` 是有 `NULL` 值的结果一样。以及外部表达式结果为 `NULL` 时也会影响表达式的最终结果,因此这里完整的改写会是如下的形式:

- `t.id < all(select s.id from s)` 会被改写为 `t.id < min(s.id) and if(sum(s.id is null) != 0, null, true)`。
- `t.id < any (select s.id from s)` 会被改写为 `t.id < max(s.id) or if(sum(s.id is null) != 0, null, false)`。

## `... != ANY (SELECT ... FROM ...)`

对于这种情况,当子查询中不同值的各种只有一种的话,那只要和这个值对比就可以了,如果子查询中不同值的个数多余1个,那么必然会有不相等的情况出现。因此这样的子查询可以采取如下的改写手段。
winoros marked this conversation as resolved.
Show resolved Hide resolved

- `select * from t where t.id != any (select s.id from s)` 会被改写为 `select t.* from t, (select s.id, count(distinct s.id) as cnt_distinct from s) where (t.id != s.id or cnt_distinct > 1)`

## `... = ALL (SELECT ... FROM ...)`

对于这种情况,当子查询中不同值的个数多于一种的话,那么这个表达式的结果必然为假。因此这样的子查询在 TiDB 中会改写为如下的形式。

- `select * from t where t.id = all (select s.id from s)` 会被改写为 `select t.* from t, (select s.id, count(distinct s.id) as cnt_distinct from s) where (t.id = s.id and cnt_distinct <= 1)`

## `... IN (SELECT ... FROM ...)`

对于这种情况,我们会将其改写为 `IN` 的子查询改写为 `SELECT ... FROM ... GROUP ...` 的形式然后将 `IN` 改写为普通的 `JOIN` 的形式。
winoros marked this conversation as resolved.
Show resolved Hide resolved
如 `select * from t1 where t1.a in (select t2.a from t2)` 会被改写为 `select t1.* from t1, (select distinct(a) a from t2) t2 where t1.a = t2.a` 的形式。同时这里的 `DISTINCT` 可以在 `t2.a` 具有 `UNIQUE` 属性时被自动消去。

```
mysql> explain select * from t1 where t1.a in (select t2.a from t2);
+------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+
| IndexJoin_12 | 9990.00 | root | | inner join, inner:TableReader_11, outer key:test.t2.a, inner key:test.t1.a |
| ├─HashAgg_21(Build) | 7992.00 | root | | group by:test.t2.a, funcs:firstrow(test.t2.a)->test.t2.a |
| │ └─IndexReader_28 | 9990.00 | root | | index:IndexFullScan_27 |
| │ └─IndexFullScan_27 | 9990.00 | cop[tikv] | table:t2, index:idx(a) | keep order:false, stats:pseudo |
| └─TableReader_11(Probe) | 1.00 | root | | data:TableRangeScan_10 |
| └─TableRangeScan_10 | 1.00 | cop[tikv] | table:t1 | range: decided by [test.t2.a], keep order:false, stats:pseudo |
+------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+
```

这个改写会在 `IN` 子查询相对较小,而外部查询相对较大时产生更好的执行性能。因为不经过改写的情况下,我们无法使用以 t2 为驱动表的 `index join`。同时这里的弊端便是当改写删成的聚合无法被自动消去且 `t2` 表比较大时,反而会影响查询的性能。目前 TiDB 中使用 [tidb\_opt\_insubq\_to\_join\_and\_agg](/tidb-specific-system-variables.md#tidb_opt_insubq_to_join_and_agg) 变量来控制这个优化的打开与否。当遇到不合适这个优化的情况可以手动关闭。

## `EXISTS` 子查询以及 `... >/>=/</<=/=/!= (SELECT ... FROM ...)`

当前对于这种场景的子查询,当它不是关联子查询时,TiDB 会在优化阶段提前展开它,将其直接替换为一个结果集直接判断结果。如下图中,`EXISTS` 会提前在优化阶段被执行为 `TRUE`,从而不会在最终的执行结果中看到它。

```
mysql> create table t1(a int);
mysql> create table t2(a int);
mysql> insert into t2 values(1);
mysql> explain select * from t where exists (select * from t2);
+------------------------+----------+-----------+---------------+--------------------------------+
| id | estRows | task | access object | operator info |
+------------------------+----------+-----------+---------------+--------------------------------+
| TableReader_12 | 10000.00 | root | | data:TableFullScan_11 |
| └─TableFullScan_11 | 10000.00 | cop[tikv] | table:t | keep order:false, stats:pseudo |
+------------------------+----------+-----------+---------------+--------------------------------+

```