GreatSQL最佳化技巧:半連線(semijoin)最佳化

GreatSQL發表於2024-04-17

何為半連線?

半連線是在GreatSQL內部採用的一種執行子查詢的方式,semi join不是語法關鍵字,不能像使用inner joinleft joinright join這種語法關鍵字一樣提供給使用者來編寫SQL語句。

兩個表t1表和t2表進行半連線的含義是:對於t1表的某條記錄來說,我們只關心在t2表中是否存在與之匹配的記錄,而不關心有多少條記錄與之匹配,最終的結果集中只保留t1表的記錄。

前面文章也提到過,含in、exists子查詢的語句通常會採用半連線方式執行查詢,但這不絕對,也有一些情況不適用半連線。比如:

(1)外查詢的where子句中,存在其他搜尋條件使用OR運算子與IN子查詢的條件連線起來

(2)IN子查詢位於Select子句中

(3)IN子查詢中含有union的情況

(4)IN子查詢中含group by、having或聚合函式的情況

GreatSQL執行半連線的最佳化策略

本文實驗使用資料庫版本為GreatSQL 8.0.32-25。

建立兩張實驗表來說明。

greatsql> create table t1(
c1 varchar(30),
c2 int
);
greatsql> create table t2(
id int primary key,
c1 varchar(30),
key idx_c1(c1)
);
--插入幾條測試資料
greatsql> insert into t1 values('a',1);
greatsql> insert into t1 values('b',3);
greatsql> insert into t1 values('a',5);
greatsql> insert into t1 values('c',7);
greatsql> insert into t1 values('d',9);
greatsql> insert into t2 values(1,'a');
greatsql> insert into t2 values(2,'a');
greatsql> insert into t2 values(3,'b');
greatsql> insert into t2 values(4,'b');
greatsql> insert into t2 values(5,'c');
greatsql> insert into t2 values(6,'b');

GreatSQL執行半連線的方式大致有以下5種:

1.Table pullout(子查詢中的表上拉)

當子查詢的查詢列表處只有主鍵或者唯一索引列時,可以直接把子查詢中的表上拉到外層查詢的FROM子句中,並把子查詢的查詢條件合併到外層查詢的搜尋條件中。所以選擇這種方式是有先決條件的,子查詢的查詢列表處必須只有主鍵或唯一索引列。有沒有選擇這種方式,可以透過執行explain展示計劃後,使用show warnings命令檢視最佳化器改寫後的語句。

例如下面這個語句:

select * from t1 where c2 in (select id from t2 where t2.c1='b');

這個語句種子查詢的id列是t2表的主鍵列,滿足這種方式的先決條件,看一下執行計劃。

greatsql> explain select * from t1 where c2 in (select id from t2 where t2.c1='b');
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys  | key    | key_len | ref   | rows | filtered | Extra                                      |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | ref  | PRIMARY,idx_c1 | idx_c1 | 123     | const |    3 |   100.00 | Using index                                |
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL           | NULL   | NULL    | NULL  |    4 |    25.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                            |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t2` join `test`.`t1` where ((`test`.`t1`.`c2` = `test`.`t2`.`id`) and (`test`.`t2`.`c1` = 'b')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

從warning資訊可以看出,最佳化器改執行連線方式是,t1表與t2表透過內連線來關聯,原子查詢內部t2表的過濾條件放到了整個語句where條件的後面,原語句與最佳化器執行的語句之所以等價,是因為子查詢的查詢列id列是主鍵列,不會有重複值,跟外表t1使用inner join連線後,不會造成關聯後結果集資料量的放大。一般情況下子查詢的查詢列表處只有主鍵或者唯一索引列時都會轉化為這種方式來執行。對於這種業務,無論開發者怎麼編寫SQL,使用inner join 也好,exists也好,最後最佳化器執行方式可能都是一樣的。

可以看一下將原語句改造為inner join 與 exists語句的執行計劃,是不是都是一樣的。

greatsql> explain select * from t1 where exists (select 1 from t2 where t2.id=t1.c2 and t2.c1='b');
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys  | key    | key_len | ref   | rows | filtered | Extra                                      |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | ref  | PRIMARY,idx_c1 | idx_c1 | 123     | const |    3 |   100.00 | Using index                                |
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL           | NULL   | NULL    | NULL  |    4 |    25.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
2 rows in set, 2 warnings (0.00 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                            |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1276 | Field or reference 'test.t1.c2' of SELECT #2 was resolved in SELECT #1                                                                                                                           |
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t2` join `test`.`t1` where ((`test`.`t1`.`c2` = `test`.`t2`.`id`) and (`test`.`t2`.`c1` = 'b')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

greatsql> explain select t1.* from t1 inner join t2 on t1.c2=t2.id where t2.c1='b';
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys  | key    | key_len | ref   | rows | filtered | Extra                                      |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | ref  | PRIMARY,idx_c1 | idx_c1 | 123     | const |    3 |   100.00 | Using index                                |
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL           | NULL   | NULL    | NULL  |    4 |    25.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+------+----------------+--------+---------+-------+------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                            |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` join `test`.`t2` where ((`test`.`t1`.`c2` = `test`.`t2`.`id`) and (`test`.`t2`.`c1` = 'b')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

這種執行方式本質上已經轉換為內連線了。

2.FirstMatch(首次匹配)

這種方式先取外層查詢的一條記錄,到子查詢的表中尋找符合匹配條件的記錄,如果能找到一條,則將外層查詢的記錄放入到最終結果集中並且停止查詢匹配更多的記錄,如果找不到,則把該外層查詢的記錄丟棄掉,然後再開始取下一條外層查詢中的記錄,這個過程一直持續到外層查詢獲取不到記錄為止。

看一個簡單語句的執行計劃

select * from t1 where c1 in (select c1 from t2);
greatsql> explain select * from t1 where c1 in (select c1 from t2);
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+-------------------------------+
| id | select_type | table | partitions | type | possible_keys | key    | key_len | ref          | rows | filtered | Extra                         |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+-------------------------------+
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL          | NULL   | NULL    | NULL         |    4 |   100.00 | Using where                   |
|  1 | SIMPLE      | t2  | NULL       | ref  | idx_c1        | idx_c1 | 123     | test.t1.c1 |    2 |   100.00 | Using index; FirstMatch(t1) |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+-------------------------------+
2 rows in set, 1 warning (0.01 sec)

greatsql> show warnings;
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                  |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`test`.`t2`.`c1` = `test`.`t1`.`c1`) |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

從warning資訊可以看到 semi join 的字樣,最佳化器使用半連線方式執行的子查詢。從執行計劃可以看到 extra 列有FirstMatch(t1) 的字樣,表示對t1表外查詢傳入的每個c1值在t2表上都進行了首次匹配,這種方式也是我最初理解的in子查詢的含義,只關心有無匹配上,不關心匹配上多少。

3.LooseScan(鬆散掃描)

LooseScan是使用子查詢的查詢列上的索引,只針對相同索引列值的第一條記錄,去外查詢找對應的記錄。使用了這種最佳化方式的半連線,在explain的計劃的Extra列會有LooseScan字樣。

還是上面的語句,使用semijoin的hint干涉最佳化器,使其選擇LooseScan的最佳化策略。

select /*+ semijoin(@subq1 loosescan) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );
greatsql> explain select /*+ semijoin(@subq1 loosescan) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );            
+----+-------------+-------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key    | key_len | ref  | rows | filtered | Extra                                      |
+----+-------------+-------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | t2  | NULL       | index | idx_c1        | idx_c1 | 123     | NULL |    6 |    50.00 | Using index; LooseScan                     |
|  1 | SIMPLE      | t1  | NULL       | ALL   | NULL          | NULL   | NULL    | NULL |    5 |    20.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

greatsql> show warnings;
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                      |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select /*+ SEMIJOIN(@`subq1` LOOSESCAN) */ `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`test`.`t1`.`c1` = `test`.`t2`.`c1`) |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

從執行計劃可以看出,子查詢的表t2作為驅動表,t2表的c1列上有索引,對錶t2進行訪問時,使用其c1列的索引,對相同的索引列值只取第一條記錄去t1表中找對應記錄,將所有外查詢表t1對應的記錄都加入到最終結果集,可以理解為對子查詢t2表的索引掃描方式是跳躍式的。

4.Duplicate Weedout重複值消除

這種方式是藉助臨時表來消除重複值,explain展示計劃時,在extra列會出現Start temporaryEnd temporary的字樣。

還是上面的語句,我們使用semijoin的hint干涉最佳化器,使其選擇dupsweedout最佳化策略。

greatsql> explain select /*+ semijoin(@subq1 dupsweedout)*/ * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2);
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+---------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key    | key_len | ref          | rows | filtered | Extra                                       |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+---------------------------------------------+
|  1 | SIMPLE      | t1  | NULL       | ALL  | NULL          | NULL   | NULL    | NULL         |    4 |   100.00 | Using where                                 |
|  1 | SIMPLE      | t2  | NULL       | ref  | idx_c1        | idx_c1 | 123     | test.t1.c1 |    2 |   100.00 | Using index; Start temporary; End temporary |
+----+-------------+-------+------------+------+---------------+--------+---------+--------------+------+----------+---------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

greatsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                        |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select /*+ SEMIJOIN(@`subq1` DUPSWEEDOUT) */ `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`test`.`t2`.`c1` = `test`.`t1`.`c1`) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

例如:t1表的記錄('b',3),可以匹配上t2表的兩條記錄(3,'b'),(4,'b'),為了消除關聯結果的重複值,可以想象建立這樣一個臨時表:

create table tmp(rowid int primary key);

當把t1表的記錄加入到結果集時,先把這條記錄的rowid加入到臨時表中,如果新增成功,說明這條記錄並沒有加入到最後的結果集,如果新增失敗,則說明t1表的這條記錄已經加入到最終結果集了

個人感覺這種方式比其他方式效率低。

5.Semi-join Materialization(半連線物化)

先把IN 子句中的不相關子查詢進行物化,然後再將外層查詢的表與物化表進行連線。子查詢內部有分組聚合運算時通常會先進行物化處理。

還是上面的語句,使用semijoin的hint干涉最佳化器,使其選擇materialization的最佳化策略。

select /*+ semijoin(@subq1 materialization) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );
greatsql> explain select /*+ semijoin(@subq1 materialization) */  * from t1 where c1 in (select /*+ qb_name(subq1)*/ c1 from t2 );
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
| id | select_type  | table       | partitions | type   | possible_keys       | key                 | key_len | ref          | rows | filtered | Extra       |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
|  1 | SIMPLE       | t1        | NULL       | ALL    | NULL                | NULL                | NULL    | NULL         |    5 |   100.00 | Using where |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 123     | test.t1.c1 |    1 |   100.00 | NULL        |
|  2 | MATERIALIZED | t2        | NULL       | index  | idx_c1              | idx_c1              | 123     | NULL         |    6 |   100.00 | Using index |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

greatsql> show warnings;
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                            |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select /*+ SEMIJOIN(@`subq1` MATERIALIZATION) */ `test`.`t1`.`c1` AS `c1`,`test`.`t1`.`c2` AS `c2` from `test`.`t1` semi join (`test`.`t2`) where (`<subquery2>`.`c1` = `test`.`t1`.`c1`) |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

從執行計劃可以看出,先對子查詢t2表做了物化表處理,物化表會生成自動索引<auto_distinct_key>,外查詢表t1再與物化表做Nest loop連線。

補充說明

對於上面的語句 select * from t1 where c1 in (select c1 from t2);,最佳化器預設選擇了firstmatch方式,其他方式都是使用hint來干涉的最佳化器的選擇,可以看到這個hint包含兩部分,一個是使用qb_name()給子查詢分配一個名稱,一個是使用semijoin([@query_block_name] [strategy]),指定子查詢塊使用半連線策略,可以指定多個策略。同時semijoin的最佳化策略的選擇還受最佳化開關引數optimize_switch的影響,該引數裡有semijoin,loosescan,firstmatch,duplicateweedout的開關,預設都是開啟的,所以也可以使用最佳化開關來干涉最佳化器的選擇。

最佳化舉例

select count(*)
  from t1 a
 where substr(a.modifytime, 1, 8) = '20240301'
   and a.sospecnumber in
       (select a.sospecnumber
          from t1 a
         where substr(a.modifytime, 1, 8) < '20240301');

這條SQL只涉及一張表t1,表中資料200萬左右,modify_time為字元型別,儲存從2009年開始的時間串。看一下該表的索引情況。

greatsql> show index from t1;
+-------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| t1    |          1 | idx_sospecnumber |            1 | SOSPECNUMBER | A         |         133 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| t1    |          1 | idx_modifytime   |            1 | MODIFYTIME   | A         |      634186 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+-------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)

explain的執行計劃如下:

greatsql> explain
    -> select count(*)
    ->   from t1 a
    ->  where substr(a.modifytime, 1, 8) ='20240301'
    ->    and a.sospecnumber  in
    ->        (select a.sospecnumber
    ->           from t1 a
    ->          where substr(a.modifytime, 1, 8) < '20240301') ;
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+---------------------+---------+----------+-------------+
| id | select_type  | table       | partitions | type   | possible_keys       | key                 | key_len | ref                 | rows    | filtered | Extra       |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+---------------------+---------+----------+-------------+
|  1 | SIMPLE       | a           | NULL       | ALL    | idx_sospecnumber    | NULL                | NULL    | NULL                | 2426414 |   100.00 | Using where |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 131     | test.a.SOSPECNUMBER |       1 |   100.00 | NULL        |
|  2 | MATERIALIZED | a           | NULL       | ALL    | idx_sospecnumber    | NULL                | NULL    | NULL                | 2426414 |   100.00 | Using where |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+---------------------+---------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

最佳化器選擇的半連線最佳化策略是物化的方式。

explain analyze的實際計劃如下:

greatsql> explain analyze
    -> select count(*)
    ->   from t1 a
    ->  where substr(a.modifytime, 1, 8) ='20240301'
    ->    and a.sospecnumber  in
    ->        (select a.sospecnumber
    ->           from t1 a
    ->          where substr(a.modifytime, 1, 8) < '20240301') \G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=1177497474524.58 rows=1) (actual time=4442.499..4442.500 rows=1 loops=1)
    -> Nested loop inner join  (cost=588748984584.98 rows=5887484899396) (actual time=4438.967..4442.408 rows=1346 loops=1)
        -> Filter: ((substr(a.MODIFYTIME,1,8) = '20240301') and (a.SOSPECNUMBER is not null))  (cost=252003.98 rows=2426414) (actual time=1550.096..1552.027 rows=1346 loops=1)
            -> Table scan on a  (cost=252003.98 rows=2426414) (actual time=0.050..1189.136 rows=2493198 loops=1)
        -> Single-row index lookup on <subquery2> using <auto_distinct_key> (sospecnumber=a.SOSPECNUMBER)  (cost=494645.48..494645.48 rows=1) (actual time=2.147..2.147 rows=1 loops=1346)
            -> Materialize with deduplication  (cost=494645.38..494645.38 rows=2426414) (actual time=2888.845..2888.845 rows=165 loops=1)
                -> Filter: (a.SOSPECNUMBER is not null)  (cost=252003.98 rows=2426414) (actual time=0.215..1927.315 rows=2487547 loops=1)
                    -> Filter: (substr(a.MODIFYTIME,1,8) < '20240301')  (cost=252003.98 rows=2426414) (actual time=0.214..1745.562 rows=2487547 loops=1)
                        -> Table scan on a  (cost=252003.98 rows=2426414) (actual time=0.211..1235.738 rows=2493198 loops=1)

1 row in set (4.45 sec)

最佳化分析:

這條SQL總體耗時4.45s,耗時主要分佈在兩處:

一處消耗在外表的查詢,對t1進行了全表掃描,回表過濾後剩餘1346行資料,耗時1552ms,此處雖然modifytime列有索引,但是因為在條件列上施加了substr函式,導致索引用不上,改為modifytime like '20240301%'的方式,也表示了查詢2024年3月1日的資料,同時用上了索引。

另一處消耗在子查詢的物化上,子查詢結果集有2487547行資料,表掃描、過濾、物化整個過程耗時約2888ms,對大結果集進行物化消耗比較大,同時IN子查詢的查詢列sospecnumber列上是有索引的,雖然選擇性不好,但是這個子查詢的含義是隻需要判斷子查詢結果集中有無記錄能匹配上,而不關心匹配上多少條,所以這種情況採用first match方式比較好。

SQL改寫如下:

select /*+ semijoin(@subq firstmatch)*/
 count(*)
  from t1 a
 where a.modifytime like '20240301%'
   and a.sospecnumber in
       (select /*+ qb_name(subq)*/
         a.sospecnumber
          from t1 a
         where substr(a.modifytime, 1, 8) < '20240301')

改寫後執行計劃如下:

*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)  (cost=11052513.72 rows=1) (actual time=157.570..157.570 rows=1 loops=1)
    -> Nested loop semijoin  (cost=8596909.70 rows=24556040) (actual time=0.203..157.450 rows=1346 loops=1)
        -> Filter: (a.SOSPECNUMBER is not null)  (cost=606.05 rows=1346) (actual time=0.057..7.610 rows=1346 loops=1)
            -> Index range scan on a using idx_modifytime over ('20240301' <= MODIFYTIME <= '20240301????????????????????????????????????????????????'), with index condition: (a.MODIFYTIME like '20240301%')  (cost=606.05 rows=1346) (actual time=0.055..7.406 rows=1346 loops=1)
        -> Filter: (substr(a.MODIFYTIME,1,8) < '20240301')  (cost=83255911.06 rows=18244) (actual time=0.111..0.111 rows=1 loops=1346)
            -> Index lookup on a using idx_sospecnumber (SOSPECNUMBER=a.SOSPECNUMBER)  (cost=83255911.06 rows=18244) (actual time=0.111..0.111 rows=1 loops=1346)

1 row in set, 1 warning (0.16 sec)

改寫後耗時0.16s,效能提升近30倍,在對子查詢透過索引idx_sospecnumber搜尋資料時,查到一條就會停止繼續搜尋了。

結語

GreatSQL的 IN 子查詢適用於半連線時,最佳化器提供了5種最佳化策略:Table pullout、FirstMatch、LooseScan、Duplicate weedout、materialize。

一般外查詢表結果集小,子查詢結果集太大時,不希望透過物化這種方式來執行連線,因為物化表的代價太大,可能透過FirstMatch或者LooseScan很快就可以執行出結果了。那反之外查詢結果集大,子查詢結果集小時,透過物化表這種方式可能就會取得很好的效果。很多時候都不用過多幹涉最佳化器做選擇,但是如果懂得原理,當最佳化器選錯的時候我們也可以透過hint來穩定計劃,讓SQL保持高效的執行。


Enjoy GreatSQL 😃

關於 GreatSQL

GreatSQL是適用於金融級應用的國內自主開源資料庫,具備高效能、高可靠、高易用性、高安全等多個核心特性,可以作為MySQL或Percona Server的可選替換,用於線上生產環境,且完全免費併相容MySQL或Percona Server。

相關連結: GreatSQL社群 Gitee GitHub Bilibili

GreatSQL社群:

社群部落格有獎徵稿詳情:https://greatsql.cn/thread-100-1-1.html

image-20230105161905827

技術交流群:

微信:掃碼新增GreatSQL社群助手微信好友,傳送驗證資訊加群

image-20221030163217640

相關文章