MYSQL 中的GROUP BY 的方式 (1)(loose index scan鬆散掃描 tight index scan緊湊掃描)

gaopengtttt發表於2016-12-27

水平有限有誤請指出,轉載請說明出處

測試指令碼:
create table tgrploose(p_id int primary key auto_increment,s_id1 int,s_id2 int,s_id3 int, key(s_id1,s_id2,s_id3));
create table tgrpnloose(p_id int primary key auto_increment,s_id1 int,s_id2 int,s_id3 int, key(s_id1,s_id2,s_id3));


 delimiter //
 create procedure inloose1()
     begin
    declare i int;
     set i=0;
     while i<20000 do
         insert into tgrploose(s_id1,s_id2,s_id3) values(FLOOR((RAND()*2)),FLOOR((RAND()*3)),FLOOR((RAND()*4)) );
         set i=i+1;
     end while;
  end;
//
delimiter ;




 delimiter //
 create procedure innloose()
     begin
    declare i int;
     set i=0;
     while i<20000 do
         insert into tgrpnloose(s_id1,s_id2,s_id3) values(FLOOR((RAND()*10000)),FLOOR((RAND()*10000)),FLOOR((RAND()*10000)) );
         set i=i+1;
     end while;
  end;
//
delimiter ;


call inloose();
call innloose();


一、MYSQL 中可能的GROUP BY 方式
1、loose index scan(鬆散索引掃描) 執行計劃必然出現Using index for group-by 
2、tight index scan(緊湊索引掃描) 執行計劃必然出現Using index但是不涉及Using temporary; Using filesort
3、常規方式掃描                         執行計劃涉及到Using temporary; Using filesort  

二、各種方式說明
1、loose index scan(鬆散索引掃描) 執行計劃必然出現Using index for group-by
   實際上這種掃描方式源於B+樹索引結構的,我們回顧一下索引葉子結點的結構
   首先考慮語句
     while i<20000 do
         insert into tgrploose(s_id1,s_id2,s_id3) values(FLOOR((RAND()*2)),FLOOR((RAND()*3)),FLOOR((RAND()*4)) );
   明顯這裡我是向tgrploose插入大約20000條資料,同時s_id1的值只有2個不同值,s_id2的值只有3個不同的值,s_id3的值只有
   4個不同的值,這樣做是為了將loose(鬆散)的影響儘量放大,我們知道在key(s_id1,s_id2,s_id3)中索引的排列為先按照
   s_id1的順序排列如果相同按照s_id2的順序排列如果相同按照s_id3排列,如果都相同按照該primary id排列就是我們這裡的
   p_id這個我早就證明過了,其實這也是B+樹索引的特性,驗證參考如下文章:
   (http://blog.itpub.net/7728585/viewspace-2128817/)
   實際上我們可以透過語句驗證索引的結構


mysql> explain  select s_id1,s_id2,s_id3,min(p_id),max(p_id),count(*) from tgrploose group by s_id1,s_id2,s_id3;
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table     | partitions | type  | possible_keys | key   | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | tgrploose | NULL       | index | s_id1         | s_id1 | 15      | NULL | 19982 |   100.00 | Using index |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.01 sec)


可以看到這裡沒有filesort沒有排序,並且使用TYPE=INDEX的方式進行訪問索引S_ID1,這種方式實際上就是group by中的緊湊掃描方式
mysql> select s_id1,s_id2,s_id3,min(p_id),max(p_id),count(*) from tgrploose group by s_id1,s_id2,s_id3 ;
+-------+-------+-------+-----------+-----------+----------+
| s_id1 | s_id2 | s_id3 | min(p_id) | max(p_id) | count(*) |
+-------+-------+-------+-----------+-----------+----------+
|     0 |     0 |     0 |        20 |     19991 |      882 |
|     0 |     0 |     1 |        10 |     19950 |      835 |
|     0 |     0 |     2 |         3 |     19979 |      830 |
|     0 |     0 |     3 |         1 |     19944 |      853 |
|     0 |     1 |     0 |        22 |     19999 |      762 |
|     0 |     1 |     1 |        26 |     19987 |      786 |
|     0 |     1 |     2 |        21 |     19968 |      867 |
|     0 |     1 |     3 |        19 |     19997 |      908 |
|     0 |     2 |     0 |         7 |     19916 |      848 |
|     0 |     2 |     1 |        14 |     19971 |      906 |
|     0 |     2 |     2 |         4 |     19988 |      870 |
|     0 |     2 |     3 |        80 |     19906 |      762 |
|     1 |     0 |     0 |        49 |     19990 |      779 |
|     1 |     0 |     1 |        38 |     19976 |      886 |
|     1 |     0 |     2 |         2 |     19981 |      857 |
|     1 |     0 |     3 |        37 |     19998 |      830 |
|     1 |     1 |     0 |        65 |     19993 |      839 |
|     1 |     1 |     1 |         5 |     19984 |      822 |
|     1 |     1 |     2 |         8 |     19996 |      808 |
|     1 |     1 |     3 |         6 |     19927 |      792 |
|     1 |     2 |     0 |        91 |     19992 |      797 |
|     1 |     2 |     1 |        24 |     20000 |      839 |
|     1 |     2 |     2 |         9 |     19965 |      779 |
|     1 |     2 |     3 |        12 |     19977 |      863 |
+-------+-------+-------+-----------+-----------+----------+
實際也是驗證了我們說法s_id1的順序排列如果相同按照s_id2的順序排列如果相同按照s_id3排列,如果都相同按照該primary id排列。
那麼考慮如下的group by查詢
select s_id1,s_id2,s_id3,min(p_id) from tgrploose group by s_id1,s_id2,s_id3 ;


mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrploose group by s_id1,s_id2,s_id3 ;
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| id | select_type | table     | partitions | type  | possible_keys | key   | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | tgrploose | NULL       | range | s_id1         | s_id1 | 15      | NULL |   25 |   100.00 | Using index for group-by |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)


因為s_id1,s_id2,s_id3在本列子中一共才24個不同的值,相當的稀疏,那就可以根據索引結構訪問到葉子結點找到最小的那個p_id值即可,
其他的p_id就不用看了,然後跳到下一個s_id1,s_id2,s_id3組合,這就是稀疏掃描的優勢,如果非要給稀疏下一個定義那麼就是group by
後欄位組合相對於表總行數的比率,這裡表為20000行,s_id1,s_id2,s_id3為24個值,那麼比率為24/20000,如果這個值越大則越稠密,
如果越小越則稀疏(這讓我想到稀疏矩陣)。
其次我們要考慮一下loose稀疏索引掃描的效能問題(這部分為自己理解的,沒有參考資料)一般的情況下我們使用type=INDEX這樣的方式完成
group by,這種方式下訪問是比較順序的並且訪問葉子結點即可,而稀疏索引掃描不得不多次使用根結點分支結點來定位,每次跳過的距離,這
個可能隨機訪問,並且多次訪問跟節點和分支節點也是需要開銷的,索引他們之間就存在一個效能的綜合考慮,到底使用稀疏索引掃描還是緊湊
索引掃描其根源在於上面說的那個稀疏的比例問題。考慮前面給出的tgrpnloose這個表我插入資料時候將s_id1,s_id2,s_id3不同值都設定為10000
那麼這個時候就非常稠密了,比較執行計劃如下:
mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrpnloose group by s_id1,s_id2,s_id3;
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table      | partitions | type  | possible_keys | key   | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | tgrpnloose | NULL       | index | s_id1         | s_id1 | 15      | NULL | 19982 |   100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)


mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrploose group by s_id1,s_id2,s_id3;
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| id | select_type | table     | partitions | type  | possible_keys | key   | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | tgrploose | NULL       | range | s_id1         | s_id1 | 15      | NULL |   25 |   100.00 | Using index for group-by |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)


可以看到MYSQL做出了選擇,對tgrpnloose使用了緊湊索引掃描tight index scan(後面描述),當出現Using index for group-by為使用稀疏索引掃描完成group 
by。一些不能用到稀疏索引掃描來完成group by 的限制如下:
1、效能考慮(參考如上tgrpnloose的列子)
2、對單一資料表進行group by(參考tight index scan部分例子)
3、不能使用字首索引(prefix index)
4、僅僅可以使用max(),min()聚合函式,其他聚合函式如count(),sum()不支援
如:
mysql> explain select a.s_id1,a.s_id2,count(*) from tgrploose a where a.s_id1>30 group by a.s_id1,a.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key   | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | a     | NULL       | range | s_id1         | s_id1 | 5       | NULL |    1 |   100.00 | Using where; Using index |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)
5、group by中必須滿足最左側列一致,如果不一致棄用鬆散掃描方式
如:
mysql> explain select a.s_id2,a.s_id3 from tgrploose a  group by a.s_id2,a.s_id3;
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key   | key_len | ref  | rows  | filtered | Extra                                        |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | a     | NULL       | index | s_id1         | s_id1 | 15      | NULL | 19982 |   100.00 | Using index; Using temporary; Using filesort |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+----------------------------------------------+
1 row in set, 1 warning (0.00 sec)

注意一下這個查詢,雖然使用了索引掃描,同時使用了Using temporary; Using filesort,也就是使用了臨時表來儲存a.s_id2,a.s_id3和值然後做了排序操作。
但這個不是緊湊索引掃描的方式,因為使用了臨時表和排序。

在官方文件中如下查詢都能使用到稀疏索引掃描(key(c1,c2,c3) table(c1,c2,c3,c4))
SELECT c1, c2 FROM t1 GROUP BY c1, c2;
SELECT DISTINCT c1, c2 FROM t1;
SELECT c1, MIN(c2) FROM t1 GROUP BY c1;
SELECT c1, c2 FROM t1 WHERE c1 < const GROUP BY c1, c2;
SELECT MAX(c3), MIN(c3), c1, c2 FROM t1 WHERE c2 > const GROUP BY c1, c2;
SELECT c2 FROM t1 WHERE c1 < const GROUP BY c1, c2;
SELECT c1, c2 FROM t1 WHERE c3 = const GROUP BY c1, c2;
SELECT COUNT(DISTINCT c1), SUM(DISTINCT c1) FROM t1;
SELECT COUNT(DISTINCT c1, c2), COUNT(DISTINCT c2, c1) FROM t1;
其實根據索引的結構和稀疏掃描的原理稍加考慮可以明白為什麼能夠使用到稀疏索引掃描

2、tight index scan(緊湊索引掃描) 執行計劃必然出現Using index但是不涉及Using temporary; Using filesort
   描述為僅對驅動表(注意是驅動表)中資料進行分組的時候,如果group by按照索引的順序給出,如果有缺失需要使用column=constant的情況
   比如:
   按照順序給出
mysql>  explain select b.s_id1,b.s_id2,max(a.s_id3) from  tgrpnloose b STRAIGHT_JOIN tgrploose a on a.s_id1=b.s_id1 group by b.s_id1,b.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key   | key_len | ref          | rows  | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+--------------------------+
|  1 | SIMPLE      | b     | NULL       | index | s_id1         | s_id1 | 15      | NULL         | 19982 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | a     | NULL       | ref   | s_id1         | s_id1 | 5       | test.b.s_id1 |  9991 |   100.00 | Using index              |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+--------------------------+
2 rows in set, 1 warning (0.01 sec)


  順序缺失但是使用s_id1=10
 mysql>  explain select b.s_id2,max(a.s_id3) from  tgrpnloose b STRAIGHT_JOIN tgrploose a on a.s_id1=b.s_id1 where b.s_id1=10 group by b.s_id2;
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref   | rows | filtered | Extra                    |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
|  1 | SIMPLE      | b     | NULL       | ref  | s_id1         | s_id1 | 5       | const |    2 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | a     | NULL       | ref  | s_id1         | s_id1 | 5       | const |    1 |   100.00 | Using index              |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
2 rows in set, 1 warning (0.02 sec)

我們能夠清楚的看到這種情況用不到索引稀疏掃描因為不是單表查詢,但是在 Extra 都沒有出現Using temporary; Using filesort,因為使用了tight 
index scan(緊湊索引掃描),注意type=ref 是b.s_id1=10因為索引不是唯一索引。
再看下面的例子:

mysql>  explain select b.s_id2,max(a.s_id3) from  tgrpnloose b STRAIGHT_JOIN tgrploose a on a.s_id1=b.s_id1 group by b.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key   | key_len | ref          | rows  | filtered | Extra                                                     |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | b     | NULL       | index | s_id1         | s_id1 | 15      | NULL         | 19982 |   100.00 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | a     | NULL       | ref   | s_id1         | s_id1 | 5       | test.b.s_id1 |  9991 |   100.00 | Using index                                               |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

明顯group by b.s_id2不滿足按照索引順序進行group by也就是不滿足最左原則,同時沒有s_id1=10這樣常量,使用了
Using index; Using temporary; Using filesort用到了臨時表和filesort排序。
還要明確一點這裡是驅動表一定要注意,如果不加STRAIGHT_JOIN MYSQL給出如下的執行計劃
mysql>  explain select b.s_id1,b.s_id2,max(a.s_id3) from  tgrpnloose b join tgrploose a on a.s_id1=b.s_id1 group by b.s_id1,b.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key   | key_len | ref          | rows  | filtered | Extra                                                     |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | a     | NULL       | index | s_id1         | s_id1 | 15      | NULL         | 19982 |   100.00 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | b     | NULL       | ref   | s_id1         | s_id1 | 5       | test.a.s_id1 |     2 |   100.00 | Using index                                               |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
和第一個列子的區別就是a變為了驅動表,b變為了被驅動表,使用了 Using index; Using temporary; Using filesort

另外tight index scan(緊湊索引掃描)當然也適用於單表的情況,如:
-- 不滿足最左原則,棄用loose index scan
mysql>  explain select b.s_id2 from  tgrploose b where b.s_id1=10  group by b.s_id2;
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref   | rows | filtered | Extra                    |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
|  1 | SIMPLE      | b     | NULL       | ref  | s_id1         | s_id1 | 5       | const |    1 |   100.00 | Using where; Using index |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
1 row in set, 1 warning (0.01 sec)
-- 聚合函式count,棄用loose index scan
mysql>  explain select b.s_id1,count(*) from  tgrploose b  group by b.s_id1;
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key   | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | b     | NULL       | index | s_id1         | s_id1 | 15      | NULL | 19982 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
-- 效能考慮,棄用loose index scan
mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrpnloose group by s_id1,s_id2,s_id3;
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table      | partitions | type  | possible_keys | key   | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | tgrpnloose | NULL       | index | s_id1         | s_id1 | 15      | NULL | 19982 |   100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

簡而言之,如果不能使用loose index scan會優先考慮tight index scan,來避免可能的使用臨時表和排序操作

3、常規方式掃描,執行計劃涉及到Using temporary; Using filesort  
   簡單的說在沒有辦法使用索引規避排序的情況下 需要使用這種方式進行group by,需要使用這種常規的方式,先將group by的欄位放到臨時表
   然後進行排序去重操作。
   上面已經給出了一列子這裡再看一個
   -- //不滿足最左原則,棄用loose index scan,不滿足缺失部分column=constant棄用tight index scan
mysql>  explain select b.s_id3,count(*) from  tgrploose b  where b.s_id1=10 group by b.s_id3;
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref   | rows | filtered | Extra                                                     |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | b     | NULL       | ref  | s_id1         | s_id1 | 5       | const |    1 |   100.00 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-----------------------------------------------------------+


改為
mysql>  explain select b.s_id3,count(*) from  tgrploose b  where b.s_id1=10 and b.s_id2=10 group by b.s_id3;
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref         | rows | filtered | Extra                    |
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+--------------------------+
|  1 | SIMPLE      | b     | NULL       | ref  | s_id1         | s_id1 | 10      | const,const |    1 |   100.00 | Using where; Using index |
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+--------------------------+
可以使用tight index scan(緊湊索引掃描),這是前面說了的原則


三、oracle中的index skip scan
ORACLE中也有差不多的用法,但是範圍更廣不限於僅僅是group by,但是一般情況考慮為沒有正確的為謂詞建立字首索引,效率不是最優,而是
比全表好一些,它僅僅適用於前導列distinct值很少,非前導列選擇性比較高的情況,其實也就是前導列是稀疏的,而非前導列是稠密的。其效能的
問題應該也在於不斷的使用根結點和分支節點定位和可能的隨機讀上,這裡就不給出例子了。
四、總結
可以看到MYSQL三種group by 方式效率越來越低及(最佳化器選擇正確的情況下):
loose index scan(鬆散索引掃描)>tight index scan(緊湊索引掃描)>常規方式掃描
但是適用範圍越來越廣。

關於在trace中發現了代價的計算,這也是導致loose index scan(鬆散索引掃描)和tight index scan(緊湊索引掃描)切換的
根據
,隨後會在根據trace看看原始碼的判斷。

tight index scan:                                                     loose index scan:
T@2: | | | | | | | | | | opt: distinct_aggregate: 0             T@2: | | | | | | | | | | opt: distinct_aggregate: 0
T@2: | | | | | | | | | | opt: rows: 19983                         T@2: | | | | | | | | | | opt: rows: 7
T@2: | | | | | | | | | | opt: cost: 8040.2                         T@2: | | | | | | | | | | opt: cost: 9.8


這裡也給出一個12條資料key(s_id1,s_id2) primary key(p_id)的索引的排列方式,方便大家理 解
明顯先按照s_id1排序,s_id1相同按照s_id2排序,s_id2相同按照主鍵p_id排序

s_id1 0 0 0 0 0 0 1 1 1 1 1 1
s_id2 0 0 1 1 2 2 0 0 1 1 2 2
p_id 5 12 1 6 3 9 2 8 4 10 0 11

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7728585/viewspace-2131478/,如需轉載,請註明出處,否則將追究法律責任。

相關文章