聊一聊MySQL的直方圖

darren__chan發表於2023-03-28

前一段時間和一個客戶在最佳化SQL,發現其實收集了直方圖以後看起來對執行計劃起不到多大影響。

MySQL 8.0後開始支援直方圖,我們可以透過analyze table時進行收集,

ANALYZE TABLE t UPDATE HISTOGRAM ON c1, c2, c3 WITH 10 BUCKETS;

ANALYZE TABLE t UPDATE HISTOGRAM ON c1, c3 WITH 10 BUCKETS;

ANALYZE TABLE t DROP HISTOGRAM ON c2;

其中可選的WITH N BUCKETS子句指定直方圖的buckets數。N必須為1 ~ 1024之間的整數。如果省略掉,則buckets數為100。

 

按照我們在ORACLE上的對直方圖的理解, CBO最佳化器可以根據直方圖收集的列值分佈資訊,讓選擇性高(返回資料行比例少)的列值使用索引,而選擇性低(返回資料行比例多)的列值不使用索引。尤其對於存在資料傾斜嚴重的列而言,直方圖很重要。

 

在MYSQL上我們可以測測看,比如我有個100w行的表,status這一列資料傾斜很嚴重。

master [localhost:22132] {msandbox} (test) > select status ,count(*) from t100w group by status;
+---------+----------+
| status  | count(*) |
+---------+----------+
| ONLINE  |   990000 |
| OFFLINE |    10000 |
+---------+----------+
2 rows in set (5.80 sec)

那麼當我在查詢時,沒有索引,沒有直方圖的情況下:

我們可以看到兩個條件filtered都是10%:

master [localhost:22132] {msandbox} (test) >  explain  select * from t100w where status='ONLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) >  explain analyze select * from t100w where status='ONLINE';
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'ONLINE')  (cost=105265.82 rows=99401) (actual time=10.226..1625.808 rows=990000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.033..1382.466 rows=1000000 loops=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (2.14 sec)
master [localhost:22132] {msandbox} (test) >  explain select * from t100w where status='OFFLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) >  explain analyze select * from t100w where status='OFFLINE';
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'OFFLINE')  (cost=105265.82 rows=99401) (actual time=0.032..1488.844 rows=10000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.028..1333.076 rows=1000000 loops=1)
 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.49 sec)

 

對status欄位收集直方圖:

master [localhost:22132] {msandbox} (test) > analyze table t100w  UPDATE HISTOGRAM ON status;
+------------+-----------+----------+---------------------------------------------------+
| Table      | Op        | Msg_type | Msg_text                                          |
+------------+-----------+----------+---------------------------------------------------+
| test.t100w | histogram | status   | Histogram statistics created for column 'status'. |
+------------+-----------+----------+---------------------------------------------------+
1 row in set (1.39 sec)
master [localhost:22132] {msandbox} (test) > SELECT
    ->          TABLE_NAME, COLUMN_NAME,
    ->          HISTOGRAM->>'$."data-type"' AS 'data-type',
    ->          JSON_LENGTH(HISTOGRAM->>'$."buckets"') AS 'bucket-count'
    ->        FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
+------------+-------------+-----------+--------------+
| TABLE_NAME | COLUMN_NAME | data-type | bucket-count |
+------------+-------------+-----------+--------------+
| t100w      | status      | string    |            2 |
+------------+-------------+-----------+--------------+
1 row in set (0.01 sec)


再來對比執行情況:

確實有了直方圖之後,我們看到了filtered 確實更準確了。執行時間也有所降低。

master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='ONLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |    99.17 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='ONLINE';
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'ONLINE')  (cost=105265.82 rows=985725) (actual time=9.278..1479.325 rows=990000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.028..1240.229 rows=1000000 loops=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.97 sec)
 
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='OFFLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |     0.83 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='OFFLINE';
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'OFFLINE')  (cost=105265.82 rows=8283) (actual time=0.031..920.213 rows=10000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.029..778.104 rows=1000000 loops=1)
 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.93 sec)

 

 

此時,我再給status欄位建立索引,那麼這種情況下,如果條件時 status='OFFLINE',則執行效率會比較高,但如果條件是status='ONLINE',走索引時得回表,執行效率不比全表掃描好。

master [localhost:22132] {msandbox} (test) > create index idx_t100w_s on t100w(status);
Query OK, 0 rows affected (9.07 sec)
Records: 0  Duplicates: 0  Warnings: 0
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='ONLINE';
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key         | key_len | ref   | rows   | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+
|  1 | SIMPLE      | t100w | NULL       | ref  | idx_t100w_s   | idx_t100w_s | 43      | const | 497004 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='ONLINE';
+----------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------+
| -> Index lookup on t100w using idx_t100w_s (status='ONLINE')  (cost=67295.45 rows=497004) (actual time=0.112..3988.694 rows=990000 loops=1)
 |
+----------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (4.55 sec) 《《《《《《====
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='OFFLINE';
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key         | key_len | ref   | rows  | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+
|  1 | SIMPLE      | t100w | NULL       | ref  | idx_t100w_s   | idx_t100w_s | 43      | const | 18512 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='OFFLINE';
+-------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| -> Index lookup on t100w using idx_t100w_s (status='OFFLINE')  (cost=17606.99 rows=18512) (actual time=0.074..80.192 rows=10000 loops=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.09 sec)

 

 

可以看到,此時走了索引,和收集 的直方圖就沒什麼關係了。

看看官方文件怎麼說吧:

 

Histogram statistics are useful primarily for nonindexed columns. Adding an index to a column for which histogram statistics are applicable might also help the optimizer make row estimates. The tradeoffs are:

l   An index must be updated when table data is modified.

 

l   A histogram is created or updated only on demand, so it adds no overhead when table data is modified. On the other hand, the statistics become progressively more out of date when table modifications occur, until the next time they are updated.

 

The optimizer prefers range optimizer row estimates to those obtained from histogram statistics. If the optimizer determines that the range optimizer applies, it does not use histogram statistics.

For columns that are indexed, row estimates can be obtained for equality comparisons using index dives (see  ). In this case, histogram statistics are not necessarily useful because index dives can yield better estimates.

 

慢慢改進吧,我們太想任何其他資料庫都一下子能像ORACLE一樣強大了。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29863023/viewspace-2942060/,如需轉載,請註明出處,否則將追究法律責任。

相關文章