聊一聊MySQL的直方圖
前一段時間和一個客戶在最佳化SQL,發現其實收集了直方圖以後看起來對執行計劃起不到多大影響。
MySQL 8.0後開始支援直方圖,我們可以透過analyze table時進行收集,
ANALYZE TABLE t UPDATE HISTOGRAM ON c1, c2, c3 WITH 10 BUCKETS;
ANALYZE TABLE t UPDATE HISTOGRAM ON c1, c3 WITH 10 BUCKETS;
ANALYZE TABLE t DROP HISTOGRAM ON c2;
其中可選的WITH N BUCKETS子句指定直方圖的buckets數。N必須為1 ~ 1024之間的整數。如果省略掉,則buckets數為100。
按照我們在ORACLE上的對直方圖的理解, CBO最佳化器可以根據直方圖收集的列值分佈資訊,讓選擇性高(返回資料行比例少)的列值使用索引,而選擇性低(返回資料行比例多)的列值不使用索引。尤其對於存在資料傾斜嚴重的列而言,直方圖很重要。
在MYSQL上我們可以測測看,比如我有個100w行的表,status這一列資料傾斜很嚴重。
master [localhost:22132] {msandbox} (test) > select status ,count(*) from t100w group by status; +---------+----------+ | status | count(*) | +---------+----------+ | ONLINE | 990000 | | OFFLINE | 10000 | +---------+----------+ 2 rows in set (5.80 sec)
那麼當我在查詢時,沒有索引,沒有直方圖的情況下:
我們可以看到兩個條件filtered都是10%:
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='ONLINE'; +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | 1 | SIMPLE | t100w | NULL | ALL | NULL | NULL | NULL | NULL | 994008 | 10.00 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='ONLINE'; +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (t100w.`status` = 'ONLINE') (cost=105265.82 rows=99401) (actual time=10.226..1625.808 rows=990000 loops=1) -> Table scan on t100w (cost=105265.82 rows=994008) (actual time=0.033..1382.466 rows=1000000 loops=1) | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (2.14 sec) master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='OFFLINE'; +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | 1 | SIMPLE | t100w | NULL | ALL | NULL | NULL | NULL | NULL | 994008 | 10.00 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='OFFLINE'; +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (t100w.`status` = 'OFFLINE') (cost=105265.82 rows=99401) (actual time=0.032..1488.844 rows=10000 loops=1) -> Table scan on t100w (cost=105265.82 rows=994008) (actual time=0.028..1333.076 rows=1000000 loops=1) | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (1.49 sec)
對status欄位收集直方圖:
master [localhost:22132] {msandbox} (test) > analyze table t100w UPDATE HISTOGRAM ON status; +------------+-----------+----------+---------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +------------+-----------+----------+---------------------------------------------------+ | test.t100w | histogram | status | Histogram statistics created for column 'status'. | +------------+-----------+----------+---------------------------------------------------+ 1 row in set (1.39 sec) master [localhost:22132] {msandbox} (test) > SELECT -> TABLE_NAME, COLUMN_NAME, -> HISTOGRAM->>'$."data-type"' AS 'data-type', -> JSON_LENGTH(HISTOGRAM->>'$."buckets"') AS 'bucket-count' -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS; +------------+-------------+-----------+--------------+ | TABLE_NAME | COLUMN_NAME | data-type | bucket-count | +------------+-------------+-----------+--------------+ | t100w | status | string | 2 | +------------+-------------+-----------+--------------+ 1 row in set (0.01 sec)
再來對比執行情況:
確實有了直方圖之後,我們看到了filtered 確實更準確了。執行時間也有所降低。
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='ONLINE'; +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | 1 | SIMPLE | t100w | NULL | ALL | NULL | NULL | NULL | NULL | 994008 | 99.17 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='ONLINE'; +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (t100w.`status` = 'ONLINE') (cost=105265.82 rows=985725) (actual time=9.278..1479.325 rows=990000 loops=1) -> Table scan on t100w (cost=105265.82 rows=994008) (actual time=0.028..1240.229 rows=1000000 loops=1) | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (1.97 sec) master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='OFFLINE'; +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | 1 | SIMPLE | t100w | NULL | ALL | NULL | NULL | NULL | NULL | 994008 | 0.83 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='OFFLINE'; +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (t100w.`status` = 'OFFLINE') (cost=105265.82 rows=8283) (actual time=0.031..920.213 rows=10000 loops=1) -> Table scan on t100w (cost=105265.82 rows=994008) (actual time=0.029..778.104 rows=1000000 loops=1) | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.93 sec)
此時,我再給status欄位建立索引,那麼這種情況下,如果條件時 status='OFFLINE',則執行效率會比較高,但如果條件是status='ONLINE',走索引時得回表,執行效率不比全表掃描好。
master [localhost:22132] {msandbox} (test) > create index idx_t100w_s on t100w(status); Query OK, 0 rows affected (9.07 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='ONLINE'; +----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+ | 1 | SIMPLE | t100w | NULL | ref | idx_t100w_s | idx_t100w_s | 43 | const | 497004 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+ 1 row in set, 1 warning (0.00 sec) master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='ONLINE'; +----------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +----------------------------------------------------------------------------------------------------------------------------------------------+ | -> Index lookup on t100w using idx_t100w_s (status='ONLINE') (cost=67295.45 rows=497004) (actual time=0.112..3988.694 rows=990000 loops=1) | +----------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (4.55 sec) 《《《《《《==== master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='OFFLINE'; +----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+ | 1 | SIMPLE | t100w | NULL | ref | idx_t100w_s | idx_t100w_s | 43 | const | 18512 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+ 1 row in set, 1 warning (0.00 sec) master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='OFFLINE'; +-------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-------------------------------------------------------------------------------------------------------------------------------------------+ | -> Index lookup on t100w using idx_t100w_s (status='OFFLINE') (cost=17606.99 rows=18512) (actual time=0.074..80.192 rows=10000 loops=1) | +-------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.09 sec)
可以看到,此時走了索引,和收集 的直方圖就沒什麼關係了。
看看官方文件怎麼說吧:
Histogram statistics are useful primarily for nonindexed columns. Adding an index to a column for which histogram statistics are applicable might also help the optimizer make row estimates. The tradeoffs are:
l An index must be updated when table data is modified.
l A histogram is created or updated only on demand, so it adds no overhead when table data is modified. On the other hand, the statistics become progressively more out of date when table modifications occur, until the next time they are updated.
The optimizer prefers range optimizer row estimates to those obtained from histogram statistics. If the optimizer determines that the range optimizer applies, it does not use histogram statistics.
For columns that are indexed, row estimates can be obtained for equality comparisons using index dives (see ). In this case, histogram statistics are not necessarily useful because index dives can yield better estimates.
慢慢改進吧,我們太想任何其他資料庫都一下子能像ORACLE一樣強大了。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29863023/viewspace-2942060/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 聊一聊MySQL的字符集MySql
- 聊一聊MySQL的儲存引擎MySql儲存引擎
- 聊一聊MySQL索引失效的問題MySql索引
- 聊一聊Oracle的Tablespace(一)Oracle
- 聊一聊 JVM 的 GCJVMGC
- 聊一聊 RestTemplateREST
- 聊一聊 cookieCookie
- 聊一聊圖資料庫的發展現狀資料庫
- 聊一聊遊戲的壓測遊戲
- 聊一聊 Javascript 中的 ASTJavaScriptAST
- 聊一聊 TLS/SSLTLS
- 聊一聊Java的列舉enumJava
- 聊一聊Redis的離線分析Redis
- 簡單聊一聊Vuex的原理Vue
- 聊一聊Javascript中的Promise物件JavaScriptPromise物件
- 聊一聊前端換膚前端
- 聊一聊Greenplum與PostgreSQLSQL
- 聊一聊模板方法模式模式
- 聊一聊測試流程
- 聊一聊session和cookieSessionCookie
- 聊一聊JWT與sessionJWTSession
- 聊一聊Iterable與Iterator的那些事!
- 聊一聊RocketMQ的註冊中心NameServerMQServer
- 聊一聊 SQLSERVER 的行不能跨頁SQLServer
- 簡單聊一聊FutureTask的實現
- 聊一聊隨機數安全隨機
- 面試官(7): 聊一聊 Babel?面試Babel
- 聊一聊前端業務開發前端
- 面試官:聊一聊索引吧面試索引
- 和手遊開發者聊一聊 iPhoneiPhone
- 聊一聊責任鏈模式模式
- 聊一聊介面卡模式模式
- 聊一聊裝飾者模式模式
- 聊一聊遊戲版本運營遊戲
- 聊一聊系統重構
- 簡單聊一聊ThreadPoolExecutorthread
- 聊一聊 php 程式碼提示PHP
- 來,聊一聊效能優化優化