淺析MySQL 8.0直方圖原理

华为云开发者联盟發表於2024-05-27

本文分享自華為雲社群《【MySQL技術專欄】MySQL8.0直方圖介紹》,作者:GaussDB 資料庫。

背景

資料庫查詢最佳化器負責將SQL查詢轉換為儘可能高效的執行計劃,但因為資料環境不斷變化導致最佳化器對查詢資料瞭解的不夠充足,可能無法生成最優的執行計劃進而影響查詢效率,因此MySQL8.0推出了直方圖(histogram)功能來解決該問題。

直方圖用於統計欄位值的分佈情況,向最佳化器提供統計資訊。利用直方圖,可以對一張表的一列資料做分佈統計,估算where條件中過濾欄位的選擇率,從而幫助最佳化器更準確地估計查詢過程中的行數,選擇更高效的查詢計劃。

本文將對直方圖概念進行介紹,藉助舉例描述直方圖的使用方式,對建立/刪除直方圖的原理進行淺析,並透過例子說明其應用場景。

MySQL8.0直方圖介紹

資料庫中,查詢最佳化器所生成執行計劃的好壞關乎執行耗時的多少,最佳化器若是不清楚表中資料的分佈情況,可能會導致無法生成最優的執行計劃,造成執行時浪費時間。

假設一條SQL語句要查詢相等間隔的兩個不同時間段內出行的人數,若不知道每個時間段內的人數,最佳化器會假設人數在兩個不同時間段內是均勻分佈的。如果兩個時間段內人數相差較大,這樣最佳化器估算的統計資料就出現嚴重偏差,從而可能選擇錯誤的執行計劃。那麼,如何使最佳化器比較清楚地知道資料統計情況進而生成好的執行計劃呢?

一種解決方法就是,在列上建立直方圖,從而近似地獲取一列上的資料分佈情況。利用好直方圖,將會帶來很多方面收益:

(1)查詢最佳化:提供關於資料分佈的統計資訊,幫助最佳化查詢計劃,選擇合適的索引和最佳化查詢語句,從而提高查詢效能;

(2)索引設計:透過分析資料的分佈情況,幫助確定哪些列適合建立索引,以提高查詢效率;

(3)資料分析:提供資料的分佈情況,幫助使用者瞭解資料的特徵和趨勢。

直方圖分為兩類:等寬直方圖(singleton)和等高直方圖(equi-height)。等寬直方圖是每個桶儲存一個值以及這個值累積頻率:

SCHEMA_NAME: xxx//庫名

TABLE_NAME: xxx//表名

COLUMN_NAME: xxx//列名

HISTOGRAM: {

"buckets":[

[

xxx, //桶中數值

xxx //取值頻率

],

......

],

"data-type":"xxx", //資料型別

"null-values":xxx, //是否有NULL值

"collation-id":xxx,

"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx", //更新時間

"sampling-rate":xxx, //取樣率,1表示採集所有資料

"histogram-type":"singleton", //桶型別,等寬

"number-of-buckets-specified":xxx //桶數量

}

等高直方圖每個桶需要儲存不同值的個數,上下限以及累積頻率等:

SCHEMA_NAME: xxx

TABLE_NAME: xxx

COLUMN_NAME: xxx

HISTOGRAM: {

"buckets":[

[

xxx, //最小值

xxx, //最大值

xxx, //桶值出現的頻率

xxx //桶值出現的次數

],

......

],

"data-type":"xxx",

"null-values":xxx,

"collation-id":xxx,

"last-updated":"xxxx-xx-xx xx:xx:xx.xxxxxx",

"sampling-rate":xxx,

"histogram-type":"equi-height", //桶型別,等高

"number-of-buckets-specified":xxx

}

MySQL8.0直方圖使用方式

建立和刪除直方圖時涉及analyze語句,常用語法格式為:

建立直方圖:

ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] ... [WITH N BUCKETS]

刪除直方圖:

ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name] ...

具體示例:

mysql> create table t1(c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int,c10 int,c11 int,c12 int,c13 datetime,c14 int,c15 int,c16 int,primary key(c1));

Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values(1,2,3,4,5,6,7,8,9,10,11,12,'0000-01-01',14,15,16),(2,2,3,4,5,6,7,8,9,10,11,12,'0500-01-01',14,15,16),(3,2,3,4,5,6,7,8,9,10,11,12,'1000-01-01',14,15,16),(4,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16),(5,2,3,4,5,6,7,8,9,10,11,12,'1500-01-01',14,15,16);

Query OK, 5 rows affected (0.00 sec)

Records: 5 Duplicates: 0 Warnings: 0

建立直方圖:

mysql> analyze table t1 update histogram on c13;

+---------+-----------+----------+------------------------------------------------+

| Table | Op | Msg_type | Msg_text |

+---------+-----------+----------+------------------------------------------------+

| test.t1 | histogram | status | Histogram statistics created for column 'c13'. |

+---------+-----------+----------+------------------------------------------------+

1 row in set (0.01 sec)

檢視直方圖資訊:

mysql> select json_pretty(histogram)result from information_schema.column_statistics where table_name = 't1' and column_name = 'c13'\G

*************************** 1. row ***************************

result: {

"buckets": [

[

"0000-01-01 00:00:00.000000", //統計的列值

0.2 //統計的相對頻率,下同

],

[

"0500-01-01 00:00:00.000000",

0.4

],

[

"1000-01-01 00:00:00.000000",

0.6

],

[

"1500-01-01 00:00:00.000000",

1.0

]

],

"data-type": "datetime", //統計的資料型別

"null-values": 0.0, //NULL值的比例

"collation-id": 8, //直方圖資料的排序規則ID

"last-updated": "2023-09-30 16:05:28.533732", //最近更新直方圖的時間

"sampling-rate": 1.0, //直方圖構建取樣率

"histogram-type": "singleton", //直方圖型別,等寬

"number-of-buckets-specified": 100 //桶數量

}

1 row in set (0.00 sec)

刪除直方圖:

mysql> analyze table t1 drop histogram on c13;

+---------+-----------+----------+------------------------------------------------+

| Table | Op | Msg_type | Msg_text |

+---------+-----------+----------+------------------------------------------------+

| test.t1 | histogram | status | Histogram statistics removed for column 'c13'. |

+---------+-----------+----------+------------------------------------------------+

1 row in set (0.00 sec)

MySQL8.0直方圖原理淺析

直方圖原理整體框架可概括為下圖所示:

直方圖程式碼主要包含在sql/histograms路徑下,帶有equi_height字首的相關檔案涉及等高直方圖,帶有singleton字首的相關檔案涉及等寬直方圖,帶有value_map字首的相關檔案涉及儲存統計值結構,histogram.h/histogram.cc涉及直方圖相關呼叫介面。

Sql_cmd_analyze_table::handle_histogram_command為對直方圖操作的整體處理入口,目前只支援在一張表上進行直方圖相關操作。建立直方圖的主要呼叫堆疊如下所示,update_histogram為建立直方圖的入口。

mysql_execute_command

->Sql_cmd_analyze_table::execute

->Sql_cmd_analyze_table::handle_histogram_command

->Sql_cmd_analyze_table::update_histogram

->histograms::update_histogram

->prepare_value_maps

->fill_value_maps

->build_histogram

->store_histogram

->dd::cache::Dictionary_client::update

->dd::cache::Storage_adapter::store

->dd::Column_statistics_impl::store_attributes

->histograms::Singleton<xxx>::histogram_to_json

對於建立流程展開描述,prepare_value_maps中主要根據直方圖列型別建立對應的value_map做準備,之後利用histogram_generation_max_mem_size引數值(限制生成直方圖時所允許使用的最大記憶體大小)和單行資料大小計算後控制統計取樣率,fill_value_maps將反覆讀取資料填充到對應型別的value_map中,key為列實際值,value為其出現的次數。呼叫build_histogram以完成對直方圖的構建,如果桶個數(num_buckets)比不同值個數(value_map.size())要大,則自動建立一個等寬直方圖,否則建立一個等高直方圖。兩種直方圖的建立邏輯分別在Singleton<T>:: build_histogram和Equi_height<T>:: build_histogram中。

構建直方圖完成後呼叫store_histogram,將結果以JSON的形式儲存在系統表中,透過INFORMATION_SCHEMA.COLUMN_STATISTICS對使用者呈現,histogram_to_json會將直方圖結果轉換為Json_object格式,例如last-updated使用Json_datetime格式儲存、histogram-type使用Json_string格式儲存、sampling rate使用Json_double格式儲存等,再依次呼叫json_object->add_clone將各json型別欄位儲存。

刪除直方圖的主要堆疊如下所示。drop_histograms邏輯中在刪除直方圖前會先嚐試獲取以檢查對應直方圖是否真的存在,不存在的話就提前終止邏輯,存在則刪除。

mysql_execute_command

->Sql_cmd_analyze_table::execute

->Sql_cmd_analyze_table::handle_histogram_command

->Sql_cmd_analyze_table::update_histogram

->histograms::update_histogram

MySQL8.0直方圖最佳化場景

最佳化方面,如本文在前所描述的直方圖作用,利用直方圖資訊估算where條件中各謂詞的選擇率,幫助選擇最優的執行計劃。例如,表存在如下所示資料傾斜場景。

mysql> select sys_id,order_status,count(*) from my_table_1 group by sys_id,order_status order by 1,2,3;

+--------+--------------+----------+

| sys_id | order_status | count(*) |

+--------+--------------+----------+

| 3 | 1 | 1 |

| 3 | 2 | 200766 |

| 3 | 3 | 3353 |

| 3 | 4 | 1325 |

| 5 | 1 | 13 |

| 5 | 2 | 2478373 |

| 5 | 3 | 43243 |

| 5 | 4 | 13529 |

| 6 | 2 | 171388 |

| 6 | 3 | 254 |

| 6 | 4 | 716 |

+--------+--------------+----------+

執行如下SQL語句時,因為存在資料傾斜而最佳化器未能準確估計導致執行計劃選擇錯誤,執行耗時約為1.35s。

mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=4163.10 rows=20) (actual time=1350.825..1350.825 rows=0 loops=1)

-> Nested loop left join (cost=4163.10 rows=49) (actual time=1350.825..1350.825 rows=0 loops=1)

-> Filter: ((t1.order_status = 1) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=215.79 rows=49) (actual time=1350.823..1350.823 rows=0 loops=1)

-> Index scan on t1 using PRIMARY (reverse) (cost=215.79 rows=8828) (actual time=0.088..1209.201 rows=2910194 loops=1)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.63 rows=1) (never executed)

透過執行ANALYZE table my_table_1 UPDATE HISTOGRAM ON order_status, sys_id, create_time語句建立直方圖後,再次執行上述SQL語句時,執行計劃中的索引發生了變化,執行耗時為0.11s。因此可以看出,最佳化器利用更準確的資料分佈資訊選擇了更優的執行計劃。

mysql> explain analyze select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (1) and t1.create_time >= '2022-09-10 00:00:00' and t1.create_time <= '2022-09-16 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=38385.46 rows=20) (actual time=114.217..114.217 rows=0 loops=1)

-> Nested loop left join (cost=38385.46 rows=62764) (actual time=114.216..114.216 rows=0 loops=1)

-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=28200.86 rows=62668) (actual time=114.215..114.215 rows=0 loops=1)

-> Filter: (t1.order_status = 1) (cost=28200.86 rows=62668) (actual time=114.207..114.207 rows=0 loops=1)

-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2022-09-10 00:00:00') and (t1.create_time <= TIMESTAMP'2022-09-16 23:59:59')) (cost=28200.86 rows=62668) (actual time=0.326..112.912 rows=31142 loops=1)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.62 rows=1) (never executed)

另外,當where條件中變數值不同時,最佳化器也根據資料分佈情況選擇了準確的執行計劃,使得執行效率提高。

mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (2) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=13541.27 rows=20)

-> Nested loop left join (cost=13541.27 rows=44)

-> Filter: ((t1.order_status = 2) and (t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=15.79 rows=44)

-> Index scan on t1 using PRIMARY (reverse) (cost=15.79 rows=338)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)

1 row in set (0.00 sec)

mysql> explain format=tree select t1.id, t1.order_number, t1.create_time, t1.order_status from my_table_1 t1 left join my_table_2 t2 on t1.id = t2.order_id WHERE t1.sys_id = 5 and t1.order_status in (4) and t1.create_time >= '2020-10-01 00:00:00' and t1.create_time <= '2020-10-09 23:59:59' order by t1.id desc LIMIT 20\G

*************************** 1. row ***************************

EXPLAIN: -> Limit: 20 row(s) (cost=30559.31 rows=20)

-> Nested loop left join (cost=30559.31 rows=55852)

-> Sort: t1.id DESC, limit input to 20 row(s) per chunk (cost=24966.26 rows=55480)

-> Filter: (t1.order_status = 4) (cost=24966.26 rows=55480)

-> Index range scan on t1 using idx_sys_id_create_time, with index condition: ((t1.sys_id = 5) and (t1.create_time >= TIMESTAMP'2020-10-01 00:00:00') and (t1.create_time <= TIMESTAMP'2020-10-09 23:59:59')) (cost=24966.26 rows=55480)

-> Index lookup on t2 using idx_order_id (order_id=t1.id) (cost=0.25 rows=1)

1 row in set (0.00 sec)

所以,透過所提供的統計資訊,幫助最佳化查詢計劃進而提高查詢效能是如前所述應用直方圖的一個收益點。

點選關注,第一時間瞭解華為雲新鮮技術~

相關文章