資料庫的智慧之源-統計資訊之自定義多維統計

德哥發表於2017-09-11

資料庫

背景

PostgreSQL和Oracle一樣，優化器是基於成本的估算。

成本估算中很重要的一個環節是估計每個執行節點返回的記錄數。

例如兩張表JOIN的時候，如果走HASH JOIN那麼需要選擇記錄數少的那個作為雜湊表。

又比如求多個欄位的group by，評估返回多少條記錄給上層節點。

對於基於單列統計的的柱狀圖，估算單個欄位條件的選擇性是很準確的，而估算多個欄位時，PostgreSQL預設使用獨立屬性，直接以多個欄位選擇性相乘的方法計算多個欄位條件的選擇性。不是很準確。

PostgreSQL 10引入了一個黑科技功能，允許使用者自定義多個欄位的統計資訊，目前支援多列相關性和多列唯一值兩種統計。

由於多列統計涉及到許多組合（N階乘種組合），因此預設不會對所有欄位進行任意組合的統計，使用者可以根據實際的業務需求，對需要and查詢，組合group by的欄位(例如 where a xx and b xx, group by a,b)。建立對應的自定義統計資訊。

例子講解

1、建表，11個欄位。

postgres=# create table tbl(id int, c1 int, c2 text, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int, c10 int);    
CREATE TABLE

2、寫入測試資料，1000萬條。

postgres=# insert into tbl select     
postgres-# id,     
postgres-# random()*100, substring(md5(random()::text), 1, 4), random()*900, random()*10000, random()*10000000,     
postgres-# random()*100000, random()*100, random()*200000, random()*40000, random()*90000     
postgres-# from generate_series(1,10000000) t(id);    
INSERT 0 10000000

3、分析表

postgres=# analyze tbl;  
ANALYZE

得到reltuples為1e+07，後面要用於計算。

postgres=# select reltuples from pg_class where relname=`tbl`;  
-[ RECORD 1 ]----  
reltuples | 1e+07

4、SQL舉例

4.1 單個欄位條件

postgres=# explain (analyze) select * from tbl where c1=1;  
                                                 QUERY PLAN                                                    
-------------------------------------------------------------------------------------------------------------  
 Seq Scan on tbl  (cost=0.00..218458.08 rows=93865 width=45) (actual time=0.018..755.833 rows=99527 loops=1)  
   Filter: (c1 = 1)  
   Rows Removed by Filter: 9900473  
 Planning time: 0.077 ms  
 Execution time: 763.151 ms  
(5 rows)

可以推算得到c1=1的選擇性為: 93865/1e+07 。

postgres=# explain (analyze) select * from tbl where c2=`abc`;  
                                               QUERY PLAN                                                  
---------------------------------------------------------------------------------------------------------  
 Seq Scan on tbl  (cost=0.00..218458.08 rows=148 width=45) (actual time=874.473..874.473 rows=0 loops=1)  
   Filter: (c2 = `abc`::text)  
   Rows Removed by Filter: 10000000  
 Planning time: 0.080 ms  
 Execution time: 874.505 ms  
(5 rows)

可以推算得到c2=`abc`的選擇性為: 148/1e+07 。

4.2 多個欄位條件

postgres=# explain (analyze) select * from tbl where c1=1 and c2=`abc`;  
                                              QUERY PLAN                                                 
-------------------------------------------------------------------------------------------------------  
 Seq Scan on tbl  (cost=0.00..243458.09 rows=1 width=45) (actual time=802.347..802.347 rows=0 loops=1)  
   Filter: ((c1 = 1) AND (c2 = `abc`::text))  
   Rows Removed by Filter: 10000000  
 Planning time: 0.116 ms  
 Execution time: 802.374 ms  
(5 rows)

rows=1是怎麼得來的呢，在沒有自定義統計資訊時，是這麼算的，算這兩個條件完全不相干，所以選擇性直接相乘。

(93865/1e+07) * (148/1e+07) * 1e+07 = 1.389202 ~= 1

4.3 單個欄位條件求唯一值

postgres=# explain (analyze) select c1,count(*) from tbl group by c1;  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=243458.09..243459.10 rows=101 width=12) (actual time=3256.458..3256.473 rows=101 loops=1)  
   Group Key: c1  
   ->  Seq Scan on tbl  (cost=0.00..193458.06 rows=10000006 width=4) (actual time=0.013..1252.169 rows=10000000 loops=1)  
 Planning time: 0.061 ms  
 Execution time: 3256.518 ms  
(5 rows)

rows=101來自pg_stats.n_distinct , tbl.c1列的統計。

n_distinct             | 101

4.4 多個欄位條件求唯一值

postgres=# explain (analyze) select c1,c2,count(*) from tbl group by c1,c2;  
                                                          QUERY PLAN                                                             
-------------------------------------------------------------------------------------------------------------------------------  
 GroupAggregate  (cost=1561215.43..1671215.50 rows=1000001 width=17) (actual time=11414.144..16549.549 rows=5147139 loops=1)  
   Group Key: c1, c2  
   ->  Sort  (cost=1561215.43..1586215.45 rows=10000006 width=9) (actual time=11414.132..13905.616 rows=10000000 loops=1)  
         Sort Key: c1, c2  
         Sort Method: external merge  Disk: 185984kB  
         ->  Seq Scan on tbl  (cost=0.00..193458.06 rows=10000006 width=9) (actual time=0.008..1567.909 rows=10000000 loops=1)  
 Planning time: 0.082 ms  
 Execution time: 16952.301 ms  
(8 rows)

5、自定義統計資訊語法講解

Command:     CREATE STATISTICS  
Description: define extended statistics  
Syntax:  
CREATE STATISTICS [ IF NOT EXISTS ] statistics_name  
    [ ( statistic_type [, ... ] ) ]  
    ON column_name, column_name [, ...]  
    FROM table_name

建立自定義統計資訊，指定需要自定義統計的欄位名，需要統計依賴性、唯一性（不指定則都統計）。

6、建立自定義統計資訊

我們建立c1 c2 c3這三個欄位的自定義統計資訊。

postgres=# create statistics s1 on c1,c2,c3 from tbl;  
CREATE STATISTICS

自定義統計資訊建立好之後，需要分析表，才會生成。

postgres=# analyze tbl;  
ANALYZE

7、自定義統計資訊解讀

postgres=# select * from pg_statistic_ext where stxname=`s1`;  
-[ RECORD 1 ]---+--------------------------------------------------------------------------------------------------------------  
stxrelid        | 16384   -- 表  
stxname         | s1  
stxnamespace    | 2200  
stxowner        | 10  
stxkeys         | 2 3 4  -- 表示第2,3,4列建立自定義統計資訊。  
stxkind         | {d,f}   -- 統計 欄位之間的依賴度（相關性）、唯一值個數。  
stxndistinct    | {"2, 3": 3747653, "2, 4": 87662, "3, 4": 9001205, "2, 3, 4": 10000006}   -- 組合唯一值個數  
stxdependencies | {"3 => 2": 0.642100, "3 => 4": 0.639567, "2, 3 => 4": 0.995000, "2, 4 => 3": 0.712033, "3, 4 => 2": 0.999667}    
-- 欄位之間的依賴性，當使用多個欄位AND條件時，用於代替多個孤立條件的選擇性相乘。選擇性乘以依賴度，選出最後計算結果最低的，作為最終選擇性。

stxndistinct，很好理解，就是欄位組合後的唯一值個數。

stxdependencies，當一個欄位確定後，另一個欄位是唯一值的比例有多少？例如a=1, b={1,2,3,4,5,….}; a=2,b=1，這裡只有後面這條算b依賴a。依賴條數除以總數即a => b的依賴值。可以用於評估兩個欄位都是等值條件時的選擇性。生成stxdependencies的演算法很簡單，a => b 等於 count(distinct a)/count(distinct a,b)；b=>a 等於 count(distinct b)/count(distinct a,b)；

postgres=# d t  
                 Table "public.t"  
 Column |  Type   | Collation | Nullable | Default   
--------+---------+-----------+----------+---------  
 c1     | integer |           |          |   
 c2     | integer |           |          |   
Statistics objects:  
    "public"."s1" (ndistinct, dependencies) ON c1, c2 FROM t  
  
postgres=# select * from pg_statistic_ext where stxname=`s2`;  
 stxrelid | stxname | stxnamespace | stxowner | stxkeys | stxkind |   stxndistinct   |             stxdependencies                
----------+---------+--------------+----------+---------+---------+------------------+------------------------------------------  
    16394 | s2      |         2200 |       10 | 1 2     | {d,f}   | {"1, 2": 105358} | {"1 => 2": 0.083733, "2 => 1": 0.916200}  
(1 row)  
  
postgres=# select count(distinct c1) from t;  
 count   
-------  
 10000  
(1 row)  
  
postgres=# select count(distinct c2) from t;  
 count    
--------  
 100001  
(1 row)  
  
postgres=# select count(distinct (c1,c2)) from t;  
s count    
--------  
 109999  
(1 row)  
  
postgres=# select 10000/109999.0;  
        ?column?          
------------------------  
 0.09090991736288511714  
(1 row)  
  
postgres=# select 100001/109999.0;  
        ?column?          
------------------------  
 0.90910826462058745989  
(1 row)  
  
postgres=# select 0.083733+0.916200;  
 ?column?   
----------  
 0.999933  
(1 row)

評估例子

a = ? and b = ? 的選擇性   
=   
min( "選擇性(a) * (a=>b)" , "選擇性(b) * (b=>a)" )

8、SQL舉例

8.1 多個欄位條件

postgres=# explain (analyze) select * from tbl where c1=1 and c2=`abc`;  
                                               QUERY PLAN                                                 
--------------------------------------------------------------------------------------------------------  
 Seq Scan on tbl  (cost=0.00..243458.09 rows=96 width=45) (actual time=802.182..802.182 rows=0 loops=1)  
   Filter: ((c1 = 1) AND (c2 = `abc`::text))  
   Rows Removed by Filter: 10000000  
 Planning time: 0.098 ms  
 Execution time: 802.203 ms  
(5 rows)

建立了多欄位統計資訊後，這兩個條件在統計資訊之列，所以可以用他們的依賴度來算組合AND條件的選擇性。

演算法：選擇性最低的條件的選擇性 * 與另一個欄位的依賴度，得到組合選擇性。

0.642100 * (148/1e+07) * 1e+07 = 95.0308 (這裡反推的選擇性有一點失真，大概原理就是這樣)

8.2 多個欄位條件求唯一值

postgres=# explain (analyze) select c1,c2,count(*) from tbl group by c1,c2;  
                                                          QUERY PLAN                                                             
-------------------------------------------------------------------------------------------------------------------------------  
 GroupAggregate  (cost=1561215.43..1698692.02 rows=3747653 width=17) (actual time=11632.613..16843.873 rows=5147139 loops=1)  
   Group Key: c1, c2  
   ->  Sort  (cost=1561215.43..1586215.45 rows=10000006 width=9) (actual time=11632.597..14202.457 rows=10000000 loops=1)  
         Sort Key: c1, c2  
         Sort Method: external merge  Disk: 185984kB  
         ->  Seq Scan on tbl  (cost=0.00..193458.06 rows=10000006 width=9) (actual time=0.008..1567.024 rows=10000000 loops=1)  
 Planning time: 0.215 ms  
 Execution time: 17246.889 ms  
(8 rows)

直接使用了多列統計資訊中的唯一值統計資訊3747653。

"2, 3": 3747653

9、如何利用自定義統計資訊統計多個欄位唯一值、多列依賴性。

9.1 PostgreSQL已有了單列唯一值的統計，我們可以通過pg_stats.n_distinct以及pg_class.reltuples查詢到。

通過create statistic，資料庫會自動收集多列值的統計資訊，我們查詢pg_statistic_ext.stxndistinct，可以得到多列唯一值的估計值。

9.2 多列依賴性指的是列與列之間值的依賴強度，是一個小於等於1的係數。1表示強依賴，

小結

1、PostgreSQL 10支援自定義多列統計資訊，目前支援多列組合唯一值、列與列的相關性。

2、多列唯一值可用於評估group by, count(distinct)等。

3、列與列相關性可用於估算多個列AND條件的選擇性。演算法

a = ? and b = ? 的選擇性   
=   
min( "選擇性(a) * (a=>b)" , "選擇性(b) * (b=>a)" )

4、由於多列統計資訊的組合很多，因此資料庫預設只統計單列的柱狀圖。當使用者意識到某些列會作為組合查詢列時，再建立自定義多列統計資訊即可。

參考

https://www.postgresql.org/docs/10/static/multivariate-statistics-examples.html

https://www.postgresql.org/docs/10/static/sql-createstatistics.html

https://www.postgresql.org/docs/10/static/planner-stats.html#planner-stats-extended

按自定義週期統計資料
2019-01-26
6 收集資料庫統計資訊
2020-04-27
資料庫
percona之資料庫資訊統計工具介紹
2015-03-30
資料庫
資料庫遷移後的統計資訊更新
2009-06-17
資料庫
達夢資料庫統計資訊詳解
2021-06-18
資料庫
收集資料庫統計資訊需要收集直方圖資訊.
2016-05-18
資料庫直方圖
資料庫效能調優之始： analyze統計資訊
2021-01-25
資料庫
Oracle多列統計資訊
2017-05-18
Oracle
達夢資料庫手動管理統計資訊方法
2021-06-24
資料庫
【統計資訊】Oracle統計資訊
2021-03-31
Oracle
Oracle統計資訊的收集和維護
2024-03-06
Oracle
Oracle資料庫的資料統計(Analyze)
2010-11-16
Oracle資料庫
資料庫統計資訊不更新導致的效能問題
2008-05-06
資料庫
收集全庫統計資訊
2020-09-16
Java之學生資訊管理系統升級版（資料庫程式設計）
2018-07-23
Java資料庫程式設計
【效能優化】Oracle 10g 資料庫之間複製統計資訊
2014-09-07
優化Oracle 10g資料庫
ORACLE表統計資訊與列統計資訊、索引統計資訊
2018-03-25
Oracle索引
oracle統計資訊使用dbms_stats包收集統計資料
2011-05-23
Oracle
python資料統計之禪道bug統計
2021-08-17
Python
請問這個資料庫統計資訊是什麼，
2022-02-03
資料庫
【PG執行計劃】Postgresql資料庫執行計劃統計資訊簡述
2019-01-10
SQL資料庫
統計資料庫中表大小
2019-09-19
資料庫
【統計資訊】Oracle常用的收集統計資訊方式
2020-11-30
Oracle
自定義統計結果指令碼mystat
2013-08-27
指令碼
資料庫：系統設計的核心
2023-11-17
資料庫
對資料庫的統計和分析。
2007-09-02
資料庫
android 自定義帶動畫的統計餅圖
2019-03-04
Android動畫
Oracle運維指令碼-收集統計資訊
2020-06-11
Oracle運維指令碼
oracle之autotrace統計資訊分析
2011-05-19
Oracle
統計資料庫每天的資料增長量
2014-11-05
資料庫
MapReduce程式設計實踐之自定義資料型別
2015-11-24
程式設計資料型別
資料庫系統設計概述
2020-08-02
資料庫
MySQL資料庫審計系統
2018-08-06
MySql資料庫
ORACLE表統計資訊與列統計資訊
2017-05-10
Oracle
Oracle 11g新特性之收集多列統計資訊
2017-09-21
Oracle
EBS系統資料庫統計資訊收集總結- gather_schema_stats [final]
2014-06-12
資料庫
Oracle 12c資料庫優化器統計資訊收集的最佳實踐（三）｜何時不需要收集統計資訊
2017-09-01
Oracle資料庫優化
PostgreSQL10.0preview功能增強-自由定義統計資訊維度
2017-03-30
SQLView

資料庫的智慧之源-統計資訊之自定義多維統計

標籤

背景

例子講解

小結

參考

相關文章