PostgreSQL多查詢條件,多個索引的選擇演算法與問題診斷方法

德哥發表於2018-04-18

標籤

PostgreSQL , 多列條件 , 多索引 , 單列索引 , 複合索引 , 聯合索引 , 優化器 , 評估 , 行評估 , 成本


背景

當一個SQL中涉及多個條件,並且多個條件有多種索引可選時,資料庫優化器是如何選擇使用哪個索引的?

例如

有一張表,有2個欄位,單列一個索引,雙列一個複合索引.

建表。  
postgres=# create table tbl(id int, gid int);  
CREATE TABLE  
  
插入1000萬記錄,其中ID唯一,GID只有10個值。  
postgres=# insert into tbl select generate_series(1,10000000), random()*9 ;  
INSERT 0 10000000  
  
建立兩個索引。  
postgres=# create index idx1 on tbl(id);  
CREATE INDEX  
postgres=# create index idx2 on tbl(gid,id);  
CREATE INDEX  

下面三條SQL,會如何選擇使用哪個索引呢?

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=123;  
  
select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=1;  
  
select * from tbl where id in (1,2,3,4,5,6,7,8,9,10);  

問題思考

人為選擇

這三條QUERY,實際上有三重含義:

1、gid=123的行根本不存在。

如果讓你來選索引,你肯定會選複合索引,馬上就能定位到資料不存在掃描最少的BLOCK。

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=123;  

2、gid=1存在,同時id裡面的條件也存在。

如果讓你來選索引,應該也是選擇複合索引,因為精確定位到了所有的行。

當然如果id in裡面很多記錄不存在,那麼你可能就會選擇id單列索引,因為這個索引本身更小,可能掃描更少的BLOCK。

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=1;  

3、只有id的條件。

此時,肯定選單列索引了。

select * from tbl where id in (1,2,3,4,5,6,7,8,9,10);  

實際情況如何呢?

1、資料庫執行計劃與預期一致

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=123;  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using idx2 on public.tbl  (cost=0.43..2.46 rows=1 width=8) (actual time=0.037..0.037 rows=0 loops=1)  
   Output: id, gid  
   Index Cond: (tbl.gid = 123)  
   Filter: (tbl.id = ANY (`{1,2,3,4,5,6,7,8,9,10}`::integer[]))  
   Heap Fetches: 0  
   Buffers: shared hit=3  
 Planning time: 0.829 ms  
 Execution time: 0.086 ms  
(8 rows)  

2、與預期一致

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=1;  
                                                      QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using idx2 on public.tbl  (cost=0.43..15.46 rows=1 width=8) (actual time=0.026..0.037 rows=2 loops=1)  
   Output: id, gid  
   Index Cond: ((tbl.gid = 1) AND (tbl.id = ANY (`{1,2,3,4,5,6,7,8,9,10}`::integer[])))  
   Heap Fetches: 2  
   Buffers: shared hit=31  
 Planning time: 0.121 ms  
 Execution time: 0.058 ms  
(7 rows)  

3、與預期一致

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10);  
                                                     QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx1 on public.tbl  (cost=0.43..15.52 rows=10 width=8) (actual time=0.021..0.035 rows=10 loops=1)  
   Output: id, gid  
   Index Cond: (tbl.id = ANY (`{1,2,3,4,5,6,7,8,9,10}`::integer[]))  
   Buffers: shared hit=31  
 Planning time: 0.104 ms  
 Execution time: 0.055 ms  
(6 rows)  

問題昇華

資料庫生成執行計劃靠的是統計資訊,如果統計資訊不準確,那麼執行計劃必然不準確。

例如我們人為關閉TBL的自動統計資訊收集,然後寫入一批新的資料。

postgres=# alter table tbl set (autovacuum_enabled =off);  
ALTER TABLE  
postgres=# insert into tbl select generate_series(1,10000000), 100;  
INSERT 0 10000000  

這個資料的特點是GID=100,在原有的統計資訊中,gid=100的行是不存在的,所以下面的SQL優化器顯然做出了錯誤的決定。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=100;  
                                                        QUERY PLAN                                                          
--------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using idx2 on public.tbl  (cost=0.44..2.46 rows=1 width=8) (actual time=0.030..2051.851 rows=10 loops=1)  
   Output: id, gid  
   Index Cond: (tbl.gid = 100)  
   Filter: (tbl.id = ANY (`{1,2,3,4,5,6,7,8,9,10}`::integer[]))  
   Rows Removed by Filter: 9999990  
   Heap Fetches: 10000000  
   Buffers: shared hit=71574  
 Planning time: 0.130 ms  
 Execution time: 2051.900 ms  
(9 rows)  

更新統計資訊後,執行計劃就準確了。

postgres=# analyze tbl;  
ANALYZE  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where id in (1,2,3,4,5,6,7,8,9,10) and gid=100;  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using idx2 on public.tbl  (cost=0.44..20.57 rows=10 width=8) (actual time=0.027..0.043 rows=10 loops=1)  
   Output: id, gid  
   Index Cond: ((tbl.gid = 100) AND (tbl.id = ANY (`{1,2,3,4,5,6,7,8,9,10}`::integer[])))  
   Heap Fetches: 10  
   Buffers: shared hit=31  
 Planning time: 0.212 ms  
 Execution time: 0.067 ms  
(7 rows)  

如何自動收集統計資訊

開啟autovacuum , track_counts即可。

有幾個微調引數,決定了什麼時候掃描是否需要收集統計資訊,以及當前表的變化量。

track_counts = on  
  
#------------------------------------------------------------------------------  
# AUTOVACUUM PARAMETERS  
#------------------------------------------------------------------------------  
  
autovacuum = on                 # Enable autovacuum subprocess?  `on`  
                                        # requires track_counts to also be on.  
#log_autovacuum_min_duration = -1       # -1 disables, 0 logs all actions and  
                                        # their durations, > 0 logs only  
                                        # actions running at least this number  
                                        # of milliseconds.  
#autovacuum_max_workers = 3             # max number of autovacuum subprocesses  
                                        # (change requires restart)  
autovacuum_naptime = 3s         # time between autovacuum runs  
#autovacuum_vacuum_threshold = 50       # min number of row updates before  
                                        # vacuum  
#autovacuum_analyze_threshold = 50      # min number of row updates before  
                                        # analyze  
#autovacuum_vacuum_scale_factor = 0.2   # fraction of table size before vacuum  
#autovacuum_analyze_scale_factor = 0.1  # fraction of table size before analyze  
#autovacuum_freeze_max_age = 200000000  # maximum XID age before forced vacuum  
                                        # (change requires restart)  
#autovacuum_multixact_freeze_max_age = 400000000        # maximum multixact age  
                                        # before forced vacuum  
                                        # (change requires restart)  
autovacuum_vacuum_cost_delay = 0ms      # default vacuum cost delay for  
                                        # autovacuum, in milliseconds;  
                                        # -1 means use vacuum_cost_delay  
#autovacuum_vacuum_cost_limit = -1      # default vacuum cost limit for  
                                        # autovacuum, -1 means use  
                                        # vacuum_cost_limit  

PostgreSQL優化器是支援CBO與遺傳演算法

《資料庫優化器原理 – 如何治療選擇綜合症》

評估每個條件過濾多少行

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》

統計資訊解讀

《PostgreSQL pg_stat_ pg_statio_ 統計資訊(scan,read,fetch,hit)原始碼解讀》

《PostgreSQL 統計資訊pg_statistic格式及匯入匯出dump_stat – 相容Oracle》

昇華-多列統計資訊

《PostgreSQL 10 黑科技 – 自定義統計資訊》

其他因統計資訊不準導致的效能問題

《Greenplum 統計資訊收集引數 – 暨統計資訊不準引入的broadcast motion一例》

其他參考文獻

《PostgreSQL 10 黑科技 – 自定義統計資訊》

《資料庫優化器原理 – 如何治療選擇綜合症》

《PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan》

《PostgreSQL 9種索引的原理和應用場景》

《Greenplum 統計資訊收集引數 – 暨統計資訊不準引入的broadcast motion一例》

《PostgreSQL pg_stat_ pg_statio_ 統計資訊(scan,read,fetch,hit)原始碼解讀》

《PostgreSQL 統計資訊pg_statistic格式及匯入匯出dump_stat – 相容Oracle》

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》


相關文章