PostgreSQL任意列組合條件行數估算實踐-取樣估算

postgres=# create table tbl1(c1 int, c2 int, c3 int, c4 int, c5 int, c6 timestamp);  
CREATE TABLE  
postgres=# insert into tbl1 select random()*100, random()*10, random()*50, random()*20, random()*5000, clock_timestamp()  from generate_series(1,50000000);  
INSERT 0 50000000  
postgres=# analyze tbl1;  
ANALYZE

2、觀察多列、單列條件的行評估是否準確

兩列AND條件，非常準確（評估53562，實際50354 ）

postgres=# select count(*) from tbl1 where c1=1 and c2=1;  
 count   
-------  
 50354  
(1 row)  
  
postgres=# explain select * from tbl1 where c1=1 and c2=1;  
                                 QUERY PLAN                                    
-----------------------------------------------------------------------------  
 Gather  (cost=1000.00..686505.00 rows=53562 width=28)  
   Workers Planned: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..680148.80 rows=22318 width=28)  
         Filter: ((c1 = 1) AND (c2 = 1))  
(4 rows)

兩列OR條件，非常準確（評估5558119，實際5451751 ）

postgres=# explain select * from tbl1 where c1=1 or c2=1;  
                           QUERY PLAN                              
-----------------------------------------------------------------  
 Seq Scan on tbl1  (cost=0.00..1117649.92 rows=5558119 width=28)  
   Filter: ((c1 = 1) OR (c2 = 1))  
(2 rows)  
  
postgres=# select count(*) from tbl1 where c1=1 or c2=1;  
  count    
---------  
 5451751  
(1 row)

三列AND條件，非常準確（評估10，實際12 ）

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=99;  
                                                       QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..733233.27 rows=10 width=28) (actual time=41.413..1422.724 rows=12 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..732232.27 rows=4 width=28) (actual time=273.134..1420.060 rows=4 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 99))  
         Rows Removed by Filter: 16666663  
 Planning time: 0.126 ms  
 Execution time: 1424.440 ms  
(8 rows)

但是，所有情況都能準確嗎？

當資料出現傾斜時，可能就沒那麼準確了。

3、再次寫入5000萬資料，此次c2,c3,c4,c5使用固定值，使其傾斜。

postgres=# insert into tbl1 select random()*100, 10, 50, 20, 5000, clock_timestamp()  from generate_series(1,50000000);  
INSERT 0 50000000  
  
postgres=# analyze tbl1;  
ANALYZE

4、非傾斜值的評估，依舊準確

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=99;  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..1465464.64 rows=21 width=28) (actual time=362.359..3084.858 rows=12 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..1464462.54 rows=9 width=28) (actual time=154.661..3082.340 rows=4 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 99))  
         Rows Removed by Filter: 33333329  
 Planning time: 0.099 ms  
 Execution time: 3086.687 ms  
(8 rows)

5、對於傾斜值，評估偏差很大（評估12702，實際0 ）

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=5000 and c4=20;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..1570899.53 rows=12702 width=28) (actual time=2834.241..2834.241 rows=0 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..1568629.33 rows=5292 width=28) (actual time=2831.777..2831.777 rows=0 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 5000) AND (c4 = 20))  
         Rows Removed by Filter: 33333333  
 Planning time: 0.144 ms  
 Execution time: 2836.049 ms  
(8 rows)

6、調整目標列的統計資訊柱狀圖，擴大到1000。

postgres=# alter table tbl1 alter COLUMN c5 set statistics 1000;  
ALTER TABLE  
postgres=# alter table tbl1 alter COLUMN c4 set statistics 1000;  
ALTER TABLE  
postgres=# alter table tbl1 alter COLUMN c1 set statistics 1000;  
ALTER TABLE  
postgres=# alter table tbl1 alter COLUMN c2 set statistics 1000;  
ALTER TABLE  
postgres=# analyze tbl1;  
ANALYZE

7、行評估依舊不準確（評估13034，實際0 ）

postgres=# explain analyze select * from tbl1 where c1=1 and c2=1 and c5=5000 and c4=20;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Gather  (cost=1000.00..1570932.73 rows=13034 width=28) (actual time=2852.150..2852.150 rows=0 loops=1)  
   Workers Planned: 2  
   Workers Launched: 2  
   ->  Parallel Seq Scan on tbl1  (cost=0.00..1568629.33 rows=5431 width=28) (actual time=2849.510..2849.510 rows=0 loops=3)  
         Filter: ((c1 = 1) AND (c2 = 1) AND (c5 = 5000) AND (c4 = 20))  
         Rows Removed by Filter: 33333333  
 Planning time: 0.181 ms  
 Execution time: 2853.844 ms  
(8 rows)

那麼這個問題無解了嗎？實際上PG提供了取樣查詢的介面，我們通過取樣查詢能夠實現高效、準確的評估。

《秒級任意維度分析1TB級大表 – 通過取樣估值滿足高效TOP N等統計分析需求》

內建了4種取樣介面，我們這裡用到的是基於BLOCK級別的百分比取樣。

取樣百分之一，COUNT值乘以100就是評估行數。如果取樣百分之5的話，COUNT值乘以20就是評估行數。

實際上取樣百分之一，評估已經非常準確了。

postgres=# select count(*) from ( SELECT * FROM tbl1 TABLESAMPLE SYSTEM(1)) t where c1=1 and c2=1 and c5=5000 and c4=20;  
 count   
-------  
     0  
(1 row)  
Time: 104.212 ms  
  
postgres=# select count(*) from ( SELECT * FROM tbl1 TABLESAMPLE SYSTEM(1)) t where c1=1 and c2=1 and c5=5000 ;  
 count   
-------  
     0  
(1 row)  
Time: 107.088 ms  
  
postgres=# select count(*) from tbl1 where c1=1 and c2=1 and c5=5000;  
 count   
-------  
     2  
(1 row)  
Time: 2838.372 ms (00:02.838)

postgres=# select count(*) from ( SELECT * FROM tbl1 TABLESAMPLE SYSTEM(1)) t where c1=1 and c2=1 ;  
 count   
-------  
   552  
(1 row)  
  
Time: 100.066 ms  
  
postgres=# select count(*) from tbl1 where c1=1 and c2=1 ;  
 count   
-------  
 50354  
(1 row)  
Time: 2546.602 ms (00:02.547)

1億條記錄，取樣百分之一，大概耗時100毫秒。相比直接全表掃描還是節約了不少成本的。

小結

通過取樣查詢，我們實現了任意維度條件過濾的行評估，相比傳統的行評估演算法，精確度提高了不少，在1億資料量的情況下，取樣百分之一，耗時約100毫秒。

參考

《PostgreSQL Oracle 相容性之 – 資料取樣與脫敏》

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》

《PostgreSQL 優化器行評估演算法》

《妙用explain Plan Rows快速估算行 – 分頁數估算》

《秒級任意維度分析1TB級大表 – 通過取樣估值滿足高效TOP N等統計分析需求》

《PostgreSQL 10 黑科技 – 自定義統計資訊》

修改單列統計資訊柱狀圖精度，預設是100全域性default_statistics_target引數控制。

postgres=# alter table tbl1 alter COLUMN c5 set statistics 1000;  
ALTER TABLE  
Time: 0.786 ms  
  
postgres=# analyze tbl1;  
ANALYZE  
Time: 2157.194 ms (00:02.157)

《Greenplum 最佳實踐 – 估值外掛hll的使用(以及hll分式聚合函式優化)》

《PostgreSQL hll (HyperLogLog) extension for “State of The Art Cardinality Estimation Algorithm” – 3》

《PostgreSQL hll (HyperLogLog) extension for “State of The Art Cardinality Estimation Algorithm” – 2》

《PostgreSQL hll (HyperLogLog) extension for “State of The Art Cardinality Estimation Algorithm” – 1》

《PostgreSQL count-min sketch top-n 概率計算外掛 cms_topn (結合視窗實現同比、環比、滑窗分析等) – 流計算核心功能之一》

PostgreSQL實踐-實時廣告位推薦2(任意欄位組合、任意維度組合搜尋、輸出TOP-K)
2018-05-06
SQL
Transformer 估算 101
2023-05-19
ORM
如何實現條件組合元件
2024-03-21
元件
樣本量估算：隨機對照試驗（兩組均數）比較的樣本量計算方法
2020-12-22
隨機
1、擬合、預測、估算器、管道與模型評估
2024-08-22
模型
架構：軟體成本估算
2024-03-01
架構
ClickHouse最大QPS到底咋估算？
2024-03-21
NameNode堆記憶體估算
2024-05-08
記憶體
架構設計（九）：估算
2022-12-25
架構
專案成本估算快速指南
2022-04-13
敏捷實施中的估算與實際效果 - Ottinger
2021-05-14
敏捷
【Django】組合條件的搜尋功能實現
2018-04-12
Django
遊戲ROI估算模型（附工具）
2019-12-19
遊戲模型
演算法分析__時間估算
2019-03-05
演算法
軟體工程估算的技巧 - shubhro
2022-05-12
軟體工程
專案管理中的自下而上估算
2022-07-14
專案管理
如何合理地估算執行緒池大小？（轉載）
2018-05-21
執行緒
定性檢測的樣本量估算之精確概率法
2022-03-28
5G基站電源配置如何估算？
2020-06-02
你如何估算專案資源的成本？
2021-11-24
PostgreSQL生成任意基數數獨-3
2018-04-18
SQL
PostgreSQL生成任意基數數獨-2
2018-04-18
SQL
postgresql 多條記錄合併一條，或取最新的一條資料
2024-04-26
SQL
【DevCloud·敏捷智庫】如何利用故事點做估算
2020-09-10
devCloud敏捷
CO-標準成本估算深入研究系列
2022-05-16
Spring Boot @Condition 註解，組合條件你知道嗎
2019-09-09
Spring Boot
專案經理如何有效地進行專案工作量估算?
2022-07-19
分散式儲存系統可靠性如何估算？
2019-03-01
分散式
如何為 Cloud TPU 編寫自定義估算器模型
2018-09-19
Cloud模型
【GreatSQL最佳化器-03】查詢開銷估算
2024-11-20
SQL
敏捷開發：使用者故事估算方法介紹
2024-12-03
敏捷
PostgreSQL啟動恢復讀取checkpoint記錄失敗的條件
2018-08-06
SQL
TypeScript 條件型別精讀與實踐
2021-10-04
TypeScript型別
多執行緒06：條件變數
2022-05-13
執行緒變數
敏捷實戰分享：Runtastic停止了估算故事並改善衝刺，效率提高30%
2020-03-22
敏捷AST
組合數學筆記-特殊計數數列
2023-03-01
筆記
ReentrantLock的條件佇列
2020-12-07
ReentrantLock佇列
使用 TensorFlow Hub 和估算器構建文字分類模型
2018-09-05
文字分類模型

PostgreSQL任意列組合條件行數估算實踐-取樣估算

標籤

背景

例子

小結

參考

相關文章