HybridDBPostgreSQL”Sort、Group、distinct聚合、JOIN”不懼怕資料傾斜的黑科技和原理-多階段聚合
標籤
PostgreSQL , Greenplum , JOIN , group by , distinct , 聚合 , 非分佈鍵 , 資料傾斜 , 多階段聚合
背景
對於分散式系統,資料分佈儲存,例如隨機、雜湊分佈。
Greenplum資料庫支援兩種資料分佈模式:
1、雜湊(指定單個、或多個欄位)
2、隨機分佈(無需指定任何欄位)
資料分佈儲存後,面臨一些挑戰:
JOIN,排序,group by,distinct。
1、JOIN涉及非分佈鍵欄位
2、排序,如何保證輸出順序全域性有序
3、group by非分佈鍵欄位
4、distinct設計非分佈鍵欄位
一些功能不完整的資料庫,可能無法支援以上功能。
Greenplum商業化數十年,功能方面非常完善,那麼它有什麼祕密法寶呢?
( HybridDB for PostgreSQL基於GPDB開源版本改進而來,已包含這個功能。 )
非分佈鍵 JOIN,排序,group by,distinct
1、非分佈鍵 group by
例子,
tbl_ao_col表是c1的分佈鍵,但是我們group by使用了c398欄位,因此看看它是怎麼做的呢?請看執行計劃的解釋。
postgres=# explain analyze select c398,count(*),sum(c399),avg(c399),min(c399),max(c399) from tbl_ao_col group by c398;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 48:1 (slice2; segments: 48) (cost=123364.18..123582.28 rows=9693 width=96)
// 返回結果
Rows out: 10001 rows at destination with 120 ms to end, start offset by 1.921 ms.
-> HashAggregate (cost=123364.18..123582.28 rows=202 width=96)
// 重分佈後再次聚合。
Group By: tbl_ao_col.c398
Rows out: Avg 208.4 rows x 48 workers. Max 223 rows (seg17) with 0.001 ms to first row, 54 ms to end, start offset by 35 ms.
-> Redistribute Motion 48:48 (slice1; segments: 48) (cost=122928.00..123121.86 rows=202 width=96)
// 第一次聚合後,記錄數以及降低到了幾千行,因此重分佈後即使出現傾斜,關係也不大。
Hash Key: tbl_ao_col.c398
Rows out: Avg 8762.2 rows x 48 workers at destination. Max 9422 rows (seg46) with 31 ms to end, start offset by 63 ms.
-> HashAggregate (cost=122928.00..122928.00 rows=202 width=96)
// 這一步是在segment節點聚合
Group By: tbl_ao_col.c398
Rows out: Avg 8762.2 rows x 48 workers. Max 8835 rows (seg2) with 0.004 ms to first row, 8.004 ms to end, start offset by 82 ms.
-> Append-only Columnar Scan on tbl_ao_col (cost=0.00..107928.00 rows=20834 width=16)
Rows out: 0 rows (seg0) with 28 ms to end, start offset by 64 ms.
Slice statistics:
(slice0) Executor memory: 377K bytes.
(slice1) Executor memory: 1272K bytes avg x 48 workers, 1272K bytes max (seg0).
(slice2) Executor memory: 414K bytes avg x 48 workers, 414K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Settings: optimizer=off
Optimizer status: legacy query optimizer
Total runtime: 122.173 ms
(22 rows)
執行計劃解讀:
非分佈鍵 GROUP BY,首先會在本地節點group by,然後按GROUP BY欄位進行資料重分佈,然後再在本地節點GROUP BY,最後返回GROUP BY結果給master節點,返回給使用者。
Greenplum會根據group by的欄位的distinct值的比例,考慮是直接重分佈資料,還是先在本地聚合後再重分佈資料(減少重分佈的資料量)。
2、非分佈鍵 distinct
例子,
tbl 為 隨機分佈
postgres=# explain analyze select count(distinct c2) from tbl;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1549462.55..1549462.56 rows=1 width=8)
Rows out: 1 rows with 0.002 ms to first row, 0.645 ms to end, start offset by 1.681 ms.
-> Gather Motion 48:1 (slice2; segments: 48) (cost=1548947.03..1549450.04 rows=1001 width=4)
Rows out: 1001 rows at destination with 498 ms to end, start offset by 1.684 ms.
-> HashAggregate (cost=1548947.03..1548959.55 rows=21 width=4)
Group By: tbl.c2
Rows out: Avg 20.9 rows x 48 workers. Max 31 rows (seg17) with 0.002 ms to first row, 152 ms to end, start offset by 39 ms.
-> Redistribute Motion 48:48 (slice1; segments: 48) (cost=1548912.00..1548932.02 rows=21 width=4)
Hash Key: tbl.c2
Rows out: Avg 1001.0 rows x 48 workers at destination. Max 1488 rows (seg17) with 309 ms to end, start offset by 39 ms.
-> HashAggregate (cost=1548912.00..1548912.00 rows=21 width=4)
Group By: tbl.c2
Rows out: Avg 1001.0 rows x 48 workers. Max 1001 rows (seg0) with 0.006 ms to first row, 271 ms to end, start offset by 42 ms.
-> Append-only Columnar Scan on tbl (cost=0.00..1048912.00 rows=2083334 width=4)
Rows out: 0 rows (seg0) with 25 ms to end, start offset by 42 ms.
Slice statistics:
(slice0) Executor memory: 327K bytes.
(slice1) Executor memory: 764K bytes avg x 48 workers, 764K bytes max (seg0).
(slice2) Executor memory: 292K bytes avg x 48 workers, 292K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Settings: enable_bitmapscan=off; enable_seqscan=off; optimizer=off
Optimizer status: legacy query optimizer
Total runtime: 502.576 ms
(24 rows)
執行計劃解讀:
非分佈鍵 求distinct,首先會在本地節點hash 聚合,然後按distinct欄位進行資料重分佈,然後再在本地節點hash 聚合,最後返回結果給master節點,返回給使用者。
Greenplum會根據欄位的distinct值的比例,考慮是直接重分佈資料,還是先在本地聚合後再重分佈資料(減少重分佈的資料量)。
3、非分佈鍵 distinct + 非分佈鍵 group by
tbl 為 隨機分佈
postgres=# explain analyze select count(distinct c2) from tbl group by c3;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 48:1 (slice2; segments: 48) (cost=1805483.56..1805484.83 rows=101 width=12)
Rows out: 101 rows at destination with 990 ms to end, start offset by 519 ms.
-> HashAggregate (cost=1805483.56..1805484.83 rows=3 width=12)
Group By: partial_aggregation.c3
Rows out: Avg 2.5 rows x 41 workers. Max 4 rows (seg9) with 0.005 ms to first row, 0.284 ms to end, start offset by 577 ms.
-> HashAggregate (cost=1802703.29..1803967.05 rows=2107 width=8)
Group By: tbl.c3, tbl.c2
Rows out: Avg 2465.9 rows x 41 workers. Max 4004 rows (seg9) with 0.001 ms to first row, 260 ms to end, start offset by 577 ms.
-> Redistribute Motion 48:48 (slice1; segments: 48) (cost=1798912.00..1800934.02 rows=2107 width=8)
Hash Key: tbl.c3
Rows out: Avg 118362.0 rows x 41 workers at destination. Max 192192 rows (seg9) with 663 ms to end, start offset by 577 ms.
-> HashAggregate (cost=1798912.00..1798912.00 rows=2107 width=8)
Group By: tbl.c3, tbl.c2
Rows out: Avg 101100.9 rows x 48 workers. Max 101101 rows (seg0) with 0.005 ms to first row, 747 ms to end, start offset by 562 ms.
-> Append-only Columnar Scan on tbl (cost=0.00..1048912.00 rows=2083334 width=8)
Rows out: 0 rows (seg0) with 40 ms to end, start offset by 560 ms.
Slice statistics:
(slice0) Executor memory: 327K bytes.
(slice1) Executor memory: 1117K bytes avg x 48 workers, 1117K bytes max (seg0).
(slice2) Executor memory: 435K bytes avg x 48 workers, 452K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Settings: enable_bitmapscan=off; enable_seqscan=off; optimizer=off
Optimizer status: legacy query optimizer
Total runtime: 1511.120 ms
(25 rows)
distinct和group by都是非分佈鍵,Greenplum分散式執行計劃優雅的解決了非分佈鍵group by與distinct資料重分佈帶來的網路傳輸的問題。
4、非分佈鍵 join
對於兩個表JOIN時,採用了非分佈鍵時,Greenplum會自動對資料進行重分佈(或者小表使用廣播模式)。
PS
join欄位有資料傾斜時,需要注意。
本例為1000萬個重複ID作為JOIN欄位。JOIN重分佈後,會落到一個節點。
postgres=# explain analyze select a.c1,count(*) from a join b on (a.id=b.id) group by a.c1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 48:1 (slice3; segments: 48) (cost=0.00..2730.45 rows=1 width=12)
Rows out: 1 rows at destination with 7190 ms to end, start offset by 2.357 ms.
-> GroupAggregate (cost=0.00..2730.45 rows=1 width=12)
Group By: a.c1
Rows out: 1 rows (seg22) with 0.001 ms to first row, 0.320 ms to end, start offset by 54 ms.
-> Sort (cost=0.00..2730.44 rows=1 width=12)
Sort Key: a.c1
Rows out: 1 rows (seg22) with 0.001 ms to end, start offset by 54 ms.
Executor memory: 33K bytes avg, 33K bytes max (seg0).
Work_mem used: 33K bytes avg, 33K bytes max (seg0). Workfile: (0 spilling, 0 reused)
-> Redistribute Motion 48:48 (slice2; segments: 48) (cost=0.00..2730.44 rows=1 width=12)
Hash Key: a.c1
Rows out: 1 rows at destination (seg22) with 7138 ms to end, start offset by 54 ms.
-> Result (cost=0.00..2730.44 rows=1 width=12)
Rows out: 1 rows (seg42) with 0.003 ms to end, start offset by 77 ms.
-> GroupAggregate (cost=0.00..2730.44 rows=1 width=12)
Group By: a.c1
Rows out: 1 rows (seg42) with 0.002 ms to first row, 1054 ms to end, start offset by 77 ms.
-> Sort (cost=0.00..2730.44 rows=1 width=4)
Sort Key: a.c1
Rows out: 10000000 rows (seg42) with 0.003 ms to end, start offset by 77 ms.
Executor memory: 1400K bytes avg, 65676K bytes max (seg42).
Work_mem used: 1400K bytes avg, 65676K bytes max (seg42). Workfile: (1 spilling, 0 reused)
Work_mem wanted: 481337K bytes avg, 481337K bytes max (seg42) to lessen workfile I/O affecting 1 workers.
-> Hash Join (cost=0.00..2730.44 rows=1 width=4)
Hash Cond: b.id = a.id
Rows out: 10000000 rows (seg42) with 0.014 ms to first row, 4989 ms to end, start offset by 77 ms.
Executor memory: 6511K bytes avg, 6513K bytes max (seg18).
Work_mem used: 6511K bytes avg, 6513K bytes max (seg18). Workfile: (0 spilling, 0 reused)
-> Redistribute Motion 48:48 (slice1; segments: 48) (cost=0.00..471.72 rows=208130 width=4)
Hash Key: b.id
Rows out: 10000000 rows at destination (seg42) with 0.004 ms to end, start offset by 77 ms.
-> Table Scan on b (cost=0.00..436.27 rows=208130 width=4)
Rows out: Avg 208333.3 rows x 48 workers. Max 208430 rows (seg17) with 4.815 ms to first row, 824 ms to end, start offset by 92 ms.
-> Hash (cost=436.27..436.27 rows=208475 width=8)
Rows in: (No row requested) 0 rows (seg0) with 0 ms to end.
-> Table Scan on a (cost=0.00..436.27 rows=208475 width=8)
Rows out: Avg 208333.3 rows x 48 workers. Max 208401 rows (seg18) with 34 ms to first row, 46 ms to end, start offset by 63 ms.
Slice statistics:
(slice0) Executor memory: 330K bytes.
(slice1) Executor memory: 1129K bytes avg x 48 workers, 1129K bytes max (seg0).
(slice2) * Executor memory: 2139K bytes avg x 48 workers, 66504K bytes max (seg42). Work_mem: 65676K bytes max, 481337K bytes wanted.
(slice3) Executor memory: 372K bytes avg x 48 workers, 388K bytes max (seg22). Work_mem: 33K bytes max.
Statement statistics:
Memory used: 128000K bytes
Memory wanted: 1444908K bytes
Settings: enable_bitmapscan=on; enable_seqscan=on; optimizer=on
Optimizer status: PQO version 1.602
Total runtime: 7193.902 ms
(49 rows)
JOIN兩個非分佈鍵
對於兩個表JOIN時,採用了非分佈鍵時,Greenplum會自動對資料進行重分佈(或者小表使用廣播模式)。
postgres=# explain analyze select a.c1,count(*) from a join b on (a.id=b.id) group by a.c1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 48:1 (slice4; segments: 48) (cost=0.00..990.85 rows=101 width=12)
Rows out: 101 rows at destination with 752 ms to first row, 753 ms to end, start offset by 732 ms.
-> GroupAggregate (cost=0.00..990.85 rows=3 width=12)
Group By: a.c1
Rows out: Avg 2.5 rows x 41 workers. Max 4 rows (seg9) with 746 ms to end, start offset by 738 ms.
-> Sort (cost=0.00..990.85 rows=3 width=12)
Sort Key: a.c1
Rows out: Avg 118.2 rows x 41 workers. Max 192 rows (seg9) with 746 ms to end, start offset by 738 ms.
Executor memory: 58K bytes avg, 58K bytes max (seg0).
Work_mem used: 58K bytes avg, 58K bytes max (seg0). Workfile: (0 spilling, 0 reused)
-> Redistribute Motion 48:48 (slice3; segments: 48) (cost=0.00..990.85 rows=3 width=12)
Hash Key: a.c1
Rows out: Avg 118.2 rows x 41 workers at destination. Max 192 rows (seg9) with 594 ms to first row, 746 ms to end, start offset by 738 ms.
-> Result (cost=0.00..990.85 rows=3 width=12)
Rows out: Avg 101.0 rows x 48 workers. Max 101 rows (seg0) with 675 ms to first row, 676 ms to end, start offset by 740 ms.
-> HashAggregate (cost=0.00..990.85 rows=3 width=12)
Group By: a.c1
Rows out: Avg 101.0 rows x 48 workers. Max 101 rows (seg0) with 675 ms to first row, 676 ms to end, start offset by 740 ms.
Executor memory: 4185K bytes avg, 4185K bytes max (seg0).
-> Hash Join (cost=0.00..964.88 rows=208191 width=4)
Hash Cond: a.id = b.id
Rows out: Avg 208333.3 rows x 48 workers. Max 208401 rows (seg18) with 282 ms to first row, 661 ms to end, start offset by 767 ms.
Executor memory: 4883K bytes avg, 4885K bytes max (seg18).
Work_mem used: 4883K bytes avg, 4885K bytes max (seg18). Workfile: (0 spilling, 0 reused)
(seg18) Hash chain length 1.3 avg, 4 max, using 159471 of 262151 buckets.
-> Redistribute Motion 48:48 (slice1; segments: 48) (cost=0.00..444.59 rows=208378 width=8)
Hash Key: a.id
Rows out: Avg 208333.3 rows x 48 workers at destination. Max 208401 rows (seg18) with 0.112 ms to first row, 104 ms to end, start offset by 1048 ms.
-> Table Scan on a (cost=0.00..436.27 rows=208378 width=8)
Rows out: Avg 208333.3 rows x 48 workers. Max 208422 rows (seg31) with 0.117 ms to first row, 64 ms to end, start offset by 749 ms.
-> Hash (cost=440.42..440.42 rows=208191 width=4)
Rows in: Avg 208333.3 rows x 48 workers. Max 208401 rows (seg18) with 250 ms to end, start offset by 798 ms.
-> Redistribute Motion 48:48 (slice2; segments: 48) (cost=0.00..440.42 rows=208191 width=4)
Hash Key: b.id
Rows out: Avg 208333.3 rows x 48 workers at destination. Max 208401 rows (seg18) with 0.219 ms to first row, 132 ms to end, start offset by 798 ms.
-> Table Scan on b (cost=0.00..436.27 rows=208191 width=4)
Rows out: Avg 208333.3 rows x 48 workers. Max 208388 rows (seg3) with 0.146 ms to first row, 77 ms to end, start offset by 760 ms.
Slice statistics:
(slice0) Executor memory: 313K bytes.
(slice1) Executor memory: 1096K bytes avg x 48 workers, 1096K bytes max (seg0).
(slice2) Executor memory: 1096K bytes avg x 48 workers, 1096K bytes max (seg0).
(slice3) Executor memory: 25518K bytes avg x 48 workers, 25518K bytes max (seg0). Work_mem: 4885K bytes max.
(slice4) Executor memory: 374K bytes avg x 48 workers, 382K bytes max (seg0). Work_mem: 58K bytes max.
Statement statistics:
Memory used: 128000K bytes
Settings: enable_bitmapscan=on; enable_seqscan=on; optimizer=on
Optimizer status: PQO version 1.602
Total runtime: 1486.335 ms
(48 rows)
非分佈鍵 排序
1、merge sort
為了保證全域性有序,以及資料排序的效率。
Greenplum使用了merge sort,首先在資料節點本地排序(所有節點並行),然後master節點向segment請求資料,在master節點merge sort合併。
體現了排序的效率。
非分佈鍵 group by 和 distinct 的原理
對於非分佈鍵的分組聚合請求,Greenplum採用了多階段聚合如下:
第一階段,在SEGMENT本地聚合。(Greenplum會根據欄位的distinct值的比例,考慮是直接重分佈資料,還是先在本地聚合後再重分佈資料(減少重分佈的資料量)。)
第二階段,根據分組欄位,將結果資料重分佈。
第三階段,再次在SEGMENT本地聚合。
第四階段,返回結果給master,有必要的話master節點呼叫聚合函式的final func(已經是很少的記錄數和運算量)。
非分佈鍵 JOIN 的原理
1、對於JOIN為分佈鍵的表,Greenplum根據表的大小,選擇對這張表根據JOIN列重分佈(大表),或廣播(小表)。
2、重分佈完成後,SEGMENT節點並行的執行本地JOIN。
參考
《Greenplum 行存、列存,堆表、AO表的原理和選擇》
《分散式DB(Greenplum)中資料傾斜的原因和解法 – 阿里雲HybridDB for PostgreSQL最佳實踐》
視窗,強制重分佈
《日增量萬億+級 實時分析、資料規整 – 阿里雲HybridDB for PostgreSQL最佳實踐》
相關文章
- 【Hive】資料傾斜優化 shuffle, join, group byHive優化
- 資料庫的sort group by和hash group by資料庫
- Spark SQL三種join和資料傾斜的產生和解決辦法SparkSQL
- Spark學習——資料傾斜Spark
- sql case when, Exist ,group by ,聚合SQL
- 資料傾斜解決辦法
- IoT資料傾斜如何解決
- ES資料聚合
- PostgreSQL DBA(193) - 資料傾斜下的HashJoinSQL
- Redis 切片叢集的資料傾斜分析Redis
- Flink sql 之 兩階段聚合與 TwoStageOptimizedAggregateRule(原始碼分析)SQLZed原始碼
- hive優化-資料傾斜優化Hive優化
- Spark 資料傾斜及其解決方案Spark
- DDD之4聚合和聚合根
- oracle中distinct和group by的區別Oracle
- SQL -去重Group by 和Distinct的效率SQL
- hadoop 透過cachefile來避免資料傾斜Hadoop
- Hive千億級資料傾斜解決方案Hive
- Elasticsearch 系列(五)- 資料聚合Elasticsearch
- mongoDB資料庫之聚合MongoDB資料庫
- 4G多卡聚合路由器工作原理路由器
- es筆記七之聚合操作之桶聚合和矩陣聚合筆記矩陣
- MySQL 中的 distinct 和 group by 的效能比較MySql
- 【Spark篇】---Spark解決資料傾斜問題Spark
- 聚合路由器工作原理路由器
- Elasticsearch系列---聚合查詢原理Elasticsearch
- 多卡多鏈路聚合路由器的關鍵技術、原理分析路由器
- Spring Boot 高效資料聚合之道Spring Boot
- 大資料SQL優化之資料傾斜解決案例全集大資料SQL優化
- 一種自平衡解決資料傾斜的分表方法
- 聚合函式及分組與過濾(GROUP BY … HAVING)函式
- SQL Server資料庫————模糊查詢和聚合函式SQLServer資料庫函式
- 如何解決 Redis 資料傾斜、熱點等問題Redis
- Spark效能最佳化篇三:資料傾斜調優Spark
- Oracle資料傾斜導致的問題-無繫結變數Oracle變數
- Oracle資料傾斜導致的問題-有繫結變數Oracle變數
- es java 聚合方法——聚合後根據count排序並取前2條資料Java排序
- 聚合
- 五款傾斜攝影與三維資料處理工具介紹:GISBox、Cesiumlab、OSGBLab、靈易智模、傾斜伴侶