Greenplum計算能力估算-暨多大表需要分割槽，單個分割槽多大適宜

德哥發表於2018-04-18

原文網址 : https://flycode.co/archives/185825

背景

在資料倉儲業務中，單表的資料量通常是非常巨大的。Greenplum在這方面做了很多的優化

1、支援列儲存

2、支援向量計算

3、支援分割槽表

4、支援btree, bitmap, gist索引介面

5、執行平行計算

6、支援HASH JOIN

等

提高資料篩選的效率是一個較為低廉有效的優化手段，比如表分割槽。

但是分割槽是不是越多越好呢？

實際上分割槽過多也會引入導致優化器生成執行計劃較慢，後設資料過多，SYSCACHE過大等問題。

設定多大分割槽應該權衡影響，同時又要考慮計算能力。

單個SEGMENT多大資料量合適

GPDB是一個分散式資料庫，執行一條複雜QUERY時，所有的SEGMENT可能並行參與計算。

那麼最慢的SEGMENT就成為了整個SQL的瓶頸，單個SEGMENT多少記錄合適呢？

可以做一個簡單的測試，生成一份測試報告，以供參考。

建立3種常用欄位型別，分別測試這幾種型別的聚合統計能力，JOIN能力。

1、int8型別

postgres=> create temp table t1 (id int8) with (APPENDONLY=true, ORIENTATION=column);  
NOTICE:  Table doesn`t have `DISTRIBUTED BY` clause -- Using column named `id` as the Greenplum Database data distribution key for this table.  
HINT:  The `DISTRIBUTED BY` clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
postgres=> create temp table t2 (id int8) with (APPENDONLY=true, ORIENTATION=column);  
NOTICE:  Table doesn`t have `DISTRIBUTED BY` clause -- Using column named `id` as the Greenplum Database data distribution key for this table.  
HINT:  The `DISTRIBUTED BY` clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
postgres=> insert into t1 select generate_series(1,100000000);  
INSERT 0 100000000  
postgres=> insert into t2 select * from t1;  
INSERT 0 100000000

2、text型別

postgres=> create temp table tt1 (id text) with (APPENDONLY=true, ORIENTATION=column);  
NOTICE:  Table doesn`t have `DISTRIBUTED BY` clause -- Using column named `id` as the Greenplum Database data distribution key for this table.  
HINT:  The `DISTRIBUTED BY` clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
postgres=> create temp table tt2 (id text) with (APPENDONLY=true, ORIENTATION=column);  
NOTICE:  Table doesn`t have `DISTRIBUTED BY` clause -- Using column named `id` as the Greenplum Database data distribution key for this table.  
HINT:  The `DISTRIBUTED BY` clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
  
postgres=> insert into tt1 select id from t1;  
INSERT 0 100000000  
postgres=> insert into tt2 select id from tt1;  
INSERT 0 100000000

3、numeric型別

postgres=> create temp table ttt1 (id numeric) with (APPENDONLY=true, ORIENTATION=column);  
NOTICE:  Table doesn`t have `DISTRIBUTED BY` clause -- Using column named `id` as the Greenplum Database data distribution key for this table.  
HINT:  The `DISTRIBUTED BY` clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
postgres=> create temp table ttt2 (id numeric) with (APPENDONLY=true, ORIENTATION=column);  
NOTICE:  Table doesn`t have `DISTRIBUTED BY` clause -- Using column named `id` as the Greenplum Database data distribution key for this table.  
HINT:  The `DISTRIBUTED BY` clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
  
postgres=> insert into ttt1 select id from t1;  
INSERT 0 100000000  
postgres=> insert into ttt2 select id from t1;  
INSERT 0 100000000

測試環境為單物理機（64執行緒機器），48個SEGMENT。1億記錄。

1 聚合

1、int8型別

postgres=> explain analyze select count(*),sum(id),avg(id),min(id),max(id),stddev(id) from t1;  
                                                                QUERY PLAN                                                                  
------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=2524480.55..2524480.56 rows=1 width=120)  
   Rows out:  1 rows with 0.002 ms to first row, 159 ms to end, start offset by 1.624 ms.  
   ->  Gather Motion 48:1  (slice1; segments: 48)  (cost=2524480.02..2524480.52 rows=1 width=120)  
         Rows out:  48 rows at destination with 221 ms to end, start offset by 1.626 ms.  
         ->  Aggregate  (cost=2524480.02..2524480.02 rows=1 width=120)  
               Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.001 ms to first row, 134 ms to end, start offset by 10 ms.  
               ->  Append-only Columnar Scan on t1  (cost=0.00..1024480.00 rows=2083334 width=8)  
                     Rows out:  0 rows (seg0) with 10 ms to end, start offset by 37 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 315K bytes.  
   (slice1)    Executor memory: 378K bytes avg x 48 workers, 378K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 382.093 ms  
(16 rows)

2、text型別

postgres=> explain analyze select count(*),sum(t1.id::int8),avg(t1.id::int8),min(t1.id::int8),max(t1.id::int8),stddev(t1.id::int8) from tt1 t1;  
                                                                QUERY PLAN                                                                  
------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=2527178.57..2527178.58 rows=1 width=120)  
   Rows out:  1 rows with 0.003 ms to first row, 798 ms to end, start offset by 1.382 ms.  
   ->  Gather Motion 48:1  (slice1; segments: 48)  (cost=2527178.02..2527178.53 rows=1 width=120)  
         Rows out:  48 rows at destination with 1006 ms to end, start offset by 1.385 ms.  
         ->  Aggregate  (cost=2527178.02..2527178.04 rows=1 width=120)  
               Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.003 ms to first row, 926 ms to end, start offset by 14 ms.  
               ->  Append-only Columnar Scan on tt1 t1  (cost=0.00..1027178.00 rows=2083334 width=8)  
                     Rows out:  0 rows (seg0) with 16 ms to end, start offset by 36 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 315K bytes.  
   (slice1)    Executor memory: 378K bytes avg x 48 workers, 378K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 1805.789 ms  
(16 rows)

3、numeric型別

postgres=> explain analyze select count(*),sum(id),avg(id),min(id),max(id),stddev(id) from ttt1;  
                                                                QUERY PLAN                                                                  
------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=2527512.55..2527512.56 rows=1 width=168)  
   Rows out:  1 rows with 0.001 ms to first row, 1712 ms to end, start offset by 1.292 ms.  
   ->  Gather Motion 48:1  (slice1; segments: 48)  (cost=2527512.02..2527512.52 rows=1 width=168)  
         Rows out:  48 rows at destination with 1926 ms to end, start offset by 1.293 ms.  
         ->  Aggregate  (cost=2527512.02..2527512.02 rows=1 width=168)  
               Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0 ms to first row, 1849 ms to end, start offset by 4.436 ms.  
               ->  Append-only Columnar Scan on ttt1  (cost=0.00..1027512.00 rows=2083334 width=8)  
                     Rows out:  0 rows (seg0) with 7.385 ms to end, start offset by 53 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 315K bytes.  
   (slice1)    Executor memory: 378K bytes avg x 48 workers, 378K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 3640.108 ms  
(16 rows)

2 JOIN 聚合

1、int8型別

postgres=> explain analyze select count(*),sum(t1.id),avg(t1.id),min(t1.id),max(t1.id),stddev(t1.id) from t1 join t2 using (id);  
                                                                          QUERY PLAN                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=6689588.56..6689588.57 rows=1 width=120)  
   Rows out:  1 rows with 0.003 ms to first row, 908 ms to end, start offset by 1.505 ms.  
   ->  Gather Motion 48:1  (slice1; segments: 48)  (cost=6689588.02..6689588.52 rows=1 width=120)  
         Rows out:  48 rows at destination with 1517 ms to end, start offset by 1.508 ms.  
         ->  Aggregate  (cost=6689588.02..6689588.03 rows=1 width=120)  
               Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.002 ms to first row, 81 ms to end, start offset by 11 ms.  
               ->  Hash Join  (cost=2372137.00..5189588.00 rows=2083334 width=8)  
                     Hash Cond: t1.id = t2.id  
                     Rows out:  Avg 2083333.3 rows x 48 workers.  Max 2083479 rows (seg42) with 0.013 ms to first row, 1359 ms to end, start offset by 38 ms.  
                     Executor memory:  65105K bytes avg, 65109K bytes max (seg42).  
                     Work_mem used:  65105K bytes avg, 65109K bytes max (seg42). Workfile: (0 spilling, 0 reused)  
                     ->  Append-only Columnar Scan on t1  (cost=0.00..1024480.00 rows=2083334 width=8)  
                           Rows out:  0 rows (seg0) with 0.003 ms to end, start offset by 38 ms.  
                     ->  Hash  (cost=1024480.00..1024480.00 rows=2083334 width=8)  
                           Rows in:  (No row requested) 0 rows (seg0) with 0 ms to end.  
                           ->  Append-only Columnar Scan on t2  (cost=0.00..1024480.00 rows=2083334 width=8)  
                                 Rows out:  0 rows (seg0) with 30 ms to end, start offset by 54 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 315K bytes.  
   (slice1)    Executor memory: 378K bytes avg x 48 workers, 378K bytes max (seg0).  Work_mem: 65109K bytes max.  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 2426.790 ms  
(25 rows)

2、text型別

postgres=> explain analyze select count(*),sum(t1.id::int8),avg(t1.id::int8),min(t1.id::int8),max(t1.id::int8),stddev(t1.id::int8) from tt1 t1 join tt2 using (id);  
                                                                         QUERY PLAN                                                                            
-------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=6694984.57..6694984.58 rows=1 width=120)  
   Rows out:  1 rows with 0.001 ms to first row, 2068 ms to end, start offset by 1.423 ms.  
   ->  Gather Motion 48:1  (slice1; segments: 48)  (cost=6694984.02..6694984.53 rows=1 width=120)  
         Rows out:  48 rows at destination with 3169 ms to end, start offset by 1.425 ms.  
         ->  Aggregate  (cost=6694984.02..6694984.04 rows=1 width=120)  
               Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.004 ms to first row, 1049 ms to end, start offset by 11 ms.  
               ->  Hash Join  (cost=2374835.00..5194984.00 rows=2083334 width=8)  
                     Hash Cond: t1.id = tt2.id  
                     Rows out:  Avg 2083333.3 rows x 48 workers.  Max 2084068 rows (seg4) with 0.012 ms to first row, 2240 ms to end, start offset by 60 ms.  
                     Executor memory:  65105K bytes avg, 65128K bytes max (seg4).  
                     Work_mem used:  65105K bytes avg, 65128K bytes max (seg4). Workfile: (0 spilling, 0 reused)  
                     ->  Append-only Columnar Scan on tt1 t1  (cost=0.00..1027178.00 rows=2083334 width=8)  
                           Rows out:  0 rows (seg0) with 0.003 ms to end, start offset by 11 ms.  
                     ->  Hash  (cost=1027178.00..1027178.00 rows=2083334 width=8)  
                           Rows in:  (No row requested) 0 rows (seg0) with 0 ms to end.  
                           ->  Append-only Columnar Scan on tt2  (cost=0.00..1027178.00 rows=2083334 width=8)  
                                 Rows out:  0 rows (seg0) with 37 ms to end, start offset by 43 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 315K bytes.  
   (slice1)    Executor memory: 378K bytes avg x 48 workers, 378K bytes max (seg0).  Work_mem: 65128K bytes max.  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 5249.571 ms  
(25 rows)

3、numeric型別

postgres=> explain analyze select count(*),sum(t1.id),avg(t1.id),min(t1.id),max(t1.id),stddev(t1.id) from ttt1 t1 join ttt2 using (id);  
                                                                          QUERY PLAN                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=6695652.56..6695652.57 rows=1 width=168)  
   Rows out:  1 rows with 0.003 ms to first row, 2661 ms to end, start offset by 1.406 ms.  
   ->  Gather Motion 48:1  (slice1; segments: 48)  (cost=6695652.02..6695652.52 rows=1 width=168)  
         Rows out:  48 rows at destination with 4696 ms to end, start offset by 1.409 ms.  
         ->  Aggregate  (cost=6695652.02..6695652.03 rows=1 width=168)  
               Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.004 ms to first row, 2770 ms to end, start offset by 4.078 ms.  
               ->  Hash Join  (cost=2375169.00..5195652.00 rows=2083334 width=8)  
                     Hash Cond: t1.id = ttt2.id  
                     Rows out:  Avg 2083333.3 rows x 48 workers.  Max 2083627 rows (seg10) with 0.015 ms to first row, 3745 ms to end, start offset by 35 ms.  
                     Executor memory:  65105K bytes avg, 65114K bytes max (seg10).  
                     Work_mem used:  65105K bytes avg, 65114K bytes max (seg10). Workfile: (0 spilling, 0 reused)  
                     ->  Append-only Columnar Scan on ttt1 t1  (cost=0.00..1027512.00 rows=2083334 width=8)  
                           Rows out:  0 rows (seg0) with 0.012 ms to end, start offset by 45 ms.  
                     ->  Hash  (cost=1027512.00..1027512.00 rows=2083334 width=8)  
                           Rows in:  (No row requested) 0 rows (seg0) with 0 ms to end.  
                           ->  Append-only Columnar Scan on ttt2  (cost=0.00..1027512.00 rows=2083334 width=8)  
                                 Rows out:  0 rows (seg0) with 30 ms to end, start offset by 46 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 315K bytes.  
   (slice1)    Executor memory: 378K bytes avg x 48 workers, 378K bytes max (seg0).  Work_mem: 65114K bytes max.  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 7369.522 ms  
(25 rows)

效能指標

型別	總記錄數	segment數	單segment記錄數	單表聚合耗時	多表JOIN+聚合耗時	每segment每秒聚合記錄數	每segment每秒JOIN+聚合記錄數
INT8	1億行	48	208萬行	0.38秒	2.4秒	547萬行	86萬行 * 2
TEXT	1億行	48	208萬行	1.8秒	5.2秒	115萬行	40萬行 * 2
NUMERIC	1億行	48	208萬行	3.6秒	7.37秒	57萬行	28萬行 * 2

小結

設定多少個分割槽，除了業務邏輯層面的因素（比如按日、月、年，或者按LIST等），另外還應該考慮兩方面的因素：

1、分割槽過多也會引入導致優化器生成執行計劃較慢，後設資料過多，SYSCACHE過大等問題。

2、單個SEGMENT的計算能力。（將分割槽後單個SEGMENT的單個分割槽內的記錄數壓縮到可以接受的範圍。）例如:

100億條記錄，1000個SEGMENT，不分割槽的情況下，一個SEGMENT有1000萬條記錄。如果要滿足在輸入WHERE條件過濾資料後(假設過濾後要計算的記錄數小於50億條)INT8型別欄位聚合1秒響應，根據以上效能測試資料，建議至少分成2個區。

數值型別的選擇，除非精度要求，建議不要使用numeric。建議使用int, int8, float, float8等型別。從以上測試可以看出效能差異巨大。

參考

《PostgreSQL 11 preview – 分割槽表智慧並行JOIN (已類似MPP架構，效能暴增)》

《PostgreSQL 11 preview – 分割槽表智慧並行聚合、分組計算(已類似MPP架構，效能暴增)》

oracle分割槽表和分割槽表exchange
2024-03-15
Oracle
oracle分割槽表和非分割槽表exchange
2024-03-15
Oracle
【Linux】MBR磁碟分割槽表只能有四個分割槽？
2018-04-26
Linux
PostgreSQL/LightDB 分割槽表之分割槽裁剪
2022-07-14
SQL
Oracle drop分割槽表單個分割槽無法透過閃回恢復
2021-02-24
Oracle
非分割槽錶轉換成分割槽表
2022-07-09
[oracle] expdp 匯出分割槽表的分割槽
2021-01-28
Oracle
PG的非分割槽表線上轉分割槽表
2022-07-12
Oracle分割槽表基礎運維-07增加分割槽(3列表分割槽)
2020-05-18
Oracle運維
Oracle分割槽表基礎運維-07增加分割槽(2 HASH分割槽)
2020-05-18
Oracle運維
【MYSQL】分割槽表
2021-09-09
MySql
oracle 分割槽表move和包含分割槽表的lob move
2019-03-11
Oracle
移動分割槽表和分割槽索引的表空間
2018-05-23
索引
Oracle分割槽表基礎運維-07增加分割槽(1範圍分割槽)
2020-05-18
Oracle運維
windows10 c盤多大合適_win10 c盤分割槽的詳細教程
2019-11-25
WindowsWin10
Oracle分割槽表基礎運維-04列表分割槽
2020-05-15
Oracle運維
MySQL 分割槽表探索
2019-01-28
MySql
分割槽表-實戰
2021-09-05
Linux分割槽方案、分割槽建議
2024-11-04
Linux
Oracle分割槽表基礎運維-09刪除分割槽
2020-05-30
Oracle運維
Oracle分割槽表基礎運維-05組合分割槽
2020-05-15
Oracle運維
Oracle分割槽表基礎運維-02範圍分割槽
2020-05-14
Oracle運維
Oracle分割槽表基礎運維-03HASH分割槽
2020-05-14
Oracle運維
分割槽表之自動增加分割槽（11G）
2018-04-26
Oracle分割槽表基礎運維-06分割槽表索引
2020-05-18
Oracle運維索引
oracle 線上重新定義，普通表改變分割槽表，分割槽表可以更改型別、分割槽欄位等
2020-08-19
Oracle型別
PostgreSQL 並行vacuum patch - 暨為什麼需要並行vacuum或分割槽表
2018-04-18
SQL並行
HGDB的分割槽表實現SQL Server的分割槽檢視
2021-11-22
SQLServer
非分割槽錶轉換成分割槽表以及注意事項
2022-07-05
Oracle分割槽表基礎運維-01分割槽表分類
2020-05-14
Oracle運維
SQL Server大分割槽表沒有空分割槽的情況下如何擴充套件分割槽的方法
2022-09-30
SQLServer套件
Linux 分割槽擴容（根分割槽擴容，SWAP 分割槽擴容，掛載新分割槽為目錄）
2021-08-27
Linux
SQL SERVER之分割槽表
2019-04-17
SQLServer
ORACLE分割槽表梳理系列
2019-04-17
Oracle
Spark操作Hive分割槽表
2018-12-07
SparkHive
OceaBase 分割槽表建立技巧
2024-05-28
Mysql表分割槽實現
2020-10-25
MySql
mysql 進行表分割槽
2019-01-04
MySql

Greenplum計算能力估算-暨多大表需要分割槽，單個分割槽多大適宜

標籤

背景

單個SEGMENT多大資料量合適

1 聚合

2 JOIN 聚合

效能指標

小結

參考

相關文章