PostgreSQL11preview-並行排序、並行索引(效能線性暴增)單例項100億TOP-K僅40秒

德哥發表於2018-02-28

標籤

PostgreSQL , 並行排序 , 外排 , merge sort , TOP-K , 並行建立索引 , 分割槽並行 , ranking , percent , midean , 資料分佈


背景

在OLAP場景,排序是一個非常重要的功能。也是衡量資料庫是否適合OLAP場景的一個重要指標。例如這些場景:

1、求TOP-K,例如在對資料進行分組後,取出每個分組的TOP-K。

2、求中位數,要求對資料排序,並取出處於中間水位的值。

3、資料柱狀圖,需要對資料進行排序,並按記錄均勻分割成若干BUCKET,獲得bucket邊界。

4、ranking排序,對資料進行排序,取出TOP-K。

5、資料按視窗分組,求資料分佈,或groupAGG等。

https://www.postgresql.org/docs/devel/static/functions-window.html

6、海量資料,取percent value,例如每一天的網站訪問延遲水位數(99%, 95%, ……)。

7、建立索引(在TP場景,如果資料頻繁更新刪除,索引可能會逐漸膨脹,PG 11並行建立索引功能推出,索引維護效能槓槓的了(不堵塞DML的語法也支援並行, create index concurrently))。

排序能力真的很重要對吧,怪不得那麼多企業參加計算機排序效能大賽呢。

PostgreSQL 11 在排序上有了長足的進步,支援了並行排序。

並行排序, 建立索引流程如下(並行掃描 -> 並行排序(階段性quicksort+external sort) -> 合併排序並寫索引檔案),如果是求TOP-K,採用top-n heapsort並且不需要合併落盤的過程,爆快:

pic

下面有一組測試資料,以及如何操縱並行排序的並行度,以及一些簡單的原理介紹。有兩個數字值得關注(也說明了PostgreSQL 進步飛速):

單例項 100億排序求TOP-K,40秒。  
  
單例項 100億建立索引,26分鐘。  

環境介紹

64執行緒,512G,NVME SSD

CentOS 7.x x64

PostgreSQL 11

一些預設引數,幫助控制並行度。

max_worker_processes = 128  
max_parallel_workers = 64  
max_parallel_workers_per_gather = 64  
max_parallel_maintenance_workers = 64  
  
# 並行建索引使用了external sort,和work_mem關係不大  
#   (當序列排序,如果要使用quick sort的話需要夠大的work_mem;  
#     或者使用hash agg , hash join時需要夠大的work_mem)。  
work_mem=`128MB`   
  
# 至少滿足每個worker有最少32MB的可用記憶體。否則可能並行度會低於預期。  
maintenance_work_mem=`16GB`    

測試資料量1億、10億、100億。排序欄位為INT8型別。

準備測試資料

1、表結構

-- 1億  
create unlogged table ttt(id int8, info text);  
  
-- 10億  
create unlogged table t(id int8, info text);  
  
-- 100億  
create unlogged table tbl(id int8, info text);  

2、使用dblink非同步呼叫,並行載入資料。

postgres=# create extension dblink;  
CREATE EXTENSION  

3、建立生成dblink連線的函式,重複建立不報錯。

create or replace function conn(        
  name,   -- dblink名字        
  text    -- 連線串,URL        
) returns void as $$          
declare          
begin          
  perform dblink_connect($1, $2);         
  return;          
exception when others then          
  return;          
end;          
$$ language plpgsql strict;          

4、並行載入1億

do language plpgsql $$     
declare        
begin        
for i in 0..99 loop         
  perform conn(`link`||i,  `hostaddr=127.0.0.1 port=9999 user=postgres dbname=postgres`);         
  perform 1 from dblink_get_result(`link`||i) as t(id int, c1 int);        
  perform dblink_send_query(`link`||i, format(`insert into tbl select generate_series(%s::int8, %s::int8), ``test```, i*1000000::int8+1, (i+1)*1000000::int8));        
end loop;           
end;        
$$;        

5、並行載入10億

do language plpgsql $$     
declare        
begin        
for i in 0..99 loop         
  perform conn(`link`||i,  `hostaddr=127.0.0.1 port=9999 user=postgres dbname=postgres`);         
  perform 1 from dblink_get_result(`link`||i) as t(id int, c1 int);        
  perform dblink_send_query(`link`||i, format(`insert into tbl select generate_series(%s::int8, %s::int8), ``test```, i*10000000::int8+1, (i+1)*10000000::int8));        
end loop;           
end;        
$$;        

6、並行載入100億

do language plpgsql $$     
declare        
begin        
for i in 0..99 loop         
  perform conn(`link`||i,  `hostaddr=127.0.0.1 port=9999 user=postgres dbname=postgres`);         
  perform 1 from dblink_get_result(`link`||i) as t(id int, c1 int);        
  perform dblink_send_query(`link`||i, format(`insert into tbl select generate_series(%s::int8, %s::int8), ``test```, i*100000000::int8+1, (i+1)*100000000::int8));        
end loop;           
end;        
$$;        

並行求TOP-K

1、1億

非並行

postgres=# select * from ttt order by id limit 10;  
 id | info   
----+------  
  1 | test  
  2 | test  
  3 | test  
  4 | test  
  5 | test  
  6 | test  
  7 | test  
  8 | test  
  9 | test  
 10 | test  
(10 rows)  
  
Time: 9374.943 ms (00:09.375)  

並行(強制N個並行)

alter table ttt set (parallel_workers =64);  
set min_parallel_index_scan_size =0;  
set min_parallel_table_scan_size =0;  
set parallel_tuple_cost =0;  
set parallel_setup_cost =0;  
  
postgres=# select * from ttt order by id limit 10;  
 id | info   
----+------  
  1 | test  
  2 | test  
  3 | test  
  4 | test  
  5 | test  
  6 | test  
  7 | test  
  8 | test  
  9 | test  
 10 | test  
(10 rows)  
  
Time: 501.723 ms  

看一個執行計劃,這個不涉及索引落盤,所以不受max_parallel_maintenance_workers引數控制,僅與其他幾個並行引數相關。

postgres=# show max_parallel_maintenance_workers ;  
 max_parallel_maintenance_workers   
----------------------------------  
 0  
(1 row)  
  
postgres=# explain select * from ttt order by id limit 10;  
                                        QUERY PLAN                                          
------------------------------------------------------------------------------------------  
 Limit  (cost=589933.02..589933.35 rows=10 width=13)  
   ->  Gather Merge  (cost=589933.02..3855023.18 rows=100000000 width=13)  
         Workers Planned: 64  
         ->  Sort  (cost=589931.06..593837.31 rows=1562500 width=13)  
               Sort Key: id  
               ->  Parallel Seq Scan on ttt  (cost=0.00..556166.00 rows=1562500 width=13)  
(6 rows)  

2、10億

非並行

postgres=# select * from t order by id limit 10;  
 id | info   
----+------  
  1 | test  
  2 | test  
  3 | test  
  4 | test  
  5 | test  
  6 | test  
  7 | test  
  8 | test  
  9 | test  
 10 | test  
(10 rows)  
  
Time: 95162.659 ms (01:35.163)  

並行(強制N個並行)

alter table t set (parallel_workers =64);  
set min_parallel_index_scan_size =0;  
set min_parallel_table_scan_size =0;  
set parallel_tuple_cost =0;  
set parallel_setup_cost =0;  
  
postgres=# select * from t order by id limit 10;  
 id | info   
----+------  
  1 | test  
  2 | test  
  3 | test  
  4 | test  
  5 | test  
  6 | test  
  7 | test  
  8 | test  
  9 | test  
 10 | test  
(10 rows)  
  
Time: 5112.469 ms (00:05.112)  

3、100億

非並行

postgres=# select * from tbl order by id limit 10;  
 id | info   
----+------  
  1 | test  
  2 | test  
  3 | test  
  4 | test  
  5 | test  
  6 | test  
  7 | test  
  8 | test  
  9 | test  
 10 | test  
(10 rows)  
  
Time: 1107549.801 ms (18:27.550)  

並行(強制N個並行)

alter table tbl set (parallel_workers =64);  
set min_parallel_index_scan_size =0;  
set min_parallel_table_scan_size =0;  
set parallel_tuple_cost =0;  
set parallel_setup_cost =0;  
  
postgres=# select * from tbl order by id limit 10;  
 id | info   
----+------  
  1 | test  
  2 | test  
  3 | test  
  4 | test  
  5 | test  
  6 | test  
  7 | test  
  8 | test  
  9 | test  
 10 | test  
(10 rows)  
  
Time: 40033.507 ms (00:40.034)  

並行建立索引

為了保證每個Worker至少有32MB的記憶體用於並行排序,需要設定足夠大的maintenance_work_mem(64個WORKER + 1 leader,至少需要2080MB記憶體)。

postgres=# select 65*32||` MB`;  
 ?column?   
----------  
 2080 MB  
(1 row)  

1、1億

非並行

alter table ttt reset ( parallel_workers );  
set max_parallel_maintenance_workers =0;  
  
postgres=# create index idx_ttt on ttt (id);  
CREATE INDEX  
Time: 26244.362 ms (00:26.244)  

並行

alter table ttt set ( parallel_workers=64 );  
set max_parallel_maintenance_workers =64;  
  
postgres=# create index idx_ttt on ttt (id);  
CREATE INDEX  
Time: 15550.082 ms (00:15.550)  

2、10億

非並行

alter table t reset ( parallel_workers );  
set max_parallel_maintenance_workers =0;  
  
postgres=# create index idx_t on t(id);  
CREATE INDEX  
Time: 1442771.066 ms (24:02.771)  

並行

alter table t set ( parallel_workers=64 );  
set max_parallel_maintenance_workers =64;  
  
postgres=# create index idx_t on t(id);  
CREATE INDEX  
Time: 160773.885 ms (02:40.774)  

3、100億

非並行

alter table tbl reset ( parallel_workers );  
set max_parallel_maintenance_workers =0;  
  
postgres=# create index idx_tbl on tbl(id);  
CREATE INDEX  
Time: 7456891.022 ms (02:04:16.891)  

並行

alter table tbl set ( parallel_workers=64 );  
set max_parallel_maintenance_workers =64;  
  
postgres=# create index idx on tbl(id) ;  
CREATE INDEX  
Time: 1869483.741 ms (31:09.484)  

附加 – 分割槽表建立索引

PostgreSQL對分割槽表建立索引時,並不是所有分割槽同時並行開始建立(至少目前沒有在核心中實現所有分割槽同時開始建立索引)。

但是,我們可以通過並行建立索引,控制每個分割槽的並行度。

create unlogged table p(id int8, info text) partition by hash(id);   
  
CREATE unlogged TABLE p0 PARTITION OF p FOR VALUES WITH (MODULUS 4, REMAINDER 0);    
CREATE unlogged TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 4, REMAINDER 1);    
CREATE unlogged TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 4, REMAINDER 2);    
CREATE unlogged TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 4, REMAINDER 3);    
  
insert into p select generate_series(1,100000000), `test`;  

讓每個分割槽以16個並行度並行建立索引。

set max_parallel_maintenance_workers =16;  
alter table p0 set ( parallel_workers=16 );  
alter table p1 set ( parallel_workers=16 );  
alter table p2 set ( parallel_workers=16 );  
alter table p3 set ( parallel_workers=16 );  
  
create index idx_p on p(id);  
CREATE INDEX  
Time: 15786.299 ms (00:15.786)  

耗時與單表開64個並行差不多。(為什麼不開64個並行,這裡主要是因為分了4個分割槽,單分割槽資料量已經不大,16個並行度足矣。)

按照前面的測試,我們將100億的表,分成100個分割槽,那麼每個分割槽1億資料,建立索引應該僅需1555秒(26分鐘)。

如何操縱並行排序的並行度

並行排序,僅與這些引數有關:

# 全域性可使用WORKER程式數  
max_worker_processes = 128  
  
# 全域性可開平行計算的WORKER程式數  
max_parallel_workers = 64  
  
# 單個gather可開平行計算的WORKER程式數  
max_parallel_workers_per_gather = 64  
  
# 平行計算COST計算依賴的幾個引數  
set parallel_tuple_cost =0;  
set parallel_setup_cost =0;  
  
# 最小多大的表才會使用平行計算  
set min_parallel_index_scan_size =0;  
set min_parallel_table_scan_size =0;  
  
# 表大小,用於自動計算並行度  
  
# 表上引數,覆蓋根據成本模型計算得到的並行度  
  
alter table p1 set ( parallel_workers=16 );  

具體並行度演算法,參考:

《PostgreSQL 9.6 平行計算 優化器演算法淺析 – 以及如何強制並行度》

如何操縱並行建立索引的並行度

1、基本引數,保證可以開足夠多的平行計算worker程式數

# 全域性可使用WORKER程式數  
max_worker_processes = 128  
  
# 全域性可開平行計算的WORKER程式數  
max_parallel_workers = 64  

2、成本模型,自動計算並行度

# 平行計算COST計算依賴的幾個引數  
set parallel_tuple_cost =0;  
set parallel_setup_cost =0;  
  
# 最小多大的表才會使用平行計算  
set min_parallel_index_scan_size =0;  
set min_parallel_table_scan_size =0;  
  
# 表大小,用於自動計算並行度  

3、限定基於成本模型計算得到的並行度,單條建立索引SQL最大開多少個並行度

max_parallel_maintenance_workers  

4、表上parallel_workers引數,覆蓋根據成本模型計算得到的並行度(忽略max_parallel_maintenance_workers)

alter table p1 set ( parallel_workers=16 );  

5、共享記憶體大小,每個建立索引的worker程式至少需要32MB,(注意還有一個leader process,因此為了保證開N個並行,那麼需要 (N+1)*32MB ).

maintenance_work_mem  

那麼,要強制開N個並行建立索引,在滿足條件1的情況下,只需要控制這兩個引數maintenance_work_mem,以及表級parallel_workers。

文件

使用者可以自己去看文件

1、

https://www.postgresql.org/docs/devel/static/runtime-config-resource.html#GUC-MAX-PARALLEL-WORKERS-MAINTENANCE

max_parallel_maintenance_workers (integer)  
  
Sets the maximum number of parallel workers that can be started by a single utility command.   
  
Currently, the only parallel utility command that supports the use of parallel workers is CREATE INDEX,   
and only when building a B-tree index.   
  
Parallel workers are taken from the pool of processes established by max_worker_processes,   
limited by max_parallel_workers.   
  
Note that the requested number of workers may not actually be available at runtime.   
  
If this occurs, the utility operation will run with fewer workers than expected.   
  
The default value is 2.   
  
Setting this value to 0 disables the use of parallel workers by utility commands.  
  
Note that parallel utility commands should not consume substantially more memory   
than equivalent non-parallel operations.   
  
This strategy differs from that of parallel query,   
where resource limits generally apply per worker process.   
  
Parallel utility commands treat the resource limit maintenance_work_mem   
as a limit to be applied to the entire utility command,   
regardless of the number of parallel worker processes.   
  
However, parallel utility commands may still consume substantially   
more CPU resources and I/O bandwidth.  

2、

https://www.postgresql.org/docs/devel/static/sql-createindex.html

PostgreSQL can build indexes while leveraging multiple CPUs in order to process the table rows faster.   
  
This feature is known as parallel index build.   
  
For index methods that support building indexes in parallel (currently, only B-tree),   
maintenance_work_mem specifies the maximum amount of memory that can be used by each   
index build operation as a whole, regardless of how many worker processes were started.   
  
Generally, a cost model automatically determines how many worker processes should be requested, if any.  
  
Parallel index builds may benefit from increasing maintenance_work_mem where an   
equivalent serial index build will see little or no benefit.   
  
Note that maintenance_work_mem may influence the number of worker processes requested,   
since parallel workers must have at least a 32MB share of the total maintenance_work_mem budget.   
  
There must also be a remaining 32MB share for the leader process.   
  
Increasing max_parallel_maintenance_workers may allow more workers to be used,   
which will reduce the time needed for index creation,   
so long as the index build is not already I/O bound.   
  
Of course, there should also be sufficient CPU capacity that would otherwise lie idle.  
  
Setting a value for parallel_workers via ALTER TABLE directly controls how many   
parallel worker processes will be requested by a CREATE INDEX against the table.   
  
This bypasses the cost model completely, and prevents maintenance_work_mem from affecting   
how many parallel workers are requested.   
  
Setting parallel_workers to 0 via ALTER TABLE will disable parallel index builds on the table in all cases.  

跟蹤輸出排序過程日誌

開啟trace_sort可以跟蹤排序過程。例子:

postgres=# set trace_sort =on;  
SET  
postgres=# set client_min_messages =log;  
SET  
  
set maintenance_work_mem=`8GB`;  
alter table ttt set ( parallel_workers=4 );  
  
postgres=# drop index idx_ttt;  
DROP INDEX  
postgres=# create index idx_ttt on ttt(id);  
LOG:  begin index sort: unique = f, workMem = 1677721, randomAccess = f  
LOG:  begin index sort: unique = f, workMem = 1677721, randomAccess = f  
LOG:  begin index sort: unique = f, workMem = 1677721, randomAccess = f  
LOG:  begin index sort: unique = f, workMem = 1677721, randomAccess = f  
LOG:  begin index sort: unique = f, workMem = 1677721, randomAccess = f  
LOG:  performsort of 0 starting: CPU: user: 3.90 s, system: 0.51 s, elapsed: 4.41 s  
LOG:  performsort of 4 starting: CPU: user: 3.74 s, system: 0.66 s, elapsed: 4.40 s  
LOG:  4 switching to external sort with 7 tapes: CPU: user: 3.74 s, system: 0.66 s, elapsed: 4.40 s  
LOG:  performsort of 3 starting: CPU: user: 3.78 s, system: 0.62 s, elapsed: 4.40 s  
LOG:  3 switching to external sort with 7 tapes: CPU: user: 3.78 s, system: 0.62 s, elapsed: 4.40 s  
LOG:  performsort of 2 starting: CPU: user: 3.74 s, system: 0.66 s, elapsed: 4.41 s  
LOG:  2 switching to external sort with 7 tapes: CPU: user: 3.74 s, system: 0.66 s, elapsed: 4.41 s  
LOG:  performsort of 1 starting: CPU: user: 3.76 s, system: 0.64 s, elapsed: 4.41 s  
LOG:  1 switching to external sort with 7 tapes: CPU: user: 3.76 s, system: 0.64 s, elapsed: 4.41 s  
LOG:  1 starting quicksort of run 1: CPU: user: 3.76 s, system: 0.65 s, elapsed: 4.41 s  
LOG:  0 switching to external sort with 7 tapes: CPU: user: 3.90 s, system: 0.51 s, elapsed: 4.41 s  
LOG:  3 starting quicksort of run 1: CPU: user: 3.78 s, system: 0.62 s, elapsed: 4.40 s  
LOG:  2 starting quicksort of run 1: CPU: user: 3.74 s, system: 0.66 s, elapsed: 4.41 s  
LOG:  0 starting quicksort of run 1: CPU: user: 3.90 s, system: 0.51 s, elapsed: 4.41 s  
LOG:  4 starting quicksort of run 1: CPU: user: 3.74 s, system: 0.66 s, elapsed: 4.40 s  
LOG:  1 finished quicksort of run 1: CPU: user: 3.90 s, system: 0.65 s, elapsed: 4.55 s  
LOG:  3 finished quicksort of run 1: CPU: user: 3.92 s, system: 0.62 s, elapsed: 4.55 s  
LOG:  2 finished quicksort of run 1: CPU: user: 3.88 s, system: 0.66 s, elapsed: 4.55 s  
LOG:  4 finished quicksort of run 1: CPU: user: 3.88 s, system: 0.66 s, elapsed: 4.55 s  
LOG:  0 finished quicksort of run 1: CPU: user: 4.05 s, system: 0.51 s, elapsed: 4.56 s  
LOG:  0 finished writing run 1 to tape 0: CPU: user: 5.04 s, system: 1.88 s, elapsed: 6.91 s  
LOG:  4 finished writing run 1 to tape 0: CPU: user: 4.57 s, system: 1.81 s, elapsed: 6.38 s  
LOG:  performsort of 4 done: CPU: user: 4.57 s, system: 1.85 s, elapsed: 6.43 s  
LOG:  parallel external sort of 4 ended, 48940 disk blocks used: CPU: user: 4.57 s, system: 1.85 s, elapsed: 6.43 s  
LOG:  3 finished writing run 1 to tape 0: CPU: user: 4.61 s, system: 1.79 s, elapsed: 6.41 s  
LOG:  performsort of 3 done: CPU: user: 4.61 s, system: 1.83 s, elapsed: 6.45 s  
LOG:  parallel external sort of 3 ended, 48363 disk blocks used: CPU: user: 4.61 s, system: 1.83 s, elapsed: 6.45 s  
LOG:  2 finished writing run 1 to tape 0: CPU: user: 4.58 s, system: 1.78 s, elapsed: 6.37 s  
LOG:  performsort of 2 done: CPU: user: 4.58 s, system: 1.82 s, elapsed: 6.41 s  
LOG:  parallel external sort of 2 ended, 48558 disk blocks used: CPU: user: 4.58 s, system: 1.82 s, elapsed: 6.41 s  
LOG:  1 finished writing run 1 to tape 0: CPU: user: 4.58 s, system: 1.82 s, elapsed: 6.41 s  
LOG:  performsort of 1 done: CPU: user: 4.58 s, system: 1.86 s, elapsed: 6.44 s  
LOG:  parallel external sort of 1 ended, 48234 disk blocks used: CPU: user: 4.58 s, system: 1.86 s, elapsed: 6.44 s  
LOG:  performsort of 0 done: CPU: user: 5.04 s, system: 1.91 s, elapsed: 6.95 s  
LOG:  parallel external sort of 0 ended, 50526 disk blocks used: CPU: user: 5.04 s, system: 1.91 s, elapsed: 6.95 s  
LOG:  begin index sort: unique = f, workMem = 8388608, randomAccess = f  
LOG:  performsort of -1 starting: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s  
LOG:  -1 using 8388553 KB of memory for read buffers among 5 input tapes  
LOG:  performsort of -1 done (except 5-way final merge): CPU: user: 0.11 s, system: 1.08 s, elapsed: 1.20 s  
LOG:  parallel external sort of -1 ended, 244621 disk blocks used: CPU: user: 8.52 s, system: 3.98 s, elapsed: 12.49 s  
CREATE INDEX  

注意,以上提到的workMem,都是maintenance_work_mem決定的,除以(並行度+1)。

postgres=# select 8*1024*1024/5||` KB`;  
  ?column?    
------------  
 1677721 KB  
(1 row)  

maintenance_work_mem的大小對並行建立索引的效能影響非常大。

檢視索引頁內部結構 – pageinspect

想了解索引的內部構造、統計資訊,可以使用pageinspect外掛。

《深入淺出PostgreSQL B-Tree索引結構》

https://www.postgresql.org/docs/devel/static/pageinspect.html

《Use pageinspect EXTENSION view PostgreSQL Page`s raw infomation》

postgres=# create extension pageinspect ;  
CREATE EXTENSION  

用法這裡不再贅述,可以看上面幾篇文件。

排序方法介紹

PostgreSQL 11 內建了4種排序方法。

src/backend/utils/sort/tuplesort.c

/*  
 * Convert TuplesortMethod to a string.  
 */  
const char *  
tuplesort_method_name(TuplesortMethod m)  
{  
        switch (m)  
        {  
                case SORT_TYPE_STILL_IN_PROGRESS:  
                        return "still in progress";  
                case SORT_TYPE_TOP_N_HEAPSORT:  
                        return "top-N heapsort";  
                case SORT_TYPE_QUICKSORT:  
                        return "quicksort";  
                case SORT_TYPE_EXTERNAL_SORT:  
                        return "external sort";  
                case SORT_TYPE_EXTERNAL_MERGE:  
                        return "external merge";  
        }  
  
        return "unknown";  
}  

小結

1、空間佔用

資料量 結構 表佔用空間 並行建索引佔用空間
1億 INT8+TEXT 4.2 GB 2.1 GB
10億 INT8+TEXT 41 GB 21 GB
100億 INT8+TEXT 413 GB 209 GB

2、TOP-K

硬體 資料量 非並行求TOP-K耗時 64並行度求TOP-K耗時 N倍效能
64執行緒機器 1億 9.375 秒 0.5 秒 18.75 倍
64執行緒機器 10億 95.16 秒 5.112 秒 18.615 倍
64執行緒機器 100億 1107.55 秒 40 秒 27.69 倍

TOP-K用到的是top-N heapsort排序方法。

Sort Method: top-N heapsort  Memory: 25kB  

3、索引建立

maintenance_work_mem=`16GB`;  
硬體 資料量 非並行建立索引耗時 64並行度建立索引耗時 效能提升倍數
64執行緒機器 1億 26.244 秒 15.55 秒 1.69 倍
64執行緒機器 10億 1442.7 秒 160.77 秒 8.97 倍
64執行緒機器 100億 7456.9 秒 1869.5 秒 4 倍

並行建立索引用到的是external sorting algorithm 排序方法(external sort + external merge)。

Sort Method: external sort  

由於並行建立索引分為並行掃描 -> 並行排序 -> merge的過程(merge階段只有leader process在幹活)。所以時間上並不是1億(單程式)等於64億(64並行)。

非並行的情況下,排序會隨著記錄數增多,非線性的增加耗時(26 -> 1442 -> 7456)。

並行的情況下,隨著記錄數增多,耗時基本是線性增加的,建立索引的基本時間可以預期(16 -> 161 -> 1870)。

值得注意的是,100億時(建立索引耗時31分鐘),並行過程IO已經開始有壓力了(所有程式加起來寫的檔案已經超過400 GB),但是MERGE過程依舊是時間佔比的大頭。

PostgreSQL 每一個大版本都有令人非常驚喜的進步。單例項可以在26分鐘(分割槽表)、31分鐘(單表)左右建立一個100億的索引。

maintenance_work_mem引數的大小對並行建立索引的效能影響較大。

目前PG暫時只支援BTREE索引的並行建立。期待後面加入另外N種(gin, hash, gist, sp-gist, brin, bloom)索引介面的並行建立。

《PostgreSQL 9種索引的原理和應用場景》

參考

1、src/backend/utils/sort/tuplesort.c

 * This module supports parallel sorting.  Parallel sorts involve coordination  
 * among one or more worker processes, and a leader process, each with its own  
 * tuplesort state.  The leader process (or, more accurately, the  
 * Tuplesortstate associated with a leader process) creates a full tapeset  
 * consisting of worker tapes with one run to merge; a run for every  
 * worker process.  This is then merged.  Worker processes are guaranteed to  
 * produce exactly one output run from their partial input.  

2、src/include/utils/tuplesort.h

並行建立索引動作分解介紹:

 * Callers must do the following to perform a sort in parallel using multiple  
 * worker processes:  
 *  
 * 1. Request tuplesort-private shared memory for n workers.  Use  
 *    tuplesort_estimate_shared() to get the required size.  
 * 2. Have leader process initialize allocated shared memory using  
 *    tuplesort_initialize_shared().  Launch workers.  
 * 3. Initialize a coordinate argument within both the leader process, and  
 *    for each worker process.  This has a pointer to the shared  
 *    tuplesort-private structure, as well as some caller-initialized fields.  
 *    Leader`s coordinate argument reliably indicates number of workers  
 *    launched (this is unused by workers).  
 * 4. Begin a tuplesort using some appropriate tuplesort_begin* routine,  
 *    (passing the coordinate argument) within each worker.  The workMem  
 *    arguments need not be identical.  All other arguments should match  
 *    exactly, though.  
 * 5. tuplesort_attach_shared() should be called by all workers.  Feed tuples  
 *    to each worker, and call tuplesort_performsort() within each when input  
 *    is exhausted.  
 * 6. Call tuplesort_end() in each worker process.  Worker processes can shut  
 *    down once tuplesort_end() returns.  
 * 7. Begin a tuplesort in the leader using the same tuplesort_begin*  
 *    routine, passing a leader-appropriate coordinate argument (this can  
 *    happen as early as during step 3, actually, since we only need to know  
 *    the number of workers successfully launched).  The leader must now wait  
 *    for workers to finish.  Caller must use own mechanism for ensuring that  
 *    next step isn`t reached until all workers have called and returned from  
 *    tuplesort_performsort().  (Note that it`s okay if workers have already  
 *    also called tuplesort_end() by then.)  
 * 8. Call tuplesort_performsort() in leader.  Consume output using the  
 *    appropriate tuplesort_get* routine.  Leader can skip this step if  
 *    tuplesort turns out to be unnecessary.  
 * 9. Call tuplesort_end() in leader.  

3、《PostgreSQL 9.6 平行計算 優化器演算法淺析 – 以及如何強制並行度》

4、https://commitfest.postgresql.org/16/690/


相關文章