PostgreSQLjdbcbatchinsert

德哥發表於2017-03-29

標籤

PostgreSQL , jdbc , batch , addbatch , executebatch , insert


背景

如何快速的將資料匯入資料庫?

比如ETL程式,資料還原程式,資料初始化,資料同步等場景都會有這樣的訴求。

從幾個方面來分析

1. 統計資訊

PostgreSQL會自動統計表的統計資訊,包括柱狀圖等。會有一定的開銷,所以在做批量匯入時,可以先關閉表的autovacuum.

2. 索引

構造索引,會有一定的CPU和IO開銷,影響匯入速度,所以可以在資料匯入後再建索引。

3. 約束

約束會耗費一定的CPU,也可以在資料匯入後再加索引。

4. 檢查點

資料庫匯入勢必產生WAL日誌,如果WAL很多,可能會產生檢查點,影響IO。因此可以把檢查點的時間拉長(設定ckpt target, timeout)。

5. COMMIT fsync

如果是單條INSERT的匯入方式,並且使用了autocommit,那麼每條COMMIT都會導致FSYNC,flush wal。從而降低速度。

可以設定為非同步提交,降低RT。

或者使用批量提交,例如100條提交一次。

6. parser開銷

insert需要parser, 即使使用繫結變數,也有諸多tuple form的工作。

使用copy 介面可以提高效率。

7. 互動次數

如果使用的是單條insert,也有批量的方式提高速度,例如insert into table values (),(),(),…..;

相比以下,減少了互動次數,可以大幅提升效能,與COPY效率類似。

begin;  
insert into table values ();  
insert into table values ();  
...  
end;  

8. 10.0即將推出的一個新特性,將允許libpq有批量提交QUERY的功能。

《PostgreSQL 10.0 preview 效能增強 – libpq支援pipeline batch模式減少網路互動提升效能》

本文將要講一下jdbc的batch insert.

jdbc batch insert

目前,使用jdbc prepared statement insert與addbatch, execute batch,你可能會發現,最後SQL還是沒有封裝成這樣insert into table values (),(),(),…..;而是一條一條的insert。

社群也有類似的討論帖子

Hi,  
  
As document said, in the extended query protocol, "The query string  
contained in a Parse message cannot include more than one SQL  
statement".  
  
So to support batch in prepared statement, I think the only way is to  
determine the batch size in advance and then create the appropriate  
prepared statement, e.g.  
  
Given the batch size is fixed to 3, then prepare below statement:  
-- create table foobar(a int, b text);  
insert into foobar values($1, $2), ($3, $4), ($5, $6);  
  
Then this prepared statement must be bound with 3 set of arguments.  
  
The limitation is obvious: the batch size is fixed, so if you need to  
do batch with size of 4, the previous prepared statement is useless  
and you need to recreate it.  
  
On the other hand, in JDBC, it seems that you just need to prepare  
below statement:  
  
insert into foobar values($1, $2);  
  
And then calls addBatch() repeatedly until you think the batch size is enough.  
  
What`s the final statement does postgresql jdbc driver convert to? I`m  
so curious.  
  
I`m not familiar with jdbc, and although I check the driver source  
codes, but I still cannot understand it.  
  
Anybody knows the answer? Thanks.  
  
  
Regards,  
Jinhua Luo  

github裡面有一個JDBC batch insert rewrite功能,可以將single batch轉換為multi-row batch

https://github.com/whitingjr/batch-rewrite-statements-perf

The first measures INSERT statements in a single batch  
  
 batch begin  
  | INSERT  
  | INSERT  
  | INSERT  
  | INSERT  
  | n INSERT  
 batch end  
the second uses an individual multi-row INSERT statement.  
  
INSERT INTO orderline VALUES (?,?),(?,?),(?,?),(?,?),(n,n)  
Both types has 3 individual benchmarks with varying numbers of statement/row. There is a benchmark called SMALL, MEDIUM and LARGE. The count for each is configurable. See Configuration section later for details.  

下面是測試對比,很顯然multi-row的效果好很多。

pic

pic

pic

建議要麼使用multi-row batch,要麼使用copy.

jdbc 版本

可以看到multi-row batch已經支援了,在這

https://github.com/pgjdbc/pgjdbc/commit/510e6e0fec5e3a44ffda854f5cf808ce9f72ee2e

fix: improve insert values(...) batch rewrite
Now it supports VALUES (?,1), and splits very long batch into series of smaller ones
to keep overall number of bind variables <32768 (that is PG`s limit in v3 protocol)

Trigger property: reWriteBatchedInserts=true. Default value is still "false".

closes #580
closes #584

用法

reWriteBatchedInserts=true

參考

《PostgreSQL 10.0 preview 效能增強 – libpq支援pipeline batch模式減少網路互動提升效能》

https://jdbc.postgresql.org/

https://github.com/pgjdbc/pgjdbc/

https://www.postgresql.org/message-id/55130DC8.2070508@redhat.com

https://github.com/pgjdbc/pgjdbc/pull/491

https://www.postgresql.org/message-id/flat/CAAc9rOwTZ3d6%3DYUV-vJPndebVyUGAz_Pk8WV1fYkmpVykttLug%40mail.gmail.com#CAAc9rOwTZ3d6=YUV-vJPndebVyUGAz_Pk8WV1fYkmpVykttLug@mail.gmail.com

https://jdbc.postgresql.org/documentation/publicapi/index.html