如何遏制 PostgreSQL WAL 的瘋狂增長

發表於2017-11-07

SQL

前言

PostgreSQL在寫入頻繁的場景中，可能會產生大量的WAL日誌，而且WAL日誌量遠遠超過實際更新的資料量。我們可以把這種現象起個名字,叫做“WAL寫放大”，造成WAL寫放大的主要原因有2點。

在checkpoint之後第一次修改頁面，需要在WAL中輸出整個page，即全頁寫(full page writes)。全頁寫的目的是防止在意外當機時出現的資料塊部分寫導致資料庫無法恢復。
更新記錄時如果新記錄位置(ctid)發生變更，索引記錄也要相應變更，這個變更也要記入WAL。更嚴重的是索引記錄的變更又有可能導致索引頁的全頁寫，進一步加劇了WAL寫放大。

過量的WAL輸出會對系統資源造成很大的消耗，因此需要進行適當的優化。

磁碟IO
WAL寫入是順序寫,通常情況下硬碟對付WAL的順序寫入是綽綽有餘的。所以一般可以忽略。
網路IO
對區域網內的複製估計還不算問題，遠端複製就難說了。
磁碟空間
如果做WAL歸檔，需要的磁碟空間也是巨大的。

WAL記錄的構成

每條WAL記錄的構成大致如下：

src/include/access/xlogrecord.h:

 * The overall layout of an XLOG record is:
 *		Fixed-size header (XLogRecord struct)
 *		XLogRecordBlockHeader struct
 *		XLogRecordBlockHeader struct
 *		...
 *		XLogRecordDataHeader[Short|Long] struct
 *		block data
 *		block data
 *		...
 *		main data

* The overall layout of an XLOG record is:

* Fixed-size header (XLogRecord struct)

* XLogRecordBlockHeader struct

* ...

* XLogRecordDataHeader[Short|Long] struct

* block data

* ...

* main data

主要佔空間是上面的”block data”，再往上的XLogRecordBlockHeader是”block data”的後設資料。一條WAL記錄可能不涉及資料塊，也可能涉及多個資料塊，因此WAL記錄中可能沒有”block data”也可能有多個”block data”。

“block data”的內容可能是下面幾種情況之一

full page image
如果是checkpoint之後第一次修改頁面，則輸出整個page的內容(即full page image，簡稱FPI)。但是page中沒有資料的hole部分會被排除，如果設定了wal_compression = on還會對這page上的資料進行壓縮。
buffer data
不需要輸出FPI時，就只輸出page中指定的資料。
full page image + buffer data
邏輯複製時，即使輸出了FPI，也要輸出指定的資料。

究竟”block data”中存的是什麼內容，通過前面的XLogRecordBlockHeader中的fork_flags進行描述。這裡的XLogRecordBlockHeader其實也只是個概括的說法，實際上後面還跟了一些其它的Header。完整的結構如下:

XLogRecordBlockHeader 
XLogRecordBlockImageHeader (可選，包含FPI時存在）
XLogRecordBlockCompressHeader (可選，對FPI壓縮時存在)
RelFileNode  (可選，和之前的"block data"的file node不一樣時才存在)
BlockNumber

XLogRecordBlockHeader

XLogRecordBlockImageHeader (可選，包含FPI時存在）

XLogRecordBlockCompressHeader (可選，對FPI壓縮時存在)

RelFileNode (可選，和之前的"block data"的file node不一樣時才存在)

BlockNumber

下面以insert作為例子說明。

src/backend/access/heap/heapam.c:

Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
			int options, BulkInsertState bistate)
{
		...
		xl_heap_insert xlrec;
		xl_heap_header xlhdr;
		...

		xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
		...

		XLogBeginInsert();
		XLogRegisterData((char *) &xlrec, SizeOfHeapInsert); //1)記錄tuple的位置到WAL記錄裡的"main data"。

		xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
		xlhdr.t_infomask = heaptup->t_data->t_infomask;
		xlhdr.t_hoff = heaptup->t_data->t_hoff;

		/*
		 * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
		 * write the whole page to the xlog, we don't need to store
		 * xl_heap_header in the xlog.
		 */
		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
		XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);//2)記錄tuple的head到WAL記錄裡的"block data"。
		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
		XLogRegisterBufData(0,
							(char *) heaptup->t_data + SizeofHeapTupleHeader,
							heaptup->t_len - SizeofHeapTupleHeader);//3)記錄tuple的內容到WAL記錄裡的"block data"。
		...
}

Oid

heap_insert(Relation relation, HeapTuple tup, CommandId cid,

int options, BulkInsertState bistate)

{

...

xl_heap_insert xlrec;

xl_heap_header xlhdr;

...

xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);

...

XLogBeginInsert();

XLogRegisterData((char *) &xlrec, SizeOfHeapInsert); //1)記錄tuple的位置到WAL記錄裡的"main data"。

xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;

xlhdr.t_infomask = heaptup->t_data->t_infomask;

xlhdr.t_hoff = heaptup->t_data->t_hoff;

* note we mark xlhdr as belonging to buffer; if XLogInsert decides to

* write the whole page to the xlog, we don't need to store

* xl_heap_header in the xlog.

XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);

XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);//2)記錄tuple的head到WAL記錄裡的"block data"。

/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */

XLogRegisterBufData(0,

(char *) heaptup->t_data + SizeofHeapTupleHeader,

heaptup->t_len - SizeofHeapTupleHeader);//3)記錄tuple的內容到WAL記錄裡的"block data"。

...

}

WAL的解析

PostgreSQL的安裝目錄下有個叫做pg_xlogdump的命令可以解析WAL檔案，下面看一個例子。

-bash-4.1$ pg_xlogdump /pgsql/data/pg_xlog/0000000100000555000000D5 -b
...
rmgr: Heap        len (rec/tot):     14/   171, tx:  301170263, lsn: 555/D5005080, prev 555/D50030A0, desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0
	blkref #0: rel 1663/13269/54349226 fork main blk 1640350
	blkref #1: rel 1663/13269/54349226 fork main blk 1174199
...

-bash-4.1$ pg_xlogdump /pgsql/data/pg_xlog/0000000100000555000000D5 -b

...

rmgr: Heap len (rec/tot): 14/ 171, tx: 301170263, lsn: 555/D5005080, prev 555/D50030A0, desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0

blkref #0: rel 1663/13269/54349226 fork main blk 1640350

blkref #1: rel 1663/13269/54349226 fork main blk 1174199

...

這條WAL記錄的解釋如下:

rmgr: Heap
PostgreSQL內部將WAL日誌歸類到20多種不同的資源管理器。這條WAL記錄所屬資源管理器為Heap,即堆表。除了Heap還有Btree,Transaction等。
len (rec/tot): 14/ 171
WAL記錄的總長度是171位元組，其中main data部分是14位元組(只計數main data可能並不合理，本文的後面會有說明)。
tx: 301170263
事務號
lsn: 555/D5005080
本WAL記錄的LSN
prev 555/D50030A0
上條WAL記錄的LSN
desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0
這是一條UPDATE型別的記錄(每個資源管理器最多包含16種不同的WAL記錄型別，)，舊tuple在page中的位置為30(即ctid的後半部分)，新tuple在page中的位置為20。

blkref #0: rel 1663/13269/54349226 fork main blk 1640350
引用的第一個page(新tuple所在page)所屬的堆表檔案為1663/13269/54349226,塊號為1640350(即ctid的前半部分)。通過oid2name可以查到是哪個堆表。

 -bash-4.1$ oid2name -f 54349226
  From database "postgres":
    Filenode        Table Name
  ----------------------------
    54349226  pgbench_accounts

-bash-4.1$ oid2name -f 54349226

From database "postgres":

Filenode Table Name

----------------------------

54349226 pgbench_accounts

blkref #1: rel 1663/13269/54349226 fork main blk 1174199
引用的第二個page(舊tuple所在page)所屬的堆表檔案及塊號

UPDATE語句除了產生UPDATE型別的WAL記錄，實際上還會在前面產生一條LOCK記錄，可選的還可能在後面產生若干索引更新的WAL記錄。

-bash-4.1$ pg_xlogdump /pgsql/data/pg_xlog/0000000100000555000000D5 -b
...
rmgr: Heap        len (rec/tot):      8/  8135, tx:  301170263, lsn: 555/D50030A0, prev 555/D5001350, desc: LOCK off 30: xid 301170263: flags 0 LOCK_ONLY EXCL_LOCK 
	blkref #0: rel 1663/13269/54349226 fork main blk 1174199 (FPW); hole: offset: 268, length: 116

rmgr: Heap        len (rec/tot):     14/   171, tx:  301170263, lsn: 555/D5005080, prev 555/D50030A0, desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0
	blkref #0: rel 1663/13269/54349226 fork main blk 1640350
	blkref #1: rel 1663/13269/54349226 fork main blk 1174199
...

-bash-4.1$ pg_xlogdump /pgsql/data/pg_xlog/0000000100000555000000D5 -b

...

rmgr: Heap len (rec/tot): 8/ 8135, tx: 301170263, lsn: 555/D50030A0, prev 555/D5001350, desc: LOCK off 30: xid 301170263: flags 0 LOCK_ONLY EXCL_LOCK

blkref #0: rel 1663/13269/54349226 fork main blk 1174199 (FPW); hole: offset: 268, length: 116

rmgr: Heap len (rec/tot): 14/ 171, tx: 301170263, lsn: 555/D5005080, prev 555/D50030A0, desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0

blkref #0: rel 1663/13269/54349226 fork main blk 1640350

blkref #1: rel 1663/13269/54349226 fork main blk 1174199

...

上面的LOCK記錄的例子中，第一個引用page裡有PFW標識，表示包含FPI，這也是這條WAL記錄長度很大的原因。後面的hole: offset: 268, length: 116表示page中包含hole,以及這個hole的偏移位置和長度。可以算出FPI的大小為8196-116=8080, WAL記錄中除FPI以外的資料長度8135-8080=55。

WAL的統計

PostgreSQL 9.5以後的pg_xlogdump都帶有統計功能，可以檢視不同型別的WAL記錄的數量，大小以及FPI的比例。例子如下:

postgres.conf配置

下面是一個未經特別優化的配置

shared_buffers = 32GB
checkpoint_completion_target = 0.9
checkpoint_timeout = 5min
min_wal_size = 1GB
max_wal_size = 4GB
full_page_writes = on
wal_log_hints = on
wal_level = replica
wal_keep_segments = 1000

shared_buffers = 32GB

checkpoint_completion_target = 0.9

checkpoint_timeout = 5min

min_wal_size = 1GB

max_wal_size = 4GB

full_page_writes = on

wal_log_hints = on

wal_level = replica

wal_keep_segments = 1000

測試

先手動執行checkpoint，再利用pgbench做一個10秒鐘的壓測

-bash-4.1$ psql -c "checkpoint;select pg_switch_xlog(),pg_current_xlog_location()"
 pg_switch_xlog | pg_current_xlog_location 
----------------+--------------------------
 556/48000270   | 556/49000000
(1 row)

-bash-4.1$ pgbench -n -c 64 -j 64 -T 10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 64
number of threads: 64
duration: 10 s
number of transactions actually processed: 123535
latency average = 5.201 ms
tps = 12304.460572 (including connections establishing)
tps = 12317.916235 (excluding connections establishing)

-bash-4.1$ psql -c "select pg_current_xlog_location()"
 pg_current_xlog_location 
--------------------------
 556/B8B40CA0
(1 row)

-bash-4.1$ psql -c "checkpoint;select pg_switch_xlog(),pg_current_xlog_location()"

pg_switch_xlog | pg_current_xlog_location

----------------+--------------------------

556/48000270 | 556/49000000

(1 row)

-bash-4.1$ pgbench -n -c 64 -j 64 -T 10

transaction type: <builtin: TPC-B (sort of)>

scaling factor: 1000

query mode: simple

number of clients: 64

number of threads: 64

duration: 10 s

number of transactions actually processed: 123535

latency average = 5.201 ms

tps = 12304.460572 (including connections establishing)

tps = 12317.916235 (excluding connections establishing)

-bash-4.1$ psql -c "select pg_current_xlog_location()"

pg_current_xlog_location

--------------------------

556/B8B40CA0

(1 row)

日誌統計

統計壓測期間產生的WAL

-bash-4.1$ pg_xlogdump --stats=record -s 556/49000000 -e 556/B8B40CA0
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                            650 (  0.06)                15600 (  0.05)              5262532 (  0.29)              5278132 (  0.29)
Transaction/COMMIT                        123535 ( 11.54)              3953120 ( 11.46)                    0 (  0.00)              3953120 (  0.22)
CLOG/ZEROPAGE                                  4 (  0.00)                  112 (  0.00)                    0 (  0.00)                  112 (  0.00)
Standby/RUNNING_XACTS                          2 (  0.00)                  232 (  0.00)                    0 (  0.00)                  232 (  0.00)
Heap/INSERT                               122781 ( 11.47)              3315087 (  9.61)              1150064 (  0.06)              4465151 (  0.25)
Heap/UPDATE                               220143 ( 20.57)              8365434 ( 24.24)              1110312 (  0.06)              9475746 (  0.52)
Heap/HOT_UPDATE                           147169 ( 13.75)              5592422 ( 16.21)               275568 (  0.02)              5867990 (  0.32)
Heap/LOCK                                 228031 ( 21.31)              7296992 ( 21.15)            975914004 ( 54.70)            983210996 ( 54.06)
Heap/INSERT+INIT                             754 (  0.07)                20358 (  0.06)                    0 (  0.00)                20358 (  0.00)
Heap/UPDATE+INIT                            3293 (  0.31)               125134 (  0.36)                    0 (  0.00)               125134 (  0.01)
Btree/INSERT_LEAF                         223003 ( 20.84)              5798078 ( 16.80)            800409940 ( 44.86)            806208018 ( 44.33)
Btree/INSERT_UPPER                           433 (  0.04)                11258 (  0.03)                32576 (  0.00)                43834 (  0.00)
Btree/SPLIT_L                                218 (  0.02)                 6976 (  0.02)                26040 (  0.00)                33016 (  0.00)
Btree/SPLIT_R                                216 (  0.02)                 6912 (  0.02)                27220 (  0.00)                34132 (  0.00)
                                        --------                      --------                      --------                      --------
Total                                    1070232                      34507715 [1.90%]            1784208256 [98.10%]           1818715971 [100%]

-bash-4.1$ pg_xlogdump --stats=record -s 556/49000000 -e 556/B8B40CA0

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 650 ( 0.06) 15600 ( 0.05) 5262532 ( 0.29) 5278132 ( 0.29)

Transaction/COMMIT 123535 ( 11.54) 3953120 ( 11.46) 0 ( 0.00) 3953120 ( 0.22)

CLOG/ZEROPAGE 4 ( 0.00) 112 ( 0.00) 0 ( 0.00) 112 ( 0.00)

Standby/RUNNING_XACTS 2 ( 0.00) 232 ( 0.00) 0 ( 0.00) 232 ( 0.00)

Heap/INSERT 122781 ( 11.47) 3315087 ( 9.61) 1150064 ( 0.06) 4465151 ( 0.25)

Heap/UPDATE 220143 ( 20.57) 8365434 ( 24.24) 1110312 ( 0.06) 9475746 ( 0.52)

Heap/HOT_UPDATE 147169 ( 13.75) 5592422 ( 16.21) 275568 ( 0.02) 5867990 ( 0.32)

Heap/LOCK 228031 ( 21.31) 7296992 ( 21.15) 975914004 ( 54.70) 983210996 ( 54.06)

Heap/INSERT+INIT 754 ( 0.07) 20358 ( 0.06) 0 ( 0.00) 20358 ( 0.00)

Heap/UPDATE+INIT 3293 ( 0.31) 125134 ( 0.36) 0 ( 0.00) 125134 ( 0.01)

Btree/INSERT_LEAF 223003 ( 20.84) 5798078 ( 16.80) 800409940 ( 44.86) 806208018 ( 44.33)

Btree/INSERT_UPPER 433 ( 0.04) 11258 ( 0.03) 32576 ( 0.00) 43834 ( 0.00)

Btree/SPLIT_L 218 ( 0.02) 6976 ( 0.02) 26040 ( 0.00) 33016 ( 0.00)

Btree/SPLIT_R 216 ( 0.02) 6912 ( 0.02) 27220 ( 0.00) 34132 ( 0.00)

-------- -------- -------- --------

Total 1070232 34507715 [1.90%] 1784208256 [98.10%] 1818715971 [100%]

這個統計結果顯示FPI的比例佔到了98.10%。但是這個資料並不準確，因為上面的Record size只包含了WAL記錄中”main data”的大小，Combined size則是”main data”與FPI的合計，漏掉了FPI以外的”block data”。這是一個Bug，社群正在進行修復，參考BUG #14687

作為臨時對策，可以在pg_xlogdump.c中新增了一行程式碼，重新計算Record size使之等於WAL總記錄長度減去FPI的大小。為便於區分，修改後編譯的二進位制檔案改名為pg_xlogdump_ex。

src/bin/pg_xlogdump/pg_xlogdump.c：

fpi_len = 0;
for (block_id = 0; block_id <= record->max_block_id; block_id++)
{
	if (XLogRecHasBlockImage(record, block_id))
		fpi_len += record->blocks[block_id].bimg_len;
}
rec_len = XLogRecGetTotalLen(record) - fpi_len;/* 新增這一行，重新計算rec_len */

fpi_len = 0;

for (block_id = 0; block_id <= record->max_block_id; block_id++)

{

if (XLogRecHasBlockImage(record, block_id))

fpi_len += record->blocks[block_id].bimg_len;

}

rec_len = XLogRecGetTotalLen(record) - fpi_len;/* 新增這一行，重新計算rec_len */

修改後，重新統計WAL的結果如下:

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 556/49000000 -e 556/B8B40CA0
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                            650 (  0.06)                31850 (  0.04)              5262532 (  0.29)              5294382 (  0.28)
Transaction/COMMIT                        123535 ( 11.54)              4200190 (  5.14)                    0 (  0.00)              4200190 (  0.23)
CLOG/ZEROPAGE                                  4 (  0.00)                  120 (  0.00)                    0 (  0.00)                  120 (  0.00)
Standby/RUNNING_XACTS                          2 (  0.00)                  236 (  0.00)                    0 (  0.00)                  236 (  0.00)
Heap/INSERT                               122781 ( 11.47)              9694899 ( 11.86)              1150064 (  0.06)             10844963 (  0.58)
Heap/UPDATE                               220143 ( 20.57)             29172042 ( 35.67)              1110312 (  0.06)             30282354 (  1.62)
Heap/HOT_UPDATE                           147169 ( 13.75)             10591360 ( 12.95)               275568 (  0.02)             10866928 (  0.58)
Heap/LOCK                                 228031 ( 21.31)             12917849 ( 15.80)            975914004 ( 54.70)            988831853 ( 52.99)
Heap/INSERT+INIT                             754 (  0.07)                59566 (  0.07)                    0 (  0.00)                59566 (  0.00)
Heap/UPDATE+INIT                            3293 (  0.31)               455778 (  0.56)                    0 (  0.00)               455778 (  0.02)
Btree/INSERT_LEAF                         223003 ( 20.84)             13080672 ( 16.00)            800409940 ( 44.86)            813490612 ( 43.60)
Btree/INSERT_UPPER                           433 (  0.04)                31088 (  0.04)                32576 (  0.00)                63664 (  0.00)
Btree/SPLIT_L                                218 (  0.02)               775610 (  0.95)                26040 (  0.00)               801650 (  0.04)
Btree/SPLIT_R                                216 (  0.02)               765118 (  0.94)                27220 (  0.00)               792338 (  0.04)
                                        --------                      --------                      --------                      --------
Total                                    1070232                      81776378 [4.38%]            1784208256 [95.62%]           1865984634 [100%]

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 556/49000000 -e 556/B8B40CA0

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 650 ( 0.06) 31850 ( 0.04) 5262532 ( 0.29) 5294382 ( 0.28)

Transaction/COMMIT 123535 ( 11.54) 4200190 ( 5.14) 0 ( 0.00) 4200190 ( 0.23)

CLOG/ZEROPAGE 4 ( 0.00) 120 ( 0.00) 0 ( 0.00) 120 ( 0.00)

Standby/RUNNING_XACTS 2 ( 0.00) 236 ( 0.00) 0 ( 0.00) 236 ( 0.00)

Heap/INSERT 122781 ( 11.47) 9694899 ( 11.86) 1150064 ( 0.06) 10844963 ( 0.58)

Heap/UPDATE 220143 ( 20.57) 29172042 ( 35.67) 1110312 ( 0.06) 30282354 ( 1.62)

Heap/HOT_UPDATE 147169 ( 13.75) 10591360 ( 12.95) 275568 ( 0.02) 10866928 ( 0.58)

Heap/LOCK 228031 ( 21.31) 12917849 ( 15.80) 975914004 ( 54.70) 988831853 ( 52.99)

Heap/INSERT+INIT 754 ( 0.07) 59566 ( 0.07) 0 ( 0.00) 59566 ( 0.00)

Heap/UPDATE+INIT 3293 ( 0.31) 455778 ( 0.56) 0 ( 0.00) 455778 ( 0.02)

Btree/INSERT_LEAF 223003 ( 20.84) 13080672 ( 16.00) 800409940 ( 44.86) 813490612 ( 43.60)

Btree/INSERT_UPPER 433 ( 0.04) 31088 ( 0.04) 32576 ( 0.00) 63664 ( 0.00)

Btree/SPLIT_L 218 ( 0.02) 775610 ( 0.95) 26040 ( 0.00) 801650 ( 0.04)

Btree/SPLIT_R 216 ( 0.02) 765118 ( 0.94) 27220 ( 0.00) 792338 ( 0.04)

-------- -------- -------- --------

Total 1070232 81776378 [4.38%] 1784208256 [95.62%] 1865984634 [100%]

這上面可以看出，有95.62%的WAL空間都被FPI佔據了（也就是說WAL至少被放大了20倍)，這個比例是相當高的。

如果不修改pg_xlogdump的程式碼,也可以通過計算WAL距離的方式，算出準確的FPI比例。

postgres=# select pg_xlog_location_diff('556/B8B40CA0','556/49000000');
 pg_xlog_location_diff 
-----------------------
            1874070688
(1 row)

postgres=# select 1784208256.0 / 1874070688;
        ?column?        
------------------------
 0.95204960379808256197
(1 row)

postgres=# select pg_xlog_location_diff('556/B8B40CA0','556/49000000');

pg_xlog_location_diff

-----------------------

1874070688

(1 row)

postgres=# select 1784208256.0 / 1874070688;

?column?

------------------------

0.95204960379808256197

(1 row)

WAL的優化

在應用的寫負載不變的情況下，減少WAL生成量主要有下面幾種辦法。

延長checkpoint時間間隔
FPI產生於checkpoint之後第一次變髒的page，在下次checkpoint到來之前，已經輸出過PFI的page是不需要再次輸出FPI的。因此checkpoint時間間隔越長，FPI產生的頻度會越低。增大checkpoint_timeout和max_wal_size可以延長checkpoint時間間隔。
增加HOT_UPDATE比例
普通的UPDATE經常需要更新2個資料塊，並且可能還要更新索引page，這些又都有可能產生FPI。而HOT_UPDATE只修改1個資料塊，需要寫的WAL量也會相應減少。
壓縮
PostgreSQL9.5新增加了一個wal_compression引數，設為on可以對FPI進行壓縮，削減WAL的大小。另外還可以在外部通過SSL/SSH的壓縮功能減少主備間的通訊流量，以及自定義歸檔指令碼對歸檔的WAL進行壓縮。
關閉全頁寫
這是一個立竿見影但也很危險的辦法，如果底層的檔案系統或儲存支援原子寫可以考慮。因為很多部署環境都不具備安全的關閉全頁寫的條件，下文不對該方法做展開。

延長checkpoint時間

首先優化checkpoint相關引數

postgres.conf:

shared_buffers = 32GB
checkpoint_completion_target = 0.1
checkpoint_timeout = 60min
min_wal_size = 4GB
max_wal_size = 64GB
full_page_writes = on
wal_log_hints = on
wal_level = replica
wal_keep_segments = 1000

shared_buffers = 32GB

checkpoint_completion_target = 0.1

checkpoint_timeout = 60min

min_wal_size = 4GB

max_wal_size = 64GB

full_page_writes = on

wal_log_hints = on

wal_level = replica

wal_keep_segments = 1000

然後，手工發起一次checkpoint

-bash-4.1$ psql -c "checkpoint"
CHECKPOINT

1 2	-bash-4.1$ psql -c "checkpoint" CHECKPOINT

再壓測10w個事務，並連續測試10次

-bash-4.1$ psql -c "select pg_current_xlog_location()"  ; pgbench -n -c 100 -j 100 -t 1000 ;psql -c "select pg_current_xlog_location()"
 pg_current_xlog_location 
--------------------------
 558/47542B08
(1 row)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 100
number of threads: 100
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
latency average = 7.771 ms
tps = 12868.123227 (including connections establishing)
tps = 12896.084970 (excluding connections establishing)
 pg_current_xlog_location 
--------------------------
 558/A13DF908
(1 row)

-bash-4.1$ psql -c "select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 1000 ;psql -c "select pg_current_xlog_location()"

pg_current_xlog_location

--------------------------

558/47542B08

(1 row)

transaction type: <builtin: TPC-B (sort of)>

scaling factor: 1000

query mode: simple

number of clients: 100

number of threads: 100

number of transactions per client: 1000

number of transactions actually processed: 100000/100000

latency average = 7.771 ms

tps = 12868.123227 (including connections establishing)

tps = 12896.084970 (excluding connections establishing)

pg_current_xlog_location

--------------------------

558/A13DF908

(1 row)

測試結果如下

第1次執行

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 558/47542B08 -e 558/A13DF908 
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                           1933 (  0.23)                94717 (  0.15)             15612140 (  1.09)             15706857 (  1.05)
Transaction/COMMIT                        100000 ( 11.89)              3400000 (  5.26)                    0 (  0.00)              3400000 (  0.23)
CLOG/ZEROPAGE                                  3 (  0.00)                   90 (  0.00)                    0 (  0.00)                   90 (  0.00)
Standby/RUNNING_XACTS                          1 (  0.00)                  453 (  0.00)                    0 (  0.00)                  453 (  0.00)
Heap/INSERT                                99357 ( 11.82)              7849103 ( 12.14)                25680 (  0.00)              7874783 (  0.52)
Heap/UPDATE                               163254 ( 19.42)             22354169 ( 34.58)               351364 (  0.02)             22705533 (  1.51)
Heap/HOT_UPDATE                           134045 ( 15.94)              9646593 ( 14.92)               384948 (  0.03)             10031541 (  0.67)
Heap/LOCK                                 172576 ( 20.52)              9800924 ( 15.16)            778259316 ( 54.15)            788060240 ( 52.47)
Heap/INSERT+INIT                             643 (  0.08)                50797 (  0.08)                    0 (  0.00)                50797 (  0.00)
Heap/UPDATE+INIT                            2701 (  0.32)               371044 (  0.57)                    0 (  0.00)               371044 (  0.02)
Btree/INSERT_LEAF                         165561 ( 19.69)              9643359 ( 14.92)            642548940 ( 44.70)            652192299 ( 43.42)
Btree/INSERT_UPPER                           394 (  0.05)                28236 (  0.04)                56324 (  0.00)                84560 (  0.01)
Btree/SPLIT_L                                228 (  0.03)               811172 (  1.25)                57280 (  0.00)               868452 (  0.06)
Btree/SPLIT_R                                168 (  0.02)               595137 (  0.92)                64740 (  0.00)               659877 (  0.04)
                                        --------                      --------                      --------                      --------
Total                                     840864                      64645794 [4.30%]            1437360732 [95.70%]           1502006526 [100%]

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 558/47542B08 -e 558/A13DF908

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 1933 ( 0.23) 94717 ( 0.15) 15612140 ( 1.09) 15706857 ( 1.05)

Transaction/COMMIT 100000 ( 11.89) 3400000 ( 5.26) 0 ( 0.00) 3400000 ( 0.23)

CLOG/ZEROPAGE 3 ( 0.00) 90 ( 0.00) 0 ( 0.00) 90 ( 0.00)

Standby/RUNNING_XACTS 1 ( 0.00) 453 ( 0.00) 0 ( 0.00) 453 ( 0.00)

Heap/INSERT 99357 ( 11.82) 7849103 ( 12.14) 25680 ( 0.00) 7874783 ( 0.52)

Heap/UPDATE 163254 ( 19.42) 22354169 ( 34.58) 351364 ( 0.02) 22705533 ( 1.51)

Heap/HOT_UPDATE 134045 ( 15.94) 9646593 ( 14.92) 384948 ( 0.03) 10031541 ( 0.67)

Heap/LOCK 172576 ( 20.52) 9800924 ( 15.16) 778259316 ( 54.15) 788060240 ( 52.47)

Heap/INSERT+INIT 643 ( 0.08) 50797 ( 0.08) 0 ( 0.00) 50797 ( 0.00)

Heap/UPDATE+INIT 2701 ( 0.32) 371044 ( 0.57) 0 ( 0.00) 371044 ( 0.02)

Btree/INSERT_LEAF 165561 ( 19.69) 9643359 ( 14.92) 642548940 ( 44.70) 652192299 ( 43.42)

Btree/INSERT_UPPER 394 ( 0.05) 28236 ( 0.04) 56324 ( 0.00) 84560 ( 0.01)

Btree/SPLIT_L 228 ( 0.03) 811172 ( 1.25) 57280 ( 0.00) 868452 ( 0.06)

Btree/SPLIT_R 168 ( 0.02) 595137 ( 0.92) 64740 ( 0.00) 659877 ( 0.04)

-------- -------- -------- --------

Total 840864 64645794 [4.30%] 1437360732 [95.70%] 1502006526 [100%]

第5次執行

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 559/6312AD98 -e 559/94AC4148
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                           1425 (  0.17)                69825 (  0.11)             11508300 (  1.51)             11578125 (  1.40)
Transaction/COMMIT                        100000 ( 12.13)              3400000 (  5.37)                    0 (  0.00)              3400000 (  0.41)
CLOG/ZEROPAGE                                  3 (  0.00)                   90 (  0.00)                    0 (  0.00)                   90 (  0.00)
Standby/RUNNING_XACTS                          1 (  0.00)                  453 (  0.00)                    0 (  0.00)                  453 (  0.00)
Heap/INSERT                                99296 ( 12.05)              7844384 ( 12.38)                    0 (  0.00)              7844384 (  0.95)
Heap/UPDATE                               155408 ( 18.85)             21689908 ( 34.24)                    0 (  0.00)             21689908 (  2.62)
Heap/HOT_UPDATE                           142042 ( 17.23)             10222825 ( 16.14)                    0 (  0.00)             10222825 (  1.23)
Heap/LOCK                                 164776 ( 19.99)              9274729 ( 14.64)            608647740 ( 79.60)            617922469 ( 74.63)
Heap/INSERT+INIT                             704 (  0.09)                55616 (  0.09)                    0 (  0.00)                55616 (  0.01)
Heap/UPDATE+INIT                            2550 (  0.31)               355951 (  0.56)                    0 (  0.00)               355951 (  0.04)
Btree/INSERT_LEAF                         157807 ( 19.14)              9886864 ( 15.61)            144491940 ( 18.90)            154378804 ( 18.64)
Btree/INSERT_UPPER                           151 (  0.02)                10872 (  0.02)                    0 (  0.00)                10872 (  0.00)
Btree/SPLIT_L                                128 (  0.02)               455424 (  0.72)                    0 (  0.00)               455424 (  0.06)
Btree/SPLIT_R                                 23 (  0.00)                81466 (  0.13)                    0 (  0.00)                81466 (  0.01)
                                        --------                      --------                      --------                      --------
Total                                     824314                      63348407 [7.65%]             764647980 [92.35%]            827996387 [100%]

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 559/6312AD98 -e 559/94AC4148

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 1425 ( 0.17) 69825 ( 0.11) 11508300 ( 1.51) 11578125 ( 1.40)

Transaction/COMMIT 100000 ( 12.13) 3400000 ( 5.37) 0 ( 0.00) 3400000 ( 0.41)

CLOG/ZEROPAGE 3 ( 0.00) 90 ( 0.00) 0 ( 0.00) 90 ( 0.00)

Standby/RUNNING_XACTS 1 ( 0.00) 453 ( 0.00) 0 ( 0.00) 453 ( 0.00)

Heap/INSERT 99296 ( 12.05) 7844384 ( 12.38) 0 ( 0.00) 7844384 ( 0.95)

Heap/UPDATE 155408 ( 18.85) 21689908 ( 34.24) 0 ( 0.00) 21689908 ( 2.62)

Heap/HOT_UPDATE 142042 ( 17.23) 10222825 ( 16.14) 0 ( 0.00) 10222825 ( 1.23)

Heap/LOCK 164776 ( 19.99) 9274729 ( 14.64) 608647740 ( 79.60) 617922469 ( 74.63)

Heap/INSERT+INIT 704 ( 0.09) 55616 ( 0.09) 0 ( 0.00) 55616 ( 0.01)

Heap/UPDATE+INIT 2550 ( 0.31) 355951 ( 0.56) 0 ( 0.00) 355951 ( 0.04)

Btree/INSERT_LEAF 157807 ( 19.14) 9886864 ( 15.61) 144491940 ( 18.90) 154378804 ( 18.64)

Btree/INSERT_UPPER 151 ( 0.02) 10872 ( 0.02) 0 ( 0.00) 10872 ( 0.00)

Btree/SPLIT_L 128 ( 0.02) 455424 ( 0.72) 0 ( 0.00) 455424 ( 0.06)

Btree/SPLIT_R 23 ( 0.00) 81466 ( 0.13) 0 ( 0.00) 81466 ( 0.01)

-------- -------- -------- --------

Total 824314 63348407 [7.65%] 764647980 [92.35%] 827996387 [100%]

第10次執行

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 55A/3347F298 -e 55A/5420F700
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                           1151 (  0.13)                56399 (  0.09)              9295592 (  1.93)              9351991 (  1.71)
Transaction/COMMIT                        100000 ( 11.61)              3400000 (  5.15)                    0 (  0.00)              3400000 (  0.62)
CLOG/ZEROPAGE                                  3 (  0.00)                   90 (  0.00)                    0 (  0.00)                   90 (  0.00)
Standby/RUNNING_XACTS                          1 (  0.00)                   62 (  0.00)                    0 (  0.00)                   62 (  0.00)
Heap/INSERT                                99322 ( 11.53)              7846438 ( 11.88)                    0 (  0.00)              7846438 (  1.43)
Heap/UPDATE                               173901 ( 20.19)             23253149 ( 35.21)                    0 (  0.00)             23253149 (  4.25)
Heap/HOT_UPDATE                           123452 ( 14.33)              8884888 ( 13.45)                    0 (  0.00)              8884888 (  1.62)
Heap/LOCK                                 183501 ( 21.30)             10187069 ( 15.43)            449049828 ( 93.22)            459236897 ( 83.84)
Heap/INSERT+INIT                             678 (  0.08)                53562 (  0.08)                    0 (  0.00)                53562 (  0.01)
Heap/UPDATE+INIT                            2647 (  0.31)               365259 (  0.55)                    0 (  0.00)               365259 (  0.07)
Btree/INSERT_LEAF                         176343 ( 20.47)             11251588 ( 17.04)             23338600 (  4.85)             34590188 (  6.32)
Btree/INSERT_UPPER                           205 (  0.02)                14760 (  0.02)                    0 (  0.00)                14760 (  0.00)
Btree/SPLIT_L                                172 (  0.02)               611976 (  0.93)                    0 (  0.00)               611976 (  0.11)
Btree/SPLIT_R                                 33 (  0.00)               116886 (  0.18)                    0 (  0.00)               116886 (  0.02)
Btree/VACUUM                                   1 (  0.00)                   50 (  0.00)                    0 (  0.00)                   50 (  0.00)
                                        --------                      --------                      --------                      --------
Total                                     861410                      66042176 [12.06%]            481684020 [87.94%]            547726196 [100%]

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 55A/3347F298 -e 55A/5420F700

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 1151 ( 0.13) 56399 ( 0.09) 9295592 ( 1.93) 9351991 ( 1.71)

Transaction/COMMIT 100000 ( 11.61) 3400000 ( 5.15) 0 ( 0.00) 3400000 ( 0.62)

CLOG/ZEROPAGE 3 ( 0.00) 90 ( 0.00) 0 ( 0.00) 90 ( 0.00)

Standby/RUNNING_XACTS 1 ( 0.00) 62 ( 0.00) 0 ( 0.00) 62 ( 0.00)

Heap/INSERT 99322 ( 11.53) 7846438 ( 11.88) 0 ( 0.00) 7846438 ( 1.43)

Heap/UPDATE 173901 ( 20.19) 23253149 ( 35.21) 0 ( 0.00) 23253149 ( 4.25)

Heap/HOT_UPDATE 123452 ( 14.33) 8884888 ( 13.45) 0 ( 0.00) 8884888 ( 1.62)

Heap/LOCK 183501 ( 21.30) 10187069 ( 15.43) 449049828 ( 93.22) 459236897 ( 83.84)

Heap/INSERT+INIT 678 ( 0.08) 53562 ( 0.08) 0 ( 0.00) 53562 ( 0.01)

Heap/UPDATE+INIT 2647 ( 0.31) 365259 ( 0.55) 0 ( 0.00) 365259 ( 0.07)

Btree/INSERT_LEAF 176343 ( 20.47) 11251588 ( 17.04) 23338600 ( 4.85) 34590188 ( 6.32)

Btree/INSERT_UPPER 205 ( 0.02) 14760 ( 0.02) 0 ( 0.00) 14760 ( 0.00)

Btree/SPLIT_L 172 ( 0.02) 611976 ( 0.93) 0 ( 0.00) 611976 ( 0.11)

Btree/SPLIT_R 33 ( 0.00) 116886 ( 0.18) 0 ( 0.00) 116886 ( 0.02)

Btree/VACUUM 1 ( 0.00) 50 ( 0.00) 0 ( 0.00) 50 ( 0.00)

-------- -------- -------- --------

Total 861410 66042176 [12.06%] 481684020 [87.94%] 547726196 [100%]

彙總如下:

No	tps	非FPI大小	WAL總量(位元組)	FPI比例(%)	每事務產生的WAL(位元組)
1	12896	64645794	1502006526	95.70	15020
5	12896	63348407	827996387	92.35	8279
10	12896	66042176	547726196	87.94	5477

不難看出非FPI大小是相對固定的，FPI的大小越來越小，這也證實了延長checkpoint間隔對削減WAL大小的作用。

增加HOT_UPDATE比例

HOT_UPDATE比例過低的一個很常見的原因是更新頻繁的表的fillfactor設定不恰當。fillfactor的預設值為100%，可以先將其調整為90%。

對於寬表，要進一步減小fillfactor使得至少可以保留一個tuple的空閒空間。可以查詢pg_class系統表估算平均tuple大小，並算出合理的fillfactor值。

postgres=# select 1 - relpages/reltuples max_fillfactor from pg_class where relname='big_tb';
    max_fillfactor    
----------------------
 0.69799901185770750988
(1 row)

postgres=# select 1 - relpages/reltuples max_fillfactor from pg_class where relname='big_tb';

max_fillfactor

----------------------

0.69799901185770750988

(1 row)

再上面估算出的69%的基礎上，可以把fillfactor再稍微設小一點，比如設成65% 。

在前面優化過的引數的基礎上，先保持fillfactor=100不變，執行100w事務的壓測

-bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()"  ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()"
 pg_current_xlog_location 
--------------------------
 55A/66715CC0
(1 row)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 100
number of threads: 100
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
latency average = 7.943 ms
tps = 12589.895315 (including connections establishing)
tps = 12592.623734 (excluding connections establishing)
 pg_current_xlog_location 
--------------------------
 55C/7C747F20
(1 row)

-bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()"

pg_current_xlog_location

--------------------------

55A/66715CC0

(1 row)

transaction type: <builtin: TPC-B (sort of)>

scaling factor: 1000

query mode: simple

number of clients: 100

number of threads: 100

number of transactions per client: 10000

number of transactions actually processed: 1000000/1000000

latency average = 7.943 ms

tps = 12589.895315 (including connections establishing)

tps = 12592.623734 (excluding connections establishing)

pg_current_xlog_location

--------------------------

55C/7C747F20

(1 row)

生成的WAL統計如下：

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 55A/66715CC0 -e 55C/7C747F20
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                          30699 (  0.36)              1504251 (  0.23)            248063160 (  3.00)            249567411 (  2.80)
Transaction/COMMIT                       1000000 ( 11.80)             34000000 (  5.15)                    0 (  0.00)             34000000 (  0.38)
Transaction/COMMIT                             3 (  0.00)                  502 (  0.00)                    0 (  0.00)                  502 (  0.00)
CLOG/ZEROPAGE                                 31 (  0.00)                  930 (  0.00)                    0 (  0.00)                  930 (  0.00)
Standby/RUNNING_XACTS                          6 (  0.00)                 2226 (  0.00)                    0 (  0.00)                 2226 (  0.00)
Standby/INVALIDATIONS                          3 (  0.00)                  414 (  0.00)                    0 (  0.00)                  414 (  0.00)
Heap/INSERT                               993655 ( 11.72)             78496345 ( 11.90)               135164 (  0.00)             78631509 (  0.88)
Heap/UPDATE                              1658858 ( 19.57)            225826642 ( 34.23)               455368 (  0.01)            226282010 (  2.54)
Heap/HOT_UPDATE                          1314890 ( 15.51)             94634083 ( 14.35)               344324 (  0.00)             94978407 (  1.07)
Heap/LOCK                                1757258 ( 20.73)             98577892 ( 14.94)           5953842520 ( 72.12)           6052420412 ( 67.89)
Heap/INPLACE                                   9 (  0.00)                 1730 (  0.00)                 6572 (  0.00)                 8302 (  0.00)
Heap/INSERT+INIT                            6345 (  0.07)               501255 (  0.08)                    0 (  0.00)               501255 (  0.01)
Heap/UPDATE+INIT                           26265 (  0.31)              3635102 (  0.55)                    0 (  0.00)              3635102 (  0.04)
Btree/INSERT_LEAF                        1680195 ( 19.82)            104535607 ( 15.85)           2052212660 ( 24.86)           2156748267 ( 24.19)
Btree/INSERT_UPPER                          4928 (  0.06)               354552 (  0.05)               129128 (  0.00)               483680 (  0.01)
Btree/SPLIT_L                               4854 (  0.06)             17269109 (  2.62)                22080 (  0.00)             17291189 (  0.19)
Btree/SPLIT_R                                 95 (  0.00)               336650 (  0.05)                    0 (  0.00)               336650 (  0.00)
Btree/VACUUM                                   3 (  0.00)                  155 (  0.00)                 2220 (  0.00)                 2375 (  0.00)
                                        --------                      --------                      --------                      --------
Total                                    8478097                     659677445 [7.40%]            8255213196 [92.60%]           8914890641 [100%]

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 55A/66715CC0 -e 55C/7C747F20

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 30699 ( 0.36) 1504251 ( 0.23) 248063160 ( 3.00) 249567411 ( 2.80)

Transaction/COMMIT 1000000 ( 11.80) 34000000 ( 5.15) 0 ( 0.00) 34000000 ( 0.38)

Transaction/COMMIT 3 ( 0.00) 502 ( 0.00) 0 ( 0.00) 502 ( 0.00)

CLOG/ZEROPAGE 31 ( 0.00) 930 ( 0.00) 0 ( 0.00) 930 ( 0.00)

Standby/RUNNING_XACTS 6 ( 0.00) 2226 ( 0.00) 0 ( 0.00) 2226 ( 0.00)

Standby/INVALIDATIONS 3 ( 0.00) 414 ( 0.00) 0 ( 0.00) 414 ( 0.00)

Heap/INSERT 993655 ( 11.72) 78496345 ( 11.90) 135164 ( 0.00) 78631509 ( 0.88)

Heap/UPDATE 1658858 ( 19.57) 225826642 ( 34.23) 455368 ( 0.01) 226282010 ( 2.54)

Heap/HOT_UPDATE 1314890 ( 15.51) 94634083 ( 14.35) 344324 ( 0.00) 94978407 ( 1.07)

Heap/LOCK 1757258 ( 20.73) 98577892 ( 14.94) 5953842520 ( 72.12) 6052420412 ( 67.89)

Heap/INPLACE 9 ( 0.00) 1730 ( 0.00) 6572 ( 0.00) 8302 ( 0.00)

Heap/INSERT+INIT 6345 ( 0.07) 501255 ( 0.08) 0 ( 0.00) 501255 ( 0.01)

Heap/UPDATE+INIT 26265 ( 0.31) 3635102 ( 0.55) 0 ( 0.00) 3635102 ( 0.04)

Btree/INSERT_LEAF 1680195 ( 19.82) 104535607 ( 15.85) 2052212660 ( 24.86) 2156748267 ( 24.19)

Btree/INSERT_UPPER 4928 ( 0.06) 354552 ( 0.05) 129128 ( 0.00) 483680 ( 0.01)

Btree/SPLIT_L 4854 ( 0.06) 17269109 ( 2.62) 22080 ( 0.00) 17291189 ( 0.19)

Btree/SPLIT_R 95 ( 0.00) 336650 ( 0.05) 0 ( 0.00) 336650 ( 0.00)

Btree/VACUUM 3 ( 0.00) 155 ( 0.00) 2220 ( 0.00) 2375 ( 0.00)

-------- -------- -------- --------

Total 8478097 659677445 [7.40%] 8255213196 [92.60%] 8914890641 [100%]

設定fillfactor=90

postgres=# alter table pgbench_accounts set (fillfactor=90);
ALTER TABLE
postgres=# vacuum full pgbench_accounts;
VACUUM
postgres=# alter table pgbench_tellers set (fillfactor=90);
ALTER TABLE
postgres=# vacuum full pgbench_tellers;
VACUUM
postgres=# alter table pgbench_branches set (fillfactor=90);
ALTER TABLE
postgres=# vacuum full pgbench_branches;
VACUUM

postgres=# alter table pgbench_accounts set (fillfactor=90);

ALTER TABLE

postgres=# vacuum full pgbench_accounts;

VACUUM

postgres=# alter table pgbench_tellers set (fillfactor=90);

ALTER TABLE

postgres=# vacuum full pgbench_tellers;

VACUUM

postgres=# alter table pgbench_branches set (fillfactor=90);

ALTER TABLE

postgres=# vacuum full pgbench_branches;

VACUUM

再次測試

-bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()"  ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()"
 pg_current_xlog_location 
--------------------------
 561/78BD2460
(1 row)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 100
number of threads: 100
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
latency average = 7.570 ms
tps = 13210.665959 (including connections establishing)
tps = 13212.956814 (excluding connections establishing)
 pg_current_xlog_location 
--------------------------
 562/F91436D8
(1 row)

-bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()"

pg_current_xlog_location

--------------------------

561/78BD2460

(1 row)

transaction type: <builtin: TPC-B (sort of)>

scaling factor: 1000

query mode: simple

number of clients: 100

number of threads: 100

number of transactions per client: 10000

number of transactions actually processed: 1000000/1000000

latency average = 7.570 ms

tps = 13210.665959 (including connections establishing)

tps = 13212.956814 (excluding connections establishing)

pg_current_xlog_location

--------------------------

562/F91436D8

(1 row)

生成的WAL統計如下：

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 561/78BD2460 -e 562/F91436D8
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                          13529 (  0.22)               662921 (  0.16)             99703804 (  1.66)            100366725 (  1.57)
Transaction/COMMIT                       1000000 ( 16.09)             34000000 (  8.07)                    0 (  0.00)             34000000 (  0.53)
Transaction/COMMIT                             4 (  0.00)                 1035 (  0.00)                    0 (  0.00)                 1035 (  0.00)
CLOG/ZEROPAGE                                 30 (  0.00)                  900 (  0.00)                    0 (  0.00)                  900 (  0.00)
Standby/RUNNING_XACTS                          5 (  0.00)                 1913 (  0.00)                    0 (  0.00)                 1913 (  0.00)
Standby/INVALIDATIONS                          2 (  0.00)                  276 (  0.00)                    0 (  0.00)                  276 (  0.00)
Heap/INSERT                               993629 ( 15.98)             78494191 ( 18.63)               362908 (  0.01)             78857099 (  1.23)
Heap/DELETE                                    1 (  0.00)                   59 (  0.00)                 7972 (  0.00)                 8031 (  0.00)
Heap/UPDATE                               553073 (  8.90)             47100570 ( 11.18)                48188 (  0.00)             47148758 (  0.74)
Heap/HOT_UPDATE                          2438157 ( 39.22)            170238869 ( 40.40)           5809935900 ( 96.97)           5980174769 ( 93.25)
Heap/LOCK                                 635714 ( 10.23)             34328566 (  8.15)                16200 (  0.00)             34344766 (  0.54)
Heap/INPLACE                                  10 (  0.00)                 1615 (  0.00)                22692 (  0.00)                24307 (  0.00)
Heap/INSERT+INIT                            6372 (  0.10)               503388 (  0.12)                    0 (  0.00)               503388 (  0.01)
Heap/UPDATE+INIT                            8804 (  0.14)               741136 (  0.18)                    0 (  0.00)               741136 (  0.01)
Btree/INSERT_LEAF                         556456 (  8.95)             35492624 (  8.42)             81089180 (  1.35)            116581804 (  1.82)
Btree/INSERT_UPPER                          5422 (  0.09)               389735 (  0.09)               328108 (  0.01)               717843 (  0.01)
Btree/SPLIT_L                               5036 (  0.08)             17918305 (  4.25)               154980 (  0.00)             18073285 (  0.28)
Btree/SPLIT_R                                414 (  0.01)              1466691 (  0.35)                22140 (  0.00)              1488831 (  0.02)
Btree/VACUUM                                   2 (  0.00)                  100 (  0.00)                    0 (  0.00)                  100 (  0.00)
                                        --------                      --------                      --------                      --------
Total                                    6216660                     421342894 [6.57%]            5991692072 [93.43%]           6413034966 [100%]

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 561/78BD2460 -e 562/F91436D8

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 13529 ( 0.22) 662921 ( 0.16) 99703804 ( 1.66) 100366725 ( 1.57)

Transaction/COMMIT 1000000 ( 16.09) 34000000 ( 8.07) 0 ( 0.00) 34000000 ( 0.53)

Transaction/COMMIT 4 ( 0.00) 1035 ( 0.00) 0 ( 0.00) 1035 ( 0.00)

CLOG/ZEROPAGE 30 ( 0.00) 900 ( 0.00) 0 ( 0.00) 900 ( 0.00)

Standby/RUNNING_XACTS 5 ( 0.00) 1913 ( 0.00) 0 ( 0.00) 1913 ( 0.00)

Standby/INVALIDATIONS 2 ( 0.00) 276 ( 0.00) 0 ( 0.00) 276 ( 0.00)

Heap/INSERT 993629 ( 15.98) 78494191 ( 18.63) 362908 ( 0.01) 78857099 ( 1.23)

Heap/DELETE 1 ( 0.00) 59 ( 0.00) 7972 ( 0.00) 8031 ( 0.00)

Heap/UPDATE 553073 ( 8.90) 47100570 ( 11.18) 48188 ( 0.00) 47148758 ( 0.74)

Heap/HOT_UPDATE 2438157 ( 39.22) 170238869 ( 40.40) 5809935900 ( 96.97) 5980174769 ( 93.25)

Heap/LOCK 635714 ( 10.23) 34328566 ( 8.15) 16200 ( 0.00) 34344766 ( 0.54)

Heap/INPLACE 10 ( 0.00) 1615 ( 0.00) 22692 ( 0.00) 24307 ( 0.00)

Heap/INSERT+INIT 6372 ( 0.10) 503388 ( 0.12) 0 ( 0.00) 503388 ( 0.01)

Heap/UPDATE+INIT 8804 ( 0.14) 741136 ( 0.18) 0 ( 0.00) 741136 ( 0.01)

Btree/INSERT_LEAF 556456 ( 8.95) 35492624 ( 8.42) 81089180 ( 1.35) 116581804 ( 1.82)

Btree/INSERT_UPPER 5422 ( 0.09) 389735 ( 0.09) 328108 ( 0.01) 717843 ( 0.01)

Btree/SPLIT_L 5036 ( 0.08) 17918305 ( 4.25) 154980 ( 0.00) 18073285 ( 0.28)

Btree/SPLIT_R 414 ( 0.01) 1466691 ( 0.35) 22140 ( 0.00) 1488831 ( 0.02)

Btree/VACUUM 2 ( 0.00) 100 ( 0.00) 0 ( 0.00) 100 ( 0.00)

-------- -------- -------- --------

Total 6216660 421342894 [6.57%] 5991692072 [93.43%] 6413034966 [100%]

設定fillfactor=90後，生成的WAL量從8914890641減少到6413034966。

設定WAL壓縮

修改postgres.conf，開啟WAL壓縮

wal_compression = on

1	wal_compression = on

再次測試

-bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()"  ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()"
 pg_current_xlog_location 
--------------------------
 562/F91B5978
(1 row)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 100
number of threads: 100
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
latency average = 8.295 ms
tps = 12056.091399 (including connections establishing)
tps = 12059.453725 (excluding connections establishing)
 pg_current_xlog_location 
--------------------------
 563/39880390
(1 row)

-bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()"

pg_current_xlog_location

--------------------------

562/F91B5978

(1 row)

transaction type: <builtin: TPC-B (sort of)>

scaling factor: 1000

query mode: simple

number of clients: 100

number of threads: 100

number of transactions per client: 10000

number of transactions actually processed: 1000000/1000000

latency average = 8.295 ms

tps = 12056.091399 (including connections establishing)

tps = 12059.453725 (excluding connections establishing)

pg_current_xlog_location

--------------------------

563/39880390

(1 row)

生成的WAL統計如下：

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 562/F91B5978 -e 563/39880390
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
XLOG/FPI_FOR_HINT                           7557 (  0.12)               385375 (  0.09)              5976157 (  0.94)              6361532 (  0.60)
Transaction/COMMIT                       1000000 ( 15.55)             34000000 (  7.97)                    0 (  0.00)             34000000 (  3.20)
Transaction/COMMIT                             2 (  0.00)                  356 (  0.00)                    0 (  0.00)                  356 (  0.00)
CLOG/ZEROPAGE                                 31 (  0.00)                  930 (  0.00)                    0 (  0.00)                  930 (  0.00)
Standby/RUNNING_XACTS                          5 (  0.00)                 1937 (  0.00)                    0 (  0.00)                 1937 (  0.00)
Standby/INVALIDATIONS                          4 (  0.00)                  504 (  0.00)                    0 (  0.00)                  504 (  0.00)
Heap/INSERT                               993632 ( 15.45)             78494714 ( 18.40)               205874 (  0.03)             78700588 (  7.40)
Heap/UPDATE                               663845 ( 10.32)             56645461 ( 13.28)                39548 (  0.01)             56685009 (  5.33)
Heap/HOT_UPDATE                          2326238 ( 36.17)            163847160 ( 38.41)            604564022 ( 94.97)            768411182 ( 72.27)
Heap/LOCK                                 747342 ( 11.62)             40358851 (  9.46)              1713055 (  0.27)             42071906 (  3.96)
Heap/INPLACE                                   9 (  0.00)                 1425 (  0.00)                 5160 (  0.00)                 6585 (  0.00)
Heap/INSERT+INIT                            6368 (  0.10)               503072 (  0.12)                    0 (  0.00)               503072 (  0.05)
Heap/UPDATE+INIT                            9927 (  0.15)               839135 (  0.20)                    0 (  0.00)               839135 (  0.08)
Btree/INSERT_LEAF                         671387 ( 10.44)             42884429 ( 10.05)             19691394 (  3.09)             62575823 (  5.89)
Btree/INSERT_UPPER                          2385 (  0.04)               170946 (  0.04)               210384 (  0.03)               381330 (  0.04)
Btree/SPLIT_L                               1438 (  0.02)              5107876 (  1.20)              2613608 (  0.41)              7721484 (  0.73)
Btree/SPLIT_R                                947 (  0.01)              3360714 (  0.79)              1563260 (  0.25)              4923974 (  0.46)
Btree/VACUUM                                   3 (  0.00)                  150 (  0.00)                    0 (  0.00)                  150 (  0.00)
                                        --------                      --------                      --------                      --------
Total                                    6431120                     426603035 [40.12%]            636582462 [59.88%]           1063185497 [100%]

-bash-4.1$ ./pg_xlogdump_ex --stats=record -s 562/F91B5978 -e 563/39880390

Type N (%) Record size (%) FPI size (%) Combined size (%)

---- - --- ----------- --- -------- --- ------------- ---

XLOG/FPI_FOR_HINT 7557 ( 0.12) 385375 ( 0.09) 5976157 ( 0.94) 6361532 ( 0.60)

Transaction/COMMIT 1000000 ( 15.55) 34000000 ( 7.97) 0 ( 0.00) 34000000 ( 3.20)

Transaction/COMMIT 2 ( 0.00) 356 ( 0.00) 0 ( 0.00) 356 ( 0.00)

CLOG/ZEROPAGE 31 ( 0.00) 930 ( 0.00) 0 ( 0.00) 930 ( 0.00)

Standby/RUNNING_XACTS 5 ( 0.00) 1937 ( 0.00) 0 ( 0.00) 1937 ( 0.00)

Standby/INVALIDATIONS 4 ( 0.00) 504 ( 0.00) 0 ( 0.00) 504 ( 0.00)

Heap/INSERT 993632 ( 15.45) 78494714 ( 18.40) 205874 ( 0.03) 78700588 ( 7.40)

Heap/UPDATE 663845 ( 10.32) 56645461 ( 13.28) 39548 ( 0.01) 56685009 ( 5.33)

Heap/HOT_UPDATE 2326238 ( 36.17) 163847160 ( 38.41) 604564022 ( 94.97) 768411182 ( 72.27)

Heap/LOCK 747342 ( 11.62) 40358851 ( 9.46) 1713055 ( 0.27) 42071906 ( 3.96)

Heap/INPLACE 9 ( 0.00) 1425 ( 0.00) 5160 ( 0.00) 6585 ( 0.00)

Heap/INSERT+INIT 6368 ( 0.10) 503072 ( 0.12) 0 ( 0.00) 503072 ( 0.05)

Heap/UPDATE+INIT 9927 ( 0.15) 839135 ( 0.20) 0 ( 0.00) 839135 ( 0.08)

Btree/INSERT_LEAF 671387 ( 10.44) 42884429 ( 10.05) 19691394 ( 3.09) 62575823 ( 5.89)

Btree/INSERT_UPPER 2385 ( 0.04) 170946 ( 0.04) 210384 ( 0.03) 381330 ( 0.04)

Btree/SPLIT_L 1438 ( 0.02) 5107876 ( 1.20) 2613608 ( 0.41) 7721484 ( 0.73)

Btree/SPLIT_R 947 ( 0.01) 3360714 ( 0.79) 1563260 ( 0.25) 4923974 ( 0.46)

Btree/VACUUM 3 ( 0.00) 150 ( 0.00) 0 ( 0.00) 150 ( 0.00)

-------- -------- -------- --------

Total 6431120 426603035 [40.12%] 636582462 [59.88%] 1063185497 [100%]

設定`wal_compression = on後，生成的WAL量從6413034966減少到1063185497。

優化結果彙總

wal_compression	fillfactor	tps	非FPI大小	WAL總量(位元組)	FPI比例(%)	HOT_UPDATE比例(%)	每事務產生的WAL(位元組)
off	100	12592	659677445	8255213196	92.60	44	8255
off	90	13212	421342894	6413034966	93.43	81	6413
on	90	12059	426603035	1063185497	59.88	78	1063

僅僅調整wal_compression和fillfactor就削減了87%的WAL，這還沒有算上延長checkpoint間隔帶來的收益。

總結

PostgreSQL在未經優化的情況下，20倍甚至更高的WAL寫放大是很常見的，適當的優化之後應該可以減少到3倍以下。引入SSL/SSH壓縮或歸檔壓縮等外部手段還可以進一步減少WAL的生成量。

如何判斷是否需要優化WAL？

關於如何判斷是否需要優化WAL，可以通過分析WAL，然後檢查下面的條件，做一個粗略的判斷：

FPI比例高於70%
HOT_UPDATE比例低於70%

以上僅僅是粗略的經驗值，僅供參考。並且這個FPI比例可能不適用於低寫負載的系統，低寫負載的系統FPI比例一定非常高，但是，低寫負載系統由於寫操作少，因此FPI比例即使高一點也沒太大影響。

優化WAL的副作用

前面用到了3種優化手段，如果設定不當，也會產生副作用，具體如下：

延長checkpoint時間間隔
導致crash恢復時間變長。crash恢復時需要回放的WAL日誌量一般小於max_wal_size的一半，WAL回放速度(wal_compression=on時)一般是50MB/s~150MB/s之間。可以根據可容忍的最大crash恢復時間（有備機時，切備機可能比等待crash恢復更快），估算出允許的max_wal_size的最大值。
調整fillfactor
過小的設定會浪費儲存空間，這個不難理解。另外，對於頻繁更新的表，即使把fillfactor設成100%，每個page裡還是要有一部分空間被dead tuple佔據，不會比設定一個合適的稍小的fillfactor更節省空間。
設定wal_compression=on
需要額外佔用CPU資源進行壓縮，但根據實測的結果影響不大。

其他

去年Uber放出了一篇把PostgreSQL說得一無是處的文章為什麼Uber宣佈從PostgreSQL切換到MySQL?給PostgreSQL帶來了很大負面影響。Uber文章中提到了PG的幾個問題，每一個都被描述成無法逾越的“巨坑”。但實際上這些問題中，除了“寫放大”，其它幾個問題要麼是無關痛癢要麼是隨著PG的版本升級早就不存在了。至於“寫放大”，也是有解的。Uber的文章裡沒有提到他們在優化WAL寫入量上做過什麼有益的嘗試，並且他們使用的PostgreSQL 9.2也是不支援wal_compression的，因此推斷他們PG資料庫很可能一直執行在20倍以上WAL寫放大的完全未優化狀態下。

參考

PostgreSQL的wal_buffers
2022-12-12
SQL
瘋狂的沙王
2018-03-20
《使命召喚：戰區》爆炸性增長開發商Raven瘋狂招人
2021-04-25
PostgreSQL WAL解析：構建WAL記錄準備
2019-09-08
SQL
“瘋狂”的數字孿生
2022-09-09
瘋狂學習——DP！
2024-04-12
瘋狂ajax講義
2020-10-20
小丑路人瘋狂吧
2019-05-11
瘋狂的長租公寓：自如、蛋殼們2020的生死決戰
2020-03-31
《蛋仔派對》：各大主播的技能秀，瘋狂亂鬥徹底瘋狂
2023-05-02
PostgreSQL email list：nvm wal buffer
2020-12-30
SQLAI
瘋狂的遊戲代言人！
2020-03-11
遊戲
Ivan Zanotti 與他瘋狂的作品
2021-03-01
瘋狂的 Vue3 之 Setup
2021-04-22
Vue
【Luogu1616】瘋狂的採藥
2018-04-30
京東正在瘋狂招人。。。
2024-09-11
騰訊“瘋狂”開源！
2020-01-22
PostgreSQL DBA(14) - WAL基本術語
2018-12-13
SQL
瘋狂刪除tomcat日誌
2024-07-18
Tomcat
93-臭蟲也瘋狂
2020-04-04
PostgreSQL的xlog/Wal歸檔及日誌清理
2021-07-21
SQL
PostgreSQL 原始碼解讀（104）- WAL#1（Insert & WAL-heap_i...
2018-12-08
SQL原始碼
PostgreSQL 原始碼解讀（105）- WAL#2（Insert & WAL-heap_i...
2018-12-10
SQL原始碼
PostgreSQL 原始碼解讀（106）- WAL#3（Insert & WAL-heap_i...
2018-12-11
SQL原始碼
PostgreSQL 原始碼解讀（107）- WAL#4（Insert & WAL-heap_i...
2018-12-12
SQL原始碼
PostgreSQL 原始碼解讀（110）- WAL#6（Insert&WAL - XLogRe...
2018-12-27
SQL原始碼
PostgreSQL 原始碼解讀（111）- WAL#7（Insert&WAL - XLogRe...
2018-12-28
SQL原始碼
PostgreSQL 原始碼解讀（113）- WAL#9（Insert&WAL - CopyXL...
2018-12-29
SQL原始碼
瘋狂遊戲CPO孫勁超：用社交做長線運營
2020-08-20
遊戲
更新了！帶Agent的Cursor太瘋狂了
2024-11-25
那些年我之瘋狂的超能力
2019-04-03
瘋狂java07--物件導向的陷阱
2018-08-07
Java物件
PostgreSQL DBA(172) - PG 13（WAL activity in EXPLAIN）
2020-04-14
SQLAI
PostgreSQL DBA(15) - WAL檔案結構
2018-12-17
SQL
數字馬力正在瘋狂招人。。
2024-10-28
特斯拉人，瘋狂買車配RZC
2022-03-11
《瘋狂的動物》專案詳情介紹
2020-03-23
【PG】PostgreSQL 預寫日誌(WAL)、checkpoint、LSN
2024-05-18
SQL
PostgreSQL mail list patch：Non-volatile WAL BUFFER
2020-02-25
SQLAI

如何遏制 PostgreSQL WAL 的瘋狂增長

前言

WAL記錄的構成

WAL的解析

WAL的統計

postgres.conf配置

測試

日誌統計

WAL的優化

延長checkpoint時間

增加HOT_UPDATE比例

設定WAL壓縮

優化結果彙總

總結

如何判斷是否需要優化WAL？

優化WAL的副作用

其他

參考

相關文章