KingbaseES 叢集運維典型案例 03 --“雙主” sys_rewind恢復機制

KINGBASE研究院發表於2024-07-26

案例說明:
主庫主機系統重啟,觸發failover切換,原主庫系統啟動後,資料庫服務被啟動,出現“雙主”,使用“repmgr node rejoin --force-rewind”恢復叢集。透過本案例瞭解,叢集“雙主”產生的原因及解決方案,並熟悉sys_rewind在叢集恢復中的應用。

資料庫版本:
KingbaseES V8R6

一、叢集架構
如下所示,叢集原節點狀態:

[kingbase@node202 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                         
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 17       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 17       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

二、主庫系統重啟
[root@node201 ~]# reboot

三、主庫重啟後叢集狀態
如下所示,原主庫主機系統重啟後,資料庫服務自動啟動,導致出現“雙主”:

1、failover後叢集節點狀態

[kingbase@node202 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                         
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 100      |          |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | default  | 100      | 18       |         | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)

2、檢視節點上timeline變化

如下所示,新主庫觸發promote後,timeline為(18=1*16^1 + 2 * 16^0),原主庫的timeline為(17):

新主庫:
[kingbase@node202 bin]$ ls -lh ../data/sys_wal
-rw------- 1 kingbase kingbase 16M Jan 31 10:52 0000001100000000000000EB
-rw------- 1 kingbase kingbase 16M Jan 31 10:57 0000001100000000000000EC.partial
-rw------- 1 kingbase kingbase 694 Jan 26 09:41 00000011.history
-rw------- 1 kingbase kingbase 16M Jan 31 11:08 0000001200000000000000EC
-rw------- 1 kingbase kingbase 738 Jan 31 11:07 00000012.history

原主庫:
[kingbase@node201 bin]$ ls -lh ../data/sys_wal
-rw------- 1 kingbase kingbase 16M Jan 31 10:52 0000001100000000000000EB
-rw------- 1 kingbase kingbase 16M Jan 31 11:06 0000001100000000000000EC
-rw------- 1 kingbase kingbase 16M Jan 31 11:06 0000001100000000000000ED
-rw------- 1 kingbase kingbase 694 Jan  3 16:27 00000011.history
drwx------ 2 kingbase kingbase 16K Jan 31 11:06 archive_status

四、模擬“雙主”故障

1、觸發failover切換

# 叢集引數配置
[kingbase@node201 bin]$ cat ../etc/repmgr.conf |grep -E 'recovery|failover'
failover='automatic'
recovery='manual'

# 關閉主庫資料庫服務
[kingbase@node201 bin]$ ./sys_ctl stop -D ../data/
主庫資料庫服務被關閉後,將觸發叢集主備failover切換,在引數recovery=manual情況下,原主庫將需要人工參與恢復為備庫加入叢集。

2、誤啟動原主庫資料庫服務
如下所示,如果需要人工將原主庫恢復為新的備庫加入到叢集,需要在原主庫data目錄下建立standby.signal檔案後,再啟動資料庫服務,資料庫服務將啟動到備庫狀態,如果在沒有建立standby.signal標識檔案下,直接啟動原主庫資料庫服務,將直接進入主節點模式。

[kingbase@node201 bin]$ ./sys_ctl start -D ../data/

2、檢視資料庫服務程序
如下所示,原主庫啟動後,資料庫服務程序啟動為主庫,叢集出現“雙主”。

[kingbase@node201 bin]$ ps -ef |grep kingbase

kingbase  4137     1  0 11:14 ?        00:00:00 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kingbase -D ../data
kingbase  4138  4137  0 11:14 ?        00:00:00 kingbase: logger
kingbase  4140  4137  0 11:14 ?        00:00:00 kingbase: checkpointer
kingbase  4141  4137  0 11:14 ?        00:00:00 kingbase: background writer
kingbase  4142  4137  0 11:14 ?        00:00:00 kingbase: walwriter
kingbase  4143  4137  0 11:14 ?        00:00:00 kingbase: autovacuum launcher
kingbase  4144  4137  0 11:14 ?        00:00:00 kingbase: archiver   last was 0000001100000000000000EC
kingbase  4145  4137  0 11:14 ?        00:00:00 kingbase: stats collector
kingbase  4146  4137  0 11:14 ?        00:00:00 kingbase: kwr collector
kingbase  4147  4137  0 11:14 ?        00:00:00 kingbase: ksh writer
kingbase  4148  4137  0 11:14 ?        00:00:00 kingbase: ksh collector
kingbase  4149  4137  0 11:14 ?        00:00:00 kingbase: logical replication launcher

# 資料庫狀態
test=# select sys_is_in_recovery();
 sys_is_in_recovery()
----------------------
 f
(1 row)

3、檢視資料庫日誌

[kingbase@node201 sys_log]$ tail -1000 kingbase-2024-01-31_111429.csv
2024-01-31 11:14:29.129 CST,,,4137,,65b9bb15.1029,1,,2024-01-31 11:14:29 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2024-01-31 11:14:29.149 CST,,,4139,,65b9bb15.102b,1,,2024-01-31 11:14:29 CST,,0,LOG,00000,"database system was shut down at 2024-01-31 11:06:32 CST",,,,,,,,,""
2024-01-31 11:14:29.164 CST,,,4137,,65b9bb15.1029,2,,2024-01-31 11:14:29 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
2024-01-31 11:14:31.775 CST,"esrep","esrep",4156,"192.168.1.201:32776",65b9bb17.103c,1,"ALTER SYSTEM",2024-01-31 11:14:31 CST,4/21,0,LOG,XX000,"attention:user esrep is modifying synchronous_standby_names by ALTER SYSTEM SET statement",,,,,,"ALTER SYSTEM SET synchronous_standby_names = ''",,,"internal_rwcmgr"
2024-01-31 11:14:31.783 CST,,,4137,,65b9bb15.1029,3,,2024-01-31 11:14:29 CST,,0,LOG,00000,"received SIGHUP, reloading configuration files",,,,,,,,,""
2024-01-31 11:14:31.784 CST,,,4137,,65b9bb15.1029,4,,2024-01-31 11:14:29 CST,,0,LOG,00000,"parameter ""synchronous_standby_names"" changed to """"",,,,,,,,,""

4、叢集節點狀態
如下所示,叢集出現“雙主”狀態:

[kingbase@node202 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                         
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 100      |          |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | default  | 100      | 18       |         | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)

五、恢復叢集
如下所示,將原主庫透過node rejoin方式作為備庫加入到叢集,以下為日誌資訊:

1、執行‘repmgr node rejoin’

[kingbase@node201 bin]$ ./repmgr node rejoin  -h 192.168.1.202 -U esrep -d esrep --force-rewind
......
[NOTICE] rejoin target is node "node2" (ID: 2)
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] local tli: 17; local_xlogpos: 0/EE000028; follow_target_history->tli: 17; follow_target_history->end: 0/EC000FA0
[NOTICE] sys_rewind execution required for this node to attach to rejoin target node 2
[DETAIL] rejoin target server's timeline 18 forked off current database system timeline 17 before current recovery point 0/EE000028
[NOTICE] executing sys_rewind
[DETAIL] sys_rewind command is "/home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C8/HAC8/kingbase/data' --source-server='host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/EC000FA0 on timeline 17
sys_rewind: rewinding from last common checkpoint at 0/EB000088 on timeline 17
sys_rewind: find last common checkpoint start time from 2024-01-31 11:18:07.003968 CST to 2024-01-31 11:18:07.204522 CST, in "0.200554" seconds.
sys_rewind: collect the number of WAL files to be processed:3, start time from 2024-01-31 11:18:07.204754 CST to 2024-01-31 11:18:07.320593 CST, cost "0.115839" seconds.
sys_rewind: read WAL in target server from 0/EB000088 to 0/EE0000A0 (endpoint 0/EE0000A0)
sys_rewind: read the local Wal file information, start time from 2024-01-31 11:18:07.204754 CST to 2024-01-31 11:18:07.321418 CST, cost "0.000793" seconds.
sys_rewind: file replication start time from 2024-01-31 11:18:07.321453 CST to 2024-01-31 11:18:07.827427 CST, cost "0.505974" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/EC013BE0', minRecoveryPointTLI is '18', and database state is 'in archive recovery'
sys_rewind: we will remove the dir '/home/kingbase/cluster/R6C8/HAC8/kingbase/data/sys_replslot/repmgr_slot_2.rewind' and all the file/dir in it.
sys_rewind: rewind start wal location 0/EB000058 (file 0000001100000000000000EB), end wal location 0/EC013BE0 (file 0000001200000000000000EC). wal data increment:16462(kB). time from 2024-01-31 11:18:07.321453 CST to 2024-01-31 11:18:08.543478 CST, in "1.539510" seconds.
sys_rewind: Done!
[NOTICE] 0 files copied to /home/kingbase/cluster/R6C8/HAC8/kingbase/data
[INFO] creating replication slot as user "esrep"
[DEBUG] CreateSlotBySQL(): creating slot "repmgr_slot_1" on upstream
[NOTICE] setting node 1's upstream to node 2
[WARNING] unable to ping "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000"
[DETAIL] KCIping() returned "KCIPING_NO_RESPONSE"
[NOTICE] begin to start server at 2024-01-31 11:18:08.556667
[NOTICE] starting server using "/home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6C8/HAC8/kingbase/data' -l /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/logfile start"
[NOTICE] start server finish at 2024-01-31 11:18:09.265062
[NOTICE] NODE REJOIN successful
[DETAIL] node 1 is now attached to node 2

2、rewind對比原主庫和新主庫history檔案
如下所示,sys_rewind透過對比源和目標庫的history檔案,查詢節點之間的分叉點。

原主庫:
[kingbase@node201 sys_wal]$ cat 00000011.history
1       0/180000A0      no recovery target specified
2       0/190000A0      no recovery target specified
3       0/250000A0      no recovery target specified
4       0/260000A0      no recovery target specified
5       0/AB0000A0      no recovery target specified
6       0/AD0000A0      no recovery target specified 
7       0/B10000A0      no recovery target specified
8       0/B30000A0      no recovery target specified
9       0/B60000A0      no recovery target specified
10      0/B70000A0      no recovery target specified
11      0/BE0000A0      no recovery target specified
12      0/BF0000A0      no recovery target specified
13      0/C00000A0      no recovery target specified
14      0/C10000A0      no recovery target specified
15      0/C90000A0      no recovery target specified
16      0/CA0000A0      no recovery target specified

新主庫:
[kingbase@node202 bin]$ cat ../data/sys_wal/00000012.history
1       0/180000A0      no recovery target specified
2       0/190000A0      no recovery target specified
3       0/250000A0      no recovery target specified
4       0/260000A0      no recovery target specified
5       0/AB0000A0      no recovery target specified
6       0/AD0000A0      no recovery target specified
7       0/B10000A0      no recovery target specified
8       0/B30000A0      no recovery target specified
9       0/B60000A0      no recovery target specified
10      0/B70000A0      no recovery target specified
11      0/BE0000A0      no recovery target specified
12      0/BF0000A0      no recovery target specified
13      0/C00000A0      no recovery target specified
14      0/C10000A0      no recovery target specified
15      0/C90000A0      no recovery target specified
16      0/CA0000A0      no recovery target specified
17      0/EC000FA0      no recovery target specified  --history分叉點

源庫和目標庫的分叉點:

如下所示,在rewind過程中,源庫和目標庫之間透過history檔案獲取分叉點,然後從分叉點之前最近的checkpoint開始對wal日誌執行recovery(必須保證從checkpoint開始,源庫上的wal日誌必須存在,否則將會因為缺失wal日誌,rewind失敗。):

3、檢視新主庫資料庫日誌

2024-01-31 11:18:09.261 CST,"esrep","",13596,"192.168.1.201:32307",65b9bbf1.351c,1,"idle",2024-01-31 11:18:09 CST,4/0,0,LOG,00000,"received replication command: IDENTIFY_SYSTEM",,,,,,,,,"node1"
2024-01-31 11:18:09.261 CST,"esrep","",13596,"192.168.1.201:32307",65b9bbf1.351c,2,"idle",2024-01-31 11:18:09 CST,4/0,0,LOG,00000,"received replication command: START_REPLICATION SLOT ""repmgr_slot_1"" 0/EC000000 TIMELINE 18",,,,,,,,,"node1"
2024-01-31 11:18:11.237 CST,"esrep","esrep",3552,"192.168.1.202:49745",65b9b688.de0,3,"ALTER SYSTEM",2024-01-31 10:55:04 CST,2/4414,0,LOG,XX000,"attention:user esrep is modifying synchronous_standby_names by ALTER SYSTEM SET statement",,,,,,"ALTER SYSTEM SET synchronous_standby_names = 'ANY 1(node1)'",,,"internal_rwcmgr"
2024-01-31 11:18:11.239 CST,,,2823,,65b9b66f.b07,25,,2024-01-31 10:54:39 CST,,0,LOG,00000,"received SIGHUP, reloading configuration files",,,,,,,,,""
2024-01-31 11:18:11.241 CST,,,2823,,65b9b66f.b07,26,,2024-01-31 10:54:39 CST,,0,LOG,00000,"parameter ""synchronous_standby_names"" changed to ""ANY 1(node1)""",,,,,,,,,""
2024-01-31 11:18:11.402 CST,"esrep","",13596,"192.168.1.201:32307",65b9bbf1.351c,3,"streaming 0/EC015300",2024-01-31 11:18:09 CST,4/0,0,LOG,00000,"standby ""node1"" is now a candidate for quorum synchronous standby",,,,,,,,,"node1"

4、叢集恢復完成

[kingbase@node202 bin]$ ./repmgr cluster show


 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                         
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 17       | 0 bytes | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | default  | 100      | 18       |         | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

六、總結
sys_rewind檢查源和目標叢集的時間線歷史以確定它們的分叉點,從分叉點之前最近的checkpoint位置開始解析WAL,解析出變動的資料頁,然後將變動的資料頁複製過來,並從分叉點最近的checkpoint開始應用wal日誌,最終保證源庫和目標庫資料一致。
sys_rewind 的使用不限於故障轉移,例如,可以提升備用伺服器主庫,執行一些寫入事務,然後重新回滾再次成為備用伺服器。

相關文章