關於叢集節點timeline不一致的處理方式

openGaussbaby發表於2024-03-30

原文網址 : https://www.cnblogs.com/helloopenGauss/p/18105293

關於叢集節點 timeline 不一致的處理方式
本文出處：https://www.modb.pro/db/400223

在 PostgreSQL/MogDB/openGauss 資料庫日常維護過程中，如果多次對資料庫進行角色切換，可能會出現 timeline 不一致的情況，導致備庫不能正常加入到資料庫叢集，現在以 PG 為例對這些可能發生的情況進行復現，並進行整理。

timeline 介紹
為了將基於時間點恢復後生成的 WAL 記錄序列與初始資料庫歷史中產生的 WAL 記錄序列區分開來，避免原來的 wal 檔案被覆蓋，同時也為了避免管理混亂，PostgreSQL 資料庫引入了“時間線”的概念，使其可以透過備份恢復到任何之前的狀態，包括早先被放棄的時間線分支中的狀態。

當一次歸檔恢復完成，一個新的時間線被建立來標識恢復之後生成的 WAL 記錄序列。時間線 ID 號是 WAL 段檔名的一部分，因此一個新的時間線不會重寫由之前的時間線生成的 WAL 資料。

場景一
--主庫日誌
ERROR: requested starting point 0/8000000 on timeline 1 is not in this server's history
DETAIL: This server's history forked from timeline 1 at 0/6018D98.
STATEMENT: START_REPLICATION 0/8000000 TIMELINE 1

--備庫日誌
LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/80000A0
FATAL: could not start WAL streaming: ERROR: requested starting point 0/8000000 on timeline 1 is not in this server's history
DETAIL: This server's history forked from timeline 1 at 0/6018D98.
發生場景

備庫 promote 為主庫，源主庫以備庫的方式重新加入叢集
以備份的方式恢復為新主庫，源主庫以備庫的方式加入叢集
處理方式

重建備庫，適用資料量較小的資料庫
藉助 pg_rewind 工具，推薦使用這種方式 pg_rewind 會把所有的配置檔案都覆蓋，建議提前做好備份並在啟動前新增 recovery.conf 或 standby.signal 檔案
pg_rewind 相關報錯

pg_rewind: fatal: target server needs to use either data checksums or "wal_log_hints = on"
即使資料庫已經開啟了wal_log_hints = on，依然報這個錯，這時需要以primary的形式重啟一下資料庫。

pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
主備時間線一致，無法直接使用，這時需要讓目標節點先以備庫的方式執行，然後透過promote提升為主節點，增加timeline，再次執行pg_rewind

pg_rewind: fatal: could not find common ancestor of the source and target cluster's timelines
建議直接重建備庫
場景二
--備庫啟動失敗
LOG: entering standby mode
FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/8000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/6018D98.
LOG: startup process (PID 1059) exited with exit code 1

發生場景

在場景一中啟動資料庫，會將新主庫的 00000002.history 傳輸到備庫本地
[postgres@bogon pg_wal]$ ls -l
total 49160
-rw-------. 1 postgres postgres 332 May 5 20:52 000000010000000000000004.00000028.backup
-rw-------. 1 postgres postgres 16777216 May 6 08:54 000000010000000000000008
-rw-------. 1 postgres postgres 16777216 May 6 08:49 000000010000000000000009
-rw-------. 1 postgres postgres 16777216 May 6 08:54 00000001000000000000000A
-rw-------. 1 postgres postgres 32 May 6 08:58 00000002.history
drwx------. 2 postgres postgres 88 May 6 08:58 archive_status

處理方式

將pg_wal、archive_status 和歸檔目錄中的 00000002.history 刪除即可
[postgres@bogon pg_wal]$ rm -f 00000002.history
[postgres@bogon pg_wal]$ cd archive_status/
[postgres@bogon archive_status]$ ls -l
total 0
-rw-------. 1 postgres postgres 0 May 5 20:52 000000010000000000000004.00000028.backup.done
-rw-------. 1 postgres postgres 0 May 6 08:58 00000002.history.done
[postgres@bogon archive_status]$ rm -rf *
[postgres@bogon archive_status]$

場景三
LOG: started streaming WAL from primary at 0/7000000 on timeline 2
FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/7000000 is ahead of the WAL flush position of this server 0/601A5D8
cp: cannot stat ‘/data/pgarchive/00000003.history’: No such file or directory
cp: cannot stat ‘/data/pgarchive/000000020000000000000007’: No such file or directory

發生場景

備庫以單機（未加入叢集，以 primary 的角色）的方式啟動過，雖然時間線沒變，但是 wal 檔案已經不一致
處理方式此時由於備庫的需要從 0/7000000 開始進行重放，已經比主庫的 0/601A5D8 提前，說明此時資料庫已經不一致。嘗試過修改透過 pg_resetwal 修改 timeline，也嘗試過透過 pg_switch_wal()切換 wal 檔案，依然無法透過 pg_rewind 進行處理，原因是 wal 不連續，只能選擇重建

--修改timeline
postgres=# SELECT timeline_id,redo_wal_file FROM pg_control_checkpoint();
timeline_id | redo_wal_file
-------------+--------------------------
2 | 00000002000000000000000F
(1 row)

$pg_resetwal -l 000000030000000000000010 /data/pgdata14/
Write-ahead log reset

--修改時間線
postgres=# SELECT timeline_id,redo_wal_file FROM pg_control_checkpoint();
timeline_id | redo_wal_file
-------------+--------------------------
3 | 000000030000000000000012
(1 row)

--切換wal
postgres=# select pg_switch_wal();
$ pg_ctl promote -D /data/pgdata14

總結
備庫在執行過程中，以 promote 的方式提升為主，即使有資料寫入，只要 wal 完整，也可以使用 pg_rewind 回退. 在 pg_rewind 完成後啟動，注意修改引數檔案、hba 檔案、清理歸檔日誌及新增 standby.signal/recovery.conf
備庫在執行過程中，以主庫的方式重啟過，即使沒有任何操作，也沒有辦法回退，只能重建只要中間以主庫執行過，wal 就沒有辦法連續了

consul 多節點/單節點叢集搭建
2021-07-12
cephadm訪問ceph叢集的方式及管理員節點配置案例
2024-08-22
4.2 叢集節點初步搭建
2018-11-15
Solaris叢集節點重啟
2018-12-29
HAC叢集新增新節點
2022-07-14
MongoDB叢集搭建(包括隱藏節點，仲裁節點)
2021-04-13
MongoDB
linux搭建kafka叢集，多master節點叢集說明
2022-04-06
LinuxKafkaAST
400+節點的 Elasticsearch 叢集運維
2019-03-26
Elasticsearch運維
400+ 節點的 Elasticsearch 叢集運維
2019-04-25
Elasticsearch運維
mongodb叢集節點故障的切換方法
2019-06-20
MongoDB
HAC叢集更改IP（單節點更改、全部節點更改）
2022-05-27
Oracle叢集軟體管理-新增和刪除叢集節點
2020-03-19
Oracle
Kubernetes叢集部署Node Feature Discovery元件用於檢測叢集節點特性
2024-03-14
元件
Jedis操作單節點redis，叢集及redisTemplate操作redis叢集（一）
2018-06-13
Redis
Redis Manager 叢集管理與節點管理
2018-12-15
Redis
zookeeper叢集奇偶數節點問題
2018-08-22
Redis服務之叢集節點管理
2020-08-08
Redis
從庫轉換成PXC叢集的節點
2019-10-26
repmgr 叢集雙主問題處理
2022-01-10
DKHhadoop叢集新增節點管理功能的操作步驟
2019-01-16
Hadoop
節點加入k8s 叢集的步驟
2024-03-13
K8S
Druid.io系列3：Druid叢集節點
2018-04-30
UI
升級kubeadm 叢集（只有master單節點）
2024-03-13
AST
hadoop叢集搭建——單節點（偽分散式）
2022-06-24
Hadoop分散式
kubernets叢集節點NotReady故障分析報告
2021-11-10
Quartz叢集增強版_01.叢集及缺火處理(ClusterMisfireHandler)
2024-11-12
quartz
小白：關於處理“can't find '__main__' module in ”這個問題的詳細處理方式！
2018-09-09
AI
管理 ES 叢集：集常見的叢集部署方式
2020-02-18
關於丟失表空間資料檔案的處理方式
2018-08-08
tidb之dm叢集同步異常處理
2022-03-01
TiDB
處理尚不存在的 DOM 節點
2023-03-26
weblogic手工建立簡單域的方法(包含節點，叢集)
2018-04-09
Web
CentOS7 上搭建多節點 Elasticsearch叢集
2018-11-05
CentOSElasticsearch
kafka系列二：多節點分散式叢集搭建
2019-05-11
Kafka分散式
安裝 Hadoop：設定單節點 Hadoop 叢集
2021-12-29
Hadoop
設定gbase叢集節點離線狀態
2022-02-17
關於Python中的日期處理
2019-02-16
Python
Spark 叢集執行任務失敗的故障處理
2023-02-23
Spark

關於叢集節點timeline不一致的處理方式

相關文章