環境

症狀

問題原因

解決方案

環境

系統平臺： N/A

版本： 5.6.5

症狀

本文件旨在介紹出現叢集出現雙主問題時如何處理。

問題原因

repmgr 出現切換後，原主庫停止服務。手動啟動原主庫會導致資料庫叢集狀態異常，變為雙主。

解決方案

原主庫（192.168.80.221）查詢叢集狀態，發現備庫1（192.168.80.222）已經成為新的主庫，但是原主庫（192.168.80.221）仍然以主庫的模式啟動，這時候叢集就出現了雙主的問題。檢查後發現，備庫2（192.168.80.236）已經連線到了新主庫（192.168.80.222），此時可將原主庫（192.168.80.221）重新加入叢集作為備庫執行。

[highgo@localhost HighGo5.6.5-cluster]$ repmgr cluster show
ID | Name           | Role    | Status               | Upstream       | Location | Priority | Replication lag | Last replayed LSN
----+----------------+---------+----------------------+----------------+----------+----------+-----------------+-------------------
1  | 192.168.80.221 | primary | * running            |                | default  | 100      | n/a             | none            
2  | 192.168.80.222 | standby | ! running as primary | 192.168.80.221 | default  | 100      | 96 MB           | 2/D2013F28      
3  | 192.168.80.236 | standby |   running            | 192.168.80.221 | default  | 100      | 96 MB           | 2/D2013F28      
WARNING: following issues were detected
  - node "192.168.80.222" (ID: 2) is registered as standby but running as primary
[highgo@localhost HighGo5.6.5-cluster]$ pg_ctl stop -m f
等待伺服器程式關閉 ..... 完成
伺服器程式已經關閉

測試是否可以將本節點重新加入叢集：

[highgo@localhost HighGo5.6.5-cluster]$ repmgr node rejoin  -d 'host=192.168.80.222 dbname=highgo user=highgo' --force-rewind --config-files=postgresql.conf --verbose --dry-run  
INFO: looking for configuration file in "/opt/HighGo5.6.5-cluster"
INFO: configuration file found at: "/opt/HighGo5.6.5-cluster/conf/hg_repmgr.conf"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_primary_node_id():
SELECT node_id                   FROM repmgr.nodes     WHERE type = 'primary'    AND active IS TRUE 
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name   FROM repmgr.nodes n  WHERE n.node_id = 2
DEBUG: connecting to: "user=highgo connect_timeout=2 dbname=highgo host=192.168.80.222 port=5866 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 6823598896860049498
DEBUG: local timeline: 25; rejoin target timeline: 26
DEBUG: get_timeline_history():
TIMELINE_HISTORY 26
DEBUG: local tli: 25; local_xlogpos: 2/DE000028; follow_target_history->tli: 25; follow_target_history->end: 2/D2013F28
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 26 forked off current database system timeline 25 before current recovery point 2/DE000028
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'full_page_writes' AND setting = 'off'
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'wal_log_hints' AND setting = 'on'
INFO: prerequisites for using pg_rewind are met
DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
INFO: temporary archive directory "/tmp/repmgr-config-archive-192.168.80.221" created
INFO: file "postgresql.conf" would be copied to "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf"
INFO: 1 files would have been copied to "/tmp/repmgr-config-archive-192.168.80.221"
INFO: temporary archive directory "/tmp/repmgr-config-archive-192.168.80.221" deleted
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
  /opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'
INFO: prerequisites for executing NODE REJOIN are met

正式將節點重新加入叢集：

[highgo@localhost HighGo5.6.5-cluster]$ repmgr node rejoin  -d 'host=192.168.80.222 dbname=highgo user=highgo' --force-rewind --config-files=postgresql.conf --verbose
INFO: looking for configuration file in "/opt/HighGo5.6.5-cluster"
INFO: configuration file found at: "/opt/HighGo5.6.5-cluster/conf/hg_repmgr.conf"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_primary_node_id():
SELECT node_id                   FROM repmgr.nodes     WHERE type = 'primary'    AND active IS TRUE 
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name   FROM repmgr.nodes n  WHERE n.node_id = 2
DEBUG: connecting to: "user=highgo connect_timeout=2 dbname=highgo host=192.168.80.222 port=5866 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: local timeline: 25; rejoin target timeline: 26
DEBUG: get_timeline_history():
TIMELINE_HISTORY 26
DEBUG: local tli: 25; local_xlogpos: 2/DE000028; follow_target_history->tli: 25; follow_target_history->end: 2/D2013F28
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 26 forked off current database system timeline 25 before current recovery point 2/DE000028
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'full_page_writes' AND setting = 'off'
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'wal_log_hints' AND setting = 'on'
INFO: prerequisites for using pg_rewind are met
DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
DEBUG: copying "postgresql.conf" to "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf"
INFO: 1 files copied to "/tmp/repmgr-config-archive-192.168.80.221"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'"
DEBUG: executing:
  /opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'
DEBUG: result of command was 0 (13)
DEBUG: local_command(): output returned was:
servers diverged at WAL location 2/D2013F28 on timeline 25
DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
DEBUG: copying "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf" to "/opt/HighGo5.6.5-cluster/data/postgresql.conf"
NOTICE: 1 files copied to /opt/HighGo5.6.5-cluster/data
INFO: directory "/tmp/repmgr-config-archive-192.168.80.221" deleted
INFO: deleting "recovery.done"
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name   FROM repmgr.nodes n  WHERE n.node_id = 1
NOTICE: setting node 1's upstream to node 2
DEBUG: create_recovery_file(): creating "/opt/HighGo5.6.5-cluster/data/recovery.conf"...
DEBUG: recovery file is:
standby_mode = 'on'
primary_conninfo = 'user=highgo connect_timeout=2 host=192.168.80.222 port=5866 application_name=192.168.80.221'
recovery_target_timeline = 'latest'
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_NO_RESPONSE
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/opt/HighGo5.6.5-cluster/bin/pg_ctl  -w -D '/opt/HighGo5.6.5-cluster/data' start"
DEBUG: executing:
  /opt/HighGo5.6.5-cluster/bin/pg_ctl  -w -D '/opt/HighGo5.6.5-cluster/data' start
DEBUG: result of command was 0 (13)
DEBUG: local_command(): output returned was:
等待伺服器程式啟動 ....2020-06-25 09:41:54.523 CST [5308] WARNING:  01000: gdb version should large than 7.10
DEBUG: update_node_record_status():
    UPDATE repmgr.nodes      SET type = 'standby',          upstream_node_id = 2,          active = TRUE    WHERE node_id = 1
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
INFO: waiting for node 1 to respond to pings; 1 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
DEBUG: sleeping 1 second waiting for node 1 to respond to pings; 2 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
DEBUG: sleeping 1 second waiting for node 1 to respond to pings; 3 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_OK
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
DEBUG: _create_event(): event is "node_rejoin" for node 1
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2020-06-30 10:15:47.522104+08"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

重新查詢叢集狀態，叢集狀態正常。

[highgo@localhost HighGo5.6.5-cluster]$ repmgr cluster show
ID | Name           | Role    | Status               | Upstream       | Location | Priority | Replication lag | Last replayed LSN
----+----------------+---------+----------------------+----------------+----------+----------+-----------------+-------------------
1  | 192.168.80.221 | standby |   running            |  192.168.80.221 | default  | 100      | 0 MB           | 2/D4013F28            
2  | 192.168.80.222 | primary |   running            |                 | default  | 100      | n/a            | none      
3  | 192.168.80.236 | standby |   running            | 192.168.80.221  | default  | 100      | 0 MB           | 2/D4013F28

repmgr 叢集雙主問題處理

相關文章