Data Guard ORA-16825、ORA-16661解決一例

liupzmin發表於2016-04-05

某天凌晨5點左右，被客戶方運維人員電話吵醒，客戶的交換機因故重啟，導致做了RHCS的兩臺Linux伺服器（非oracle伺服器）因心跳問題也導致重啟

第一時間登上去檢視oracle執行情況,發現主庫已是mount狀態，備庫已變為主庫，可以肯定是因為心跳問題導致了自動切換（生產上的DG環境配置了Fast-Start Failover）

BROKER連上去檢視狀態

DGMGRL> SHOW CONFIGURATION VERBOSE;
Configuration - jsconf
Protection Mode: MaxAvailability
Databases:
jsread - Primary database
Error: ORA-16825: multiple errors or warnings, including fast-start failover-related errors or warnings, detected for the database

jsrw - (*) Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated

(*) Fast-Start Failover target

Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '0'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'CONTINUE'

Fast-Start Failover: ENABLED

Threshold: 30 seconds
Target: jsrw
Observer: CAHW50
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Observer Reconnect: (none)
Observer Override: FALSE

Configuration Status:
ERROR

檢視主備庫的狀態

DGMGRL> show database jsrw

Database - jsrw

Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: (unknown)
Apply Lag: (unknown)
Apply Rate: (unknown)
Real Time Query: OFF
Instance(s):
jsrw

Database Status:
ORA-16661: the standby database needs to be reinstated

DGMGRL> show database jsread

Database - jsread

Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
jsread

Database Error(s):
ORA-16820: fast-start failover observer is no longer observing this database

Database Warning(s):
ORA-16817: unsynchronized fast-start failover configuration

Database Status:
ERROR

看樣子原主庫已被隔離，檢視alert日誌

原主庫日誌：

Tue Dec 15 04:14:45 2015
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Error 16198 for archive log file 6 to 'jsread'
ORA-16198: LGWR received timedout error from KSR
LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'jsread'
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.
It is likely an automatic failover has already occurred. Primary is shutting down.
Errors in file /orasystem_readwrite/oracle/oraWR/diag/rdbms/jsrw/jsrw/trace/jsrw_lgwr_73912.trc:
ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down

日誌中可以看到，主庫失去了備庫和observer的連結，已超時30秒，因此主庫猜測此時極有可能已經發生了自動故障轉移，然後將自己關掉

原備庫日誌：

Tue Dec 15 04:14:47 2015
RFS[44]: Assigned to RFS process 41247
RFS[44]: Possible network disconnect with primary database
Tue Dec 15 04:14:47 2015
RFS[40]: Possible network disconnect with primary database
Tue Dec 15 04:14:47 2015
RFS[42]: Possible network disconnect with primary database
Tue Dec 15 04:14:50 2015
Attempting Fast-Start Failover because the threshold of 30 seconds has elapsed.
Tue Dec 15 04:14:50 2015
Data Guard Broker: Beginning failover
Tue Dec 15 04:14:50 2015
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
Tue Dec 15 04:14:50 2015
MRP0: Background Media Recovery cancelled with status 16037
Errors in file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_pr00_59244.trc:
ORA-16037: user requested cancel of managed recovery operation
Managed Standby Recovery not using Real Time Apply
Tue Dec 15 04:14:50 2015
ALTER SYSTEM SET service_names='jsread' SCOPE=MEMORY SID='jsread';
Recovery interrupted!
Recovered data files to a consistent state at change 229593026
Tue Dec 15 04:14:51 2015
MRP0: Background Media Recovery process shutdown (jsread)
Managed Standby Recovery Canceled (jsread)
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Attempt to do a Terminal Recovery (jsread)
Media Recovery Start: Managed Standby Recovery (jsread)
started logmerger process
Tue Dec 15 04:14:51 2015
Managed Standby Recovery not using Real Time Apply
Parallel Media Recovery started with 64 slaves
Media Recovery Waiting for thread 1 sequence 3176 (in transit)
Killing 1 processes with pids 41252 (all RFS, wait for I/O) in order to disallow current and future RFS connections. Requested by OS process 130750
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
Terminal Recovery timestamp is '12/15/2015 04:14:57'
Terminal Recovery: applying standby redo logs.
Terminal Recovery: thread 1 seq# 3176 redo required
Terminal Recovery:
Recovery of Online Redo Log: Thread 1 Group 9 Seq 3176 Reading mem 0
Mem# 0: /oradata_readonly/jsread/stb_redo02.log
Identified End-Of-Redo (failover) for thread 1 sequence 3176 at SCN 0xffff.ffffffff
Incomplete Recovery applied until change 229593027 time 12/15/2015 04:14:14
Media Recovery Complete (jsread)
Terminal Recovery: successful completion
Forcing ARSCN to IRSCN for TR 0:229593027
Tue Dec 15 04:14:57 2015
ARCH: Archival stopped, error occurred. Will continue retrying
Attempt to set limbo arscn 0:229593027 irscn 0:229593027
ORACLE Instance jsread - Archival ErrorResetting standby activation ID 486853183 (0x1d04ca3f)
ORA-16014: log 9 sequence# 3176 not archived, no available destinations
ORA-00312: online log 9 thread 1: '/oradata_readonly/jsread/stb_redo02.log'
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
ALTER DATABASE SWITCHOVER TO PRIMARY (jsread)
Maximum wait for role transition is 15 minutes.
All dispatchers and shared servers shutdown
CLOSE: killing server sessions.
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
Tue Dec 15 04:15:00 2015
Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
CLOSE: all sessions shutdown successfully.
Tue Dec 15 04:15:01 2015
SMON: disabling cache recovery
Backup controlfile written to trace file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_rsm0_79936.trc
Standby terminal recovery start SCN: 229593026
RESETLOGS after incomplete recovery UNTIL CHANGE 229593027
Online log /oradata_readonly/jsread/redo01.log: Thread 1 Group 1 was previously cleared
Online log /oradata_readonly/jsread/redo02.log: Thread 1 Group 2 was previously cleared
Online log /oradata_readonly/jsread/redo03.log: Thread 1 Group 3 was previously cleared
Online log /oradata_readonly/jsread/redo04.log: Thread 1 Group 4 was previously cleared
Online log /oradata_readonly/jsread/redo05.log: Thread 1 Group 5 was previously cleared
Online log /oradata_readonly/jsread/redo06.log: Thread 1 Group 6 was previously cleared
Online log /oradata_readonly/jsread/redo07.log: Thread 1 Group 7 was previously cleared
Online log /oradata_readonly/jsread/redo08.log: Thread 1 Group 16 was previously cleared
Online log /oradata_readonly/jsread/redo09.log: Thread 1 Group 17 was previously cleared
Online log /oradata_readonly/jsread/redo10.log: Thread 1 Group 18 was previously cleared
Standby became primary SCN: 229593025
Tue Dec 15 04:15:09 2015
Setting recovery target incarnation to 5
AUDIT_TRAIL initialization parameter is changed back to its original value as specified in the parameter file.
Switchover: Complete - Database mounted as primary
Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
ALTER DATABASE OPEN
Data Guard Broker initializing...
Tue Dec 15 04:15:09 2015
Assigning activation ID 488981393 (0x1d254391)
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
Thread 1 advanced to log sequence 2 (thread open)
Tue Dec 15 04:15:09 2015
ARC2: Becoming the 'no SRL' ARCH
Tue Dec 15 04:15:09 2015
ARC3: Becoming the 'no SRL' ARCH
Thread 1 opened at log sequence 2
Current log# 2 seq# 2 mem# 0: /oradata_readonly/jsread/redo02.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
ARC2: Becoming the 'no SRL' ARCH
SMON: enabling cache recovery
Archived Log entry 3278 added for thread 1 sequence 1 ID 0x1d254391 dest 1:
Archiver process freed from errors. No longer stopped
Tue Dec 15 04:15:10 2015
PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
[79936] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:1846831906 end:1846832126 diff:220 (2 seconds)
Dictionary check beginning
PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
Dictionary check complete
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Starting background process SMCO
Database Characterset is ZHS16GBK
Tue Dec 15 04:15:10 2015
idle dispatcher 'D000' terminated, pid = (24, 1)
Tue Dec 15 04:15:10 2015
SMCO started with pid=24, OS id=131236
No Resource Manager plan active
Tue Dec 15 04:15:11 2015
Starting background process QMNC
Tue Dec 15 04:15:11 2015
QMNC started with pid=35, OS id=131244
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Completed: ALTER DATABASE OPEN

備庫確實在失去主庫之後開始自動故障轉移，並最終轉移成功。

這裡要說一下，按理交換機恢復之後DG檢測到原主庫會自動將其轉換為備庫，但這次顯然沒有，還記得有兩個配置了RHCS的伺服器因網路問題重啟了，而客戶的observer剛好部署在上面，登上去檢視甚至沒有找到observer的日誌，為什麼observer沒有打日誌出來的？

嘗試開啟observer，重啟原主庫，寄希望於observer自動將失敗的主庫轉為備庫，但是在start observer的時候就遇到了問題

DGMGRL> start observer
Error: ORA-16647: could not start more than one observer
Failed.
DGMGRL> stop observer
Error:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

Observer started
[W000 12/15 05:30:13.80] Observer started.
Observer stopped
Error:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0
[W000 12/15 05:32:05.86] Failed to start the Observer.
Error: ORA-16647: could not start more than one observer
[W000 12/15 05:33:40.06] Failed to start the Observer.

即無法停止，也無法啟動，猜測應該是伺服器重啟導致的狀態錯亂，此時情況變的複雜，因交換機重啟導致網路不通，導致快速故障轉移，推測轉移工作未完全結束，observer的伺服器又被重啟，使得DG處在一個錯誤的狀態下，
現在已無法通過observer自動恢復失敗的主庫，只能通過手動解決，而日誌裡確認新主庫已轉移成功，那麼恢復失敗的主庫即可，那麼如何恢復呢？有很多選擇重建？利用flashback databse？或者利用broker的reinstate？
其實reinstate也是使用的flashback，那麼幹脆使用簡單的reinstate吧

DGMGRL> reinstate database jsrw;
Reinstating database "jsrw", please wait...
Operation requires shutdown of instance "jsrw" on database "jsrw"
Shutting down instance "jsrw"...
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
Operation requires startup of instance "jsrw" on database "jsrw"
Starting instance "jsrw"...
ORACLE instance started.
Database mounted.
Continuing to reinstate database "jsrw" ...
Reinstatement of database "jsrw" succeeded

檢視狀態

DGMGRL> show configuration verbose;

Configuration - jsconf

Protection Mode: MaxAvailability
Databases:
jsread - Primary database
Error: ORA-16820: fast-start failover observer is no longer observing this database

jsrw - (*) Physical standby database
Error: ORA-16820: fast-start failover observer is no longer observing this database

(*) Fast-Start Failover target

Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '0'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'CONTINUE'

Fast-Start Failover: ENABLED

Threshold: 30 seconds
Target: jsrw
Observer: CAHW50
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Observer Reconnect: (none)
Observer Override: FALSE

Configuration Status:
ERROR

可見已經恢復成功，此時stop observer，start observer就沒有問題了，但為了保險起見，停止了observer，

DGMGRL> stop observer;
Done.
DGMGRL> start observer;
Observer started

這次顯然是交換機導致的烏龍failover，根本不需要failover的，本來的高可用措施反而帶來了些許麻煩，後來找機會又切換回原來的主庫了，自動故障轉移也一直沒有再啟動，屆時手動faiover更為靈活，因為客戶的應用並不夠靈活，此次應用也全面癱瘓，因為重啟的兩臺伺服器是redis的訊息中轉伺服器。

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/26838672/viewspace-2075586/，如需轉載，請註明出處，否則將追究法律責任。

DATA GUARD部署模式——DATA GUARD概念和管理
2007-04-08
模式
介紹ORACLE DATA GUARD——DATA GUARD概念和管理
2007-02-24
Oracle
Data guard搭建
2016-05-12
oracle data guard!!
2008-05-11
Oracle
Data Guard配置Broker解決ORA-16664、ORA-16792
2016-03-22
Data Guard Broker系列之二：Data Guard Broker配置實戰
2010-12-29
DATA GUARD 簡介
2013-12-04
Data Guard 建立（ASM)
2010-11-16
ASM
Oracle Data Guard配置
2012-05-09
Oracle
Oracle 11g Data Guard Enabling Active Data Guard
2013-08-14
Oracle
[Data Guard]Oracle10g Data Guard學習筆記（一）
2008-12-15
Oracle筆記
[Data Guard]Oracle10g Data Guard學習筆記（二）
2008-12-22
Oracle筆記
[Data Guard]Oracle10g Data Guard學習筆記（三）
2008-12-22
Oracle筆記
Oracle 11g Data Guard搭建過程中問題解決兩例
2014-04-18
Oracle
【DataGuard】部署Data Guard相關引數詳解
2018-01-09
Oracle Data Guard Broker元件
2022-02-02
Oracle元件
Oracle Data Guard簡介
2022-02-02
Oracle
單機搭建Data Guard
2021-09-09
Active Data Guard初探（一）
2016-11-07
DATA GUARD架構（一）
2016-12-06
架構
Oracle Data Guard 介紹
2014-04-17
Oracle
Data Guard Wait Events
2016-06-10
AI
physical data guard 的原理
2008-03-11
ORACLE Data Guard--I
2009-06-08
Oracle
data guard switchover on solaris 10
2009-04-14
技術白皮書：Oracle Data Guard 11gOracle Data Guard 理論知識
2010-11-30
OracleGo
1 關於 Oracle Data Guard
2020-03-22
Oracle
2 Oracle Data Guard 安裝
2020-03-26
Oracle
1 Oracle Data Guard Broker 概念
2020-03-26
Oracle
【DG】Data Guard搭建（physical standby）
2020-08-03
Oracle Data Guard和Broker概述
2022-02-01
Oracle
RedHat搭建物理Data Guard
2014-02-26
Redhat
搭建Active Data Guard環境
2014-11-10
flashback database 結合 data guard
2011-01-30
Database
Oracle 11g Data Guard
2011-08-03
Oracle
【DATAGUARD】Data Guard Wait Events
2011-07-13
AI
DATA GUARD 中alter database 命令
2011-04-19
Database
Oracle11g Data Guard
2010-08-31
Oracle

Data Guard ORA-16825、ORA-16661解決一例

相關文章