Data Guard ORA-16825、ORA-16661解決一例
某天凌晨5點左右,被客戶方運維人員電話吵醒,客戶的交換機因故重啟,導致做了RHCS的兩臺Linux伺服器(非oracle伺服器)因心跳問題也導致重啟
第一時間登上去檢視oracle執行情況,發現主庫已是mount狀態,備庫已變為主庫,可以肯定是因為心跳問題導致了自動切換(生產上的DG環境配置了Fast-Start Failover)
BROKER連上去檢視狀態
jsrw - (*) Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated
(*) Fast-Start Failover target
Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '0'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'CONTINUE'
Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: jsrw
Observer: CAHW50
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Observer Reconnect: (none)
Observer Override: FALSE
Configuration Status:
ERROR
檢視主備庫的狀態
看樣子原主庫已被隔離,檢視alert日誌
原主庫日誌:
日誌中可以看到,主庫失去了備庫和observer的連結,已超時30秒,因此主庫猜測此時極有可能已經發生了自動故障轉移,然後將自己關掉
原備庫日誌:
備庫確實在失去主庫之後開始自動故障轉移,並最終轉移成功。
這裡要說一下,按理交換機恢復之後DG檢測到原主庫會自動將其轉換為備庫,但這次顯然沒有,還記得有兩個配置了RHCS的伺服器因網路問題重啟了,而客戶的observer剛好部署在上面,登上去檢視甚至沒有找到observer的日誌,為什麼observer沒有打日誌出來的?
嘗試開啟observer,重啟原主庫,寄希望於observer自動將失敗的主庫轉為備庫,但是在start observer的時候就遇到了問題
即無法停止,也無法啟動,猜測應該是伺服器重啟導致的狀態錯亂,此時情況變 的複雜,因交換機重啟導致網路不通,導致快速故障轉移,推測轉移工作未完全結束,observer的伺服器又被重啟,使得DG處在一個錯誤的狀態下,
現在已無法通過observer自動恢復失敗的主庫,只能通過手動解決,而日誌裡確認新主庫已轉移成功,那麼恢復失敗的主庫即可,那麼如何恢復呢?有很多選擇重建?利用flashback databse?或者利用broker的reinstate?
其實reinstate也是使用的flashback,那麼幹脆使用簡單的reinstate吧
檢視狀態
可見已經恢復成功,此時stop observer,start observer就沒有問題了,但為了保險起見,停止了observer,
這次顯然是交換機導致的烏龍failover,根本不需要failover的,本來的高可用措施反而帶來了些許麻煩,後來找機會又切換回原來的主庫了,自動故障轉移也一直沒有再啟動,屆時手動faiover更為靈活,因為客戶的應用並不夠靈活,此次應用也全面癱瘓,因為重啟的兩臺伺服器是redis的訊息中轉伺服器。
第一時間登上去檢視oracle執行情況,發現主庫已是mount狀態,備庫已變為主庫,可以肯定是因為心跳問題導致了自動切換(生產上的DG環境配置了Fast-Start Failover)
BROKER連上去檢視狀態
- DGMGRL> SHOW CONFIGURATION VERBOSE;
- Configuration - jsconf
- Protection Mode: MaxAvailability
- Databases:
- jsread - Primary database
- Error: ORA-16825: multiple errors or warnings, including fast-start failover-related errors or warnings, detected for the database
jsrw - (*) Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated
(*) Fast-Start Failover target
Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '0'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'CONTINUE'
Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: jsrw
Observer: CAHW50
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Observer Reconnect: (none)
Observer Override: FALSE
Configuration Status:
ERROR
檢視主備庫的狀態
-
DGMGRL> show database jsrw
Database - jsrw
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: (unknown)
Apply Lag: (unknown)
Apply Rate: (unknown)
Real Time Query: OFF
Instance(s):
jsrw
Database Status:
ORA-16661: the standby database needs to be reinstated
DGMGRL> show database jsread
Database - jsread
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
jsread
Database Error(s):
ORA-16820: fast-start failover observer is no longer observing this database
Database Warning(s):
ORA-16817: unsynchronized fast-start failover configuration
Database Status:
ERROR
看樣子原主庫已被隔離,檢視alert日誌
原主庫日誌:
- Tue Dec 15 04:14:45 2015
- ORA-16198: LGWR received timedout error from KSR
- LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
- LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
- Error 16198 for archive log file 6 to 'jsread'
- ORA-16198: LGWR received timedout error from KSR
- LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'jsread'
- Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
- Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.
- It is likely an automatic failover has already occurred. Primary is shutting down.
- Errors in file /orasystem_readwrite/oracle/oraWR/diag/rdbms/jsrw/jsrw/trace/jsrw_lgwr_73912.trc:
- ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down
原備庫日誌:
- Tue Dec 15 04:14:47 2015
- RFS[44]: Assigned to RFS process 41247
- RFS[44]: Possible network disconnect with primary database
- Tue Dec 15 04:14:47 2015
- RFS[40]: Possible network disconnect with primary database
- Tue Dec 15 04:14:47 2015
- RFS[42]: Possible network disconnect with primary database
- Tue Dec 15 04:14:50 2015
- Attempting Fast-Start Failover because the threshold of 30 seconds has elapsed.
- Tue Dec 15 04:14:50 2015
- Data Guard Broker: Beginning failover
- Tue Dec 15 04:14:50 2015
- ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
- Tue Dec 15 04:14:50 2015
- MRP0: Background Media Recovery cancelled with status 16037
- Errors in file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_pr00_59244.trc:
- ORA-16037: user requested cancel of managed recovery operation
- Managed Standby Recovery not using Real Time Apply
- Tue Dec 15 04:14:50 2015
- ALTER SYSTEM SET service_names='jsread' SCOPE=MEMORY SID='jsread';
- Recovery interrupted!
- Recovered data files to a consistent state at change 229593026
- Tue Dec 15 04:14:51 2015
- MRP0: Background Media Recovery process shutdown (jsread)
- Managed Standby Recovery Canceled (jsread)
- Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
- ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
- Attempt to do a Terminal Recovery (jsread)
- Media Recovery Start: Managed Standby Recovery (jsread)
- started logmerger process
- Tue Dec 15 04:14:51 2015
- Managed Standby Recovery not using Real Time Apply
- Parallel Media Recovery started with 64 slaves
- Media Recovery Waiting for thread 1 sequence 3176 (in transit)
- Killing 1 processes with pids 41252 (all RFS, wait for I/O) in order to disallow current and future RFS connections. Requested by OS process 130750
- Begin: Standby Redo Logfile archival
- End: Standby Redo Logfile archival
- Terminal Recovery timestamp is '12/15/2015 04:14:57'
- Terminal Recovery: applying standby redo logs.
- Terminal Recovery: thread 1 seq# 3176 redo required
- Terminal Recovery:
- Recovery of Online Redo Log: Thread 1 Group 9 Seq 3176 Reading mem 0
- Mem# 0: /oradata_readonly/jsread/stb_redo02.log
- Identified End-Of-Redo (failover) for thread 1 sequence 3176 at SCN 0xffff.ffffffff
- Incomplete Recovery applied until change 229593027 time 12/15/2015 04:14:14
- Media Recovery Complete (jsread)
- Terminal Recovery: successful completion
- Forcing ARSCN to IRSCN for TR 0:229593027
- Tue Dec 15 04:14:57 2015
- ARCH: Archival stopped, error occurred. Will continue retrying
- Attempt to set limbo arscn 0:229593027 irscn 0:229593027
- ORACLE Instance jsread - Archival ErrorResetting standby activation ID 486853183 (0x1d04ca3f)
- ORA-16014: log 9 sequence# 3176 not archived, no available destinations
- ORA-00312: online log 9 thread 1: '/oradata_readonly/jsread/stb_redo02.log'
- Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
- ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
- ALTER DATABASE SWITCHOVER TO PRIMARY (jsread)
- Maximum wait for role transition is 15 minutes.
- All dispatchers and shared servers shutdown
- CLOSE: killing server sessions.
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- Tue Dec 15 04:15:00 2015
- Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
- CLOSE: all sessions shutdown successfully.
- Tue Dec 15 04:15:01 2015
- SMON: disabling cache recovery
- Backup controlfile written to trace file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_rsm0_79936.trc
- Standby terminal recovery start SCN: 229593026
- RESETLOGS after incomplete recovery UNTIL CHANGE 229593027
- Online log /oradata_readonly/jsread/redo01.log: Thread 1 Group 1 was previously cleared
- Online log /oradata_readonly/jsread/redo02.log: Thread 1 Group 2 was previously cleared
- Online log /oradata_readonly/jsread/redo03.log: Thread 1 Group 3 was previously cleared
- Online log /oradata_readonly/jsread/redo04.log: Thread 1 Group 4 was previously cleared
- Online log /oradata_readonly/jsread/redo05.log: Thread 1 Group 5 was previously cleared
- Online log /oradata_readonly/jsread/redo06.log: Thread 1 Group 6 was previously cleared
- Online log /oradata_readonly/jsread/redo07.log: Thread 1 Group 7 was previously cleared
- Online log /oradata_readonly/jsread/redo08.log: Thread 1 Group 16 was previously cleared
- Online log /oradata_readonly/jsread/redo09.log: Thread 1 Group 17 was previously cleared
- Online log /oradata_readonly/jsread/redo10.log: Thread 1 Group 18 was previously cleared
- Standby became primary SCN: 229593025
- Tue Dec 15 04:15:09 2015
- Setting recovery target incarnation to 5
- AUDIT_TRAIL initialization parameter is changed back to its original value as specified in the parameter file.
- Switchover: Complete - Database mounted as primary
- Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
- ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
- Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
- ALTER DATABASE OPEN
- Data Guard Broker initializing...
- Tue Dec 15 04:15:09 2015
- Assigning activation ID 488981393 (0x1d254391)
- LGWR: Primary database is in MAXIMUM AVAILABILITY mode
- Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
- LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
- Thread 1 advanced to log sequence 2 (thread open)
- Tue Dec 15 04:15:09 2015
- ARC2: Becoming the 'no SRL' ARCH
- Tue Dec 15 04:15:09 2015
- ARC3: Becoming the 'no SRL' ARCH
- Thread 1 opened at log sequence 2
- Current log# 2 seq# 2 mem# 0: /oradata_readonly/jsread/redo02.log
- Successful open of redo thread 1
- MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
- ARC2: Becoming the 'no SRL' ARCH
- SMON: enabling cache recovery
- Archived Log entry 3278 added for thread 1 sequence 1 ID 0x1d254391 dest 1:
- Archiver process freed from errors. No longer stopped
- Tue Dec 15 04:15:10 2015
- PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
- [79936] Successfully onlined Undo Tablespace 2.
- Undo initialization finished serial:0 start:1846831906 end:1846832126 diff:220 (2 seconds)
- Dictionary check beginning
- PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
- Dictionary check complete
- Verifying file header compatibility for 11g tablespace encryption..
- Verifying 11g file header compatibility for tablespace encryption completed
- SMON: enabling tx recovery
- Starting background process SMCO
- Database Characterset is ZHS16GBK
- Tue Dec 15 04:15:10 2015
- idle dispatcher 'D000' terminated, pid = (24, 1)
- Tue Dec 15 04:15:10 2015
- SMCO started with pid=24, OS id=131236
- No Resource Manager plan active
- Tue Dec 15 04:15:11 2015
- Starting background process QMNC
- Tue Dec 15 04:15:11 2015
- QMNC started with pid=35, OS id=131244
- LOGSTDBY: Validating controlfile with logical metadata
- LOGSTDBY: Validation complete
- Completed: ALTER DATABASE OPEN
備庫確實在失去主庫之後開始自動故障轉移,並最終轉移成功。
這裡要說一下,按理交換機恢復之後DG檢測到原主庫會自動將其轉換為備庫,但這次顯然沒有,還記得有兩個配置了RHCS的伺服器因網路問題重啟了,而客戶的observer剛好部署在上面,登上去檢視甚至沒有找到observer的日誌,為什麼observer沒有打日誌出來的?
嘗試開啟observer,重啟原主庫,寄希望於observer自動將失敗的主庫轉為備庫,但是在start observer的時候就遇到了問題
- DGMGRL> start observer
- Error: ORA-16647: could not start more than one observer
- Failed.
- DGMGRL> stop observer
- Error:
- ORA-01034: ORACLE not available
- Process ID: 0
- Session ID: 0 Serial number: 0
- Observer started
- [W000 12/15 05:30:13.80] Observer started.
- Observer stopped
- Error:
- ORA-01034: ORACLE not available
- Process ID: 0
- Session ID: 0 Serial number: 0
- [W000 12/15 05:32:05.86] Failed to start the Observer.
- Error: ORA-16647: could not start more than one observer
- [W000 12/15 05:33:40.06] Failed to start the Observer.
即無法停止,也無法啟動,猜測應該是伺服器重啟導致的狀態錯亂,此時情況變 的複雜,因交換機重啟導致網路不通,導致快速故障轉移,推測轉移工作未完全結束,observer的伺服器又被重啟,使得DG處在一個錯誤的狀態下,
現在已無法通過observer自動恢復失敗的主庫,只能通過手動解決,而日誌裡確認新主庫已轉移成功,那麼恢復失敗的主庫即可,那麼如何恢復呢?有很多選擇重建?利用flashback databse?或者利用broker的reinstate?
其實reinstate也是使用的flashback,那麼幹脆使用簡單的reinstate吧
- DGMGRL> reinstate database jsrw;
- Reinstating database "jsrw", please wait...
- Operation requires shutdown of instance "jsrw" on database "jsrw"
- Shutting down instance "jsrw"...
- ORA-01109: database not open
- Database dismounted.
- ORACLE instance shut down.
- Operation requires startup of instance "jsrw" on database "jsrw"
- Starting instance "jsrw"...
- ORACLE instance started.
- Database mounted.
- Continuing to reinstate database "jsrw" ...
- Reinstatement of database "jsrw" succeeded
檢視狀態
-
DGMGRL> show configuration verbose;
Configuration - jsconf
Protection Mode: MaxAvailability
Databases:
jsread - Primary database
Error: ORA-16820: fast-start failover observer is no longer observing this database
jsrw - (*) Physical standby database
Error: ORA-16820: fast-start failover observer is no longer observing this database
(*) Fast-Start Failover target
Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '0'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'CONTINUE'
Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: jsrw
Observer: CAHW50
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Observer Reconnect: (none)
Observer Override: FALSE
Configuration Status:
ERROR
- DGMGRL> stop observer;
- Done.
- DGMGRL> start observer;
- Observer started
這次顯然是交換機導致的烏龍failover,根本不需要failover的,本來的高可用措施反而帶來了些許麻煩,後來找機會又切換回原來的主庫了,自動故障轉移也一直沒有再啟動,屆時手動faiover更為靈活,因為客戶的應用並不夠靈活,此次應用也全面癱瘓,因為重啟的兩臺伺服器是redis的訊息中轉伺服器。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26838672/viewspace-2075586/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- DATA GUARD部署模式——DATA GUARD概念和管理模式
- 介紹ORACLE DATA GUARD——DATA GUARD概念和管理Oracle
- Data guard搭建
- oracle data guard!!Oracle
- Data Guard配置Broker解決ORA-16664、ORA-16792
- Data Guard Broker系列之二:Data Guard Broker配置實戰
- DATA GUARD 簡介
- Data Guard 建立(ASM)ASM
- Oracle Data Guard配置Oracle
- Oracle 11g Data Guard Enabling Active Data GuardOracle
- [Data Guard]Oracle10g Data Guard學習筆記(一)Oracle筆記
- [Data Guard]Oracle10g Data Guard學習筆記(二)Oracle筆記
- [Data Guard]Oracle10g Data Guard學習筆記(三)Oracle筆記
- Oracle 11g Data Guard搭建過程中問題解決兩例Oracle
- 【DataGuard】部署Data Guard相關引數詳解
- Oracle Data Guard Broker元件Oracle元件
- Oracle Data Guard簡介Oracle
- 單機搭建Data Guard
- Active Data Guard初探(一)
- DATA GUARD架構(一)架構
- Oracle Data Guard 介紹Oracle
- Data Guard Wait EventsAI
- physical data guard 的原理
- ORACLE Data Guard--IOracle
- data guard switchover on solaris 10
- 技術白皮書:Oracle Data Guard 11gOracle Data Guard 理論知識OracleGo
- 1 關於 Oracle Data GuardOracle
- 2 Oracle Data Guard 安裝Oracle
- 1 Oracle Data Guard Broker 概念Oracle
- 【DG】Data Guard搭建(physical standby)
- Oracle Data Guard和Broker概述Oracle
- RedHat搭建物理Data GuardRedhat
- 搭建Active Data Guard環境
- flashback database 結合 data guardDatabase
- Oracle 11g Data GuardOracle
- 【DATAGUARD】Data Guard Wait EventsAI
- DATA GUARD 中alter database 命令Database
- Oracle11g Data GuardOracle