Data Guard ORA-16825、ORA-16661解決一例

liupzmin發表於2016-04-05
某天凌晨5點左右,被客戶方運維人員電話吵醒,客戶的交換機因故重啟,導致做了RHCS的兩臺Linux伺服器(非oracle伺服器)因心跳問題也導致重啟

第一時間登上去檢視oracle執行情況,發現主庫已是mount狀態,備庫已變為主庫,可以肯定是因為心跳問題導致了自動切換(生產上的DG環境配置了Fast-Start Failover)

BROKER連上去檢視狀態

  1. DGMGRL> SHOW CONFIGURATION VERBOSE;
  2. Configuration - jsconf
  3. Protection Mode: MaxAvailability
  4. Databases:
  5. jsread - Primary database
  6. Error: ORA-16825: multiple errors or warnings, including fast-start failover-related errors or warnings, detected for the database
  
  jsrw   - (*) Physical standby database (disabled)
      ORA-16661: the standby database needs to be reinstated


  (*) Fast-Start Failover target


  Properties:
    FastStartFailoverThreshold      = '30'
    OperationTimeout                = '30'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'


Fast-Start Failover: ENABLED


  Threshold:          30 seconds
  Target:             jsrw
  Observer:           CAHW50
  Lag Limit:          30 seconds (not in use)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE


Configuration Status:
ERROR


檢視主備庫的狀態

  1. DGMGRL> show database jsrw


    Database - jsrw


      Role:            PHYSICAL STANDBY
      Intended State:  APPLY-ON
      Transport Lag:   (unknown)
      Apply Lag:       (unknown)
      Apply Rate:      (unknown)
      Real Time Query: OFF
      Instance(s):
        jsrw


    Database Status:
    ORA-16661: the standby database needs to be reinstated


    DGMGRL> show database jsread


    Database - jsread


      Role:            PRIMARY
      Intended State:  TRANSPORT-ON
      Instance(s):
        jsread


      Database Error(s):
        ORA-16820: fast-start failover observer is no longer observing this database


      Database Warning(s):
        ORA-16817: unsynchronized fast-start failover configuration


    Database Status:
    ERROR


看樣子原主庫已被隔離,檢視alert日誌

原主庫日誌:

  1. Tue Dec 15 04:14:45 2015
  2. ORA-16198: LGWR received timedout error from KSR
  3. LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
  4. LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
  5. Error 16198 for archive log file 6 to 'jsread'
  6. ORA-16198: LGWR received timedout error from KSR
  7. LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'jsread'
  8. Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
  9. Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.
  10. It is likely an automatic failover has already occurred. Primary is shutting down.
  11. Errors in file /orasystem_readwrite/oracle/oraWR/diag/rdbms/jsrw/jsrw/trace/jsrw_lgwr_73912.trc:
  12. ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down
日誌中可以看到,主庫失去了備庫和observer的連結,已超時30秒,因此主庫猜測此時極有可能已經發生了自動故障轉移,然後將自己關掉

原備庫日誌:

  1. Tue Dec 15 04:14:47 2015
  2. RFS[44]: Assigned to RFS process 41247
  3. RFS[44]: Possible network disconnect with primary database
  4. Tue Dec 15 04:14:47 2015
  5. RFS[40]: Possible network disconnect with primary database
  6. Tue Dec 15 04:14:47 2015
  7. RFS[42]: Possible network disconnect with primary database
  8. Tue Dec 15 04:14:50 2015
  9. Attempting Fast-Start Failover because the threshold of 30 seconds has elapsed.
  10. Tue Dec 15 04:14:50 2015
  11. Data Guard Broker: Beginning failover
  12. Tue Dec 15 04:14:50 2015
  13. ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
  14. Tue Dec 15 04:14:50 2015
  15. MRP0: Background Media Recovery cancelled with status 16037
  16. Errors in file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_pr00_59244.trc:
  17. ORA-16037: user requested cancel of managed recovery operation
  18. Managed Standby Recovery not using Real Time Apply
  19. Tue Dec 15 04:14:50 2015
  20. ALTER SYSTEM SET service_names='jsread' SCOPE=MEMORY SID='jsread';
  21. Recovery interrupted!
  22. Recovered data files to a consistent state at change 229593026
  23. Tue Dec 15 04:14:51 2015
  24. MRP0: Background Media Recovery process shutdown (jsread)
  25. Managed Standby Recovery Canceled (jsread)
  26. Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
  27. ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
  28. Attempt to do a Terminal Recovery (jsread)
  29. Media Recovery Start: Managed Standby Recovery (jsread)
  30. started logmerger process
  31. Tue Dec 15 04:14:51 2015
  32. Managed Standby Recovery not using Real Time Apply
  33. Parallel Media Recovery started with 64 slaves
  34. Media Recovery Waiting for thread 1 sequence 3176 (in transit)
  35. Killing 1 processes with pids 41252 (all RFS, wait for I/O) in order to disallow current and future RFS connections. Requested by OS process 130750
  36. Begin: Standby Redo Logfile archival
  37. End: Standby Redo Logfile archival
  38. Terminal Recovery timestamp is '12/15/2015 04:14:57'
  39. Terminal Recovery: applying standby redo logs.
  40. Terminal Recovery: thread 1 seq# 3176 redo required
  41. Terminal Recovery:
  42. Recovery of Online Redo Log: Thread 1 Group 9 Seq 3176 Reading mem 0
  43. Mem# 0: /oradata_readonly/jsread/stb_redo02.log
  44. Identified End-Of-Redo (failover) for thread 1 sequence 3176 at SCN 0xffff.ffffffff
  45. Incomplete Recovery applied until change 229593027 time 12/15/2015 04:14:14
  46. Media Recovery Complete (jsread)
  47. Terminal Recovery: successful completion
  48. Forcing ARSCN to IRSCN for TR 0:229593027
  49. Tue Dec 15 04:14:57 2015
  50. ARCH: Archival stopped, error occurred. Will continue retrying
  51. Attempt to set limbo arscn 0:229593027 irscn 0:229593027
  52. ORACLE Instance jsread - Archival ErrorResetting standby activation ID 486853183 (0x1d04ca3f)
  53. ORA-16014: log 9 sequence# 3176 not archived, no available destinations
  54. ORA-00312: online log 9 thread 1: '/oradata_readonly/jsread/stb_redo02.log'
  55. Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
  56. ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
  57. ALTER DATABASE SWITCHOVER TO PRIMARY (jsread)
  58. Maximum wait for role transition is 15 minutes.
  59. All dispatchers and shared servers shutdown
  60. CLOSE: killing server sessions.
  61. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  62. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  63. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  64. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  65. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  66. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  67. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  68. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  69. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  70. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  71. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  72. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  73. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  74. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  75. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  76. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  77. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  78. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  79. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  80. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  81. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  82. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  83. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  84. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  85. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  86. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  87. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  88. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  89. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  90. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  91. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  92. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  93. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  94. Tue Dec 15 04:15:00 2015
  95. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  96. CLOSE: all sessions shutdown successfully.
  97. Tue Dec 15 04:15:01 2015
  98. SMON: disabling cache recovery
  99. Backup controlfile written to trace file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_rsm0_79936.trc
  100. Standby terminal recovery start SCN: 229593026
  101. RESETLOGS after incomplete recovery UNTIL CHANGE 229593027
  102. Online log /oradata_readonly/jsread/redo01.log: Thread 1 Group 1 was previously cleared
  103. Online log /oradata_readonly/jsread/redo02.log: Thread 1 Group 2 was previously cleared
  104. Online log /oradata_readonly/jsread/redo03.log: Thread 1 Group 3 was previously cleared
  105. Online log /oradata_readonly/jsread/redo04.log: Thread 1 Group 4 was previously cleared
  106. Online log /oradata_readonly/jsread/redo05.log: Thread 1 Group 5 was previously cleared
  107. Online log /oradata_readonly/jsread/redo06.log: Thread 1 Group 6 was previously cleared
  108. Online log /oradata_readonly/jsread/redo07.log: Thread 1 Group 7 was previously cleared
  109. Online log /oradata_readonly/jsread/redo08.log: Thread 1 Group 16 was previously cleared
  110. Online log /oradata_readonly/jsread/redo09.log: Thread 1 Group 17 was previously cleared
  111. Online log /oradata_readonly/jsread/redo10.log: Thread 1 Group 18 was previously cleared
  112. Standby became primary SCN: 229593025
  113. Tue Dec 15 04:15:09 2015
  114. Setting recovery target incarnation to 5
  115. AUDIT_TRAIL initialization parameter is changed back to its original value as specified in the parameter file.
  116. Switchover: Complete - Database mounted as primary
  117. Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
  118. ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
  119. Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
  120. ALTER DATABASE OPEN
  121. Data Guard Broker initializing...
  122. Tue Dec 15 04:15:09 2015
  123. Assigning activation ID 488981393 (0x1d254391)
  124. LGWR: Primary database is in MAXIMUM AVAILABILITY mode
  125. Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
  126. LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
  127. Thread 1 advanced to log sequence 2 (thread open)
  128. Tue Dec 15 04:15:09 2015
  129. ARC2: Becoming the 'no SRL' ARCH
  130. Tue Dec 15 04:15:09 2015
  131. ARC3: Becoming the 'no SRL' ARCH
  132. Thread 1 opened at log sequence 2
  133. Current log# 2 seq# 2 mem# 0: /oradata_readonly/jsread/redo02.log
  134. Successful open of redo thread 1
  135. MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
  136. ARC2: Becoming the 'no SRL' ARCH
  137. SMON: enabling cache recovery
  138. Archived Log entry 3278 added for thread 1 sequence 1 ID 0x1d254391 dest 1:
  139. Archiver process freed from errors. No longer stopped
  140. Tue Dec 15 04:15:10 2015
  141. PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
  142. [79936] Successfully onlined Undo Tablespace 2.
  143. Undo initialization finished serial:0 start:1846831906 end:1846832126 diff:220 (2 seconds)
  144. Dictionary check beginning
  145. PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
  146. Dictionary check complete
  147. Verifying file header compatibility for 11g tablespace encryption..
  148. Verifying 11g file header compatibility for tablespace encryption completed
  149. SMON: enabling tx recovery
  150. Starting background process SMCO
  151. Database Characterset is ZHS16GBK
  152. Tue Dec 15 04:15:10 2015
  153. idle dispatcher 'D000' terminated, pid = (24, 1)
  154. Tue Dec 15 04:15:10 2015
  155. SMCO started with pid=24, OS id=131236
  156. No Resource Manager plan active
  157. Tue Dec 15 04:15:11 2015
  158. Starting background process QMNC
  159. Tue Dec 15 04:15:11 2015
  160. QMNC started with pid=35, OS id=131244
  161. LOGSTDBY: Validating controlfile with logical metadata
  162. LOGSTDBY: Validation complete
  163. Completed: ALTER DATABASE OPEN

備庫確實在失去主庫之後開始自動故障轉移,並最終轉移成功。

這裡要說一下,按理交換機恢復之後DG檢測到原主庫會自動將其轉換為備庫,但這次顯然沒有,還記得有兩個配置了RHCS的伺服器因網路問題重啟了,而客戶的observer剛好部署在上面,登上去檢視甚至沒有找到observer的日誌,為什麼observer沒有打日誌出來的?

嘗試開啟observer,重啟原主庫,寄希望於observer自動將失敗的主庫轉為備庫,但是在start observer的時候就遇到了問題


  1. DGMGRL> start observer
  2. Error: ORA-16647: could not start more than one observer
  3. Failed.
  4. DGMGRL> stop observer
  5. Error:
  6. ORA-01034: ORACLE not available
  7. Process ID: 0
  8. Session ID: 0 Serial number: 0
  1. Observer started
  2. [W000 12/15 05:30:13.80] Observer started.
  3. Observer stopped
  4. Error:
  5. ORA-01034: ORACLE not available
  6. Process ID: 0
  7. Session ID: 0 Serial number: 0
  8. [W000 12/15 05:32:05.86] Failed to start the Observer.
  9. Error: ORA-16647: could not start more than one observer
  10. [W000 12/15 05:33:40.06] Failed to start the Observer.

即無法停止,也無法啟動,猜測應該是伺服器重啟導致的狀態錯亂,此時情況變 的複雜,因交換機重啟導致網路不通,導致快速故障轉移,推測轉移工作未完全結束,observer的伺服器又被重啟,使得DG處在一個錯誤的狀態下,
現在已無法通過observer自動恢復失敗的主庫,只能通過手動解決,而日誌裡確認新主庫已轉移成功,那麼恢復失敗的主庫即可,那麼如何恢復呢?有很多選擇重建?利用flashback databse?或者利用broker的reinstate?
其實reinstate也是使用的flashback,那麼幹脆使用簡單的reinstate吧


  1. DGMGRL> reinstate database jsrw;
  2. Reinstating database "jsrw", please wait...
  3. Operation requires shutdown of instance "jsrw" on database "jsrw"
  4. Shutting down instance "jsrw"...
  5. ORA-01109: database not open
  6. Database dismounted.
  7. ORACLE instance shut down.
  8. Operation requires startup of instance "jsrw" on database "jsrw"
  9. Starting instance "jsrw"...
  10. ORACLE instance started.
  11. Database mounted.
  12. Continuing to reinstate database "jsrw" ...
  13. Reinstatement of database "jsrw" succeeded

檢視狀態

  1. DGMGRL> show configuration verbose;


    Configuration - jsconf


      Protection Mode: MaxAvailability
      Databases:
        jsread - Primary database
          Error: ORA-16820: fast-start failover observer is no longer observing this database


        jsrw   - (*) Physical standby database
          Error: ORA-16820: fast-start failover observer is no longer observing this database


      (*) Fast-Start Failover target


      Properties:
        FastStartFailoverThreshold      = '30'
        OperationTimeout                = '30'
        FastStartFailoverLagLimit       = '30'
        CommunicationTimeout            = '180'
        ObserverReconnect               = '0'
        FastStartFailoverAutoReinstate  = 'TRUE'
        FastStartFailoverPmyShutdown    = 'TRUE'
        BystandersFollowRoleChange      = 'ALL'
        ObserverOverride                = 'FALSE'
        ExternalDestination1            = ''
        ExternalDestination2            = ''
        PrimaryLostWriteAction          = 'CONTINUE'


    Fast-Start Failover: ENABLED


      Threshold:          30 seconds
      Target:             jsrw
      Observer:           CAHW50
      Lag Limit:          30 seconds (not in use)
      Shutdown Primary:   TRUE
      Auto-reinstate:     TRUE
      Observer Reconnect: (none)
      Observer Override:  FALSE


    Configuration Status:
    ERROR
可見已經恢復成功,此時stop observer,start observer就沒有問題了,但為了保險起見,停止了observer,

  1. DGMGRL> stop observer;
  2. Done.
  3. DGMGRL> start observer;
  4. Observer started

這次顯然是交換機導致的烏龍failover,根本不需要failover的,本來的高可用措施反而帶來了些許麻煩,後來找機會又切換回原來的主庫了,自動故障轉移也一直沒有再啟動,屆時手動faiover更為靈活,因為客戶的應用並不夠靈活,此次應用也全面癱瘓,因為重啟的兩臺伺服器是redis的訊息中轉伺服器。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26838672/viewspace-2075586/,如需轉載,請註明出處,否則將追究法律責任。

相關文章