MAA switchover失敗經歷分享

Allen2312發表於2009-01-04
    一直想把前段時間做的MAA switchover失敗的經歷總結下,結果趕上一個專案實施,再加上元旦放假,所以都沒來得及總結!今天總算騰出時間來了,必須總結出來!
    應客戶要求要對已經上線的一套MAA系統做一次switchover測試,由於是上線系統,所以必須晚上+週末來做,可是苦了我了~~~
    依然按慣例介紹系統環境:
    4節點RAC+2節點Physical Standby,DB release: Oracle 10.2.0.3.0,採用real time apply方式同步physical standby database,沒有安裝Broker
    參考文件:metalink Note:751600.1(2008年12月4號剛剛更新的一篇文件,以下用文件代替)
    swtichover過程:
    1、預先檢查
     根據文件建議,先要檢查整個環境,這裡就不多寫了,省得大家看的累,我放到附件裡了,感興趣的下載看吧!
     我們的環境沒有下載最新的patch bundle,我看了看就patch 6081547有些關係,其他應該沒事,又加上上網不方便,所以抱著僥倖心理就沒有去安裝patch
    2、Pre-switchover檢查
    這裡分成兩部分,一部分是使用broker管理的,一種是沒有使用broker管理的
    我是沒用過broker,就知道是Oracle用來圖形化管理Data guard的工具,這種switchover我也更傾向於使用命令方式,這樣能更好的定位錯誤,便於解決。
    a、Verify Managed Recovery is Running (non-broker) on the standby
    SQL> select process from v$managed_standby where process like 'MRP%';
    檢查是否還有MRP在執行,我們的結果是0;
    b、Cancel apply delay for the target standby using SQL(如果有延時應用引數,取消之)
    On the Standby capture the current value
    select delay_mins from v$managed_standby where process = 'MRP0';
    On the standby turn off delay if  > 0
    SQL> RECOVER MANAGED STANDBY DATABASE CANCEL
    SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE NODELAY USING CURRENT LOGFILE DISCONNECT FROM SESSION;
    我們這裡沒有設定delay引數,跳過之
    c、Verify there are no large GAPS (檢查是否存在gaps,我覺得這裡有點問題)
    文件步驟如下:
1.      Identify the current sequence number for each thread

SQL> SELECT THREAD#, SEQUENCE# FROM V$THREAD;

2.      Verify  the standby has applied up to, but not including the logs from the primary query.  On the standby the following query should be no more than 1-2 less than the primary query result.

SQL> SELECT THREAD#, MAX(SEQUENCE#) FROM V$ARCHIVED_LOG val, V$DATABASE vdb
     WHERE APPLIED = 'YES' AND val.RESETLOGS_CHANGE#=vdb.RESETLOGS_CHANGE# GROUP BY THREAD#;

3.      If large gaps exist (more than 3 logs) then consult the “Oracle Data Guard Concepts and Administration, 10g Release 2 (10.2) guide”: Section 5.8 “Managing Archive Gaps” and Section 12.11 “Resolving Archive Gaps Manually”.  If the gap is not resolved by Data Guard automatically then consult, “Manually Determining and Resolving Archive Gaps”.
我們這裡剛剛最大的是3個,我記得,所以我想可以忽略,但是下面一步就又提出新問題
4.      If a large redo apply lag (greater than 2 logs) persists then review the MAA best practice paper, “Data Guard Redo Apply & Media Recovery” and also consult the “Oracle Data Guard Concepts and Administration 11g Release 1 (11.1)  guide” to monitor in more detail, 9.5 Monitoring Primary, Physical Standby.
我當時把這裡看成和第3步一樣的意思了,實際上這裡說的是redo apply lag,所以我決定去做recover,但是這裡失敗了,報錯如下:
alter database recover automatic standby database
Sat Dec 13 21:57:51 2008
Media Recovery Start
Sat Dec 13 21:57:51 2008
Managed Standby Recovery not using Real Time Apply
Sat Dec 13 21:57:52 2008
 parallel recovery started with 7 processes
Sat Dec 13 21:57:52 2008
Media Recovery Log +DG1/standby/1_1163_664662264.dbf
Errors with log +DG1/standby/1_1163_664662264.dbf
ORA-279 signalled during: alter database recover automatic standby database...
這個1_1163_664662264.dbf還沒有產生,應該是找不到的,估計是在主庫的redo log日誌組裡面,當時沒有考慮到這個問題,就又想讓standby自動apply試試,就去執行:
alter database recover managed standby database using current logfile disconnect
最後是完成了,但是中間報錯:
Managed Standby Recovery starting Real Time Apply
Sat Dec 13 22:02:10 2008
MRP0: Some datafile enqueues are still held! Retry recovery...
Sat Dec 13 22:02:10 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01124: cannot recover data file 1 - file is in use or recovery
ORA-01110: data file 1: '+DG1/standby/datafile/system.3984.665174909'
Sat Dec 13 22:02:10 2008
Managed Standby Recovery not using Real Time Apply
Sat Dec 13 22:02:10 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01124: cannot recover data file 1 - file is in use or recovery
ORA-01110: data file 1: '+DG1/standby/datafile/system.3984.665174909'
Sat Dec 13 22:02:30 2008
MRP0: Background Media Recovery terminated with error 1153
Sat Dec 13 22:02:30 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01153: an incompatible media recovery is active
Sat Dec 13 22:02:30 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01153: an incompatible media recovery is active
Sat Dec 13 22:02:30 2008
MRP0: Background Media Recovery process shutdown (standby1)
Sat Dec 13 22:02:30 2008
Completed: alter database recover managed standby database using current logfile disconnect
Sat Dec 13 22:09:21 2008
alter database recover managed standby database noparallel using current logfile disconnect
Sat Dec 13 22:09:21 2008
Attempt to start background Managed Standby Recovery process (standby1)
MRP0 started with pid=41, OS id=1271
Sat Dec 13 22:09:21 2008
MRP0: Background Managed Standby Recovery process started (standby1)
Sat Dec 13 22:09:26 2008
Managed Standby Recovery starting Real Time Apply
Sat Dec 13 22:09:26 2008
MRP0: Some datafile enqueues are still held! Retry recovery...
最後shutdown standby DB,然後重新嘗試recover,依然和上面一樣錯誤,最後只好使用real time apply方式自動去應用歸檔,然後不再報錯,就沒再做什麼別的操作,最後失敗我現在分析就跟這步有關!但是現在又說不太明白,哪位有相關經驗,幫我指點下,謝謝啊!

    d、Use “THROUGH ALL SWITCHOVER” on Bystander Standbys(如果有多個standby database,這步需要關注,如果我沒理解錯的話這步也沒什麼問題)
    e、Verify Primary and Standby TEMP Files Match(檢查主庫和備庫temp files是否匹配)
    這步在我們的環境是不匹配的,因為是rman複製方式建立的備庫,所以備庫沒有temp files,文件給出的建議是:
    If the queries do not match then you can correct the mismatch now or immediately after the open of the new primary.

·        To correct now: add or delete a tempfile now requires managed recovery to be stopped and the standby to be open read only.  Opening the standby read-only will require a database close and open before becoming the new primary, see “Open the new primary database”.  

·        To correct post-primary-open: see “Correct any tempfile mismatch” step of Switchover
     我選擇的是第二種方法: create temp tablespace immediately after the open of the new primary,這裡應該也不是錯誤的關鍵!(個人感覺temp files沒有那麼大影響)
     f、Verify that there is no issue with V$LOG_HISTORY on the Standby(確認v$log_history中resetlogs的sequence#和v$archived_log中的sequence#的匹配程度,不能超過3),詳細步驟說明:
     6010833 , 10.2.0.3 patch available on Linux 32-bit, this is included in the 6081547 patch bundle listed above under “Apply Latest Patch Bundle”)(這裡還特別提到了6081547這個patch,metalink中該patch未公開,但是能找到6081547 Tracking Bug for Inventory of Data Guard Physical  Standby for 10.2.0.3 includes Bug 5399901 (archived logs can take a long time to search [control file]) > fixes Bug 5399901的說法,懷疑後面失敗跟這個可能也有點關係


1.Get the Primary RESETLOGS_CHANGE#.  On the primary execute:
SQL> select RESETLOGS_CHANGE# from V$DATABASE;

2.On the standby get the maximum sequence numbers for each thread from V$LOG_HISTORY:
SQL> select thread#, max(sequence#) from V$LOG_HISTORY where resetlogs_change#=< resetlogs_change# from the primary V$DATABASE.RESETLOGS_CHANGE# >  group by thread#;

3.Get current sequence numbers on standby:
SQL> select thread#, max(sequence#) from v$archived_log alog,  v$database db where alog.resetlogs_change#=db.resetlogs_change# group by thread#;

4.The last sequence# for each thread# from V$LOG_HISTORY should be close (the difference in log sequences < 3) to the last sequence# for each thread# from V$ARCHIVED_LOG.

5.If there is an issue with V$LOG_HISTORY then recreate the standby controlfile.  See Note 459411.1.  If backups are being done on the standby without an RMAN Catalog then backup history will be lost.  It is highly recommended to use an RMAN Catalog for all backups.
這裡我們檢查結果都符合條件,就沒有做第5步的操作,時間有點長,記不清了都有點!

     g、Verify no old partial Standby Redo Logs on the Standby
(bug 7159505, fixed in 10.2.0.5 and 11.1.0.7; 10.2.0.3 patch available on Solaris Sparc64 and can be requested for other platforms.  This patch conflicts with the 6081547 patch bundle and would require a patch merge request if you want to apply this on top of the 6081547 patch bundle.)

1.  Identify any active standby redo logs (SRL’s)
SQL> SELECT GROUP#, THREAD#, SEQUENCE# FROM V$STANDBY_LOG WHERE STATUS = 'ACTIVE' ORDER BY THREAD#,SEQUENCE#;

2.      Identify maximum applied sequence number(s).
SQL> select thread#, max(sequence#) from V$LOG_HISTORY where resetlogs_change#=< resetlogs_change# from the primary V$DATABASE.RESETLOGS_CHANGE# >  group by thread#;

3.      If there are any active SRL's that have a thread#/sequence# less than the thread#/sequence# returned from the V$LOG_HISTORY (meaning the recovery has progressed beyond the active SRL) query then clear them.

SQL> RECOVER MANAGED STANDBY DATABASE CANCEL

SQL> ALTER DATABASE CLEAR LOGFILE GROUP ;
這裡檢查的時候,我已經把其他節點都關閉了,所以是不是這裡也出現問題,現在都沒法檢查了,不過我當時覺得是沒關係的,現在又沒環境了!
      3、Switchover(到了最關鍵的步驟了,其實前面準備過程最重要)
      a、關閉執行的jobs
      b、Shutdown all mid-tiers (optional,如果有的話)
      c、Turn on Data Guard tracing on primary and standby(開啟DG trace)
      d、Tail Primary and Standby alert logs on all instances(輸出logs)
      e、Create Guaranteed Restore Points (optional,如果有環境的話)
      f、Verify that the primary database can be switched to the standby role(檢查主庫是否具備switch條件)
     SQL> SELECT SWITCHOVER_STATUS FROM V$DATABASE;
SWITCHOVER_STATUS
-----------------
TO STANDBY
     
TO STANDBY or SESSIONS ACTIVE狀態是可以switch的
       g、If RAC, then shutdown all secondary primary instance(關閉其他次主節點)
       h、Switchover the primary to a standby database
 SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO STANDBY WITH SESSION SHUTDOWN;
1.      If an ORA-16139 is encountered, as long as V$DATABASE.DATABASE_ROLE=’PHYSICAL STANDBY’, then you can proceed.  A common case where this can occur is when there are a large number of data files, greater than 1,000, the apply of the EOR log  will timeout..  Once managed recovery is started on the new standby it will recover. 
2.      If the role was not changed then you need to cancel the switchover and review the alert logs and trace files further.
這一步操作成功了,日誌太多就不貼出來了!
       i、 If the standby is a RAC configuration, then shutdown all secondary standby instances(關閉其他次備庫)
       j、Verify that the standby database can be switched to the primary role(檢查備庫是否具備switch到主庫的條件,passed)
SQL> SELECT SWITCHOVER_STATUS FROM V$DATABASE;
SWITCHOVER_STATUS
----------------
TO PRIMARY
TO PRIMARY or SESSIONS ACTIVE 模式是可以switch的
       k、Check if the standby has ever been open read-only (passed)
1.On the standby run this query:
SQL> SELECT VALUE FROM V$DATAGUARD_STATS WHERE NAME='standby has been open';
2.      If the standby was open read-only then restart the standby
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT
       l、Switchover the standby database to a primary(失敗了)
SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
看日誌看的頭暈了~~~~~還真沒那麼容易看來!

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/122290/viewspace-526608/,如需轉載,請註明出處,否則將追究法律責任。

相關文章