log_archive_dest_n引數中reopen、max_failure、net_timeout三個屬性的組合效果測試

oliseh發表於2015-09-14
log_archive_dest_n引數中有三個屬性都是用來指導在目標歸檔路徑故障的情況下,歸檔過程應如何繼續下去,這三個屬性是:


reopen:能用於DG或者非DG的環境下,在非DG環境下當出現歸檔目錄滿或者因許可權問題無法正常寫入時,在DG環境下當prmy與stdby間的網路中斷時,用來控制隔多長時間(以秒為單位)進行下一次的嘗試


max_failure:能用於DG或者非DG的環境下,當歸檔路徑的寫入失敗次數超過指定次數時,就不會再次嘗試繼續歸檔到這個路徑下了


net_timeout:僅能用在保護模式為maximum availability的DG環境下,正常情況下prmy產生的redo透過LNS程式傳輸給Stdby的RFS程式,RFS程式接收到redo後回給prmy的LNS一個acknowledge,LNS進一步通知LGWR commit可以繼續;當prmy與stdby間的網路中斷時,LGWR如果在net_timeout定義的時間內無法收到LNS的通知,也能讓commit繼續下去。


在實際使用過程中往往需要對這三個引數進行組合應用才能達到應有的效果,且在DG和非DG環境下略有差別。



>>>>>>>>> 第一部分:非DG環境下reopen+max_failures三個引數的組合效果測試 <<<<<<<<<
///////////////////////////////////////////////////////////////////////
//////第一組測試:僅一個archive destination,設定了reopen=10 max_failure=3
///////////////////////////////////////////////////////////////////////
###archive dest當前設定
col dest_name format a30
col destination format a50
set linesize 150
select dest_name,destination,status,binding,reopen_secs,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     OPTIONAL           10           3


修改/oradata06/arch目錄許可權為root.system
ls -ld /oradata06/arch 


chown root.system /oradata06/arch 


ls -ld /oradata06/arch 
drwxr-xr-x    2 root     system           96 Sep 10 09:26 /oradata06/arch


###switch logfile
alter system switch logfile;


select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                      STATUS    BINDING   REOPEN_SECS MAX_FAILURE FAIL_DATE         FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4             ERROR     OPTIONAL           10           3 20150910 09:28:01             1


###連續執行三次switch logfile,觀察failure_count增長情況
alter system switch logfile;   --09:31:36 執行
alter system switch logfile;   --09:31:36 執行
alter system switch logfile;   --09:32:32 執行


***但是直到09:33:48 failure_count 才從1增長為2,並沒有按照reopen=10的設定每隔10秒鐘探測一次歸檔目錄是否可用
select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                      STATUS    BINDING   REOPEN_SECS MAX_FAILURE FAIL_DATE         FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4             ERROR     OPTIONAL           10           3 20150910 09:33:48             2


###再次執行switch logfile,但是直到09:38:48 failure_count才從2增長為3
alter system switch logfile;   --09:36:44


***但是直到09:33:48 failure_count 才從2增長為3,也遠超過了reopen=10的設定
SQL> select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';


DEST_NAME                      STATUS    BINDING   REOPEN_SECS MAX_FAILURE FAIL_DATE         FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4             ERROR     OPTIONAL           10           3 20150910 09:38:48             3


從上面的記錄的fail_date可以看出每次探測距離上一次失敗的間隔是5分鐘


###因為LOG_ARCHIVE_DEST_4是唯一個可用的歸檔目錄,所以儘管Failure_count達到了max_failure定義的上限3,ARCH程式依然不斷嘗試該路徑下的歸檔是否可用
alter system switch logfile;   --09:51:50再次執行switch logfile


***alert.log裡於09:53:49報出無法歸檔的錯誤
Thu Sep 10 09:53:49 2015
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance tstdb1_stdby1 - Archival Error
ORA-16014: log 3 sequence# 305 not archived, no available destinations
ORA-00312: online log 3 thread 1: '/oradata06/teststdby/testaaaaa/redo03a.log'
ORA-00312: online log 3 thread 1: '/oradata06/teststdby/testaaaaa/redo03b.log'


***但v$archive_dest裡的failure_count還是停留在3,因為其已經到了max_failure定義的最大值,所以不再更新
select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';


DEST_NAME                      STATUS    BINDING   REOPEN_SECS MAX_FAILURE FAIL_DATE         FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4             ERROR     OPTIONAL           10           3 20150910 09:38:48             3


###直到我們重新將failure_count置為0,FAILURE_COUNT才會繼續計數
SQL> alter system set log_archive_dest_4='location=/oradata06/arch reopen=10 max_failure=3' scope=both;  <---重新設定log_archive_Dest_n就能將failure_count復位成0


System altered.


SQL> select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';


DEST_NAME                      STATUS    BINDING   REOPEN_SECS MAX_FAILURE FAIL_DATE         FAILURE_COUNT
------------------------------ --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4             VALID     OPTIONAL           10           3                               0


alter system switch logfile;
alter system switch logfile;
alter system switch logfile;


SQL> select dest_name,status,binding,reopen_secs,max_failure,fail_date,failure_count from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';


DEST_NAME                                          STATUS    BINDING   REOPEN_SECS MAX_FAILURE FAIL_DATE         FAILURE_COUNT
-------------------------------------------------- --------- --------- ----------- ----------- ----------------- -------------
LOG_ARCHIVE_DEST_4                                 ERROR     OPTIONAL           10           3 20150910 10:07:07             3


reopen time並沒有向文件上描述的那樣精確的控制兩次探測archive dest間的時間間隔,對於mandatory的archive dest或者僅有的一個optional的archive dest,達到failure_count後,依然會被ARCH程式探測


///////////////////////////////////////////////////////////////////////
//////第二組測試:有兩個archive destination,其中一個是mandatory,一個是optional,兩個destination均設定reopen=20 max_failure=5;
//////故意製造optional的destination無法寫入的場景

///////////////////////////////////////////////////////////////////////
alter system set log_archive_dest_4='location=/oradata06/arch mandatory reopen=20 max_failure=5' scope=both;
alter system set log_archive_dest_5='location=/oradata06/arch1 optional reopen=20 max_failure=5' scope=both;


alter system switch logfile;


select dest_name,destination,status,binding,reopen_secs,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';


DEST_NAME            DESTINATION                    STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
-------------------- ------------------------------ --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4   /oradata06/arch                VALID     MANDATORY          10             0                             3
LOG_ARCHIVE_DEST_5   /oradata06/arch1               VALID     OPTIONAL           10             0                             3


SQL> show parameter log_archive_min_succeed_dest


NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_archive_min_succeed_dest         integer     1


###修改/oradata06/arch1目錄許可權為root.system
root@qcp570717a:/oradata06/arch1>chown root.system /oradata06/arch1
root@qcp570717a:/oradata06/arch1>ls -ld /oradata06/arch1           
drwxr-xr-x    2 root     system           96 Sep 10 10:56 /oradata06/arch1


###連續的執行switch logfile,邊執行邊觀察v$archive_dest檢視的變化情況
***session 1:以每秒鐘1次的頻率執行switch logfile
alter system switch logfile;
。。。。。 <----執行n遍


***session 2:觀察v$archive_dest的重新整理情況
select dest_name,destination,status,binding,reopen_secs,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     MANDATORY          20             0                             5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   ERROR     OPTIONAL           20             1 20150910 12:59:09           5


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     MANDATORY          20             0                             5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   ERROR     OPTIONAL           20             2 20150910 12:59:29           5


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     MANDATORY          20             0                             5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   ERROR     OPTIONAL           20             3 20150910 12:59:51           5


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     MANDATORY          20             0                             5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   ERROR     OPTIONAL           20             4 20150910 13:00:08           5


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     MANDATORY          20             0                             5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   ERROR     OPTIONAL           20             5 20150910 13:00:28           5


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     MANDATORY          20             0                             5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   ERROR     OPTIONAL           20             5 20150910 13:00:28           5


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    VALID     MANDATORY          20             0                             5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   ERROR     OPTIONAL           20             5 20150910 13:00:28           5


以上輸出清楚的看出,對於optional的archive dest的確能按照reopen和max_failure引數的指定值,儘管我們以每秒1次的頻率執行"alter system switch logfile",但ARCH程式基本以每20秒1次的頻率探測LOG_ARCHIVE_DEST_5是否可用,直到failure_count達到max_failure所指定的5次,便停止探測。
如果此時LOG_ARCHIVE_DEST_5對應的/oradata06/arch1目錄寫許可權恢復,必須人工重設log_archive_dest_5引數使得ARCH程式重新開始對歸檔目錄的探測: alter system set log_archive_dest_5='location=/oradata06/arch1 optional reopen=20 max_failure=5' scope=both;
當ARCH發現log_archive_dest_5對應的/oradata06/arch1目錄能夠正常寫入了,後面的歸檔就能放到該目錄下了

///////////////////////////////////////////////////////////////////////
////第三組測試:有兩個archive destination,其中一個是mandatory,一個是optional,兩個destination均設定reopen=20 max_failure=5;
////故意製造mandatory的destination無法寫入的場景

///////////////////////////////////////////////////////////////////////
alter system set log_archive_dest_4='location=/oradata06/arch mandatory reopen=20 max_failure=5' scope=both;
alter system set log_archive_dest_5='location=/oradata06/arch1 optional reopen=20 max_failure=5' scope=both;


alter system switch logfile;


ls -ld /oradata06/arch
drwxr-xr-x    2 oracle   oinstall       8192 Sep 10 13:43 /oradata06/arch


chown root.system /oradata06/arch


ls -ld /oradata06/arch
drwxr-xr-x    2 root     system         8192 Sep 10 13:43 /oradata06/arch


##每10秒鐘執行1次switch logfile,共執行10次
alter system switch logfile;
。。。 <---共執行10次


###觀察v$archive_dest中關於LOG_ARCHIVE_DEST_4的fail_date按照5分鐘1次的頻率進行更新,並不受reopen=20的約束
select dest_name,destination,status,binding,reopen_secs,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    ERROR     MANDATORY          20             3 20150910 13:50:59           5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   VALID     OPTIONAL           20             0                             5


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    ERROR     MANDATORY          20             3 20150910 13:50:59           5
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   VALID     OPTIONAL           20             0                             5


。。。。省略了重複的輸出


DEST_NAME                      DESTINATION                                        STATUS    BINDING   REOPEN_SECS FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------------ -------------------------------------------------- --------- --------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_4             /oradata06/arch                                    ERROR     MANDATORY          20             4 20150910 13:55:59           5       <---這一次更新距離上一次是5分鐘
LOG_ARCHIVE_DEST_5             /oradata06/arch1                                   VALID     OPTIONAL           20             0                             5


透過上面三組測試,得出如下結論:
對於mandatory的archive dest,或者僅有的一個optional的archive dest,reopen與max_failure引數並沒有起到文件中所描述的作用,而是按照每5分鐘一次的間隔永久的嘗試進行歸檔;
reopen與max_failure的作用對於optional的archive dest,完全遵循文件上的解釋


>>>>>>>>>  第二部分:DG環境下net_timeout+reopen+max_failures三個引數的組合效果測試 <<<<<<<<<
測試的DG處於maximum availability保護模式下,測試中製造網路中斷來模擬prmy與stdby間無法通訊的場景
///////////////////////////////////////////////////////////////////////
//////第一組測試:net_timeout + reopen的效果測試(在net_timeout時間內網路沒有恢復正常)
///////////////////////////////////////////////////////////////////////
***prmy設定reopen、net_timeout
SQL> select db_unique_name,database_role,protection_mode,protection_level,open_mode,SWITCHOVER_STATUS from v$database;


DB_UNIQUE_NAME  DATABASE_ROLE    PROTECTION_MODE      PROTECTION_LEVEL     OPEN_MODE            SWITCHOVER_STATUS
--------------- ---------------- -------------------- -------------------- -------------------- --------------------
tstdb1_stdby1   PRIMARY          MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY READ WRITE           TO STANDBY


alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;


 DEST_NAME                 DESTINATION                              STATUS   BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID    OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID    OPTIONAL           10          40             0                             0


***故意製造prmy與stdby網路中斷,中斷後立即在prmy上執行update
select sysdate from dual;    <---網路中斷時間,同時也可以近似認為是下面發起commit的時間
SYSDATE
-----------------
20150910 16:54:56


update scott.t0908_1 set username='FFF' where username='EEE';


SQL> commit;  <--- Hang住


Commit complete.


SQL> select sysdate from dual;   <---commit 完成時間
SYSDATE
-----------------
20150910 16:55:37


可以看出Hang住的時間大約是41秒,符合net_timeout=40的設定


***這時prmy側的nss程式尚未重啟
ps -ef|grep nss
    root  4457152  3801892   0 16:55:25  pts/8  0:00 grep nss
  oracle  4326204        1   0 16:46:57      -  0:00 ora_nss2_tstdb1_stdby1


###觀察lgwr的.trc檔案能清晰的看出網路中斷後Redo Tranport Service的應對過程:
*** 2015-09-10 16:55:02.486                                <---網路中斷後LGWR首次長時間未收到NSS的響應
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-10 16:55:12.507
LGWR found NSS2 alive.. waiting for msg           <---10秒後,LGWR仍未收到NSS的響應,這個10秒就是由reopen引數定義的


*** 2015-09-10 16:55:22.528
LGWR found NSS2 alive.. waiting for msg           <---又過了10秒,LGWR仍未收到NSS的響應


*** 2015-09-10 16:55:32.000                                <---又過了10秒,LGWR仍未收到NSS的響應,這時總的超時時間已經達到了net_timeout所定義的40秒
Current time <09/10/2015 16:55:32> exceeds begin time <09/10/2015 16:54:52> by  <40> seconds. Return timedout error
*** 2015-09-10 16:55:32.000 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
*** 2015-09-10 16:55:32.000 4320 krsh.c
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
*** 2015-09-10 16:55:32.000 4320 krsh.c
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned     <----放棄reconnect
ERROR: Receive timedout!
*** 2015-09-10 16:55:32.000 4320 krsh.c
Error 16198 for archive log file 1 to 'tstdb1'
LGWR had received a timeout previously. return timeout again
Error 16198 detaching RFS from standby instance at host 'tstdb1'     <---中斷stdby上的RFS與prmy上的NSS之間的連線
*** 2015-09-10 16:55:32.001 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/10/2015 16:54:52> and NET_TIMEOUT <40> seconds  <---網路中斷的精確時間是16:54:52
NetServer pid:4326204


*** 2015-09-10 16:55:36.001
*** 2015-09-10 16:55:36.001 8830 krsu.c
Cleaning up NSS2 [pid 4326204] because it failed to enter 'alive after detach' state
*** 2015-09-10 16:55:36.001 978 krsw.c
NSS2 [pid 4326204] receiving termination signal..   <---NSS程式被kill
  .... killed successfully


*** 2015-09-10 16:55:37.001
 .. pmon posted for async lns cleanup
*** 2015-09-10 16:55:37.001 7080 krsu.c
  upidhs done status 0
krsl_reap_io: ignoring detach error 16198
*** 2015-09-10 16:55:37.001 2932 krsi.c             <----LOG_ARCHIVE_DEST_2對應的destination被置為了failed
krsi_dst_fail: dest:2 err:16198 force:0 blast:1


*** 2015-09-10 16:55:37.015                        
Warning: log write elapsed time 44541ms, size 0KB   <---44s才完成寫入


*** 2015-09-10 16:55:37.151
*** 2015-09-10 16:55:37.151 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.


###從v$archive_dest可以看出LOG_ARCHIVE_DEST_2被置為fail的時間是16:55:37,與上面lgwr日誌反映出來的時間一致
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';


DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   ERROR     OPTIONAL           10          40             1 20150910 16:55:37           0


***路由恢復
SQL> SQL> select sysdate from dual;    <---17:00:57路由恢復


SYSDATE
-----------------
20150910 17:00:57


***路由恢復後等待了近兩分鐘LOG_ARCHIVE_DEST_2 並沒有變為VALID,需要有Transaction執行來觸發一下LGWR
SQL> update scott.t0908_1 set username='GGG' where username='FFF';


1 row updated.


SQL> commit;


Commit complete.


***隨即LGWR程式日誌有了輸出:
*** 2015-09-10 17:02:52.224
NSS2 is not running anymore. 
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/10/2015 17:02:52>
*** 2015-09-10 17:02:52.224 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 5112496] to initialize itself        <---重啟NSS程式


*** 2015-09-10 17:02:55.301
*** 2015-09-10 17:02:55.301 7890 krsu.c
Netserver NSS2 [pid 5112496] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 5112496] for dest tstdb1 mode SYNC ocis=0x11169e420          <----LOG_ARCHIVE_DEST_2對應的destination重新恢復成VALID狀態


*** 2015-09-10 17:02:55.511
*** 2015-09-10 17:02:55.511 3561 krsu.c
   upiahm connect done status is 0


*** 2015-09-10 17:02:58.681
*** 2015-09-10 17:02:58.681 7119 krsu.c
Making upinblc request to NSS2 (ocis 0x11169e420). Begin time is <09/10/2015 17:02:58> and NET_TIMEOUT is <40> seconds
NetServer pid:5112496


###v$archive_dest裡的failure_count、fail_date清零,以下過程可以看出先清fail_date,再清failure_count
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   ERROR     OPTIONAL           10          40             1 20150910 16:55:37           0




DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             1                             0




DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             1                             0




DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             1                             0




DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             1                             0




DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             1                             0




DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             0                             0




DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             0                             0


###重新啟動後的NSS
oracle@qcp570717a:/home/oracle>ps -ef|grep 5112496
  oracle  3998006  2883966   0 17:41:21  pts/7  0:00 grep 5112496
  oracle  5112496        1   0 17:02:52      -  0:00 ora_nss2_tstdb1_stdby1    <---重新啟動後的NSS


///////////////////////////////////////////////////////////////////////
//////第二組測試:net_timeout + reopen的效果測試(在net_timeout時間內網路恢復正常)
///////////////////////////////////////////////////////////////////////
***prmy設定reopen、net_timeout的引數同上
SQL> select db_unique_name,database_role,protection_mode,protection_level,open_mode,SWITCHOVER_STATUS from v$database;


DB_UNIQUE_NAME  DATABASE_ROLE    PROTECTION_MODE      PROTECTION_LEVEL     OPEN_MODE            SWITCHOVER_STATUS
--------------- ---------------- -------------------- -------------------- -------------------- --------------------
tstdb1_stdby1   PRIMARY          MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY READ WRITE           TO STANDBY


alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;


 DEST_NAME                 DESTINATION                              STATUS   BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID    OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID    OPTIONAL           10          40             0                             0


***故意製造prmy與stdby網路中斷,中斷後立即在prmy上執行update
select sysdate from dual;    <---網路中斷時間,同時也可以近似認為是下面發起commit的時間


SYSDATE
-----------------
20150913 05:13:49


update scott.t0908_1 set username='HHH' where username='GGG';


SQL> commit;  <--- Hang住之後,在net_timeout定義的40秒時間內,在05:14:08恢復prmy、stdby間的網路通訊


Commit complete.


SQL> select sysdate from dual;   <---commit 完成時間


SYSDATE
-----------------
20150913 05:14:10


可以看出Hang住的時間大約是21秒,中斷時長 < net_timeout


###這時lgwr的.trc檔案能看出經歷了兩次10秒鐘的判斷後,遠端log write恢復正常,期間沒有發生NSS程式重啟等動作:
*** 2015-09-13 05:13:58.368
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:14:08.389
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:14:10.667
Warning: log write elapsed time 22304ms, size 0KB


***nss程式還是保持上一次的重啟時間
ps -ef|grep nss
  oracle 14155968  4981746   0 05:16:17  pts/4  0:00 grep nss
  oracle  5964580        1   0 05:10:18      -  0:00 ora_nss2_tstdb1_stdby1


###因為在net_timeout時間內網路恢復,所以LOG_ARCHIVE_DEST_2的狀態一直是VALID,與上面lgwr日誌反映出來的時間一致
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID     OPTIONAL           10          40             0                             0


///////////////////////////////////////////////////////////////////////
//////第三組測試:net_timeout + reopen + max_failure的效果測試(未在net_timeout時間內網路恢復正常)
///////////////////////////////////////////////////////////////////////  
在原來的基礎上加上了max_failure=3
alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 max_failure=3 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;


col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';


DEST_NAME                 DESTINATION                              STATUS   BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID    OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID    OPTIONAL           10          40             0                             3


select sysdate from dual;    <---網路中斷時間,同時也可以近似認為是下面發起commit的時間
SYSDATE
-----------------
20150913 05:47:29


update scott.t0908_1 set username='III' where username='HHH';


SQL> commit;  <--- Hang住


Commit complete.


SQL> select sysdate from dual;   <---commit 完成時間
SYSDATE
-----------------
20150913 05:48:12


###觀察lgwr的.trc檔案能看出網路中斷後Redo Tranport Service的應對過程:經歷了兩次40s的等待,NSS程式經歷被kill->重啟->kill的過程
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:47:47.267
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:47:57.288
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:48:07.008
Current time <09/13/2015 05:48:07> exceeds begin time <09/13/2015 05:47:27> by  <40> seconds. Return timedout error
*** 2015-09-13 05:48:07.008 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
*** 2015-09-13 05:48:07.008 4320 krsh.c
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
*** 2015-09-13 05:48:07.008 4320 krsh.c
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
ERROR: Receive timedout!
*** 2015-09-13 05:48:07.009 4320 krsh.c
Error 16198 for archive log file 6 to 'tstdb1'
LGWR had received a timeout previously. return timeout again
Error 16198 detaching RFS from standby instance at host 'tstdb1'
*** 2015-09-13 05:48:07.009 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 05:47:27> and NET_TIMEOUT <40> seconds     <--第一次timeout
NetServer pid:5964580


*** 2015-09-13 05:48:11.009
*** 2015-09-13 05:48:11.009 8830 krsu.c
Cleaning up NSS2 [pid 5964580] because it failed to enter 'alive after detach' state
*** 2015-09-13 05:48:11.009 978 krsw.c
NSS2 [pid 5964580] receiving termination signal..
  .... killed successfully


*** 2015-09-13 05:48:12.009
 .. pmon posted for async lns cleanup
*** 2015-09-13 05:48:12.010 7080 krsu.c
  upidhs done status 0
krsl_reap_io: ignoring detach error 16198
*** 2015-09-13 05:48:12.010 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1


*** 2015-09-13 05:48:12.014
Warning: log write elapsed time 44779ms, size 0KB
NSS2 is not running anymore. 
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 05:48:12>
*** 2015-09-13 05:48:12.045 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 2883910] to initialize itself                                                              <---NSS被重啟


*** 2015-09-13 05:48:15.130
*** 2015-09-13 05:48:15.130 7890 krsu.c
Netserver NSS2 [pid 2883910] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 2883910] for dest tstdb1 mode SYNC ocis=0x11169e420


*** 2015-09-13 05:48:25.160
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:48:35.181
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:48:45.202
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 05:48:52.000
Current time <09/13/2015 05:48:52> exceeds begin time <09/13/2015 05:48:12> by  <40> seconds. Return timedout error
*** 2015-09-13 05:48:52.000 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
ERROR: timedout
*** 2015-09-13 05:48:52.000 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 05:48:12> and NET_TIMEOUT <40> seconds   <--第二次timeout
NetServer pid:2883910


*** 2015-09-13 05:48:56.000
*** 2015-09-13 05:48:56.000 8830 krsu.c
Cleaning up NSS2 [pid 2883910] because it failed to enter 'alive after detach' state
*** 2015-09-13 05:48:56.000 978 krsw.c
NSS2 [pid 2883910] receiving termination signal..
  .... killed successfully                                               <---NSS被第二次kill


*** 2015-09-13 05:48:57.001
 .. pmon posted for async lns cleanup
*** 2015-09-13 05:48:57.001 7080 krsu.c
  upidhs done status 0
*** 2015-09-13 05:48:57.001 915 krsu.c
Error 16198 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 05:48:57.001 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
Error 16198 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 05:48:57.001 4320 krsh.c
Error 16198 for archive log file 2 to 'tstdb1'
*** 2015-09-13 05:48:57.001 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
*** 2015-09-13 05:48:57.001 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1


###從v$archive_dest可以看出LOG_ARCHIVE_DEST_2被置為fail的時間是05:48:57,與上面lgwr日誌反映出來的時間一致,failure_count=1
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   ERROR     OPTIONAL           10          40             1 20150913 05:48:57           3


***再執行若干次switch logfile
alter system switch logfile;
alter system switch logfile;
alter system switch logfile;

同時觀察LGWR程式新的輸出,並沒有重連的資訊:
*** 2015-09-13 05:57:42.850
*** 2015-09-13 05:57:42.850 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.


*** 2015-09-13 05:59:16.776
*** 2015-09-13 05:59:16.776 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.


*** 2015-09-13 05:59:31.318
*** 2015-09-13 05:59:31.318 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.


***v$archive_dest裡的failure_count始終保持1,沒有繼續往上漲
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                 DESTINATION                              STATUS    BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- --------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID     OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   ERROR     OPTIONAL           10          40             1 20150913 05:48:57           3


***06:02:40網路恢復,但是LGWR日誌卻到了06:03:33才開始恢復Redo Transport Service
NSS2 is not running anymore. 


*** 2015-09-13 06:03:33.062
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 06:03:33>
*** 2015-09-13 06:03:33.062 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 5374506] to initialize itself


*** 2015-09-13 06:03:36.143
*** 2015-09-13 06:03:36.142 7890 krsu.c
Netserver NSS2 [pid 5374506] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 5374506] for dest tstdb1 mode SYNC ocis=0x11169e420


*** 2015-09-13 06:03:36.323
*** 2015-09-13 06:03:36.323 3561 krsu.c
   upiahm connect done status is 0


*** 2015-09-13 06:03:42.508
*** 2015-09-13 06:03:42.508 7119 krsu.c
Making upinblc request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 06:03:42> and NET_TIMEOUT is <40> seconds
NetServer pid:5374506


***failure_count重置為0
col dest_name format a25
col destination format a40
set linesize 180
select dest_name,destination,status,binding,reopen_secs,net_timeout,failure_count,fail_date,max_failure from v$archive_dest where destination is not null and STATUS <> 'DEFERRED' AND STATUS <> 'INACTIVE';
DEST_NAME                 DESTINATION                              STATUS   BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID    OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID    OPTIONAL           10          40             0                             3


至此發現max_failure=3的設定並沒有實際意義,因為failure_count=1之後,只要網路不恢復,NSS程式就不會去連第2、3次,而當網路恢復後,failure_count就置0了,因此failure_count永遠不會達到3


仍然是上面的例子我們把max_failure設定為1,效果就不一樣了
alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 max_failure=1 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;


DEST_NAME                 DESTINATION                              STATUS   BINDING   REOPEN_SECS NET_TIMEOUT FAILURE_COUNT FAIL_DATE         MAX_FAILURE
------------------------- ---------------------------------------- -------- --------- ----------- ----------- ------------- ----------------- -----------
LOG_ARCHIVE_DEST_1        USE_DB_RECOVERY_FILE_DEST                VALID    OPTIONAL          300           0             0                             0
LOG_ARCHIVE_DEST_2        tstdb1                                   VALID    OPTIONAL           10          40             0                             1


這裡我們只透過LGWR的日誌來解讀網路中斷->網路恢復->redo transport service恢復的過程
<<< 網路中斷後的第一次等待>>>
*** 2015-09-13 06:11:12.040
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 06:11:22.061
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 06:11:32.082
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 06:11:42.002
Current time <09/13/2015 06:11:42> exceeds begin time <09/13/2015 06:11:02> by  <40> seconds. Return timedout error
*** 2015-09-13 06:11:42.003 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
*** 2015-09-13 06:11:42.003 4320 krsh.c
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
*** 2015-09-13 06:11:42.003 4320 krsh.c
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
ERROR: Receive timedout!
*** 2015-09-13 06:11:42.003 4320 krsh.c
Error 16198 for archive log file 1 to 'tstdb1'
LGWR had received a timeout previously. return timeout again
Error 16198 detaching RFS from standby instance at host 'tstdb1'
*** 2015-09-13 06:11:42.003 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 06:11:02> and NET_TIMEOUT <40> seconds
NetServer pid:5374506


<<< NSS 第一次被kill >>>
*** 2015-09-13 06:11:46.004
*** 2015-09-13 06:11:46.004 8830 krsu.c
Cleaning up NSS2 [pid 5374506] because it failed to enter 'alive after detach' state
*** 2015-09-13 06:11:46.004 978 krsw.c
NSS2 [pid 5374506] receiving termination signal..
  .... killed successfully


*** 2015-09-13 06:11:47.004
 .. pmon posted for async lns cleanup
*** 2015-09-13 06:11:47.004 7080 krsu.c
  upidhs done status 0
krsl_reap_io: ignoring detach error 16198
*** 2015-09-13 06:11:47.004 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1


<<< NSS 重啟 >>>
*** 2015-09-13 06:11:47.010
Warning: log write elapsed time 44986ms, size 17KB
NSS2 is not running anymore. 
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 06:11:47>
*** 2015-09-13 06:11:47.040 7723 krsu.c
Starting NSS2 ...
Waiting for NSS2 [pid 2884044] to initialize itself


*** 2015-09-13 06:11:50.118
*** 2015-09-13 06:11:50.118 7890 krsu.c
Netserver NSS2 [pid 2884044] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 2884044] for dest tstdb1 mode SYNC ocis=0x11169e420


<<< 網路中斷後的第二次等待開始:LGWR等NSS >>>
*** 2015-09-13 06:12:00.130
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 06:12:10.146
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 06:12:20.167
LGWR found NSS2 alive.. waiting for msg


*** 2015-09-13 06:12:27.004
Current time <09/13/2015 06:12:27> exceeds begin time <09/13/2015 06:11:47> by  <40> seconds. Return timedout error
*** 2015-09-13 06:12:27.005 4320 krsh.c
ORA-16198: LGWR received timedout error from KSR
ERROR: timedout
*** 2015-09-13 06:12:27.005 6904 krsu.c
Making upidhs request to NSS2 (ocis 0x11169e420). Begin time is <09/13/2015 06:11:47> and NET_TIMEOUT <40> seconds
NetServer pid:2884044


<<< NSS 第二次被kill >>>
*** 2015-09-13 06:12:31.005
*** 2015-09-13 06:12:31.005 8830 krsu.c
Cleaning up NSS2 [pid 2884044] because it failed to enter 'alive after detach' state
*** 2015-09-13 06:12:31.005 978 krsw.c
NSS2 [pid 2884044] receiving termination signal..
  .... killed successfully


<<< 遠端歸檔目錄被置為ERROR >>>
*** 2015-09-13 06:12:32.005
 .. pmon posted for async lns cleanup
*** 2015-09-13 06:12:32.005 7080 krsu.c
  upidhs done status 0
*** 2015-09-13 06:12:32.005 915 krsu.c
Error 16198 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 06:12:32.005 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
Error 16198 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'tstdb1'
*** 2015-09-13 06:12:32.006 4320 krsh.c
Error 16198 for archive log file 4 to 'tstdb1'
*** 2015-09-13 06:12:32.006 6904 krsu.c
No LNS exists (ocis 0x11169e420) that requires a disconnect
*** 2015-09-13 06:12:32.006 2932 krsi.c
krsi_dst_fail: dest:2 err:16198 force:0 blast:1


<<< 網路恢復後在prmy上switch logfile,但redo Transport一直未恢復>>>
*** 2015-09-13 06:13:01.616
*** 2015-09-13 06:13:01.616 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.


*** 2015-09-13 06:14:12.220
*** 2015-09-13 06:14:12.220 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.


*** 2015-09-13 06:16:35.440
*** 2015-09-13 06:16:35.440 2682 krsl.c
No standby database destinations have been configured
as being archived by the LGWR process
This instance will operate at a reduced protection mode until
network connectivity to the standby databases is restored and
all archivelog gaps have been resolved.


<<< 直到手工重新設定了LOG_ARCHIVE_DEST_2: "alter system set log_archive_dest_2='service=tstdb1 SYNC affirm reopen=10 net_timeout=40 max_failure=3 valid_for=(online_logfiles,primary_roles) db_unique_name=tstdb1' scope=both;">>>
*** 2015-09-13 06:17:37.716
NSS2 is not running anymore. 
Waiting for subscriber count on LGWR-NSS2 channel to go to zero
Subscriber count went to zero - time now is <09/13/2015 06:17:37>
*** 2015-09-13 06:17:37.716 7723 krsu.c
Starting NSS2 ...


*** 2015-09-13 06:17:37.769
Waiting for NSS2 [pid 17105102] to initialize itself


*** 2015-09-13 06:17:40.782
*** 2015-09-13 06:17:40.782 7890 krsu.c
Netserver NSS2 [pid 17105102] for mode SYNC has been initialized
Performing a channel reset to ignore previous responses
Successfully started NSS2 [pid 17105102] for dest tstdb1 mode SYNC ocis=0x11169e420
*** 2015-09-13 06:17:41.046 3561 krsu.c
   upiahm connect done status is 0



由此看出在DG環境中網路中斷引起的本地Redo無法傳輸到遠端的情況下,max_failure引數如果設定為>1並沒有實際意義,因為只要網路沒有恢復,LGWR就不會去進行第2次的連線嘗試,等到網路恢復failure_count又會被複位成0。因此只有在max_failure=1的情況下,當出現一次網路中斷後,遠端歸檔路徑才會被禁用,即便之後網路恢復正常了,也需要人工的重置log_archive_dest_n來恢復Redo向遠端的傳輸


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/53956/viewspace-1798959/,如需轉載,請註明出處,否則將追究法律責任。

相關文章