資料庫報ORA-16198故障的解決方法分析

dawn009發表於2014-10-13
1. 首先看官方文件關於ORA-16198報錯的說明

........................
報錯可能原因是因為net_timeout設定低,在以前老版本預設是10,建議更改為30
……………………………
The net_timeout attribute in the log_archive_dest_2 on the primary is
set too low so that
LNS couldn't finish sending redo block in 10 seconds in this example.
…………………………….
如果設定30還不行,請檢查磁碟的IO使用情況或者網路傳輸情況
…………………………..
Note: If NET_TIMEOUT attribute has already been set to 30, and you still get ORA-16198, that means LNS couldn't finish sending redo block in 30 seconds.
The slowness may caused by:
1. Operating System. Please keep track of OS usage (like iostat).
2. Network. Please keep track network flow (like tcpdump).
……………………………
也有可能是BUG,受影響的版本為11.2.0.1或10.2.0.4,建議升級到11.2.0.2以上的版本
…………………………..
Bug 9259587  Multiple LGWR reconnect attempts in Data Guard MAXIMUM_AVAILABILITY
 This note gives a brief overview bug 9259587. 
Affects:
Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected Versions BELOW 12.1
Versions confirmed as being affected 11.2.0.1 10.2.0.4
Platforms affected Generic (all / most platforms affected)
Fixed:
This issue is fixed in 12.1 (Future Release) 11.2.0.2 (Server Patch Set)
Symptoms:
Related To:
Hang (Process Spins)
Active Dataguard (ADG)
Physical Standby Database / Dataguard
Description
…………………………………………………
發生的報錯,大概類似於下面的顯示
…………………………………………………
Rediscovery Notes:
 Alert log contains messages like:
  ORA-16198: LGWR received timedout error from KSR
  LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
  LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
  Errors in file 
  /app/oracle/diag/rdbms/ora11g_dga/ora11g/trace/ora11g_lgwr_290838.trc:
  ORA-16198: Timeout incurred on internal channel during remote archival
  LGWR: Network asynch I/O wait error 16198 log 2 service 'ora11g_DGb'
  LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby 
  host 'ora11g_DGb'
  Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
  LGWR: Failed to archive log 2 thread 1 sequence 1422 (16198)
…………………………………………………
In a Data Guard configuration using LGWR SYNC transport on one or more LOG_ARCHIVE_DEST_n parameters, and using a protection mode of MAXIMUM_AVAILABILITY, then if the primary database becomes disconnected from the standby database, LGWR continues to attempt to reconnect to the standby database. It should instead avoid attempts to reconnect until an ARCH process has re-established communication with the standby database.
所以可以確定的是:
報這種錯誤主要發生在DATAGUARD這種架構上,原因就是主機的日誌向備機傳輸時沒在規定時間完成,或無法向備機傳送日誌,那麼我們就下面主要的兩種故障原因來進行說明:


2. 引數設定過低導致的故障

可能由於設定的LOG_ARCHIVE_DEST_2的NET_TIMEOUT值過低,導致的日誌無法在規定時間傳輸完成,建議設定成30。
查詢NET_TIMEOUT:
SQL> select DEST_NAME,NET_TIMEOUT FROM V$ARCHIVE_DEST;
DEST_NAME                 NET_TIMEOUT
-------------------------         -----------
LOG_ARCHIVE_DEST_1                  0
LOG_ARCHIVE_DEST_2                 30
……………輸出省略
檢視LOG_ARCHIVE_DEST_2引數:
SQL> show parameter log_archive_dest_2
值為'service=orcl_std reopen=120 lgwr sync valid_for=(online_logfiles,primary_role) db_unique_name=orcl_std'
我沒有設定NET_TIMEOUT引數,預設卻是30,因為我的版本是11.2.0.3的。
如果你的引數不是30,請進行修改,參考如下:
SQL>ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='service=orcl_std reopen=120 lgwr sync net_timeout=30 valid_for=(online_logfiles,primary_role) db_unique_name=orcl_std';
然後觀察一下是否還報此類問題。

3. 由於網路不通暢或儲存IO繁忙等其他原因導致的故障

如果是由於網路不通暢和儲存繁忙的原因導致的報錯,請用作業系統命令類似於,tcpdump或IOSTAT,VMSTAT來檢視相關資源使用情況,或聯絡網路,儲存管理員來協助分析。
如果以上都沒問題,還有一種可能性就是你主機或備機單獨改sys密碼了,但是相關的備機或主機沒有同時改,造成主機向備機驗證時失效也是很有可能的。

4. 資料庫的BUG

如果以上方法還沒有解決問題,你也分析不出具體的原因,恰好你的資料庫版本是11.2.0.1或10.2.0.4,那麼升級吧少年。。

5. 總結
考慮此類問題,要從多角度分析,比如:引數值低,儲存使用情況,網路傳輸情況,sys密碼改了,資料庫的BUG等。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29119536/viewspace-1296992/,如需轉載,請註明出處,否則將追究法律責任。

相關文章