IPC Send timeout故障現象
RAC 資料庫上比較常見的一種問題就是“IPC Send timeout”。資料庫Alert log中出現了“IPC Send timeout”之後,經常會伴隨著ora-29740 或者 "Waiting for clusterware split-brain resolution"等,資料庫例項會因此異常終止或者被驅逐出叢集
比如:
例項1的ALERT LOG:
Thu Jul 02 05:24:50 2012
IPC Send timeout detected.Sender: ospid 6143755 <==傳送者
Receiver: inst 2 binc 1323620776 ospid 49715160 <==接收者
Thu Jul 02 05:24:51 2012
IPC Send timeout to 1.7 inc 120 for msg type 65516 from opid 13
Thu Jul 02 05:24:51 2012
Communications reconfiguration: instance_number 2
Waiting for clusterware split-brain resolution <==出現腦裂
Thu Jul 02 05:24:51 2012
Trace dumping is performing id=[cdmp_20120702052451]
Thu Jul 02 05:34:51 2012
Evicting instance 2 from cluster <==過了10分鐘,例項2被驅逐出叢集例項2的ALERT LOG:
Thu Jul 02 05:24:50 2012
IPC Send timeout detected. Receiver ospid 49715160 <==接收者
Thu Jul 02 05:24:50 2012
Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms6_49715160.trc:
Thu Jul 02 05:24:51 2012
Waiting for clusterware split-brain resolution
Thu Jul 02 05:24:51 2012
Trace dumping is performing id=[cdmp_20120702052451]
Thu Jul 02 05:35:02 2012
Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lmon_6257780.trc:
ORA-29740: evicted by member 0, group incarnation 122 <==例項2出現ORA- 29740錯誤,並被驅逐出叢集
Thu Jul 02 05:35:02 2012
LMON: terminating instance due to error 29740
Thu Jul 02 05:35:02 2012
Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms7_49453031.trc:
ORA-29740: evicted by member , group incarnation
在RAC例項間主要的通訊程式有LMON, LMD, LMS等程式。正常來說,當一個訊息被髮送給其它例項之後,傳送者期望接收者會回覆一個確認訊息,但是如果這個確認訊息沒有在指定的時間內收到(預設300秒),傳送者就會認為訊息沒有達到接收者,於是會出現“IPC Send timeout”問題。
這種問題通常有以下幾種可能性:
1. 網路問題造成丟包或者通訊異常。
2. 由於主機資源(CPU、記憶體)問題造成這些程式無法被排程或者這些程式無響應。
3. Oracle Bug.
4. AIX平臺沒有打IZ97457丁包
網路問題造成的“IPC Send timeout”例子
例項1的Alert log中顯示接收者是2號機的程式49715160,
Thu Jul 02 05:24:50 2012
IPC Send timeout detected.Sender: ospid 6143755 <==傳送者
Receiver: inst 2 binc 1323620776 ospid 49715160 <==接收者
檢視當時2號機的OSWatcher的vmstat輸出,沒有發現CPU和記憶體緊張的問題,檢視OSWatcher的netstat輸出,在發生問題前幾分鐘,私網的網路卡上有大量的網路包傳輸。
Node2:
zzz Thu Jul 02 05:12:38 CDT 2012
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.2 4073847798 0 512851119 0 0 <==4073847798 - 4073692530 = 155268 個包/30秒
zzz Thu Jul 02 05:13:08 CDT 2012
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.2 4074082951 0 513107924 0 0 <==4074082951 - 4073847798 = 235153 個包/30秒
Node1:
zzz Thu Jul 02 05:12:54 CDT 2012
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.1 502159550 0 4079190700 0 0 <==502159550 - 501938658 = 220892 個包/30秒
zzz Thu Jul 02 05:13:25 CDT 2012
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.1 502321317 0 4079342048 0 0 <==502321317 - 502159550 = 161767 個包/30秒
檢視這個系統正常的時候,大概每30秒傳輸幾千個包:
zzz Thu Jul 02 04:14:09 CDT 2012
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.2 4074126796 0 513149195 0 0 <==4074126796 - 4074122374 = 4422個包/30秒
這種突然的大量的網路傳輸可能會引發網路傳輸異常,另外網路的UDP或者IP包丟失也會造成該錯誤。對於這種情況,需要聯絡網管對網路進行檢查。在某些案例中,重啟私網交換機或者調換了交換機後問題不再發生。(請注意,網路的正常的傳輸量會根據硬體和業務的不同而不同。)
CPU負載過高造成的“IPC Send timeout”例子
Fri Aug 01 02:04:29 2008
IPC Send timeout detected.Sender: ospid 1506825 <==傳送者
Receiver: inst 2 binc -298848812 ospid 1596935 <==接收者
檢視當時2號機的OSWatcher的vmstat輸出:
zzz ***Fri Aug 01 02:01:51 CST 2008
System Configuration: lcpu=32 mem=128000MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
25 1 7532667 19073986 0 0 0 0 5 0 9328 88121 20430 32 10 47 11
58 0 7541201 19065392 0 0 0 0 0 0 11307 177425 10440 87 13 0 0 <==idle的CPU為0,說明CPU100%被使用
61 1 7552592 19053910 0 0 0 0 0 0 11122 206738 10970 85 15 0 0
zzz ***Fri Aug 01 02:03:52 CST 2008
System Configuration: lcpu=32 mem=128000MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
25 1 7733673 18878037 0 0 0 0 5 0 9328 88123 20429 32 10 47 11
81 0 7737034 18874601 0 0 0 0 0 0 9081 209529 14509 87 13 0 0 <==CPU的run queue非常高
80 0 7736142 18875418 0 0 0 0 0 0 9765 156708 14997 91 9 0 0 <==idle的CPU為0,說明CPU100%被使用
上面這個例子說明當主機CPU負載非常高的時候,接收程式無法響應傳送者,從而引發了“IPC Send timeout”。
引起IPC Send timeout問題的常見bug
10g平臺上該問題的常見Bug有Bug 5190596和Bug 6200820。這兩個bug多出現在10.2.0.3和10.2.0.4,到了10.2.0.5版本就已經修復了該bug,具體請參見MOS上的文章:
LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]
'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]
11g平臺上的常見bug有Bug 6200820和Bug 7653579具體請參見MOS上的文章:
Bug 6200820 AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)
Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]
AIX平臺沒有打IZ97457丁包引起的 IPC Send timeout
關於這點MOS上的這篇文章
AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]
有如下介紹
Applies to:
Oracle Server - Enterprise Edition - Version 9.2.0.2 and later
IBM AIX on POWER Systems (64-bit)
Symptoms
Environment with IBM AIX VIO experiences one or some or all of the following symptoms:
Packet Loss
Cache Fusion "block lost"
IPC Send timeout
Instance Eviction
SKGXPSEGRCV: MESSAGE TRUNCATED user data nnnn bytes payload nnnn bytes
Cause
AIX issue APAR IZ97457 - A VIOS Server will not forward traffic from its VIO Clients to the external network
Solution
Please engage your OS vendor for fix.
參考至:https://blogs.oracle.com/Database4CN/entry/%E5%A6%82%E4%BD%95%E8%AF%8A%E6%96%ADrac%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8A%E7%9A%84_ipc_send_timeout_%E9%97%AE%E9%A2%98
AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]
LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]
'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]
Bug 6200820 AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)
Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]
Top 5 issues for Instance Eviction (Doc ID 1374110.1)
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/28612416/viewspace-2127261/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- IPC Send timeout detected
- 【問題處理】IPC Send timeout detected
- IPC Send timeout detected. Receiver ospid 25822
- 【故障-ORACLE】rdbms ipc message timeout解釋Oracle
- ORACLE 9I RAC IPC Send timeout detected問題處理Oracle
- 如何診斷RAC資料庫上的“IPC Send timeout”問題?資料庫
- 【RAC】如何診斷RAC資料庫上的“IPC Send timeout”問題資料庫
- IPC send completion sync
- 【ASK_ORACLE】Oracle RAC報錯“ipc send timeout”的原因以及解決辦法Oracle
- SQLNET.RECV_TIMEOUT & SQLNET.SEND_TIMEOUTSQL
- cpu故障現象分析 CPU常見故障案例
- Switch to short timeout for ipc polling
- 路由器故障現象和原因分析路由器
- 常見電腦記憶體故障現象與排除方法記憶體
- DTR100測量及故障模組現象,發現產品缺陷
- JS實現非同步timeoutJS非同步
- WriteFile 奇怪的現象
- Send MailAI
- XMLHttpRequest send()XMLHTTP
- [原創]How to send patch files by git send-mailGitAI
- 小瀋陽現象分析
- 歸檔日誌 現象
- JVM異常現象解析JVM
- oracle send mailOracleAI
- IPC實現機制(一)---pipe(匿名管道)
- IPC call
- 機器學習近年來之怪現象機器學習
- Heap Block Compress現象分析BloC
- 模擬SQLserver死鎖現象SQLServer
- 怎麼實現tryLock(timeout int)的功能
- stm出現Flash Timeout解決辦法
- IPC__ALL
- IPC連結
- php 陣列遍歷奇怪現象PHP陣列
- 分析go中slice的奇怪現象Go
- oracle em 按鈕亂碼現象Oracle
- MySQL:引數wait_timeout和interactive_timeout以及空閒超時的實現MySqlAI
- mysql timeoutMySql