IPC Send timeout故障現象

流浪的野狼發表於2016-10-28

申明:本文轉載自:http://czmmiao.iteye.com/blog/1763055
IPC Send timeout故障現象

RAC 資料庫上比較常見的一種問題就是“IPC Send timeout”。資料庫Alert log中出現了“IPC Send timeout”之後,經常會伴隨著ora-29740 或者 "Waiting for clusterware split-brain resolution"等,資料庫例項會因此異常終止或者被驅逐出叢集

比如:

例項1的ALERT LOG:

Thu Jul 02 05:24:50 2012

IPC Send timeout detected.Sender: ospid 6143755      <==傳送者

Receiver: inst 2 binc 1323620776 ospid 49715160        <==接收者

Thu Jul 02 05:24:51 2012

IPC Send timeout to 1.7 inc 120 for msg type 65516 from opid 13

Thu Jul 02 05:24:51 2012

Communications reconfiguration: instance_number 2

Waiting for clusterware split-brain resolution       <==出現腦裂

Thu Jul 02 05:24:51 2012

Trace dumping is performing id=[cdmp_20120702052451]

Thu Jul 02 05:34:51 2012

Evicting instance 2 from cluster   <==過了10分鐘,例項2被驅逐出叢集例項2的ALERT LOG:

Thu Jul 02 05:24:50 2012

IPC Send timeout detected. Receiver ospid 49715160       <==接收者

Thu Jul 02 05:24:50 2012

Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms6_49715160.trc:

Thu Jul 02 05:24:51 2012

Waiting for clusterware split-brain resolution

Thu Jul 02 05:24:51 2012

Trace dumping is performing id=[cdmp_20120702052451]

Thu Jul 02 05:35:02 2012

Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lmon_6257780.trc:

ORA-29740: evicted by member 0, group incarnation 122  <==例項2出現ORA- 29740錯誤,並被驅逐出叢集

Thu Jul 02 05:35:02 2012

LMON: terminating instance due to error 29740

Thu Jul 02 05:35:02 2012

Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms7_49453031.trc:

ORA-29740: evicted by member , group incarnation

在RAC例項間主要的通訊程式有LMON, LMD, LMS等程式。正常來說,當一個訊息被髮送給其它例項之後,傳送者期望接收者會回覆一個確認訊息,但是如果這個確認訊息沒有在指定的時間內收到(預設300秒),傳送者就會認為訊息沒有達到接收者,於是會出現“IPC Send timeout”問題。

這種問題通常有以下幾種可能性:

1. 網路問題造成丟包或者通訊異常。

2. 由於主機資源(CPU、記憶體)問題造成這些程式無法被排程或者這些程式無響應。

3. Oracle Bug.

4. AIX平臺沒有打IZ97457丁包

網路問題造成的“IPC Send timeout”例子

例項1的Alert log中顯示接收者是2號機的程式49715160,

Thu Jul 02 05:24:50 2012

IPC Send timeout detected.Sender: ospid 6143755       <==傳送者

Receiver: inst 2 binc 1323620776 ospid 49715160       <==接收者

檢視當時2號機的OSWatcher的vmstat輸出,沒有發現CPU和記憶體緊張的問題,檢視OSWatcher的netstat輸出,在發生問題前幾分鐘,私網的網路卡上有大量的網路包傳輸。

Node2:

zzz Thu Jul 02 05:12:38 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.2       4073847798     0 512851119     0     0 <==4073847798 - 4073692530 = 155268 個包/30秒

zzz Thu Jul 02 05:13:08 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.2       4074082951     0 513107924     0     0 <==4074082951 - 4073847798 = 235153 個包/30秒

Node1:

zzz Thu Jul 02 05:12:54 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.1       502159550     0 4079190700     0     0 <==502159550 - 501938658 = 220892 個包/30秒

zzz Thu Jul 02 05:13:25 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.1       502321317     0 4079342048     0     0 <==502321317 - 502159550 = 161767 個包/30秒

檢視這個系統正常的時候,大概每30秒傳輸幾千個包:

zzz Thu Jul 02 04:14:09 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.2       4074126796     0 513149195     0     0 <==4074126796 - 4074122374 = 4422個包/30秒

這種突然的大量的網路傳輸可能會引發網路傳輸異常,另外網路的UDP或者IP包丟失也會造成該錯誤。對於這種情況,需要聯絡網管對網路進行檢查。在某些案例中,重啟私網交換機或者調換了交換機後問題不再發生。(請注意,網路的正常的傳輸量會根據硬體和業務的不同而不同。)

CPU負載過高造成的“IPC Send timeout”例子

例項1的Alert log中顯示接收者是2號機的程式1596935,

Fri Aug 01 02:04:29 2008 

 IPC Send timeout detected.Sender: ospid 1506825 <==傳送者

 Receiver: inst 2 binc -298848812 ospid 1596935  <==接收者

檢視當時2號機的OSWatcher的vmstat輸出:

 zzz ***Fri Aug 01 02:01:51 CST 2008 

 System Configuration: lcpu=32 mem=128000MB 

 kthr     memory             page              faults        cpu     

 ----- ----------- ------------------------ ------------ ----------- 

  r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa 

 25  1 7532667 19073986   0   0   0   0    5   0 9328 88121 20430 32 10 47 11 

58  0 7541201 19065392   0   0   0   0    0   0 11307 177425 10440 87 13  0  0 <==idle的CPU為0,說明CPU100%被使用

61  1 7552592 19053910   0   0   0   0    0   0 11122 206738 10970 85 15  0  0 

 zzz ***Fri Aug 01 02:03:52 CST 2008 

   System Configuration: lcpu=32 mem=128000MB 

   kthr     memory             page              faults        cpu     

 ----- ----------- ------------------------ ------------ ----------- 

  r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa 

 25  1 7733673 18878037   0   0   0   0    5   0 9328 88123 20429 32 10 47 11 

81  0 7737034 18874601   0   0   0   0    0   0 9081 209529 14509 87 13  0  0 <==CPU的run queue非常高

80  0 7736142 18875418   0   0   0   0    0   0 9765 156708 14997 91  9  0  0 <==idle的CPU為0,說明CPU100%被使用

上面這個例子說明當主機CPU負載非常高的時候,接收程式無法響應傳送者,從而引發了“IPC Send timeout”。

引起IPC Send timeout問題的常見bug

10g平臺上該問題的常見Bug有Bug 5190596和Bug 6200820。這兩個bug多出現在10.2.0.3和10.2.0.4,到了10.2.0.5版本就已經修復了該bug,具體請參見MOS上的文章:

LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]

'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]

11g平臺上的常見bug有Bug 6200820和Bug 7653579具體請參見MOS上的文章:

Bug 6200820  AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)

Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]

AIX平臺沒有打IZ97457丁包引起的 IPC Send timeout

關於這點MOS上的這篇文章

AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]

有如下介紹

Applies to:

Oracle Server - Enterprise Edition - Version 9.2.0.2 and later

IBM AIX on POWER Systems (64-bit)

Symptoms

Environment with IBM AIX VIO experiences one or some or all of the following symptoms:

Packet Loss

Cache Fusion "block lost"

IPC Send timeout

Instance Eviction

SKGXPSEGRCV: MESSAGE TRUNCATED user data nnnn bytes payload nnnn bytes

Cause

AIX issue APAR IZ97457 - A VIOS Server will not forward traffic from its VIO Clients to the external network

Solution

Please engage your OS vendor for fix.

Oracle的建議是打上補丁,IZ97457補丁的介紹如下
Error description
A VIOS Server will not forward traffic from its VIO Clients to the external network.
Packets from the VIO Client travel to the hypervisor(phype) but the packets are dropped by the hypervisor as it attempts to deliver the packet to the VIO Server's trunk adapter.
The hypervisor will have dropped the packets because there are no buffers to place the data in. On the VIOServer,interrupts are not activating the trunk adapter to read and remove data from its buffers. This results in having full buffers at the trunk adapter.
Since the trunk adapter's buffers are full, phype cannot deliver the data and so VIO Clients cannot get packets through the SEA adapter and out to the network.
The problem was discovered on P7 systems where Vlans on the SEA are used.
"Hypervisor Receive" errors on the trunk adapter will increase as this problem occurs and the VIO Clients are not able to reach the outside network.
Problem summary
Unresponsive VIO Clients with traffice not forwarded to external network.
Problem conclusion
Ensure proper locking around receive scheduling operations.
可以看到,IZ97457該補丁是用於處理網路緩衝池用滿的情況,建議AIX系統的使用者檢查下是否打了這個補丁。

 

參考至:https://blogs.oracle.com/Database4CN/entry/%E5%A6%82%E4%BD%95%E8%AF%8A%E6%96%ADrac%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8A%E7%9A%84_ipc_send_timeout_%E9%97%AE%E9%A2%98
              

             

              AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]

              LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]

              'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]

              Bug 6200820  AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)

              Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]

Top 5 issues for Instance Eviction (Doc ID 1374110.1)

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/28612416/viewspace-2127261/,如需轉載,請註明出處,否則將追究法律責任。

相關文章