故障說明

昨天有一套生產RAC 的一個節點的資料庫例項自動當機，然後又很快自動恢復正常狀態。隨即依次檢查了一下資料庫 alert 日誌， ASM alert 日誌和叢集 alert 日誌。發現 ASM alert 日誌和叢集 alert 日誌中在當機的時間點並沒有重要特殊的資訊提供。而在資料庫 alert 日誌中發現了以下資訊：

IPC Send timeout detected. Receiver ospid 31266 [

Sun Jan 24 06:07:59 2021

Errors in file /u01/app/oracle/diag/rdbms/racdb/racdb3/trace/racdb3_lmon_31266.trc:

Sun Jan 24 06:08:48 2021

Detected an inconsistent instance membership by instance 4

Sun Jan 24 06:08:48 2021

Received an instance abort message from instance 4Received an instance abort message from instance 4

Please check instance 4 alert and LMON trace files for detail.

LMS0 (ospid: 31270): terminating the instance due to error 481

Sun Jan 24 06:08:48 2021

System state dump requested by (instance=3, osid=31270 (LMS0)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/racdb/racdb3/trace/racdb3_diag_31256_20210124060848.trc

Instance terminated by LMS0, pid = 31270

Sun Jan 24 06:09:00 2021

Starting ORACLE instance (normal)

根據資料庫 alert 日誌提供的資訊，可以得知，此時發生了 IPC 超時檢測，並將本地資料庫例項驅逐出叢集，然後本地資料庫例項又正常啟動了。

當 IPC Send timeout detected 發生時， "netstat" 狀態會顯示 "packet reassembles failed" （重組包失敗），檢查如下：

$ cd /oracle/app/grid/tfa/repository/suptools/x38503/oswbb/grid/archive/oswnetstat/

檢查是否有很多 "packet reassembles failed" 現象

$ grep -ni 'packet reassembles failed' x38503_netstat_21.01.24.0600.dat

3283: 2671396 packet reassembles failed

3438: 2672065 packet reassembles failed

3593: 2672586 packet reassembles failed

3748: 2673032 packet reassembles failed

3903: 2673516 packet reassembles failed

4058: 2674057 packet reassembles failed

4213: 2674852 packet reassembles failed

4368: 2675658 packet reassembles failed

4523: 2675980 packet reassembles failed

4678: 2676232 packet reassembles failed

.........

8708: 2680307 packet reassembles failed

8863: 2680666 packet reassembles failed

大量的 "packet reassembles failed" 可能會導致兩種現象

1. 節點驅逐

2. 在發生節點驅逐後，如果沒有重啟產生 "packet reassembles failed" 的節點，例項或節點將不會自動加入叢集

解決方法

根據MTU （ Maximum Transmission Unit ）的尺寸，大的 UDP 資料包可能被分片，並在多個幀中傳送。這些零散的資料包需要在接收節點上重新組合。高 CPU 使用率（持續的或者是頻繁的峰值），過小的 reassembly buffer 也會導致塊重組失敗。在接收節點 ' netstat -s ' 輸出的 "IP Statistics" 部分提示有大量的 "reassembles failed" 資訊。分片的報文需要在指定時間內完成重組（ reassemble ）。沒有能夠完成重組的分片報文會被丟棄並要求重傳。已經收到，但是由於空間不足沒有進行重組的資料分片會被直接丟棄。

解決方法如下，增加reassemble buffer 尺寸，給重組分配更多的空間。方法如下：

# vi /etc/sysctl.conf

net.ipv4.ipfrag_high_thresh = 16777216 (default = 196608)
net.ipv4.ipfrag_low_thresh = 15728640 (default = 262144)

# sysctl -p -- 使引數生效

---- end ----

【問題處理】IPC Send timeout detected

故障說明

解決方法

相關文章