背景概述

客戶的10G資料庫VIP出現宕，引起VIP負載到另一個節點

事件支援細節

04:29:56.378 一號機器VIP 出現 went OFFLINE unexpectedly，當天出現這個VIP漂移的故障後為檢查VIP宕掉的原因，

對VIP資源啟動DEBUG 5模式：./crsctl debug log res "orahostname1.vip:5"

04:38:36.047 一號節點VIP 出現 went OFFLINE unexpectedly。

根據ora.hostname.vip.log日誌顯示，出現VIP宕原因基本可以確定為公網IP與預設網管通訊不暢引起。

根據Oracle管方建議，調整racgvip程式中的引數從 FAIL_WHEN_DEFAULTGW_NO_FOUND=1 修改成

FAIL_WHEN_DEFAULTGW_NO_FOUND=0

但是調整完後故障依舊

04:17:37.822: [ CRSRES][11025]32ora.hostname1.vip on hostname1 went OFFLINE unexpectedly

為明確原因，再次收集ora.hostname1.vip.log及racgvip 資訊進行分析

分析結果如下：

在racgvip程式中，有如下程式碼

# Check the status of the interface thro' pinging gateway

if [ -n "$DEFAULTGW" ]

then

_RET=1

# get base IP address of the interface

tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`

# get RX packets numbers (bug8341569,9157855->bug9743421)

_O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$(NF-4); exit}}"`

x=$CHECK_TIMES

while [ $x -gt 0 ]

do

if [ -n "$tmpIP" ]

then

logx "About to execute command: $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW"

$PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1

else

logx "About to execute command: $PING $PING_TIMEOUT $DEFAULTGW"

$PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1

fi

_O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$(NF-4); exit}}"`

if [ "$_O1" != "$_O2" ]

then

# RX packets numbers changed

_RET=0

break

fi

$SLEEP 1

x=`$EXPR $x - 1`

done

if [ $_RET -ne 0 ]

then

logx "IsIfAlive: RX packets checked if=$_IF failed"

else

logx "IsIfAlive: RX packets checked if=$_IF OK"

fi

else

logx "IsIfAlive: Default gateway is not defined (host=$HOSTNAME)"

if [ $FAIL_WHEN_DEFAULTGW_NO_FOUND -eq 1 ]

then

_RET=1

else

_RET=0

fi

從原始碼我們可以看到檢查預設閘道器的處理邏輯

1、如果檢測到預設閘道器存在執行網管檢查邏輯

2、_01收集網路卡網路包量

3、$PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW ping網管

4、_02再次收集網路卡網路包量

5、如果_01網路卡網路包量與 _02網路卡網路包量不相同，表明網路卡與預設網路卡之間通訊正常 _RET 返回編碼為0

6、如果_01網路卡網路包量與 _02網路卡網路包量相同，_RET 返回編碼沒指定，預設返回1，同時列印日誌logx "IsIfAlive: RX packets checked if=$_IF failed"，即判斷網路卡失敗。

FAIL_WHEN_DEFAULTGW_NO_FOUND引數從1修改成0，是為了跳過閘道器ping檢測，而從原始碼中我們可以看到，FAIL_WHEN_DEFAULTGW_NO_FOUND引數只有在網路卡引數$DEFAULTGW為空才生效，即主機上沒有配置閘道器並且引數FAIL_WHEN_DEFAULTGW_NO_FOUND配置為非1時返回碼RET為0。

由於我們的環境中DEFAULTGW能獲取成功及DEFAULTGW非空，導致程式沒有進入FAIL_WHEN_DEFAULTGW_NO_FOUND判斷是否為1的處理流程。

故障期間DEBUG錯誤資訊如下：

2013-11-06 04:17:37.776: [ RACG][1] [18219068][1][ora.s9lp1.vip]: Wed Nov 6 04:17:33 CST 2013 [ 6422696 ] checkIf: start for if=en5

Wed Nov 6 04:17:33 CST 2013 [ 6422696 ] IsIfAlive: start for if=en5

Wed Nov 6 04:17:33 CST 2013 [ 6422696 ] defaultgw: started

2013-11-06 04:17:37.776: [ RACG][1] [18219068][1][ora.s9lp1.vip]: Wed Nov 6 04:17:33 CST 2013 [ 6422696 ] defaultgw: completed with 10.0.241.254 （閘道器獲取成功，閘道器為10.0.241.254）

Wed Nov 6 04:17:33 CST 2013 [ 6422696 ] About to execute command: /usr/sbin/ping -S 10.0.241.150 -c 1 -w 1 10.0.241.254

2013-11-06 04:17:37.777: [ RACG][1] [18219068][1][ora.s9lp1.vip]: Wed Nov 6 04:17:35 CST 2013 [ 6422696 ] About to execute command: /usr/sbin/ping -S 10.0.241.150 -c 1 -w 1 10.0.241.254 （PING 閘道器）

Wed Nov 6 04:17:37 CST 2013 [ 6422696 ] IsIfAlive: RX packets checked if=en5 failed（由於檢查到網路卡en5在2秒中內網路卡流量包未方式變化，判斷為en5失敗）

1、故障每次發生都在凌晨04左右，時間如下：

2013-10-28 04:29:56

2013-11-01 04:38:36

2013-11-06 04:17:37
2、從原始碼上分析，發生故障期間網路卡en5連續1秒的網路包未變化

可能的原因：

ping -S 10.0.241.150 -c 1 -w 1 10.0.241.254

Oracle檢測網管時，由於當時網路質量不好導致ping不能在1秒鐘內返回結果。

引起網路卡en5 ping前 ping後沒有網路包發生變化。

根據以上分析我們建議：

1、修改racgvip原始碼跳過網管檢測

修改前：

# Check the status of the interface thro' pinging gateway

if [ -n "$DEFAULTGW" ]

修改後：

# Check the status of the interface thro' pinging gateway
if [ -n "$DEFAULTGW" -a $FAIL_WHEN_DEFAULTGW_NO_FOUND -eq 1 ]

查閱oracle11.2.0.3版本的 RACGVIP程式碼，同樣以次修改

以下為Oracle11G的racgvip程式碼

if [ -n "$DEFAULTGW" -a $FAIL_WHEN_DEFAULTGW_NOT_FOUND -eq 1 ]

then

_RET=1

# get RX packets numbers

_O1=`$IFCONFIG $_IF | $AWK '{ if (/RX packets:/) { sub("packets:", "", $2); print $2}}'`

x=$CHECK_TIMES

while [ $x -gt 0 ]

do

logx "About to execute $PING -r -I $_IF $DEFAULTGW $PING_TIMEOUT"

$PING -r -I $_IF $DEFAULTGW $PING_TIMEOUT > /dev/null 2>&1

rc=$?

if [ $rc -eq 0 ]

then

_RET=0

break

else

echo "ping to $DEFAULTGW via $_IF failed, rc = $rc (host=$HOSTNAME)"

fi

x=$(($x-1))

done

結論及解決方案

修改racgvip程式碼

修改完成後，需要觀察ora.s9lp1.vip.log裡出現如下資訊：

IsIfAlive: Default gateway is not defined (host=$HOSTNAME)

表明修改失效

------------------------------------------------------------------------------------

原部落格地址：http://blog.itpub.net/23732248/
原作者：應以峰 (frank-ying)
-------------------------------------------------------------------------------------

一次RAC VIP漂移的結果診斷及修復

相關文章