當votedisk只有一份的時候,可能出現BUG:CSSD程式會因為找不到votedisk而crash,從而導致DB無法與ASM通訊,最終instance關閉

yeahokay發表於2012-08-15

CSSD aborting from thread clssnmvDiskPingMonitorThread

當votedisk只有一份的時候,可能出現BUG:CSSD程式會因為找不到votedisk而crash,從而導致DB無法與ASM通訊,最終instance關閉。

此為Oracle的bug。

引發bug的原因未知。

由於voting disk在asm中,安裝時,只有1個(使用了外部冗餘策略),在只有一個voting disk時,當csstd程式與voting disk進行通訊時,由於bug,會發生無法通訊,導致程式關閉,從而又導致了asm與db無法通訊,所以db出現當機。

相關文件:
ASM Crashed Due to CSS Crash With Voting File Checks [ID 1468826.1]
11.2.0.3 Node Reboot With "CSSD aborting from thread clssnmvDiskPingMonitorThread" if Only one Voting Disk/File is Configured [ID 1466639.1]

[@more@]

錯誤日誌:

ocssd_node1.log
----------------------
2012-08-13 03:17:42.765: [ CSSD][1109031232]clssnmSendingThread: sent 5 status msgs to all nodes
2012-08-13 03:17:47.766: [ CSSD][1109031232]clssnmSendingThread: sending status msg to all nodes
2012-08-13 03:17:47.766: [ CSSD][1109031232]clssnmSendingThread: sent 5 status msgs to all nodes
2012-08-13 03:17:49.983: [ CSSD][1091426624](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
2012-08-13 03:17:49.984: [ CSSD][1091426624]###################################
2012-08-13 03:17:49.984: [ CSSD][1091426624]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2012-08-13 03:17:49.984: [ CSSD][1091426624]###################################
2012-08-13 03:17:49.984: [ CSSD][1091426624](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2012-08-13 03:17:49.984: [ CSSD][1091426624]
2012-08-13 03:17:49.984: [ CSSD][1091426624]calling call entry argument values in hex
2012-08-13 03:17:49.984: [ CSSD][1091426624]location type point (? means dubious value)
2012-08-13 03:17:49.984: [ CSSD][1091426624]-------------------- -------- -------------------- ----------------------------
2012-08-13 03:17:49.989: [ CSSD][1091426624]clssscExit()+740 call kgdsdst() 000000000 ? 000000000 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 0410D8568 ? 000000001 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 000000001 ? 000000003 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]clssnmvDiskCheck()+ call clssscExit() 7FC21424A8A0 ? 000000002 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]3356 0410D8568 ? 000000001 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 000000001 ? 000000003 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]clssnmvDiskPingMoni call clssnmvDiskCheck() 7FC21424A8A0 ? 7FC2140A3C40 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]torThread()+423 0410DD0B8 ? 000000000 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 000000001 ? 000000003 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]clssscthrdmain()+25 call clssnmvDiskPingMoni 7FC21424A8A0 ? 7FC2140A3C40 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]3 torThread() 0410DD0B8 ? 000000000 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 000000001 ? 000000003 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]start_thread()+221 call clssscthrdmain() 7FC21424A8A0 ? 7FC2140A3C40 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 7FC2140A3C40 ? 000000000 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 000000001 ? 000000003 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]clone()+109 call start_thread() 0410DD940 ? 7FC2140A3C40 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 7FC2140A3C40 ? 000000000 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 000000001 ? 000000003 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624]0000000000000000 call clone() 0410DD940 ? 7FC2140A3C40 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 7FC2140A3C40 ? 000000000 ?
2012-08-13 03:17:49.990: [ CSSD][1091426624] 000000001 ? 000000003 ?

...

ocssd_node2.log
----------------------
2012-08-13 03:17:47.716: [ CSSD][1110337856]clssnmSendingThread: sending status msg to all nodes
2012-08-13 03:17:47.717: [ CSSD][1110337856]clssnmSendingThread: sent 5 status msgs to all nodes
2012-08-13 03:17:50.011: [ CSSD][1113491776]clssnmHandleMeltdownStatus: node node1, number 1, has experienced a failure in thread number 9 and is shutting down
2012-08-13 03:17:52.336: [GIPCHAUP][1102911808] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7f86184a7b40 [0000000000000845] { gipchaEndpoint : port 'gm2_crs/f483-4e28-b94b-8942', peer 'node1:9a12-5b0a-0d30-d102', srcCid 00000000-00000845, dstCid 00000000-00007cb4, numSend 0, maxSend 100, groupListType 1, hagroup 0x7f86100468a0, usrFlags 0x4000, flags 0x204 }
2012-08-13 03:17:52.336: [ CSSD][1113491776]clssnmHandleManualShut: Manual shutdown of node nodename node1 nodenum 1
2012-08-13 03:17:52.337: [ CSSD][1113491776]clssnmMarkNodeForRemoval: node 1, node1 marked for removal

解決方法:

方法1、增加voting disk,保持voting disk有3-5個
或者
方法2、打補丁,該bug於2012年6月18號Oracle官網公佈,至2012年7月19號又公佈於最新的11.2.0.3.2或11.2.0.3.3也有此問題,計劃12.1版本修復。
最後在8月3號釋出了350M左右的補丁<13869978>,該補丁已整合在最新的11.2.0.3.2或11.2.0.3.3中,意味著PSU補丁也一併打上。

由於現在的系統環境中,使用了ASM外部冗餘策略:

提示:
ASM有三種模式
1、外部冗餘(external redundancy):即資料只有一份,資料的安全性完全靠外部的raid冗餘來保證
2、普通冗餘(normal redundancy):資料有兩份,那麼影響是對可用空間將減少一半。
3、高階冗餘(high redundancy):資料有三份,影響是對可用空間只有三分之一。

而且該冗餘模式下無法增加voting disk,原因是取決了上述的ASM冗餘策略,為此,只能是打補丁去避免這個bug了。。。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/786540/viewspace-1059183/,如需轉載,請註明出處,否則將追究法律責任。

相關文章