對於管理RAC的dba來說，肯定遇到過某一個節點被另一個節點reboot，其實出現這樣的原因很多，但大多數情況，都是由於某個節點的資源（cpu，記憶體，磁碟，網路等）出現了問題，阻礙了節點間的通訊，為保證資料的完整性以及RAC特性，才將出問題的節點剔除叢集的，從維護RAC以來，遇到過很多次這樣的事情，不是os被重啟了，就是db資源被重啟了，有時候出問題是晚上，有時候是凌晨，有時候是白天。以下為最近一次RAC出問題的故障分析以及解決方案。
時間發生在：2013-12-10 晚上21：24，RAC的第一個節點被重啟（2節點RAC叢集）
1. 檢視RAC的alert日誌<alert$HOSTNAME.log>
2013-12-10 21:16:05.917
[/u01/11.2.0/grid/bin/oraagent.bin(4278)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF001
13:) {1:43062:2} in /u01/11.2.0/grid/log/rac1/agent/crsd/oraagent_grid/oraagent_grid.log.
----這裡發現出現了crs-5818錯誤，其實rac會每隔1分鐘檢查一次各個資源的狀態，如果發現檢查失敗就會報錯，這裡應該是首先檢查監聽資源，所以我們每次出問題的時候都是先報監聽檢查失敗。
2013-12-10 21:16:05.491
[/u01/11.2.0/grid/bin/oraagent.bin(4278)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRS
AGF00113:) {1:43062:2} in /u01/11.2.0/grid/log/rac1/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-12-10 21:25:47.436
[ohasd(2989)]CRS-2112:The OLR service started on node rac1.--這是時候已經發生重啟，並重啟完畢了
2013-12-10 21:25:47.622
[ohasd(2989)]CRS-1301:Oracle High Availability Service started on node rac1.
2013-12-10 21:25:47.864
[ohasd(2989)]CRS-8017:location: /var/opt/oracle/lastgasp has 34 reboot advisory log files, 0 were announced and 0 errors occurred
2013-12-10 21:25:52.660
[/u01/11.2.0/grid/bin/oraagent.bin(3270)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/11.2.0/gr
id/log/rac1/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-12-10 21:26:06.605
[gpnpd(3622)]CRS-2328:GPNPD started on node rac1.
2013-12-10 21:26:09.620
[cssd(3640)]CRS-1713:CSSD daemon is started in clustered mode
2013-12-10 21:26:11.437
[ohasd(2989)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2013-12-10 21:27:28.324
[cssd(3640)]CRS-1707:Lease acquisition for node rac1 number 1 completed
2013-12-10 21:27:29.718
[cssd(3640)]CRS-1605:CSSD voting file is online: /dev/rdisk/disk104; details in /u01/11.2.0/grid/log/rac1/cssd/ocssd.log.
2013-12-10 21:27:35.244
[cssd(3640)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2013-12-10 21:27:37.720
[ctssd(3970)]CRS-2403:The Cluster Time Synchronization Service on host rac1 is in observer mode.
[ctssd(3970)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac2.
2013-12-10 21:27:38.091
[ctssd(3970)]CRS-2401:The Cluster Time Synchronization Service started on host rac1.
2013-12-10 21:27:48.864
[/u01/11.2.0/grid/bin/oraagent.bin(3613)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/11.2.0/gr
id/log/rac1/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-12-10 21:27:51.323
[ctssd(3970)]CRS-2409:The clock on host rac1 is not synchronous with the mean cluster time. No action has been taken as the Clus
ter Time Synchronization Service is running in observer mode.
2013-12-10 21:27:52.351
[/u01/11.2.0/grid/bin/oraagent.bin(3613)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/11.2.0/gr
id/log/rac1/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-12-10 21:28:23.621
[crsd(4274)]CRS-1012:The OCR service started on node rac1.
2013-12-10 21:28:23.765
[evmd(3974)]CRS-1401:EVMD started on node rac1.
2013-12-10 21:28:26.161
[crsd(4274)]CRS-1201:CRSD started on node rac1.
2013-12-10 21:28:29.194
[/u01/11.2.0/grid/bin/oraagent.bin(4325)]CRS-5016:Process "/u01/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/11.2.0/gr
id/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/rac1/agent/crsd/oraagent_g
rid/oraagent_grid.log"
2013-12-10 21:28:31.115
[/u01/11.2.0/grid/bin/oraagent.bin(4325)]CRS-5016:Process "/u01/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/11.2.0/grid/bi
n/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/rac1/agent/crsd/oraagent_grid/o
raagent_grid.log"
檢視第二個節點的rac日誌：
2013-12-10 21:16:31.893
[ohasd(2898)]CRS-8011:reboot advisory message from host: rac1, component: cssagent, with time stamp: L-2013-12-10-21:16:31.851---可以看到此時建議節點1重啟
[ohasd(2898)]CRS-8013:reboot advisory message text: Rebooting after limit 28050 exceeded; disk timeout 28050, network timeout 27848,
last heartbeat from CSSD at epoch seconds 1386681351.547, 40303 milliseconds ago based on invariant clock value of 460929194
2013-12-10 21:16:46.333
[cssd(3579)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cl
uster in 14.615 seconds---這裡千萬不要以為是網路問題，在這個過程中，其實節點一正在重啟，所以當然網路不通啦。
2013-12-10 21:16:54.413
[cssd(3579)]CRS-1611:Network communication with node rac1 (1) missing for 75% of timeout interval. Removal of this node from cl
uster in 6.535 seconds
2013-12-10 21:16:58.453
[cssd(3579)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cl
uster in 2.495 seconds
2013-12-10 21:17:00.955
[cssd(3579)]CRS-1632:Node rac1 is being removed from the cluster in cluster incarnation 209651233
2013-12-10 21:17:00.969
[cssd(3579)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac2 .
2013-12-10 21:17:01.688
[crsd(5250)]CRS-5504:Node down event reported for node 'rac1'.
2013-12-10 21:17:07.875
[crsd(5250)]CRS-2773:Server 'rac1' has been removed from pool 'Generic'.
2013-12-10 21:25:45.235
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN3.lsnr'. Details at (:CR
SAGF00113:) {2:31202:2} in /u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-12-10 21:25:43.653
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00
113:) {2:31202:2} in /u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-12-10 21:25:48.567
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CR
SAGF00113:) {2:31202:55563} in /u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-12-10 21:26:13.939
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5014:Agent "/u01/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/
11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid
/oraagent_grid.log"
2013-12-10 21:26:13.939
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5014:Agent "/u01/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/
11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid
/oraagent_grid.log"
2013-12-10 21:26:16.053
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5014:Agent "/u01/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/
11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid
/oraagent_grid.log"
2013-12-10 21:26:23.751
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5818:Aborted command 'check' for resource 'ora.ons'. Details at (:CRSAGF00113:) {0:5
:2} in /u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-12-10 21:26:24.082
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5014:Agent "/u01/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/
11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00009:)" in "/u01/11.2.0/grid/log/rac2/agent/crsd/oraagent
_grid/oraagent_grid.log"
2013-12-10 21:26:53.757
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN2.lsnr'. Details at (:CR
SAGF00113:) {2:31202:2} in /u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-12-10 21:26:53.772
[/u01/11.2.0/grid/bin/oraagent.bin(10256)]CRS-5014:Agent "/u01/11.2.0/grid/bin/oraagent.bin" timed out starting process "/u01/
11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/u01/11.2.0/grid/log/rac2/agent/crsd/oraagent_grid
/oraagent_grid.log"
2013-12-10 21:27:35.294
[cssd(3579)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .---節點一重啟完畢並加入叢集。
2 檢視ocrsd和ocssd日誌
此時可以在檢視crsd和cssd和ctssd日誌來進一步判斷引起節點一重啟的原因。
比如在節點一的ocssd日誌中看到：Core file size limit extended等的資訊。這裡就不一一列舉。

3 檢視OSW日誌
其實幫助我們診斷原因還有一個重要的工具就是OSWater。
由於我的osw資訊之保留2天，所以現在已經找不到10號的日誌了，但我記得我當時檢視節點的osw中的oswvmstat的資訊，發現在重啟之前，節點的系統記憶體僅僅剩下3M。再分析前幾次當機的時候，也發現系統記憶體過低的情況。
4 檢視系統日誌
檢視系統日誌，比如dmesg啊，還有linux下的message，hpunix下的syslog.log啊，判斷下是否是由於系統的原因導致機器重啟，或者可以使用其他分析工具，比如hpux下的crashinfo等，都是可以診斷出當時系統重啟的原因，但我這邊，最後都指向是oracle觸發的機器重啟。
5 檢視資料庫當時正在做些什麼
根據osw中oswtop和oswps等，可以檢視到在重啟前，那些消耗效能，或者我們這裡乾脆說是消耗記憶體的程式是誰，我這裡檢視到當時是有很多serverprocess程式都是run狀態，也就是此時db很忙，而檢視當時磁碟使用率也到了80%，由此應該可以推斷出系統正在執行sql，而且還可能是全表掃描。如果此時你的awr的時間間隔設的很短，那你可以檢視那個時間點系統正在執行什麼，但由於我這裡是預設1h，所以我無法看到，只能一直問開發人員。但大多數開發人員都有推卸責的嫌疑，所以他們的話，我們只能簡單採納有用的。
6 診斷
由於最開始出問題的時候時間的不固定性，所以一直都沒有找打問題的原因，直到我已經基本知道是記憶體問題，這邊在10號的21:24出現了重啟。經過一系列的溝通，最後得知在20:50的時候，連線這個資料庫的應用因為需要釋出新版本而重啟應用了，那麼在晚上9點的時候，肯定是沒有多少業務的，因為使用者都是白天干活，所以是不是可以斷定應用重啟時有一些自動任務呢，大量的自動任務導致磁碟大量訪問，記憶體消耗殆盡了呢，帶著這個問題，找了一個晚上，我將應用重啟，檢視db到底執行了什麼，果然不出所料，發現有至少8個大sql同時執行，而且這些sql晚上可以只執行一次，或者做成db的job，在db層面執行。其實這裡我一直犯了一個錯誤，我們的系統是64G的實體記憶體，為了更大限度的讓oracle使用記憶體，我給memory_target 50G，系統平穩的時候，sga處於38G左右，pga約為2G。但由於出現了8個大排序的sql，導致pga使用急劇上升，最終導致實體記憶體過低，也就出現了一系列rac資源的各種情況。這裡說各種情況，是因為有時候出現crs-5818錯誤，並不是每次都會引起os重啟，有時候可能會引起比如asm資源狀態顯示不正常，或者db資源重啟等各種稀奇古怪的事情。
7 解決
調整memory_target為42G，並且將應用重啟時執行的自動任務改為db 層面的job，保證每天只執行一次而不是多次並且執行。
最終問題解決，現在系統執行了7天了，還是比較好的。
8 誤區
以前檢視記憶體使用總是在90%左右，swap使用在60%左右，因為這是好的事情，或者因為記憶體只要是平穩的使用，這就是好的，或者說不需要修改，但其實不是這樣的，如果是rac系統的話，那很有可能導致rac故障，我們需要給rac預留足夠的資源來讓他們通訊等。如果我們的db上的連線數很多的話，那還是建議memory-target為系統記憶體的50%左右就好。

RAC第一個節點被剔除叢集故障分析

相關文章