一次RAC例項驅逐詳細分析及解決方案
一 問題說明
接同事郵件,緊急處理客戶extent RAC節點例項驅逐故障分析。我們根據客戶提供的日誌,結合具體的環境及詳細的trace檔案進行分析,在儘可能的情況下,給出故障結論,由於故障發生時候,缺失相關的核心trace檔案等,所以我們根據現有的日誌,結合該環境本身的一些技術指標,給出一個可預判的故障結論。
二 故障處理的關鍵時間點
時間點 |
故障處理關鍵點 |
03-17 15:28:00 |
接到郵件,開始具體分析故障日誌 |
03-17 16:00:00 |
結合現有的日誌,給出故障初步結論 |
03-17 16:20:00 |
郵件回覆技術細節內容 |
03-17 17:00:00 |
詳細提交技術分析細節 |
三 故障具體日誌分析
3.1 故障時間一節點叢集告警日誌分析
2017-03-04 01:28:01.282: [ctssd(265396)]CRS-2409:The clock on host ggudb-kl1p4-14 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2017-03-04 01:58:01.915: [ctssd(265396)]CRS-2409:The clock on host ggudb-kl1p4-14 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2017-03-04 02:02:43.306: [/u01/11.2.0/grid/bin/oraagent.bin(267603)]CRS-5011:Check of resource "oltp" failed: details at "(:CLSN00007:)" in "/u01/11.2.0/grid/log/ggudb-kl1p4-14/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2017-03-04 02:03:17.323: [/u01/11.2.0/grid/bin/oraagent.bin(267603)]CRS-5818:Aborted command 'check' for resource 'ora.oltp.mynode1.svc'. Details at (:CRSAGF00113:) {0:9:19} in /u01/11.2.0/grid/log/ggudb-kl1p4-14/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2017-03-04 02:05:16.514: [/u01/11.2.0/grid/bin/oraagent.bin(267603)]CRS-5818:Aborted command 'check' for resource 'ora.oltp.mynode1.svc'. Details at (:CRSAGF00113:) {0:9:19} in /u01/11.2.0/grid/log/ggudb-kl1p4-14/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2017-03-04 02:28:02.558: [ctssd(265396)]CRS-2409:The clock on host ggudb-kl1p4-14 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2017-03-04 02:58:05.193: |
從一節點叢集告警日誌看,在故障時間節點(2017-03-04 02:02:43)資料庫CRS程式接收到本節點資料庫例項及資料庫服務CRASH的資訊。從這裡我們可以分析到,在該故障時間段,叢集本身並沒有crash只是1節點例項被驅逐。
3.2 故障時間二節點叢集告警日誌分析
[crsd(243374)]CRS-2772:Server 'ggudb-kl1p4-14' has been assigned to pool 'ora.oltp_mynode1'. 2017-03-04 02:02:47.120: [crsd(243374)]CRS-2765:Resource 'ora.oltp.db' has failed on server 'ggudb-kl1p4-14'. 2017-03-04 02:05:34.367: [crsd(243374)]CRS-2765:Resource 'ora.oltp.mynode1.svc' has failed on server 'ggudb-kl1p4-14'. 2017-03-04 02:05:34.367: [crsd(243374)]CRS-2771:Maximum restart attempts reached for resource 'ora.oltp.mynode1.svc'; will not restart. 2017-03-05 00:12:21.690: |
可以看到在同一時間,二節點也發出了一節點例項crash的資訊
3.3 故障時間一節點rdbms日誌
IPC Send timeout detected. Sender: ospid 267775 [oracle@ggudb-kl1p4-14 (PING)] Receiver: inst 2 binc 429635545 ospid 244165 Sat Mar 04 02:02:34 2017 minact-scn master exiting with err:12751 Sat Mar 04 02:02:41 2017 IPC Send timeout detected. Sender: ospid 267880 [oracle@ggudb-kl1p4-14 (LMD0)] Receiver: inst 2 binc 429635551 ospid 244173 IPC Send timeout to 2.0 inc 4 for msg type 65521 from opid 12 Sat Mar 04 02:02:42 2017 Suppressed nested communications reconfiguration: instance_number 2 Detected an inconsistent instance membership by instance 2 Sat Mar 04 02:02:42 2017 Received an instance abort message from instance 2 Received an instance abort message from instance 2 Please check instance 2 alert and LMON trace files for detail. Please check instance 2 alert and LMON trace files for detail. LMS0 (ospid: 267888): terminating the instance due to error 481 Sat Mar 04 02:02:42 2017 System state dump requested by (instance=1, osid=267888 (LMS0)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/oltp/oltp1/trace/oltp1_diag_267757_20170304020242.trc Sat Mar 04 02:02:47 2017 ORA-1092 : opitsk aborting process Sat Mar 04 02:02:47 2017 License high water mark = 438 Instance terminated by LMS0, pid = 267888 USER (ospid: 715453): terminating the instance Instance terminated by USER, pid = 715453 Sun Mar 05 22:51:42 2017 |
可以看到,在故障發生時候,一節點因為接收到來自二節點的例項驅逐資訊(Received an instance abort message from instance 2)進而導致例項crash
3.4 故障時間二節點rdbms日誌
Sat Mar 04 02:00:33 2017 IPC Send timeout detected. Receiver ospid 244324 [ Sat Mar 04 02:00:33 2017 Errors in file /u01/app/oracle/diag/rdbms/oltp/oltp2/trace/oltp2_lck0_244324.trc: Sat Mar 04 02:02:10 2017 ##########發現節點無法和一節點LMD程式進行通訊############### LMD0 (ospid: 244173) has detected no messaging activity from instance 1 LMD0 (ospid: 244173) issues an IMR to resolve the situation Please check LMD0 trace file for more detail. Sat Mar 04 02:02:10 2017 ########重置一節點通訊|################ Communications reconfiguration: instance_number 1 Sat Mar 04 02:02:41 2017 IPC Send timeout detected. Receiver ospid 244173 [ Sat Mar 04 02:02:41 2017 Errors in file /u01/app/oracle/diag/rdbms/oltp/oltp2/trace/oltp2_lmd0_244173.trc: Sat Mar 04 02:02:42 2017 ##############對一節點發起驅逐###################### Detected an inconsistent instance membership by instance 2 Evicting instance 1 from cluster Waiting for instances to leave: 1 Dumping diagnostic data in directory=[cdmp_20170304020242], requested by (instance=1, osid=267888 (LMS0)), summary=[abnormal instance termination]. Reconfiguration started (old inc 4, new inc 8) List of instances: 2 (myinst: 2) Global Resource Directory frozen |
可以詳細看到,在2節點的rdbms日誌中,因LMD程式通訊超時,2節點向一節點發起了驅逐,簡單來說,在故障時間中,一節點的LMD程式因某些原因,導致無法與2節點進行進行通訊,為了保證資料一致性,叢集在資料庫層面發起了驅逐(Instance Evicting)
3.5 一節點叢集心跳日誌
2017-03-04 02:02:37.347: [ CSSD][3417790208]clssnmSendingThread: sent 5 status msgs to all nodes 2017-03-04 02:02:42.349: [ CSSD][3417790208]clssnmSendingThread: sending status msg to all nodes 2017-03-04 02:02:42.349: [ CSSD][3417790208]clssnmSendingThread: sent 5 status msgs to all nodes 2017-03-04 02:02:42.694: [ CSSD][3646260992]clssgmDeadProc: proc 0x7f42d030a6a0 2017-03-04 02:02:42.694: [ CSSD][3646260992]clssgmDestroyProc: cleaning up proc(0x7f42d030a6a0) con(0x47f7) skgpid 267966 ospid 267966 with 0 clients, refcount 0 2017-03-04 02:02:42.694: [ CSSD][3646260992]clssgmDiscEndpcl: gipcDestroy 0x47f7 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmDeadProc: proc 0x7f42d02f21f0 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmFenceClient: fencing client(0x7f42d02e77f0), member(3) in group(GR+GCR1), death fence(1), SAGE fence(0) 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmUnreferenceMember: global grock GR+GCR1 member 3 refcount is 1 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmFenceProcessDeath: client (0x7f42d02e77f0) pid 267914 undead 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmQueueFenceForCheck: (0x7f42d02f5730) Death check for object type 3, pid 267914 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmDiscEndpcl: gipcDestroy 0x471d 2017-03-04 02:02:42.703: [ CSSD][2348807936]clssgmFenceCompletion: (0x7f42d02f5730) process death fence completed for process 267914, object type 3 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmDestroyProc: cleaning up proc(0x7f42d02f21f0) con(0x46ee) skgpid 267914 ospid 267914 with 0 clients, refcount 0 2017-03-04 02:02:42.703: [ CSSD][3646260992]clssgmDiscEndpcl: gipcDestroy 0x46ee 2017-03-04 02:02:42.703: [ CSSD][2348807936]clssgmTermMember: Terminating member 3 (0x7f42d02f29e0) in grock GR+GCR1 2017-03-04 02:02:42.703: [ CSSD][2348807936]clssgmQueueGrockEvent: groupName(GR+GCR1) count(4) master(0) event(2), incarn 6, mbrc 4, to member 1, events 0x280, state 0x0 2017-03-04 02:02:42.703: [ CSSD][2348807936]clssgmUnreferenceMember: global grock GR+GCR1 member 3 refcount is 0 2017-03-04 02:02:42.703: [ CSSD][2348807936]clssgmAllocateRPCIndex: allocated rpc 121 (0x7f42d96d2f78) 2017-03-04 02:02:42.703: [ CSSD][2348807936]clssgmRPC: rpc 0x7f42d96d2f78 (RPC#121) tag(79002a) sent to node 2 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmDeadProc: proc 0x7f42d039c680 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmFenceClient: fencing client(0x7f42d0399e40), member(0) in group(DG_LOCAL_DATA), death fence(1), SAGE fence(0) 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmUnreferenceMember: local grock DG_LOCAL_DATA member 0 refcount is 23 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmFenceProcessDeath: client (0x7f42d0399e40) pid 267964 undead 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmQueueFenceForCheck: (0x7f42d0394610) Death check for object type 3, pid 267964 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmFenceClient: fencing client(0x7f42d0391770), member(0) in group(DG_LOCAL_REDO), death fence(1), SAGE fence(0) 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmUnreferenceMember: local grock DG_LOCAL_REDO member 0 refcount is 2 2017-03-04 02:02:42.704: [ CSSD][3646260992]clssgmFenceProcessDeath: client (0x7f42d0391770) pid 267964 undead |
可以看到CSS進城已經檢測到部分death進城透過分析資料庫告警日誌可以獲得這部分作業系統程式對應的oracle程式:
從上述的日誌分析看,故障時候,LMSx相關的程式因某些原因導致僵死,進而導致通訊失敗,從而二節點對一節點發起了例項層的crash
四 故障總結
4.1 故障結論分析
結合第三章節的故障分析,從現有的日誌上看,LMS程式僵死超時是導致資料庫被驅逐的最主要原因。而在RAC例項間主要的通訊程式有LMON, LMD, LMSx等程式。正常來說,當一個訊息被髮送給其它例項之後,傳送者期望接收者會回覆一個確認訊息,但是如果這個確認訊息沒有在指定的時間內收到(預設300秒),傳送者就會認為訊息沒有達到接收者,於是會出現“IPC Send timeout”問題。而導致上述這些程式僵死或者超時的原因不外乎以下幾種:
? 網路問題造成丟包或者通訊異常。
? 由於主機資源(CPU、記憶體、I/O等)問題造成這些程式無法被排程或者這些程式無響應。
?
Oracle Bug.
從目前來看,ORACLE 的BUG 可能性也存在,但是相對可能性較小,之所以說bug可能存在,原因在於資料庫並沒有安裝任何PSU:
There are no Interim patches installed in this Oracle Home.
Rac system comprising of multiple nodes
Local node = ggudb-kl1p4-14
Remote node = xhmdb-kl1p4-24
--------------------------------------------------------------------------------
OPatch succeeded.
而主機資源方面,從目前的情況看,原則上這個時間段系統並不會特別卡,除非存在夜間批處理,或者備份等定時任務在大量開銷主機資源(CPU 記憶體 IO)
最終我們來說網路問題,這也是可能性最大的,之所以這麼說,因為在故障時間段,大量的IPC Send timeout detected. 這也在一定程度上說明故障時候心跳之間的延遲還是比較厲害。
上述的三種情況,BUG這塊相對確定並且排除是最方便的,而資源不足和網路延遲等,需要更進一步的日誌,比較遺憾,在諮詢客戶後,客戶告知在故障時間段的時候並沒有部署OSwatch ,這就導致故障時間段的作業系統trace相對缺失。
4.2 解決方案
從上述4.1節的分析,我們給出如下建議:
1. 開啟OSwatch ,我們在這裡稍微在說明一下:
開啟OSwatch的自啟動 雙節點部署如下資訊: linux:/etc/rc.local下部署如下資訊:(檔案呼叫路徑自行修改) ########################## /bin/sh /osw/autostartosw.sh ####################################################################### 編輯如下檔案(檔案呼叫路徑自行修改) cd /osw/oswbb /usr/bin/nohup sh startOSWbb.sh 10 48 gzip /osw/archive&
同時開啟private監控: # vi private.net(新增如下資訊,後面跟別名) traceroute -r -F hisrac01-priv traceroute -r -F hisrac02-priv
# ll private.net -rwx------ 1 root root 94 May 25 16:16 private.net 確保可以執行,有執行許可權 # ./private.net(手工嘗試執行) traceroute to hisrac01-priv (10.10.10.31), 30 hops max, 60 byte packets 1 hisrac01-priv (10.10.10.31) 0.111 ms 0.021 ms 0.016 ms traceroute to hisrac02-priv (10.10.10.32), 30 hops max, 60 byte packets 1 hisrac02-priv (10.10.10.32) 0.215 ms 0.200 ms 0.189 ms
手工開啟 nohup sh startOSWbb.sh 10 48 gzip /osw/archive&
|
2. 作為extent RAC 儘量建立單獨的service確保多個應用各自連線自己的資料庫例項,減小心跳壓力,如果採用普通normal儲存配置,建議當前節點連線當前節點對應的儲存。較少裸光纖壓力。
3. 從目前來看,資料庫及叢集未安裝任何PSU,作為生產環境,這是絕對不允許的,建議可以針對最新的PSU進行patch
------------------------------------------------------------------------------------
<版權所有,文章允許轉載,但必須以連結方式註明源地址,否則追究法律責任!>
原部落格地址:http://blog.itpub.net/23732248/
原作者:應以峰 (frank-ying)
-------------------------------------------------------------------------------------
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/23732248/viewspace-2155709/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 一次詳細的RAC 節點例項驅逐分析文件
- IP packet reassembles failed導致例項被驅逐AI
- Tomcat常見異常及解決方案程式碼例項Tomcat
- EventBus詳解及簡單例項單例
- Kubernetes-22:kubelet 驅逐策略詳解
- Oracle minus用法詳解及應用例項Oracle
- Java開發中的事件驅動模型例項詳解Java事件模型
- 快取穿透詳解及解決方案快取穿透
- axios模擬GET請求例項及詳解iOS
- SwiftUI 官方畫圖例項詳細解析SwiftUI
- 元件例項 $el 詳解元件
- Linux與windows檔案傳輸詳解及例項LinuxWindows
- MySQL 序列 AUTO_INCREMENT詳解及例項程式碼MySqlREM
- Spring中Bean的例項化詳細流程SpringBean
- rac恢復到單例項單例
- RAC+DG(asm單例項)ASM單例
- 記錄一次Oracle 11.2.0.4 RAC異地恢復到單例項Oracle單例
- golang 中 channel 的詳細使用、使用注意事項及死鎖分析Golang
- 記錄幾個例項和解決方案
- 合成複用原則詳解篇(附圖解及原始碼例項)圖解原始碼
- CSS例項詳解:Flex佈局CSSFlex
- Spring事務管理(詳解+例項)Spring
- Leetcode 344:驗證迴文串(最詳細解決方案!!!)LeetCode
- 調節閥振動原因分析及解決方案
- 家庭電氣火災原因分析及解決方案
- led驅動程式例項
- ubuntu ftp 伺服器搭建及vsftpd.conf配置例項詳解UbuntuFTP伺服器
- Linux IO模式及 select、poll、epoll詳解(含部分例項原始碼)Linux模式原始碼
- 一個Tomcat 如何部署多個專案?附多種解決方案及詳細步驟!Tomcat
- 基於QT錄製PCM音訊例項詳細QT音訊
- nodejs打包問題解決例項NodeJS
- PHP7 新增功能詳解(例項)PHP
- 50個典型電路例項詳解
- MySQL共享鎖:使用與例項詳解MySql
- RAC+單例項DG的切換單例
- 超詳細圖文介紹,華為桌面雲解決方案
- 【自動化基礎】allure描述用例詳細講解及實戰
- 國內ERP市場現狀分析及解決方案