Oracle 19C RAC腦裂問題分析
問題現象:
一套Oracle 19C 兩節點RAC,每個節點上各有3個例項,其中節點2上三個例項突然陸續當機,當機後馬上又自動啟動。
問題分析:
一:檢視資料庫告警日誌
當機後,分別檢視節點二上三個例項(chendb2,judb2,chaodb2)對應資料庫告警日誌
---1 chendb2例項IPC0程式被強制中斷,例項中斷。
[oracle@cjcdb02 trace]$ vim alert_chendb2.log 2020-12-29T20:07:44.176617+08:00 PMON (ospid: ): terminating the instance due to ORA error 2020-12-29T20:07:44.184140+08:00 Cause - 'Instance is being terminated due to fatal process death (pid: 5, ospid: 20103, IPC0)'
---2 judb2例項DBW0程式被強制中斷,例項中斷。
[oracle@cjcdb02 trace]$ vim alert_judb2.log 2020-12-29T20:07:43.495906+08:00 PMON (ospid: ): terminating the instance due to ORA error 2020-12-29T20:07:43.506822+08:00 Cause - 'Instance is being terminated due to fatal process death (pid: 44, ospid: 11307, DBW0)' 2020-12-29T20:07:44.649325+08:00
---3 chao2例項LGWR程式被強制中斷,例項中斷。
[oracle@cjcdb02 trace]$ vim alert_chaodb2.log 2020-12-29T20:07:43.056011+08:00 PMON (ospid: ): terminating the instance due to ORA error 2020-12-29T20:07:43.059794+08:00 Cause - 'Instance is being terminated due to fatal process death (pid: 78, ospid: 4197, LGWR)' 2020-12-29T20:07:44.132532+08:00
二:檢視grid告警日誌
顯然如下,節點間通訊異常,發生腦裂,節點2被強制驅逐並重啟。
CRS-1612: Network communication with node cjcdb01 (1) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.240 seconds
詳細日誌如下:
[grid@cjcdb02 trace]$ pwd /oracle/grid/diag/crs/cjcdb02/crs/trace [grid@cjcdb02 trace]$ vim alert.log ...... 2020-12-29 19:40:57.560 [OCTSSD(158770)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/grid/diag/crs/cjcdb02/crs/trace/octssd.trc. 2020-12-29 20:07:36.698 [OCSSD(156435)]CRS-1612: Network communication with node cjcdb01 (1) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.240 seconds 2020-12-29 20:07:40.287 [OCSSD(156435)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:40.288 [OCSSD(156435)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc 2020-12-29 20:07:40.288 [OCSSD(156435)]CRS-1608: This node was evicted by node 1, cjcdb01; details at (:CSSNM00005:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:40.289 [OCSSD(156435)]CRS-1608: This node was evicted by node 1, cjcdb01; details at (:CSSNM00005:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:40.326 [OCSSD(156435)]CRS-1652: Starting clean up of CRSD resources. 2020-12-29 20:07:40.469 [OCSSD(156435)]CRS-1608: This node was evicted by node 1, cjcdb01; details at (:CSSNM00005:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc. 2020-12-29 20:07:42.128 [OCSSD(156435)]CRS-1654: Clean up of CRSD resources finished successfully. 2020-12-29 20:07:42.133 [OCSSD(156435)]CRS-1655: CSSD on node cjcdb02 detected a problem and started to shutdown. 2020-12-29 20:07:42.160 [ORAROOTAGENT(159654)]CRS-5822: Agent '/oracle/crs/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:87} in /oracle/grid/diag/crs/cjcdb02/crs/trace/crsd_orarootagent_root.trc. 2020-12-29 20:07:42.507 [CRSD(77885)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 77885 2020-12-29 20:07:44.922 [CSSDMONITOR(88318)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 88318 2020-12-29 20:07:46.783 [OCSSD(156435)]CRS-8503: Oracle Clusterware process OCSSD with operating system process ID 156435 experienced fatal signal or exception code 6. 2020-12-29T20:07:46.788757+08:00 Errors in file /oracle/grid/diag/crs/cjcdb02/crs/trace/ocssd.trc (incident=1): CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /oracle/grid/diag/crs/cjcdb02/crs/incident/incdir_1/ocssd_i1.trc 2020-12-29 20:07:47.827 [CRSD(77885)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-23: Error in cluster services layer Cluster services error [ [3]]. Details at (:CRSD00111:) in /oracle/grid/diag/crs/cjcdb02/crs/trace/crsd.trc.
三:檢視messages日誌
通過message可以看到:
節點2重啟例項時,節點2的VIP執行了Deleting interface,節點1接管了節點2的VIP。
節點2啟動後,VIP又漂會節點1.
[root@cjcdb02 ~]# vim /var/log/messages Dec 29 20:07:41 cjcdb02 ntpd[76526]: Deleting interface #15 bond0:1, 192.100.100.54#123, interface stats: received=0, sent=0, dropped=0, active_time=379716 secs Dec 29 20:07:47 cjcdb02 abrt-hook-ccpp: Process 156435 (ocssd.bin) of user 501 killed by SIGABRT - dumping core Dec 29 20:07:47 cjcdb02 abrt-hook-ccpp: Failed to create core_backtrace: waitpid failed: No child processes Dec 29 20:07:47 cjcdb02 abrt-server: Executable '/oracle/crs/bin/ocssd.bin' doesn't belong to any package and ProcessUnpackaged is set to 'no' Dec 29 20:07:47 cjcdb02 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2020-12-29-20:07:47-156435' exited with 1 Dec 29 20:07:47 cjcdb02 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2020-12-29-20:07:47-156435'
[root@cjcdb01 ~]# vim /var/log/messages Dec 29 20:07:47 sy-fkjx-db01 ntpd[76447]: Listen normally on 14 bond0:3 192.100.100.54 UDP 123 Dec 29 20:08:01 sy-fkjx-db01 systemd: Started Session 3644 of user root. Dec 29 20:08:02 sy-fkjx-db01 su: (to oracle) root on none Dec 29 20:08:13 sy-fkjx-db01 ntpd[76447]: Deleting interface #14 bond0:3, 192.100.100.54#123, interface stats: received=0, sent=0, dropped=0, active_time=26 secs Dec 29 20:09:19 sy-fkjx-db01 ntpd[76447]: Listen normally on 15 bond0:3 192.100.100.54 UDP 123 Dec 29 20:09:43 sy-fkjx-db01 ntpd[76447]: Deleting interface #15 bond0:3, 192.100.100.54#123, interface stats: received=0, sent=0, dropped=0, active_time=24 secs
四:問題分析
檢視日誌顯示兩個節點通訊異常導致節點2被驅逐自動重啟,但是導致節點通訊異常異常的原因又是什麼呢?
通常是某些資源達到瓶頸導致的,例如網路、記憶體、CPU、I/O。
檢視對應的OSW日誌:
檢視問題時間段記憶體、CPU、I/O都很閒,沒有什麼負載。
檢視問題時間段私網通訊情況,顯示有30秒左右通訊異常,
[root@cjcdb02 oswprvtnet]# pwd /oracle/grid/tfa/repository/suptools/cjcdb02/oswbb/grid/archive/oswprvtnet [root@cjcdb02 oswprvtnet]# ll -rth total 2.1M -rw-r--r-- 1 grid oinstall 38K Dec 28 19:59 cjcdb02_prvtnet_20.12.28.1900.dat ...... -rw-r--r-- 1 grid oinstall 38K Dec 29 19:59 cjcdb02_prvtnet_20.12.29.1900.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 20:59 cjcdb02_prvtnet_20.12.29.2000.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 21:59 cjcdb02_prvtnet_20.12.29.2100.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 22:59 cjcdb02_prvtnet_20.12.29.2200.dat -rw-r--r-- 1 grid oinstall 38K Dec 29 23:59 cjcdb02_prvtnet_20.12.29.2300.dat -rw-r--r-- 1 grid oinstall 38K Dec 30 00:59 cjcdb02_prvtnet_20.12.30.0000.dat -rw-r--r-- 1 grid oinstall 38K Dec 30 01:59 cjcdb02_prvtnet_20.12.30.0100.dat ...... -rw-r--r-- 1 grid oinstall 30K Dec 30 19:46 cjcdb02_prvtnet_20.12.30.1900.dat
[root@cjcdb02 oswprvtnet]# vim cjcdb02_prvtnet_20.12.29.2000.dat ...... traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.053 ms 0.039 ms 0.033 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.016 ms 0.004 ms 0.003 ms zzz ***Tue Dec 29 20:06:45 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.066 ms 0.054 ms 0.041 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.020 ms 0.005 ms 0.004 ms zzz ***Tue Dec 29 20:07:15 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 * * * 2 * * * 3 * * * 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * cjcdb01-priv (10.116.59.113) 0.039 ms * traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.015 ms 0.004 ms 0.003 ms zzz ***Tue Dec 29 20:07:45 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.057 ms 0.041 ms 0.027 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.019 ms 0.004 ms 0.003 ms zzz ***Tue Dec 29 20:08:15 CST 2020 traceroute to 10.116.59.113 (10.116.59.113), 30 hops max, 60 byte packets 1 cjcdb01-priv (10.116.59.113) 0.062 ms 0.047 ms 0.042 ms traceroute to 10.116.59.114 (10.116.59.114), 30 hops max, 60 byte packets 1 cjcdb02-priv (10.116.59.114) 0.016 ms 0.004 ms 0.004 ms
檢視ocss日誌
[grid@cjcdb02 trace]$ pwd /oracle/grid/diag/crs/cjcdb02/crs/trace [grid@cjcdb02 trace]$ ls -lhrt ocssd*.trc -rw-rw---- 1 grid oinstall 51M Dec 20 19:04 ocssd_16.trc -rw-rw---- 1 grid oinstall 51M Dec 21 23:52 ocssd_17.trc -rw-rw---- 1 grid oinstall 51M Dec 23 04:25 ocssd_18.trc -rw-rw---- 1 grid oinstall 51M Dec 24 09:14 ocssd_19.trc -rw-rw---- 1 grid oinstall 51M Dec 25 10:40 ocssd_20.trc -rw-rw---- 1 grid oinstall 51M Dec 26 14:54 ocssd_21.trc -rw-rw---- 1 grid oinstall 51M Dec 27 19:15 ocssd_22.trc -rw-rw---- 1 grid oinstall 51M Dec 28 23:47 ocssd_23.trc -rw-rw---- 1 grid oinstall 51M Dec 29 23:50 ocssd_24.trc -rw-rw---- 1 grid oinstall 36M Dec 30 19:44 ocssd.trc [grid@cjcdb02 trace]$ vim ocssd_24.trc ......
五:結論
初步懷疑叢集節點私有網路出現問題導致的節點2被驅逐,但是網路工程師檢查兩個私網接在同一個交換機上,沒有查出有什麼問題,先記錄下來,以後有進展在更新!
chenjuchao 2020-12-31 21:30
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29785807/viewspace-2747831/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- RAC腦裂的理解
- RAC 腦裂 處理機制 Oracle RAC Brain SplitOracleAI
- HA腦裂問題解決方案
- Oracle10g RAC clusterware split-brain - 腦裂OracleAI
- 記一次RAC Brain Split腦裂分析過程AI
- split-brain 腦裂問題(Keepalived)AI
- HA叢集中的腦裂問題解決方案
- 轉:瞭解Oracle RAC Brain Split Resolution叢集腦裂協議OracleAI協議
- Oracle 19C RAC open_links_per_instance引數問題Oracle
- Oracle 19C RAC實施方案詳細說明-常見問題07Oracle
- Oracle RAC叢集腦裂split-brain概述及解決辦法OracleAI
- ZooKeeper 05 - ZooKeeper 叢集的腦裂問題(Split Brain)AI
- Oracle 19c RAC INS-40724Oracle
- Oracle:Oracle RAC 11.2.0.4 升級為 19cOracle
- 【Redis叢集原理專題】分析一下相關的Redis叢集模式下的腦裂問題!Redis模式
- 【BUILD_ORACLE】Oracle 19c RAC搭建(六)建立RAC資料庫UIOracle資料庫
- G003-ORACLE-INS-RAC-01 ORACLE 19C RAC Ins ON OEL 7.8Oracle
- RAC腦裂(split brain)相關: DTO(disktimeout) - SDTO,LDTOAI
- Oracle RAC引數設定優先順序別問題分析Oracle
- How To Deal With Split Brain In Oracle 19c RacAIOracle
- oracle 10g rac問題(一)Oracle 10g
- Oracle Linux 7.1 silent install 19C RACOracleLinux
- oracle aix平臺19c rac互信不通案例OracleAI
- 配置Oracle RAC需要注意的問題Oracle
- Oracle 10g RAC vip 漂移問題Oracle 10g
- ORACLE RAC TO RAC DG搭建過程中可能遇到的問題Oracle
- 【TUNE_ORACLE】Oracle 19c RAC搭建番外篇之RAC引數配置參考(五)Oracle
- 【TUNE_ORACLE】Oracle 19c RAC搭建番外篇之RAC引數配置參考(三)Oracle
- 【TUNE_ORACLE】Oracle 19c RAC搭建番外篇之RAC引數配置參考(四)Oracle
- 【TUNE_ORACLE】Oracle 19c RAC搭建番外篇之RAC引數配置參考(二)Oracle
- 【TUNE_ORACLE】Oracle 19c RAC搭建番外篇之RAC引數配置參考(一)Oracle
- Oracle 19c RAC on Linux 7.6安裝手冊OracleLinux
- oracle 19c rac打補丁常見錯誤Oracle
- 使用DBUA升級 Oracle 11.2.0.4到Oracle 19C的問題記錄Oracle
- 安裝RAC 19C
- 【BUILD_ORACLE】Oracle 19c RAC搭建(五)DB軟體安裝UIOracle
- ORACLE 11g RAC時鐘同步問題Oracle
- 裝oracle 10g的RAC問題真多。。。。。Oracle 10g