Oracle RAC命中ORA-7445只能開啟一個節點故障案例分析
一、故障描述
某次,使用者反饋HIS資料庫在完成容災切換之後,在RACDG端只能正常開啟一個節點例項,此時如果手動啟動另一個節點例項,該例項能夠正常開啟,但是會導致之前開啟的例項命中ORA-7445,進而引起例項異常當機。由於RACDG端只能讀寫開啟一個例項,業務雖然可以正常接管,但考慮到業務的穩定性、連續性和效能,建議使用者儘快回切。隨後,HIS資料庫完成回切。隨後,針對RACDG端兩個例項讀寫開啟為什麼會命中ORA-7445,進行遠端根因排查,詳細分析過程如下。
二、問題時間點關鍵描述
(1)25號17點:容災切換後,RACDG端兩節點例項讀寫開啟後,其中一節點命中ORA-7445進而導致例項異常當機。
(2)25號18點:HIS資料庫完成回切。
三、問題詳細診斷過程
(1)2節點orcl2例項告警日誌分析
Fri Mar 25 19:46:01 2022 Starting ORACLE instance (normal) ************************ Large Pages Information ******************* Per process system memlock (soft) limit = UNLIMITED Total Shared Global Region in Large Pages = 133 GB (100%) Large Pages used by this instance: 68097 (133 GB) Large Pages unused system wide = 28347 (55 GB) Large Pages configured system wide = 96444 (188 GB) Large Page size = 2048 KB ******************************************************************** LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 64 Number of processor cores in the system is 64 Number of processor sockets in the system is 4 Private Interface 'eth1:1' configured from GPnP for use as a private interconnect. [name='eth1:1', type=1, ip=169.254.24.180, mac=00-50-56-aa-ce-92, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'eth0' configured from GPnP for use as a public interface. [name='eth0', type=1, ip=192.200.54.32, mac=00-50-56-aa-fa-17, net=192.200.54.0/24, mask=255.255.255.0, use=public/1] Picked latch-free SCN scheme 3 Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled NUMA system with 2 nodes detected Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options. ... Deprecated system parameters with specified values: standby_archive_dest End of deprecated system parameter listing Cluster communication is configured to use the following interface(s) for this instance 169.254.24.180 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 ... starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... starting up 1 shared server(s) ... NOTE: initiating MARK startup Starting background process MARK Fri Mar 25 19:46:11 2022 MARK started with pid=37, OS id=58633 NOTE: MARK has subscribed lmon registered with NM - instance number 2 (internal mem no 1) Reconfiguration started (old inc 0, new inc 16) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Fri Mar 25 19:46:13 2022 LCK0 started with pid=39, OS id=58653 ... Fri Mar 25 19:46:21 2022 SMON: enabling tx recovery Database Characterset is ZHS16GBK Starting background process GTX0 Fri Mar 25 19:46:21 2022 GTX0 started with pid=49, OS id=58745 Starting background process RCBG Fri Mar 25 19:46:21 2022 RCBG started with pid=50, OS id=58747 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Fri Mar 25 19:46:22 2022 QMNC started with pid=51, OS id=58751 Completed: ALTER DATABASE OPEN /* db agent *//* {0:7:9} */ Fri Mar 25 19:46:36 2022 Dumping diagnostic data in directory=[cdmp_20220325194624], requested by (instance=1, osid=40273 (LMS3)), summary=[incident=1648121]. Dumping diagnostic data in directory=[cdmp_20220325194625], requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination]. Fri Mar 25 19:46:38 2022 Reconfiguration started (old inc 16, new inc 18) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Fri Mar 25 19:46:38 2022 Fri Mar 25 19:46:38 2022 Fri Mar 25 19:46:38 2022 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Mar 25 19:46:38 2022 LMS 3: 1 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete
在後續驗證中,從以上alert_orcl2.log日誌不難發現,在1節點例項正常執行的前提下,在19點46分01秒,手動啟動2節點例項,在啟動過程中會第一次觸發重新配置,將2節點例項加入叢集。在19點46分22秒,2節點例項正常開啟,但隨後不久在19點46分36秒,提示1節點例項異常當機,並轉儲DIAG故障診斷TRACE資訊。之後,在19點46分38秒,第二次觸發重新配置,將1節點移出叢集。
(2)1節點orcl1例項告警日誌分析
Fri Mar 25 19:46:23 2022 Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1] Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lms3_40273.trc (incident=1648121): ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] [] Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_1648121/orcl1_lms3_40273_i1648121.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Fri Mar 25 19:46:24 2022 Dumping diagnostic data in directory=[cdmp_20220325194624], requested by (instance=1, osid=40273 (LMS3)), summary=[incident=1648121]. Fri Mar 25 19:46:24 2022 Sweep [inc][1648121]: completed Sweep [inc2][1648121]: completed Fri Mar 25 19:46:25 2022 PMON (ospid: 40198): terminating the instance due to error 484 System state dump requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_40249_20220325194625.trc Dumping diagnostic data in directory=[cdmp_20220325194625], requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination]. Instance terminated by PMON, pid = 40198
從以上alert_orcl1.log日誌不難發現,在19點46分23秒,1節點例項很快就命中ORA-7445報錯;在19點46分25秒,1節點例項被PMON程式殺掉造成例項當機。透過以上3.1日誌分析,我們知道2節點例項在19點46分22秒完成正常開啟,但緊接著1節點就立馬當機。結合該現象與處理ORA-7445的經驗,懷疑很可能是命中Oracle軟體BUG。
(3)1節點orcl1例項lms3程式trace日誌分析
*** 2022-03-25 19:46:23.680 SKGXP:[7f248be356c8.0]{0}: SKGXP Assertion msize <= SKGXP_ATTR_DGRAM_MTU_MAX(ctx) Failed at location skgxpseginit2 line_num: 31934 *** 2022-03-25 19:46:23.681 Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1] Incident 1648121 created, dump file: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_1648121/orcl1_lms3_40273_i1648121.trc ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] [] ssexhd: crashing the process... Background_Core_Dump = partial ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/orcl/orcl1/cdump'
從以上orcl1_lms3_40273.trc日誌不難發現,lms3程式在執行過程中命中ORA-7445報錯,其作用是用於叢集之間資源管理和控制。該程式在命中ORA-7445報錯之前,已經建立相應的事件(1648121)跟蹤檔案。因此,需要進一步分析orcl1_lms3_40273_i1648121.trc檔案。
(4)1節點orcl1例項lms3程式incident日誌分析
*** 2022-03-25 19:46:23.695 *** SESSION ID:(3527.1) 2022-03-25 19:46:23.695 *** CLIENT ID:() 2022-03-25 19:46:23.695 *** SERVICE NAME:(SYS$BACKGROUND) 2022-03-25 19:46:23.695 *** MODULE NAME:() 2022-03-25 19:46:23.695 *** ACTION NAME:() 2022-03-25 19:46:23.695 Dump continued from file: /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lms3_40273.trc ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] [] ========= Dump for incident 1648121 (ORA 7445 [gsignal()+53]) ======== ----- Beginning of Customized Incident Dump(s) ----- Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1] Registers: %rax: 0x0000000000000000 %rbx: 0x00007f2487d41000 %rcx: 0xffffffffffffffff %rdx: 0x0000000000000006 %rdi: 0x0000000000009d51 %rsi: 0x0000000000009d51 %rsp: 0x00007ffdcda48b78 %rbp: 0x00007f248d9a5bf0 %r8: 0x00000000ffffffff %r9: 0xfeff092d63646b68 %r10: 0x0000000000000008 %r11: 0x0000000000000202 %r12: 0x00007f248d9a74ac %r13: 0x00007f248d9a5be0 %r14: 0x00007f248d9ba68c %r15: 0x0000000000007cbe %rip: 0x00000039e5432495 %efl: 0x0000000000000202 gsignal()+40 (0x39e5432488) movsxdq %esi,%rsi gsignal()+43 (0x39e543248b) movsxdq %eax,%rdi gsignal()+46 (0x39e543248e) mov $0xea,%eax gsignal()+51 (0x39e5432493) syscall > gsignal()+53 (0x39e5432495) cmp $0xfffff000,%rax gsignal()+59 (0x39e543249b) ja 0x39e54324af gsignal()+61 (0x39e543249d) (repe) ret gsignal()+63 (0x39e543249f) nop gsignal()+64 (0x39e54324a0) test %eax,%eax
從以上orcl1_lms3_40273_i1648121.trc日誌不難發現,有很多ORA-7445報錯相關的轉儲資訊,考慮到7445是Oracle內部的錯誤程式碼,這類報錯命中BUG的機率較大,因此在Oracle官方MOS平臺上進行匹配,發現Bug 25722720 - RAC LMS getting ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] (Doc ID 25722720.8),與我們的報錯現象高度類似,具體匹配情況如下:
從目前日誌分析結果來看,RACDG兩節點例項開啟後,其中一個例項立馬命中ORA-7445導致例項當機,雖然懷疑跟Bug 25722720相關,但是官方針對該BUG只發布了12.2.0.1版本的補丁,因此需要進一步定位。
(5)備節點例項模擬測試復現ORA-7445
由於切換問題,此時備節點已經強制開啟,後續測試均在備節點上進行。
測試步驟如下:
1、啟動備庫一節點,此時資料庫正常
2、啟動備庫二節點,此時情況復現,例項節點當機,且後臺日誌出現相同ORA-7445報錯
3、查詢切換過程中出現的等待事件,結果如下:
SQL> select instance_number,event 2 from DBA_HIST_ACTIVE_SESS_HISTORY 3 where to_char(sample_time, 'yyyy-mm-dd hh24:mi') >= '2022-03-28 22:00' and 4 to_char(sample_time, 'yyyy-mm-dd hh24:mi') <= '2022-03-28 23:00' 5 and event in('enq: US - contention','row cache lock');
發現存在大量 row cache lock、enq: US – contention相關等待,此時想到之前曾經在HIS庫上出現過由於回滾段的爭用導致業務卡頓問題,曾對資料庫做過隱含引數層面調整。
--高併發環境下,設定以下引數儘可能多的保留online回滾段數量 ALTER SYSTEM SET "_rollback_segment_count"=12000 sid='*'; --關閉undo retention自動調整功能 ALTER SYSTEM SET "_undo_autotune" = false sid='*'; --設定_highthreshold_undoretention控制undo retention alter system set "_highthreshold_undoretention"=21600 sid='*';
排查主備庫兩端引數檔案設定有差異,發現備庫端未設定相關隱含引數。一般我們建議主備兩端引數需保持一致,故思路為備庫端設定隱藏引數排查引數影響。
重新進行如下步驟測試:
1、啟動備庫一節點
2、啟動備庫二節點,檢視資料庫狀態
3、查詢切換過程中出現的等待事件。發現,設定該引數後,備庫能正常開啟,且未出現ORA-7445問題。重複測試基本能匹配到Oracle MOS官方文件IF: Undo Wait Event - Enq: US - contention(Doc ID 1951680.1)。
比對主備兩端補丁號資訊,主備兩端補丁版本不同。
四、解決辦法和建議
根本原因還在Oracle軟體自身軟體BUG引起,命中ORA-7445報錯。當前可透過設定下列隱含引數,提前分配undo段,規避ORA-7445報錯。搭建DG環境時,強烈建議主備兩端環境要保持一致,包括資料庫軟體版本和補丁(PSU)版本以及引數(含隱含引數)。
ALTER SYSTEM SET "_rollback_segment_count"=12000 sid='*';
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30310891/viewspace-2927562/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- RAC資料庫只能啟動一個節點的故障資料庫
- Oracle 19c rac安裝,只能啟動一個節點的ASMOracleASM
- RAC第一個節點被剔除叢集故障分析
- oracle11GR2 RAC節點crash故障分析Oracle
- 3節點RAC資料庫夯故障分析資料庫
- 關於Oracle 11G RAC雙節點之間存在防火牆導致只能一個節點執行Oracle防火牆
- Oracle RAC 一個節點不能自動啟動 怪問題Oracle
- oracle 10g rac,刪除故障節點並新增新節點Oracle 10g
- oracle rac 其中第一個節點監聽偶爾中斷處理案例Oracle
- Oracle RAC常見啟動失敗故障分析Oracle
- RAC一個節點記憶體故障當機,無法訪問記憶體
- Oracle優化案例-新增RAC節點(二十九)Oracle優化
- ORACLE 10G增加一個節點rac3Oracle 10g
- 【kingsql分享】RAC節點故障修復一例SQL
- oracle兩節點RAC,由於gipc導致某節點crs無法啟動問題分析Oracle
- Oracle RAC新增節點Oracle
- Oracle RAC 新增節點Oracle
- Oracle RAC Database 11.1.0.6監聽故障案例OracleDatabase
- Oracle Rac 刪除節點Oracle
- oracle 11gR2 rac 兩節點有一個節點down掉問題處理Oracle
- Oracle RAC啟動失敗(DNS故障)OracleDNS
- RAC一個節點恢復另一個節點在帶庫上的備份
- Oracle10g RAC 單獨關閉一個節點(維護)Oracle
- [網摘] Oracle RAC新增節點Oracle
- 檢視oracle rac的節點Oracle
- RAC一個節點OHASD啟動不了waiting for init.ohasd to be startedAI
- 【原創】Oracle RAC故障分析與處理Oracle
- oracle11g RAC新增節點Oracle
- Oracle10g RAC 加節點Oracle
- 故障分析 | MySQL 從機故障重啟後主從同步報錯案例分析MySql主從同步
- 【RAC】Oracle10g RAC 節點重配的方式Oracle
- oracle rac_cssd程式故障重啟相關OracleCSS
- RAC一個節點錯誤CRS-4047: No Oracle Clusterware components configuredOracle
- Oracle 9.2.0.4 RAC + Linux AS3.0 如何動態加入一個新的節點OracleLinuxS3
- oracle 11g rac新增節點前之清除節點資訊Oracle
- 刪除oracle10g rac(rhel4)節點_節點Oracle
- ORACLE RAC環境下節點自動重啟問題總結Oracle
- 記一次oracle 19c RAC叢集重啟單節點DB啟動異常(二)Oracle