一、故障描述

某次，使用者反饋HIS資料庫在完成容災切換之後，在RACDG端只能正常開啟一個節點例項，此時如果手動啟動另一個節點例項，該例項能夠正常開啟，但是會導致之前開啟的例項命中ORA-7445，進而引起例項異常當機。由於RACDG端只能讀寫開啟一個例項，業務雖然可以正常接管，但考慮到業務的穩定性、連續性和效能，建議使用者儘快回切。隨後，HIS資料庫完成回切。隨後，針對RACDG端兩個例項讀寫開啟為什麼會命中ORA-7445，進行遠端根因排查，詳細分析過程如下。

二、問題時間點關鍵描述

（1）25號17點：容災切換後，RACDG端兩節點例項讀寫開啟後，其中一節點命中ORA-7445進而導致例項異常當機。

（2）25號18點：HIS資料庫完成回切。

三、問題詳細診斷過程

（1）2節點orcl2例項告警日誌分析

Fri Mar 25 19:46:01 2022
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
Total Shared Global Region in Large Pages = 133 GB (100%)
Large Pages used by this instance: 68097 (133 GB)
Large Pages unused system wide = 28347 (55 GB)
Large Pages configured system wide = 96444 (188 GB)
Large Page size = 2048 KB
********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 64
Number of processor cores in the system is 64
Number of processor sockets in the system is 4
Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
  [name='eth1:1', type=1, ip=169.254.24.180, mac=00-50-56-aa-ce-92, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'eth0' configured from GPnP for use as a public interface.
  [name='eth0', type=1, ip=192.200.54.32, mac=00-50-56-aa-fa-17, net=192.200.54.0/24, mask=255.255.255.0, use=public/1]
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
NUMA system with 2 nodes detected
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
...
Deprecated system parameters with specified values:
  standby_archive_dest     
End of deprecated system parameter listing
Cluster communication is configured to use the following interface(s) for this instance
  169.254.24.180
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
...
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
starting up 1 shared server(s) ...
NOTE: initiating MARK startup
Starting background process MARK
Fri Mar 25 19:46:11 2022
MARK started with pid=37, OS id=58633
NOTE: MARK has subscribed
lmon registered with NM - instance number 2 (internal mem no 1)
Reconfiguration started (old inc 0, new inc 16)
List of instances:
 1 2 (myinst: 2) 
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Fri Mar 25 19:46:13 2022
LCK0 started with pid=39, OS id=58653
...
Fri Mar 25 19:46:21 2022
SMON: enabling tx recovery
Database Characterset is ZHS16GBK
Starting background process GTX0
Fri Mar 25 19:46:21 2022
GTX0 started with pid=49, OS id=58745
Starting background process RCBG
Fri Mar 25 19:46:21 2022
RCBG started with pid=50, OS id=58747
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Fri Mar 25 19:46:22 2022
QMNC started with pid=51, OS id=58751
Completed: ALTER DATABASE OPEN /* db agent *//* {0:7:9} */
Fri Mar 25 19:46:36 2022
Dumping diagnostic data in directory=[cdmp_20220325194624], requested by (instance=1, osid=40273 (LMS3)), summary=[incident=1648121].
Dumping diagnostic data in directory=[cdmp_20220325194625], requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination].
Fri Mar 25 19:46:38 2022
Reconfiguration started (old inc 16, new inc 18)
List of instances:
 2 (myinst: 2) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Mar 25 19:46:38 2022
Fri Mar 25 19:46:38 2022
Fri Mar 25 19:46:38 2022
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Mar 25 19:46:38 2022
 LMS 3: 1 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete

在後續驗證中，從以上alert_orcl2.log日誌不難發現，在1節點例項正常執行的前提下，在19點46分01秒，手動啟動2節點例項，在啟動過程中會第一次觸發重新配置，將2節點例項加入叢集。在19點46分22秒，2節點例項正常開啟，但隨後不久在19點46分36秒，提示1節點例項異常當機，並轉儲DIAG故障診斷TRACE資訊。之後，在19點46分38秒，第二次觸發重新配置，將1節點移出叢集。

（2）1節點orcl1例項告警日誌分析

Fri Mar 25 19:46:23 2022
Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lms3_40273.trc  (incident=1648121):
ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_1648121/orcl1_lms3_40273_i1648121.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri Mar 25 19:46:24 2022
Dumping diagnostic data in directory=[cdmp_20220325194624], requested by (instance=1, osid=40273 (LMS3)), summary=[incident=1648121].
Fri Mar 25 19:46:24 2022
Sweep [inc][1648121]: completed
Sweep [inc2][1648121]: completed
Fri Mar 25 19:46:25 2022
PMON (ospid: 40198): terminating the instance due to error 484
System state dump requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_40249_20220325194625.trc
Dumping diagnostic data in directory=[cdmp_20220325194625], requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 40198

從以上alert_orcl1.log日誌不難發現，在19點46分23秒，1節點例項很快就命中ORA-7445報錯；在19點46分25秒，1節點例項被PMON程式殺掉造成例項當機。透過以上3.1日誌分析，我們知道2節點例項在19點46分22秒完成正常開啟，但緊接著1節點就立馬當機。結合該現象與處理ORA-7445的經驗，懷疑很可能是命中Oracle軟體BUG。

（3）1節點orcl1例項lms3程式trace日誌分析

*** 2022-03-25 19:46:23.680
SKGXP:[7f248be356c8.0]{0}: SKGXP Assertion msize <= SKGXP_ATTR_DGRAM_MTU_MAX(ctx) Failed at location skgxpseginit2 line_num: 31934
 
*** 2022-03-25 19:46:23.681
Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1]
Incident 1648121 created, dump file: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_1648121/orcl1_lms3_40273_i1648121.trc
ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] []
 
ssexhd: crashing the process...
Background_Core_Dump = partial
ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/orcl/orcl1/cdump'

從以上orcl1_lms3_40273.trc日誌不難發現，lms3程式在執行過程中命中ORA-7445報錯，其作用是用於叢集之間資源管理和控制。該程式在命中ORA-7445報錯之前，已經建立相應的事件（1648121）跟蹤檔案。因此，需要進一步分析orcl1_lms3_40273_i1648121.trc檔案。

（4）1節點orcl1例項lms3程式incident日誌分析

*** 2022-03-25 19:46:23.695
*** SESSION ID:(3527.1) 2022-03-25 19:46:23.695
*** CLIENT ID:() 2022-03-25 19:46:23.695
*** SERVICE NAME:(SYS$BACKGROUND) 2022-03-25 19:46:23.695
*** MODULE NAME:() 2022-03-25 19:46:23.695
*** ACTION NAME:() 2022-03-25 19:46:23.695
 
Dump continued from file: /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lms3_40273.trc
ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] []
 
========= Dump for incident 1648121 (ORA 7445 [gsignal()+53]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1]
Registers:
%rax: 0x0000000000000000 %rbx: 0x00007f2487d41000 %rcx: 0xffffffffffffffff
%rdx: 0x0000000000000006 %rdi: 0x0000000000009d51 %rsi: 0x0000000000009d51
%rsp: 0x00007ffdcda48b78 %rbp: 0x00007f248d9a5bf0  %r8: 0x00000000ffffffff
 %r9: 0xfeff092d63646b68 %r10: 0x0000000000000008 %r11: 0x0000000000000202
%r12: 0x00007f248d9a74ac %r13: 0x00007f248d9a5be0 %r14: 0x00007f248d9ba68c
%r15: 0x0000000000007cbe %rip: 0x00000039e5432495 %efl: 0x0000000000000202
  gsignal()+40 (0x39e5432488) movsxdq %esi,%rsi
  gsignal()+43 (0x39e543248b) movsxdq %eax,%rdi
  gsignal()+46 (0x39e543248e) mov $0xea,%eax
  gsignal()+51 (0x39e5432493) syscall
> gsignal()+53 (0x39e5432495) cmp $0xfffff000,%rax
  gsignal()+59 (0x39e543249b) ja 0x39e54324af
  gsignal()+61 (0x39e543249d) (repe) ret
  gsignal()+63 (0x39e543249f) nop
  gsignal()+64 (0x39e54324a0) test %eax,%eax

從以上orcl1_lms3_40273_i1648121.trc日誌不難發現，有很多ORA-7445報錯相關的轉儲資訊，考慮到7445是Oracle內部的錯誤程式碼，這類報錯命中BUG的機率較大，因此在Oracle官方MOS平臺上進行匹配，發現Bug 25722720 - RAC LMS getting ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] (Doc ID 25722720.8)，與我們的報錯現象高度類似，具體匹配情況如下：

從目前日誌分析結果來看，RACDG兩節點例項開啟後，其中一個例項立馬命中ORA-7445導致例項當機，雖然懷疑跟Bug 25722720相關，但是官方針對該BUG只發布了12.2.0.1版本的補丁，因此需要進一步定位。

（5）備節點例項模擬測試復現ORA-7445

由於切換問題，此時備節點已經強制開啟，後續測試均在備節點上進行。

測試步驟如下：

1、啟動備庫一節點，此時資料庫正常

2、啟動備庫二節點，此時情況復現，例項節點當機，且後臺日誌出現相同ORA-7445報錯

3、查詢切換過程中出現的等待事件，結果如下：

SQL> select instance_number,event
  2        from DBA_HIST_ACTIVE_SESS_HISTORY
  3       where to_char(sample_time, 'yyyy-mm-dd hh24:mi') >= '2022-03-28 22:00' and
  4      to_char(sample_time, 'yyyy-mm-dd hh24:mi') <= '2022-03-28 23:00'
  5  and event in('enq: US - contention','row cache lock');

發現存在大量 row cache lock、enq: US – contention相關等待，此時想到之前曾經在HIS庫上出現過由於回滾段的爭用導致業務卡頓問題，曾對資料庫做過隱含引數層面調整。

--高併發環境下，設定以下引數儘可能多的保留online回滾段數量
ALTER SYSTEM SET "_rollback_segment_count"=12000 sid='*';
--關閉undo retention自動調整功能
ALTER SYSTEM SET "_undo_autotune" = false sid='*';
--設定_highthreshold_undoretention控制undo retention
alter system set "_highthreshold_undoretention"=21600 sid='*';

排查主備庫兩端引數檔案設定有差異，發現備庫端未設定相關隱含引數。一般我們建議主備兩端引數需保持一致，故思路為備庫端設定隱藏引數排查引數影響。

重新進行如下步驟測試：

1、啟動備庫一節點

2、啟動備庫二節點，檢視資料庫狀態

3、查詢切換過程中出現的等待事件。發現，設定該引數後，備庫能正常開啟，且未出現ORA-7445問題。重複測試基本能匹配到Oracle MOS官方文件IF: Undo Wait Event - Enq: US - contention（Doc ID 1951680.1）。

比對主備兩端補丁號資訊，主備兩端補丁版本不同。

四、解決辦法和建議

根本原因還在Oracle軟體自身軟體BUG引起，命中ORA-7445報錯。當前可透過設定下列隱含引數，提前分配undo段，規避ORA-7445報錯。搭建DG環境時，強烈建議主備兩端環境要保持一致，包括資料庫軟體版本和補丁（PSU）版本以及引數（含隱含引數）。

ALTER SYSTEM SET "_rollback_segment_count"=12000 sid='*';

Oracle RAC命中ORA-7445只能開啟一個節點故障案例分析

一、故障描述

二、問題時間點關鍵描述

三、問題詳細診斷過程

四、解決辦法和建議

相關文章