2018年5月18日，客戶oracle 11.2.0.4 rac for hp-unix由於一節點主機夯，引起另一節點出現ora-07445錯誤而當機，ORA-07445當機原因與資料庫引數parallel_max_servers值設定大於3600，在叢集DR修改CPU COUNT值時出現錯誤有關，具體分析如下。
環境資訊：
作業系統版本：HP-UX B.11.31 U ia64 3938805652 unlimited-user license

資料庫版本：oracle 11.2.0.3 rac
資料庫異常當機告警日誌：
Fri May 18 10:37:46 2018
Archived Log entry 511495 added for thread 1 sequence 260260 ID 0xf640cdf dest 1:
Fri May 18 10:52:29 2018
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550, kxfpnid()+1728] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc (incident=4200673):
ORA-07445: exception encountered: core dump [kxfpnid()+1728] [SIGSEGV] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550] [Address not mapped to object] []
Incident details in: /u01/app/oracle/diag/rdbms/**/**1/incident/incdir_4200673/**1_lmon_8174_i4200673.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri May 18 10:52:31 2018
Dumping diagnostic data in directory=[cdmp_20180518105231], requested by (instance=1, osid=8174 (LMON)), summary=[incident=4200673].
Fri May 18 10:52:34 2018
PMON (ospid: 8152): terminating the instance due to error 481
Fri May 18 10:52:34 2018
opiodr aborting process unknown ospid (428) as a result of ORA-1092
Fri May 18 10:52:34 2018
ORA-1092 : opitsk aborting process
Fri May 18 10:52:34 2018
ORA-1092 : opitsk aborting process
ORA-07445相關trc檔案 /u01/app/oracle/diag/rdbms/cbsprd/cbsprd1/trace/**_lmon_8174.trc內容：
$[/home/grid] bsprd1/incident/incdir_4200673/**_lmon_8174_i4200673.trc <
Dump file /u01/app/oracle/diag/rdbms/**/**1/incident/incdir_4200673/**1_lmon_8174_i4200673.
trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name: HP-UX
Node name: **
Release: B.11.31
Version: U
Machine: ia64
Instance name: **
Redo thread mounted by this instance: 1
Oracle process number: 11
Unix process pid: 8174, image: oracle@**1 (LMON)
*** 2018-05-18 10:52:29.072
*** SESSION ID:(694.1) 2018-05-18 10:52:29.072
*** CLIENT ID:() 2018-05-18 10:52:29.072
*** SERVICE NAME:(SYS$BACKGROUND) 2018-05-18 10:52:29.072
*** MODULE NAME:() 2018-05-18 10:52:29.072
*** ACTION NAME:() 2018-05-18 10:52:29.072

Dump continued from file: /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc
ORA-07445: exception encountered: core dump [kxfpnid()+1728] [SIGSEGV] [ADDR:0xFFFFFFFF00000094] [PC:0x
400000000D060550] [Address not mapped to object] []

========= Dump for incident 4200673 (ORA 7445 [kxfpnid()+1728]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D06055
0, kxfpnid()+1728] [flags: 0x0, count: 1]
r1: 600000000014b480 r20: c0000001bfe1beec br5: 0
r2: c0000001bfde6000 r21: c000002ef7838670 br6: c000000000651f30
r3: 9fffffff5ffe7c00 r22: c000002f35c07b10 br7: c0000000004b53a0
r4: 0 r23: 0 ip: 400000000d060550
r5: c000000000000408 r24: 0 iipa: 0
r6: c00000000006fb60 r25: 1 cfm: ca1
r7: 9ffffffffd7f7350 r26: 5175657269657320 um: 1a
r8: ffffffff00000094 r27: 2900000000000000 rsc: 1f
$ more /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc
Trace file /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name: HP-UX
Node name: **
Release: B.11.31
Version: U
Machine: ia64
Instance name: **
Redo thread mounted by this instance: 1
Oracle process number: 11
Unix process pid: 8174, image: oracle@**1 (LMON)

*** 2018-05-14 04:12:05.408
*** SESSION ID:(694.1) 2018-05-14 04:12:05.408
*** CLIENT ID:() 2018-05-14 04:12:05.408
*** SERVICE NAME:(SYS$BACKGROUND) 2018-05-14 04:12:05.408
*** MODULE NAME:() 2018-05-14 04:12:05.408
*** ACTION NAME:() 2018-05-14 04:12:05.408

*** TRACE FILE RECREATED AFTER BEING REMOVED ***

kjfc_TaskScheduler_Execute_wTime: timer wraps at 0xffffff44 max 0xffffffd6

*** 2018-05-18 10:52:28.415
kjxggpoll: change db group poll time to 50 ms

*** 2018-05-18 10:52:28.495
kjxgmpoll reconfig instance map: 1

*** 2018-05-18 10:52:28.495
kjxgmrcfg: Reconfiguration started, type 1
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 26 0.

*** 2018-05-18 10:52:28.514
Name Service frozen
kjxgmcs: Setting state to 26 1.
kjxgrdecidever: No old version members in the cluster
kjxgrssvote: reconfig bitmap chksum 0x1a8c9 cnt 1 master 1 ret 0
kjxgrpropmsg: SSMEMI: inst 1 - no disk vote
kjxgrpropmsg: SSVOTE: Master indicates no Disk Voting
2018-05-18 10:52:28.514499 : kjxgrDiskVote: nonblocking method is chosen
kjxgrDiskVote: Only one inst in the cluster - no disk vote
2018-05-18 10:52:28.686421 : kjxgrDiskVote: Obtained RR update lock for sequence 27, RR seq 26
2018-05-18 10:52:28.814554 : kjxgrDiskVote: derive membership from CSS (no disk votes)
2018-05-18 10:52:28.814589 : proposed membership: 1
2018-05-18 10:52:28.875275 : kjxgrDiskVote: new membership is updated by inst 1, seq 28
2018-05-18 10:52:28.875300 : kjxgrDiskVote: bitmap: 1
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmmeminfo: can not invalidate inst 2
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 28 2.
kjfmSendAbortInstMsg: send an abort message to instance 2
kjfmuin: inst bitmap 1
kjfmmhi: received msg from inst 1 (inc 22)
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 28 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 28 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 28 5.
Name Service normal
Name Service recovery done

*** 2018-05-18 10:52:29.026
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 28 6.
kjxgmcs: total reconfig time 0.508 seconds (from 58968544 to 58969052) (old dlminc 26, new dlminc 28)
kjxggpoll: change db group poll time to 600 ms
kjfmact: call ksimdic on instance (2)

*** 2018-05-18 10:52:29.026
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550, kxfpnid()+1728] [flags: 0x0, count: 1]
Incident 4200673 created, dump file: /u01/app/oracle/diag/rdbms/**/**1/incident/incdir_4200673/**1_lmon_8174_i4200673.trc
ORA-07445: exception encountered: core dump [kxfpnid()+1728] [SIGSEGV] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550] [Address not mapped to object] []

ssexhd: crashing the process...
Background_Core_Dump = partial
ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/**/**1/cdump'
檢視oracle官方網站，文章(文件 ID 1505057.1)與本次故障相似：

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.2 and later
Information in this document applies to any platform.

SYMPTOMS

The following symptoms were seen:

The following error was seen in the alert log, and then PMON terminated the instance:
ORA-07445: exception encountered: core dump [kxfpnid()+632] [SIGBUS] [ADDR:0x95] [PC:0x103AACCD8] [Invalid address alignment] []
The value for parallel_max_servers was NOT explicitly set in the spfile or pfile.
The LMON tracefile showed a value for parallel_max_servers > 3600 (3600 is the maximum allowed in 11.2).
The call stack contained the following:
kxfpnid <- ksimdic <- kjfmact <- kjfcln <- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real <- ssthrdmain

CAUSE

- RAC NODE CRASHED AFTER ORA-7445 [KXFPNID()+632] IN LMON was filed for this issue and was closed as unpublished Bug 13743987 - ASM INSTANCE TERMINATES WITH DR CHANGE IN CPU COUNT PARALLE_MAX_SERVERS. More information about unpublished Bug 13743987 is contained in the note Document 13743987.8 - Bug 13743987 - A high CPU_COUNT can cause ORA-68 for parallel_max_servers.

SOLUTION

The following solutions are available:

Apply the 11.2.0.4 Patch Set, when available.
For an 11.2.0.3 Exadata database only, apply Bundle Patch 11.
Workaround:
Explicitly set the value for parallel_max_servers to a value < 3601. A reasonable value at which to start setting this parameter is (number of physical CPUs) * (parallel_threads_per_server) * 4. So, for example, if you have 4 quad-core CPUs and parallel_threads_per_cpu is set to 2, you would set your starting value at 16 * 2 * 4 = 128 per instance.

查詢本地oracle rac叢集的引數parallel_max_servers為3840，與MOS文件 ID 1505057.1描述一致，解決方法比較簡單，修改
該引數值為3600以下即可。

oracle 11.2.0.4 rac節點異常當機之ORA-07445

APPLIES TO:

SYMPTOMS

CAUSE

SOLUTION

相關文章