oracle 11.2.0.4 rac節點異常當機之ORA-07445

清風艾艾發表於2018-05-22
    2018年5月18日,客戶oracle 11.2.0.4 rac for hp-unix由於一節點主機夯,引起另一節點出現ora-07445錯誤而當機,ORA-07445當機原因與資料庫引數parallel_max_servers值設定大於3600,在叢集DR修改CPU COUNT值時出現錯誤有關,具體分析如下。
    環境資訊:
作業系統版本:HP-UX B.11.31 U ia64 3938805652 unlimited-user license

資料庫版本:oracle 11.2.0.3 rac
    資料庫異常當機告警日誌:
Fri May 18 10:37:46 2018
Archived Log entry 511495 added for thread 1 sequence 260260 ID 0xf640cdf dest 1:
Fri May 18 10:52:29 2018
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550, kxfpnid()+1728] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc  (incident=4200673):
ORA-07445: exception encountered: core dump [kxfpnid()+1728] [SIGSEGV] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550] [Address not mapped to object] []
Incident details in: /u01/app/oracle/diag/rdbms/**/**1/incident/incdir_4200673/**1_lmon_8174_i4200673.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri May 18 10:52:31 2018
Dumping diagnostic data in directory=[cdmp_20180518105231], requested by (instance=1, osid=8174 (LMON)), summary=[incident=4200673].
Fri May 18 10:52:34 2018
PMON (ospid: 8152): terminating the instance due to error 481
Fri May 18 10:52:34 2018
opiodr aborting process unknown ospid (428) as a result of ORA-1092
Fri May 18 10:52:34 2018
ORA-1092 : opitsk aborting process
Fri May 18 10:52:34 2018
ORA-1092 : opitsk aborting process
    ORA-07445相關trc檔案 /u01/app/oracle/diag/rdbms/cbsprd/cbsprd1/trace/**_lmon_8174.trc內容:
$[/home/grid] bsprd1/incident/incdir_4200673/**_lmon_8174_i4200673.trc                  <
Dump file /u01/app/oracle/diag/rdbms/**/**1/incident/incdir_4200673/**1_lmon_8174_i4200673.
trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name:    HP-UX
Node name:      **
Release:        B.11.31
Version:        U
Machine:        ia64
Instance name: **
Redo thread mounted by this instance: 1
Oracle process number: 11
Unix process pid: 8174, image: oracle@**1 (LMON)
*** 2018-05-18 10:52:29.072
*** SESSION ID:(694.1) 2018-05-18 10:52:29.072
*** CLIENT ID:() 2018-05-18 10:52:29.072
*** SERVICE NAME:(SYS$BACKGROUND) 2018-05-18 10:52:29.072
*** MODULE NAME:() 2018-05-18 10:52:29.072
*** ACTION NAME:() 2018-05-18 10:52:29.072
 
Dump continued from file: /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc
ORA-07445: exception encountered: core dump [kxfpnid()+1728] [SIGSEGV] [ADDR:0xFFFFFFFF00000094] [PC:0x
400000000D060550] [Address not mapped to object] []

========= Dump for incident 4200673 (ORA 7445 [kxfpnid()+1728]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D06055
0, kxfpnid()+1728] [flags: 0x0, count: 1]
  r1: 600000000014b480       r20: c0000001bfe1beec       br5:                0
  r2: c0000001bfde6000       r21: c000002ef7838670       br6: c000000000651f30
  r3: 9fffffff5ffe7c00       r22: c000002f35c07b10       br7: c0000000004b53a0
  r4:                0       r23:                0        ip: 400000000d060550
  r5: c000000000000408       r24:                0      iipa:                0
  r6: c00000000006fb60       r25:                1       cfm:              ca1
  r7: 9ffffffffd7f7350       r26: 5175657269657320        um:               1a

  r8: ffffffff00000094       r27: 2900000000000000       rsc:               1f
$ more /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc
Trace file /u01/app/oracle/diag/rdbms/**/**1/trace/**1_lmon_8174.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
System name:    HP-UX
Node name:      **
Release:        B.11.31
Version:        U
Machine:        ia64
Instance name: **
Redo thread mounted by this instance: 1
Oracle process number: 11
Unix process pid: 8174, image: oracle@**1 (LMON)

*** 2018-05-14 04:12:05.408
*** SESSION ID:(694.1) 2018-05-14 04:12:05.408
*** CLIENT ID:() 2018-05-14 04:12:05.408
*** SERVICE NAME:(SYS$BACKGROUND) 2018-05-14 04:12:05.408
*** MODULE NAME:() 2018-05-14 04:12:05.408
*** ACTION NAME:() 2018-05-14 04:12:05.408
 
*** TRACE FILE RECREATED AFTER BEING REMOVED ***

kjfc_TaskScheduler_Execute_wTime: timer wraps at 0xffffff44 max 0xffffffd6

*** 2018-05-18 10:52:28.415
kjxggpoll: change db group poll time to 50 ms

*** 2018-05-18 10:52:28.495
kjxgmpoll reconfig instance map: 1 

*** 2018-05-18 10:52:28.495
kjxgmrcfg: Reconfiguration started, type 1
CGS/IMR TIMEOUTS:
  CSS recovery timeout = 31 sec (Total CSS waittime = 65)
  IMR Reconfig timeout = 75 sec
  CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 26 0.

*** 2018-05-18 10:52:28.514
     Name Service frozen
kjxgmcs: Setting state to 26 1.
kjxgrdecidever: No old version members in the cluster
kjxgrssvote: reconfig bitmap chksum 0x1a8c9 cnt 1 master 1 ret 0
kjxgrpropmsg: SSMEMI: inst 1 - no disk vote
kjxgrpropmsg: SSVOTE: Master indicates no Disk Voting
2018-05-18 10:52:28.514499 : kjxgrDiskVote: nonblocking method is chosen
kjxgrDiskVote: Only one inst in the cluster - no disk vote
2018-05-18 10:52:28.686421 : kjxgrDiskVote: Obtained RR update lock for sequence 27, RR seq 26
2018-05-18 10:52:28.814554 : kjxgrDiskVote: derive membership from CSS (no disk votes)
2018-05-18 10:52:28.814589 : proposed membership: 1 
2018-05-18 10:52:28.875275 : kjxgrDiskVote: new membership is updated by inst 1, seq 28
2018-05-18 10:52:28.875300 : kjxgrDiskVote: bitmap: 1 
CGS/IMR TIMEOUTS:
  CSS recovery timeout = 31 sec (Total CSS waittime = 65)
  IMR Reconfig timeout = 75 sec
  CGS rcfg timeout = 85 sec
kjxgmmeminfo: can not invalidate inst 2
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 28 2.
kjfmSendAbortInstMsg: send an abort message to instance 2
 kjfmuin: inst bitmap 1 
 kjfmmhi: received msg from inst 1 (inc 22)
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 28 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 28 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 28 5.
     Name Service normal
     Name Service recovery done

*** 2018-05-18 10:52:29.026
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 28 6.
kjxgmcs: total reconfig time 0.508 seconds (from 58968544 to 58969052) (old dlminc 26, new dlminc 28)
kjxggpoll: change db group poll time to 600 ms
kjfmact: call ksimdic on instance (2)

*** 2018-05-18 10:52:29.026
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550, kxfpnid()+1728] [flags: 0x0, count: 1]
Incident 4200673 created, dump file: /u01/app/oracle/diag/rdbms/**/**1/incident/incdir_4200673/**1_lmon_8174_i4200673.trc
ORA-07445: exception encountered: core dump [kxfpnid()+1728] [SIGSEGV] [ADDR:0xFFFFFFFF00000094] [PC:0x400000000D060550] [Address not mapped to object] []

ssexhd: crashing the process...
Background_Core_Dump = partial
ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/**/**1/cdump'
    檢視oracle官方網站,文章(文件 ID 1505057.1)與本次故障相似:

APPLIES TO:

  Oracle Database - Enterprise Edition - Version 11.2.0.2 and later
Information in this document applies to any platform.

SYMPTOMS

The following symptoms were seen:

  1. The following error was seen in the alert log, and then PMON terminated the instance:
    ORA-07445: exception encountered: core dump [kxfpnid()+632] [SIGBUS] [ADDR:0x95] [PC:0x103AACCD8] [Invalid address alignment] []
  2. The value for parallel_max_servers was NOT explicitly set in the spfile or pfile.
  3. The LMON tracefile showed a value for parallel_max_servers > 3600 (3600 is the maximum allowed in 11.2).
  4. The call stack contained the following: 
    kxfpnid <- ksimdic <- kjfmact <- kjfcln <- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real <- ssthrdmain

CAUSE

  - RAC NODE CRASHED AFTER ORA-7445 [KXFPNID()+632] IN LMON was filed for this issue and was closed as unpublished Bug 13743987 - ASM INSTANCE TERMINATES WITH DR CHANGE IN CPU COUNT PARALLE_MAX_SERVERS.   More information about unpublished Bug 13743987 is contained in the note Document 13743987.8 - Bug 13743987 - A high CPU_COUNT can cause ORA-68 for parallel_max_servers.  

SOLUTION

The following solutions are available:

  1. Apply the 11.2.0.4 Patch Set, when available.
  2. For an 11.2.0.3 Exadata database only, apply Bundle Patch 11.
  3. Workaround:  
    Explicitly set the value for parallel_max_servers to a value < 3601.   A reasonable value at which to start setting this parameter is (number of physical CPUs) * (parallel_threads_per_server) * 4.  So, for example, if you have 4 quad-core CPUs and parallel_threads_per_cpu is set to 2, you would set your starting value at 16 * 2 * 4 = 128 per instance.
    查詢本地oracle rac叢集的引數parallel_max_servers為3840,與MOS文件 ID 1505057.1描述一致,解決方法比較簡單,修改
該引數值為3600以下即可。




來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29357786/viewspace-2154858/,如需轉載,請註明出處,否則將追究法律責任。

相關文章