紅色警報 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因叢集心跳丟失重啟

清風艾艾發表於2019-05-23

      ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因叢集心跳丟失重啟。該問題是BUG 10194190 18740837

導致的,修復該問題需要打patch 25142535。有一客戶的solaris oracle 11.2.0.4 rac所有節點的DB已經打上補丁,但是

主機執行248天后,跑在上邊的Oracle 11.2.0.4依然因叢集心跳丟失而重啟。根據ORACLE MOS官方回覆,修復該BUG

的一個補丁包 patch 18740837是需要同時在GI軟體上應用。

      下邊是相關客戶案例的問題分析及解決處理總結。

       1、問題節點DB告警日誌報錯提示

    2、問題節點ASM告警日誌報錯提示

      3、 +ASM2_lmhb_4609_i78497.trc檔案內容

      4、OCSSD日誌提示心跳超時

      5、+ASM2_lmhb_4609_i78497.trc檔案區域性資訊

===[ Session State Object ]===

  ----------------------------------------

  SO: 0x3ffdb38d8, type: 4, owner: 0x400b0c258, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

   proc=0x400b0c258, name=session, file=ksu.h LINE:12729 ID:, pg=0

  (session) sid: 145 ser: 1 trans: 0x0, creator: 0x400b0c258

            flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-

            flags2: (0x409) -/-/INC

            DID: , short-term DID:

            txn branch: 0x0

            edition#: 0            oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS

  ksuxds FALSE at location: 0

  service name: SYS$BACKGROUND

  Current Wait Stack:

   0: waiting for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476951 seq_num=11521 snap_id=1

      wait times: snap=2 min 46 sec, exc=2 min 46 sec, total=2 min 46 sec

      wait times: max=0.100000 sec, heur=2 min 46 sec

      wait counts: calls=1 os=1

      in_wait=1 iflags=0x5a8

  Wait State:

    fixed_waits=0 flags=0x22 boundary=0x0/-1

  Session Wait History:

      elapsed time of 0.000015 sec since current wait

   0: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476950 seq_num=11520 snap_id=1

      wait times: snap=0.000027 sec, exc=0.000027 sec, total=0.000027 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000138 sec of elapsed time

   1: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476949 seq_num=11519 snap_id=1

      wait times: snap=0.102094 sec, exc=0.102094 sec, total=0.102094 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000015 sec of elapsed time

   2: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476948 seq_num=11518 snap_id=1

      wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000133 sec of elapsed time

   3: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476947 seq_num=11517 snap_id=1

      wait times: snap=0.102074 sec, exc=0.102074 sec, total=0.102074 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

   4: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476946 seq_num=11516 snap_id=1

      wait times: snap=0.000023 sec, exc=0.000023 sec, total=0.000023 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000119 sec of elapsed time

   5: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476945 seq_num=11515 snap_id=1

      wait times: snap=0.103086 sec, exc=0.103086 sec, total=0.103086 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000012 sec of elapsed time

   6: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476944 seq_num=11514 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000152 sec of elapsed time

   7: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476943 seq_num=11513 snap_id=1

      wait times: snap=0.103090 sec, exc=0.103090 sec, total=0.103090 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000016 sec of elapsed time

   8: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476942 seq_num=11512 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000272 sec of elapsed time

   9: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476941 seq_num=11511 snap_id=1

      wait times: snap=0.102084 sec, exc=0.102084 sec, total=0.102084 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

  Sampled Session History of session 145 serial 1

  ---------------------------------------------------

  The sampled session history is constructed by sampling

  the target session every 1 second. The sampling process

  captures at each sample if the session is in a non-idle wait,

  an idle wait, or not in a wait. If the session is in a

  non-idle wait then one interval is shown for all the samples

  the session was in the same non-idle wait. If the

  session is in an idle wait or not in a wait for

  consecutive samples then one interval is shown for all

  the consecutive samples. Though we display these consecutive

  samples  in a single interval the session may NOT be continuously

  idle or not in a wait (the sampling process does not know).

  The history is displayed in reverse chronological order.

  sample interval: 1 sec, max history 120 sec

  ---------------------------------------------------

    [121 samples,                                          21:08:20 - 21:10:20]

      idle wait at each sample

  temporary object counter: 0

    ----------------------------------------

    Virtual Thread:

    kgskvt: 3fc434ce8, sess: 3ffdb38d8 sid: 145 ser: 1

    vc: 0, proc: 400b0c258, id: 145

    consumer group cur:  (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:

    vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1

    inwait: 0, short wait event: 0 posted_run: 0

    location where insched last set: kgskthrrun

    location where insched last cleared: kgskthrrun1

    location where inwait last set: NULL

    location where inwait last cleared: NULL

    is_assigned: 0, in_sched: 0 (0)

    qcls: 0, qlink: FALSE

    vt_active: 0 (pending: 0)

    vt_pq_active: 0, dop: 0

    used quanta (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu start time: 0

    idle time: 0, active time: 0 (cg: 0)

    cpu yields:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu wait time (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io wait time:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    ASL queued time outs: 0, time: 0 (cur 0, cg 0)

    PQQ queued time outs: 0, time: 0 (cur 0, cg 0)

    Queue timeout violation: 0

    calls aborted: 0, num est exec limit hit: 0

    undo current: 0k max: 0k

    I/O credits acquired:small=0 large=0

    I/O credits waiting for:small=0 large=0

    KTU Session Commit Cache Dump for IDLs:

    KTU Session Commit Cache Dump for Non-IDLs:

    ----------------------------------------

    KKS-UOL used : 0 locks(used=0, free=0)

    KGX Atomic Operation Log 3eae04a58

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04aa8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04af8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04b48

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    ----------------------------------------

    KGL-UOL SO Cache(total=0, free=0)

    KGX Atomic Operation Log 3eae047a0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae047f8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04850

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae048a8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04900

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae04958

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    KGX Atomic Operation Log 3eae049b0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4=0 pt5=0 ub4=0

    ----------------------------------------

    SO: 0x3f464e838, type: 57, owner: 0x3ffdb38d8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

     proc=0x400b0c258, name=dummy, file=ktccts.h LINE:415 ID:, pg=0

    (dummy) nxc=0, nlb=0  

===[ Callstack ]===

*** 2019-05-21 21:10:20.275

Process diagnostic dump for oracle@sm0ora10 (LMON), OS id=4604,

pid: 9, proc_ser: 1, sid: 145, sess_ser: 1

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

  0.000000 secs at [ 21:10:19 ]

    NOTE: scheduling delay has not been sampled for 0.505539 secs  0.000000 secs from [ 21:10:15 - 21:10:20 ], 5 sec avg

  0.000000 secs from [ 21:09:20 - 21:10:20 ], 1 min avg

  0.000000 secs from [ 21:05:21 - 21:10:20 ], 5 min avg

*** 2019-05-21 21:10:21.374

loadavg : 20.48 16.00 13.13

swap info: free_mem = 194124.91M rsv = 258509.04M

           alloc = 239225.71M avail = 235374.84M swap_free = 254658.16M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 O     grid  4604  4596   1  79 20        ? 168184            Sep 15 ?         781:33 asm_lmon_+ASM2

Short stack dump:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<- thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224 <-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

*** 2019-05-21 21:10:22.421

==============================

LMON (ospid: 4604) has not moved for 176 sec (1558444222.1558444046)

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+1320<-kjzdicrshnfy()+388<-ksuitm()+1084<-kjgcr_KillInstance()+148<-kjgcr_Main()+7564<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

----- End of Abridged Call Stack Trace -----

    6、主機啟動時間

root@0ora10# hostname

ora10

root@ora10# uname -a

SunOS ora10 5.11 11.3 sun4v sparc sun4v

root@# uptime

 11:28pm  up 248 day(s) , 15:48,  8 users,  load average: 6.21, 7.27, 8.09

    6、根據ORA-15046和ORA-29770查詢和TRC 堆疊呼叫資訊提示:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<- thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224 <-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380 ORACLE MOS官網,發現完全匹配上: (Doc ID 2159643.1)

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770 

    7、後續檢視問題系統補丁安裝情況,發現rac所有節點的DB已經應用上 Doc ID 2159643.1 提到的補丁 25142535

oracle@ora10$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

oracle@ora10$

oracle@ora11$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

oracle@ora11$

      但是GRID GI沒有應用補丁  25142535

grid@ora10$ opatch lsinventory|grep  25142535

grid@ora10$ 

grid@ora11$ opatch lsinventory|grep  25142535

grid@ora11$ 

      8、就此問題諮詢MOS,ORACLE MOS給出的官方回覆也沒明確提示補丁 25142535確實是否需要在GI上應用

      9、補丁 25142535是補丁 18740837和  10194190的合集,其中 18740837小補丁的readme中有提示RAC環境需要在GI應用

    10、問題分析結論:

    由目前的實際問題情況及8和9的資訊彙總,可以確定 補丁 25142535需要在oracle 11.2.0.4 rac for solaris的所有節點的

GI叢集軟體上應用。可以肯定的是:當前的 oracle 11.2.0.4 rac for solaris asm and db crash and restart確實是rac在

solaris上執行248天導致的,在叢集所有節點DB軟體上打補丁25142535不能解決,官方也沒有明確說在GI軟體上同時應

用補丁25142535後到底能否解決:

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For >  248  Days by LMHB with ORA-29770 

    

    如果有類似問題的客戶看到博文,如果有通過GI打補丁25142535解決此問題,或者通過其他途徑解決此問題歡迎:

評論留言,感謝!

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29357786/viewspace-2645319/,如需轉載,請註明出處,否則將追究法律責任。

相關文章