RAC一個節點記憶體故障當機,無法訪問

liuzhen_basis發表於2014-08-26

    環境描述 :

    應用使用sap ERP6.0

    資料庫使用11g RAC + ASM

    巡檢時SAP 裡面執行事務碼db02

    clip_image001



    檢視clusterware狀態,只剩下一個節點

    bash-3.00$ ./crsctl status resource -t

    --------------------------------------------------------------------------------

    NAME TARGET STATE SERVER STATE_DETAILS

    --------------------------------------------------------------------------------

    Local Resources

    --------------------------------------------------------------------------------

    ora.ACFS.dg

    ONLINE ONLINE r3prddb01

    ora.ARCH.dg

    ONLINE ONLINE r3prddb01

    ora.DATA.dg

    ONLINE ONLINE r3prddb01

    ora.LISTENER.lsnr

    ONLINE ONLINE r3prddb01

    ora.MLOG.dg

    ONLINE ONLINE r3prddb01

    ora.OLOG.dg

    ONLINE ONLINE r3prddb01

    ora.RECO.dg

    ONLINE ONLINE r3prddb01

    ora.VCR.dg

    ONLINE ONLINE r3prddb01

    ora.acfs.acfs.acfs

    ONLINE ONLINE r3prddb01

    ora.asm

    ONLINE ONLINE r3prddb01

    ora.gsd

    OFFLINE OFFLINE r3prddb01

    ora.net1.network

    ONLINE ONLINE r3prddb01

    ora.ons

    ONLINE ONLINE r3prddb01

    ora.registry.acfs

    ONLINE ONLINE r3prddb01

    --------------------------------------------------------------------------------

    Cluster Resources

    --------------------------------------------------------------------------------

    ora.LISTENER_SCAN1.lsnr

    1 ONLINE ONLINE r3prddb01

    ora.cvu

    1 ONLINE ONLINE r3prddb01

    ora.oc4j

    1 ONLINE ONLINE r3prddb01

    ora.p01.db

    1 ONLINE ONLINE r3prddb01 Open

    2 ONLINE OFFLINE

    ora.r3prddb01.vip

    1 ONLINE ONLINE r3prddb01

    ora.r3prddb02.vip

    1 ONLINE INTERMEDIATE r3prddb01 FAILED OVER

    ora.scan1.vip

    1 ONLINE ONLINE r3prddb01

    bash-3.00$

    db01 資料庫例項alert日誌,

    Sat Aug 23 11:17:20 2014

    Reconfiguration started (old inc 4, new inc 6)

    List of instances:

    1 (myinst: 1)

    Global Resource Directory frozen

    * dead instance detected - domain 0 invalid = TRUE

    Communication channels reestablished

    Master broadcasted resource hash value bitmaps

    Non-local Process blocks cleaned out

    Sat Aug 23 11:17:20 2014

    LMS 1: 6 GCS shadows cancelled, 0 closed, 0 Xw survived

    Sat Aug 23 11:17:20 2014

    LMS 0: 5 GCS shadows cancelled, 1 closed, 0 Xw survived

    Sat Aug 23 11:17:20 2014

    LMS 2: 5 GCS shadows cancelled, 2 closed, 0 Xw survived

    Set master node info

    Submitted all remote-enqueue requests

    Dwn-cvts replayed, VALBLKs dubious

    All grantable enqueues granted

    Post SMON to start 1st pass IR

    Sat Aug 23 11:17:20 2014

    Instance recovery: looking for dead threads

    Beginning instance recovery of 1 threads

    Submitted all GCS remote-cache requests

    Post SMON to start 1st pass IR

    Fix write in gcs resources

    Reconfiguration complete

    parallel recovery started with 31 processes

    Started redo scan

    Completed redo scan

    read 58319 KB redo, 11434 data blocks need recovery

    Started redo application at

    Thread 2: logseq 49560, block 81359

    Recovery of Online Redo Log: Thread 2 Group 44 Seq 49560 Reading mem 0

    Mem# 0: +OLOG/p01/onlinelog/log_g44m1.dbf

    Mem# 1: +MLOG/p01/onlinelog/log_g44m2.dbf

    Sat Aug 23 11:17:25 2014

    Setting Resource Manager plan SCHEDULER[0x447BF]:DEFAULT_MAINTENANCE_PLAN via scheduler window

    Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter

    Sat Aug 23 11:17:25 2014

    minact-scn: master found reconf/inst-rec before recscn scan old-inc#:6 new-inc#:6

    Completed redo application of 48.31MB

    Completed instance recovery at

    Thread 2: logseq 49560, block 197998, scn 11518227963

    10738 data blocks read, 11483 data blocks written, 58319 redo k-bytes read

    Thread 2 advanced to log sequence 49561 (thread recovery)

    Redo thread 2 internally disabled at seq 49561 (SMON)

    Sat Aug 23 11:17:27 2014

    Archived Log entry 91800 added for thread 2 sequence 49560 ID 0x592ddd4a dest 1:

    Sat Aug 23 11:17:27 2014

    ARC0: Archiving disabled thread 2 sequence 49561

    Archived Log entry 91801 added for thread 2 sequence 49561 ID 0x592ddd4a dest 1:

    minact-scn: master continuing after IR

    minact-scn: Master considers inst:2 dead

    Sat Aug 23 11:17:28 2014

    Beginning log switch checkpoint up to RBA [0xa4e4.2.10], SCN: 11518240393

    Thread 1 advanced to log sequence 42212 (LGWR switch)

    Current log# 35 seq# 42212 mem# 0: +OLOG/p01/onlinelog/log_g35m1.dbf

    Current log# 35 seq# 42212 mem# 1: +MLOG/p01/onlinelog/log_g35m2.dbf

    Archived Log entry 91802 added for thread 1 sequence 42211 ID 0x592ddd4a dest 1:



    DB01 ASM例項alert日誌 ASM01

    Sat Aug 23 11:17:20 2014

    Reconfiguration started (old inc 4, new inc 6)

    List of instances:

    1 (myinst: 1)

    Global Resource Directory frozen

    * dead instance detected - domain 1 invalid = TRUE

    * dead instance detected - domain 2 invalid = TRUE

    * dead instance detected - domain 3 invalid = TRUE

    * dead instance detected - domain 4 invalid = TRUE

    * dead instance detected - domain 5 invalid = TRUE

    * dead instance detected - domain 6 invalid = TRUE

    * dead instance detected - domain 7 invalid = TRUE

    Communication channels reestablished

    Master broadcasted resource hash value bitmaps

    Non-local Process blocks cleaned out

    Sat Aug 23 11:17:20 2014

    LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

    Set master node info

    Submitted all remote-enqueue requests

    Dwn-cvts replayed, VALBLKs dubious

    All grantable enqueues granted

    Post SMON to start 1st pass IR

    Sat Aug 23 11:17:20 2014

    NOTE: SMON starting instance recovery for group ACFS domain 1 (mounted)

    NOTE: F1X0 found on disk 1 au 49 fcn 0.12248

    Submitted all GCS remote-cache requests

    Post SMON to start 1st pass IR

    Fix write in gcs resources

    Reconfiguration complete

    NOTE: starting recovery of thread=2 ckpt=19.43 group=1 (ACFS)

    NOTE: SMON waiting for thread 2 recovery enqueue

    NOTE: SMON about to begin recovery lock claims for diskgroup 1 (ACFS)

    NOTE: SMON successfully validated lock domain 1

    NOTE: advancing ckpt for thread=2 ckpt=19.43

    NOTE: SMON did instance recovery for group ACFS domain 1

    Sat Aug 23 11:17:20 2014

    ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

    NOTE: SMON starting instance recovery for group ARCH domain 2 (mounted)

    NOTE: F1X0 found on disk 9 au 113 fcn 0.41343439

    NOTE: starting recovery of thread=2 ckpt=77.3254 group=2 (ARCH)

    NOTE: SMON waiting for thread 2 recovery enqueue

    NOTE: SMON about to begin recovery lock claims for diskgroup 2 (ARCH)

    NOTE: SMON successfully validated lock domain 2

    NOTE: advancing ckpt for thread=2 ckpt=77.3254

    NOTE: SMON did instance recovery for group ARCH domain 2

    ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

    NOTE: SMON starting instance recovery for group DATA domain 3 (mounted)

    NOTE: F1X0 found on disk 15 au 60241 fcn 0.5143392

    NOTE: starting recovery of thread=2 ckpt=22.3858 group=3 (DATA)

    NOTE: SMON waiting for thread 2 recovery enqueue

    NOTE: SMON about to begin recovery lock claims for diskgroup 3 (DATA)

    NOTE: SMON successfully validated lock domain 3

    NOTE: advancing ckpt for thread=2 ckpt=22.3858

    NOTE: SMON did instance recovery for group DATA domain 3

    ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

    NOTE: SMON starting instance recovery for group MLOG domain 4 (mounted)

    NOTE: F1X0 found on disk 3 au 639 fcn 0.120137

    NOTE: starting recovery of thread=2 ckpt=23.2161 group=4 (MLOG)

    NOTE: SMON waiting for thread 2 recovery enqueue

    NOTE: SMON about to begin recovery lock claims for diskgroup 4 (MLOG)

    NOTE: SMON successfully validated lock domain 4

    NOTE: advancing ckpt for thread=2 ckpt=23.2161

    NOTE: SMON did instance recovery for group MLOG domain 4

    ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

    NOTE: SMON starting instance recovery for group OLOG domain 5 (mounted)

    NOTE: F1X0 found on disk 3 au 637 fcn 0.121291

    NOTE: starting recovery of thread=2 ckpt=23.2261 group=5 (OLOG)

    NOTE: SMON waiting for thread 2 recovery enqueue

    NOTE: SMON about to begin recovery lock claims for diskgroup 5 (OLOG)

    NOTE: SMON successfully validated lock domain 5

    NOTE: advancing ckpt for thread=2 ckpt=23.2261

    NOTE: SMON did instance recovery for group OLOG domain 5

    ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

    NOTE: SMON starting instance recovery for group RECO domain 6 (mounted)

    NOTE: F1X0 found on disk 11 au 11 fcn 0.2264

    NOTE: starting recovery of thread=2 ckpt=19.6 group=6 (RECO)

    NOTE: SMON waiting for thread 2 recovery enqueue

    NOTE: SMON about to begin recovery lock claims for diskgroup 6 (RECO)

    NOTE: SMON successfully validated lock domain 6

    NOTE: advancing ckpt for thread=2 ckpt=19.6

    NOTE: SMON did instance recovery for group RECO domain 6

    ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

    NOTE: SMON starting instance recovery for group VCR domain 7 (mounted)

    NOTE: F1X0 found on disk 2 au 177 fcn 0.1216

    NOTE: starting recovery of thread=2 ckpt=16.13 group=7 (VCR)

    NOTE: SMON waiting for thread 2 recovery enqueue

    NOTE: SMON about to begin recovery lock claims for diskgroup 7 (VCR)

    NOTE: SMON successfully validated lock domain 7

    NOTE: advancing ckpt for thread=2 ckpt=16.13

    NOTE: SMON did instance recovery for group VCR domain 7



    透過以上日誌判斷,故障發生在 23日 11:17 ,聯絡機房同事,檢查發現M9000上該機器告警,記憶體故障,已報修。

    機器修理中。。。。

    第二天02機器記憶體故障已修好,啟動02作業系統。等待一會兒發現cluserware自動起來了,02資料庫例項也恢復

    該機器設定的cluster隨著作業系統啟動自動啟動,預設安裝好設定也是這樣的。

    如果設定不自動啟動執行以下命令crsctl disable crs,使用crsctl enable crs改回開啟自動啟動

    設定後,需要手工啟動執行crsctl start crs命令

    bash-3.00$ ./crsctl status resource -t

    --------------------------------------------------------------------------------

    NAME TARGET STATE SERVER STATE_DETAILS

    --------------------------------------------------------------------------------

    Local Resources

    --------------------------------------------------------------------------------

    ora.ACFS.dg

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.ARCH.dg

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.DATA.dg

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.LISTENER.lsnr

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.MLOG.dg

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.OLOG.dg

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.RECO.dg

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.VCR.dg

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.acfs.acfs.acfs

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.asm

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.gsd

    OFFLINE OFFLINE r3prddb01

    OFFLINE OFFLINE r3prddb02

    ora.net1.network

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.ons

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    ora.registry.acfs

    ONLINE ONLINE r3prddb01

    ONLINE ONLINE r3prddb02

    --------------------------------------------------------------------------------

    Cluster Resources

    --------------------------------------------------------------------------------

    ora.LISTENER_SCAN1.lsnr

    1 ONLINE ONLINE r3prddb01

    ora.cvu

    1 ONLINE ONLINE r3prddb01

    ora.oc4j

    1 ONLINE ONLINE r3prddb01

    ora.p01.db

    1 ONLINE ONLINE r3prddb01 Open

    2 ONLINE ONLINE r3prddb02 Open

    ora.r3prddb01.vip

    1 ONLINE ONLINE r3prddb01

    ora.r3prddb02.vip

    1 ONLINE ONLINE r3prddb02

    ora.scan1.vip

    1 ONLINE ONLINE r3prddb01


    DB02檢視已經兩個例項

    clip_image002

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/27771627/viewspace-1257887/,如需轉載,請註明出處,否則將追究法律責任。

相關文章