ORACLE 11.2.0.4 rac for linux 鏈路宕導致的單節點異常當機

清風艾艾發表於2018-06-23
    2018年5月24日,一客戶申訴其一套oracle 11.2.0.4 rac for linux異常當機,從ASM告警日誌發現有磁碟IO報錯,申請協助處理。
客戶環境資訊:
作業系統版本:rhel 6.9 X86_64
資料庫版本:oracle rac 11.2.0.4

客戶的疑問:鏈路宕為什麼導致節點2當機而節點1卻沒有受到影響

客戶反饋的ASM報錯資訊:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Handle 0x7fffe009f230 from lib :UFS:: for disk :/dev/mapper/ssdredo_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Handle 0x7fffe009fe90 from lib :UFS:: for disk :/dev/mapper/data_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe00993b0 for disk :/dev/mapper/data_2:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe0099d50 for disk :/dev/mapper/data_3:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009a970 for disk :/dev/mapper/fra_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009b590 for disk :/dev/mapper/ocr_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009c1b0 for disk :/dev/mapper/ocr_2:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009cdd0 for disk :/dev/mapper/ocr_3:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009d9f0 for disk :/dev/mapper/ssddata_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009e610 for disk :/dev/mapper/ssddata_2:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009f230 for disk :/dev/mapper/ssdredo_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
     本次客戶反饋的oracle rac單節點當機相關分析如下:
     由以上ASM告警日誌提示,需要觀察故障前叢集的作業系統日誌是否有儲存相關鏈路異常資訊,節點1作業系統日誌提示如下:
May 23 03:37:02 soadb1 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
May 23 16:21:19 soadb1 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event xa received Data: xa x20 x800110 x0 x0
May 23 16:21:23 soadb1 kernel: sd 2:0:0:1: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:3: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:4: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:6: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:7: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:9: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:176.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:208.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:224.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:0.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:16.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:208.
    觀察故障前,節點2的作業系統日誌:
May 23 03:44:03 oadb2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
May 23 16:17:45 oadb2 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event x2 received Data: x2 x20 x800110 x0 x0
May 23 16:17:49 oadb2 multipathd: 8:176: mark as failed
May 23 16:17:49 oadb2 multipathd: data_2: Entering recovery mode: max_retries=6
May 23 16:17:49 oadb2 multipathd: data_2: remaining active paths: 0
May 23 16:17:49 oadb2 kernel: sd 2:0:0:1: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: device-mapper: multipath: Failing path 8:176.
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: [sdp] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:6: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:6: [sdq] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo] 
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: [sdp] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo] CDB: Write(10): 2a 00 00 05 00 12 00 00 01 00
May 23 16:17:49 oadb2 kernel: end_request: I/O error, dev sdo, sector 327698
     結合作業系統日誌和ASM告警日誌,可判斷本次農信oracle 節點2當機的原因是儲存鏈路宕,導致節點2的OCR出現IO異常而當機,故障發生流程如下:
1、5月23 16:17:45節點2作業系統檢測到鏈路宕,資訊如下
May 23 16:17:45 oadb2 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event x2 received Data: x2 x20 x800110 x0 x0
2、5月23 16:17:49 節點2作業系統日誌檢測到核心IO請求錯誤,位置:磁碟機代號sdo,扇區 327698
May 23 16:17:49 oadb2 kernel: end_request: I/O error, dev sdo, sector 327698
3、5月23 16:17:50 節點2作業系統日誌檢測到多路徑對映失敗
May 23 16:17:50 oadb2 kernel: device-mapper: multipath: Failing path 8:192.
May 23 16:17:50 oadb2 kernel: rport-2:0-8: blocked FC remote port time out: removing target and saving binding
4、5月23 16:17:51節點2作業系統日誌提示多路徑檢測到無法完成flush操作
May 23 16:17:51 oadb2 multipathd: ocr_1: map in use
May 23 16:17:51 oadb2 multipathd: ocr_1: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_1: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_1: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdo [8:224]: path removed from map ocr_1
May 23 16:17:51 oadb2 multipathd: sdp: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_2: map in use
May 23 16:17:51 oadb2 multipathd: ocr_2: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_2: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_2: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdp [8:240]: path removed from map ocr_2
May 23 16:17:51 oadb2 multipathd: sdq: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_3: map in use
May 23 16:17:51 oadb2 multipathd: ocr_3: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_3: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_3: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdq [65:0]: path removed from map ocr_3
5、5月23 16:18節點2的作業系統多路徑設定ocr相關的磁碟不可用
May 23 16:18:20 oadb2 multipathd: ocr_1: Disable queueing
May 23 16:18:20 oadb2 multipathd: ocr_2: Disable queueing
May 23 16:18:21 oadb2 multipathd: ocr_3: Disable queueing
6、5月 23 16:21節點2作業系統日誌提示oracle rac叢集核心程式ocssd程式出現異常
May 23 16:21:08 oadb2 abrt[52037]: Saved core dump of pid 27479 (/u01/app/11.2.0/grid/bin/ocssd.bin) to /var/spool/abrt/ccpp-2018-05-23-16:21:07-27479 (101912576 bytes)
May 23 16:21:08 oadb2 abrtd: Directory 'ccpp-2018-05-23-16:21:07-27479' creation detected
May 23 16:21:09 oadb2 abrtd: Executable '/u01/app/11.2.0/grid/bin/ocssd.bin' doesn't belong to any package and ProcessUnpackaged is set to 'no'
May 23 16:21:09 oadb2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2018-05-23-16:21:07-27479' exited with 1
May 23 16:21:09 oadb2 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2018-05-23-16:21:07-27479'
May 23 16:21:11 oadb2 ntpd[8087]: Deleting interface #8 bond1:1, 169.254.211.6#123, interface stats: received=0, sent=0, dropped=0, active_time=3515469 secs

7、5月 23 16:21節點2ASM日誌提示,由於ASM數次嘗試上線DATA磁碟組失敗例項被PMON程式終止
Wed May 23 16:18:52 2018
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: cache dismounting (clean) group 1/0x5ABDEE06 (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 29399, image: oracle@soadb2 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0x5ABDEE06 (DATA) 
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0x5abdee06
NOTE: cache deleting context for group DATA 1/0x5abdee06
GMON dismounting group 1 at 37 for pid 29, osid 29399
ERROR: diskgroup DATA was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15040: diskgroup is incomplete
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
ERROR: ALTER DISKGROUP DATA MOUNT /* asm agent *//* {0:7:10794} */
Wed May 23 16:21:07 2018
NOTE: ASMB process exiting, either shutdown is in progress 
NOTE: or foreground connected to ASMB was killed. 
Wed May 23 16:21:07 2018
PMON (ospid: 28806): terminating the instance due to error 481
Wed May 23 16:21:08 2018
ORA-1092 : opitsk aborting process
Wed May 23 16:21:08 2018
License high water mark = 12
Instance terminated by PMON, pid = 28806
USER (ospid: 52046): terminating the instance
Instance terminated by USER, pid = 52046

    總結:OA系統,觀察節點1的日誌,相比節點2的日誌OCR磁碟組沒有出現Disable queueing,因此節點1的
DB和GI例項沒有收到影響。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29357786/viewspace-2156568/,如需轉載,請註明出處,否則將追究法律責任。

相關文章