剛把客戶20T的庫恢復起來,執行了幾天,突然打電話通知一個節點掛了,vpn連線上去檢視crs日誌
[oracle@rac2 crsd]$ tail -100f crsd.log 2014-07-19 21:37:02.223: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9 2014-07-19 21:37:02.225: [ CRSRTI][1073453488]0CSS is not ready. Received status 3 from CSS. Waiting for good status .. 2014-07-19 21:37:03.599: [ COMMCRS][1110501696]clsc_connect: (0xb438700) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac2_crs)) 2014-07-19 21:37:03.599: [ CSSCLNT][1073453488]clsssInitNative: connect failed, rc 9 2014-07-19 21:37:03.600: [ CRSRTI][1073453488]0CSS is not ready. Received status 3 from CSS. Waiting for good status .. |
檢視心跳網路不通,重啟網路卡後問題解決,
[root@rac2 client]# ping rac1-priv PING rac1-priv (192.168.2.81) 56(84) bytes of data. From rac2-priv (192.168.2.83) icmp_seq=10 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=11 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=12 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=14 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=15 Destination Host Unreachable From rac2-priv (192.168.2.83) icmp_seq=16 Destination Host Unreachable --- rac1-priv ping statistics --- 19 packets transmitted, 0 received, +6 errors, 100% packet loss, time 18000ms , pipe 3 [root@rac2 client]# ifconfig bond1 bond1 Link encap:Ethernet HWaddr 78:2B:CB:0D:32:49 inet addr:192.168.2.83 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::7a2b:cbff:fe0d:3249/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:105 errors:0 dropped:0 overruns:0 frame:0 TX packets:21457 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6720 (6.5 KiB) TX bytes:1373758 (1.3 MiB) [root@rac2 client]# ifdown bond1 [root@rac2 client]# ifup bond1 [root@rac2 client]# ping rac1-priv PING rac1-priv (192.168.2.81) 56(84) bytes of data. 64 bytes from rac1-priv (192.168.2.81): icmp_seq=1 ttl=64 time=0.146 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=2 ttl=64 time=0.102 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=3 ttl=64 time=0.085 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=4 ttl=64 time=0.095 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=5 ttl=64 time=0.146 ms 64 bytes from rac1-priv (192.168.2.81): icmp_seq=6 ttl=64 time=0.099 ms |
啟動crs後,發現監聽和資料庫例項不能正常啟動
[oracle@rac2 ~]$ crs_stat -t -v Name Type R/RA F/FT Target State Host ---------------------------------------------------------------------- ora.master.db application 0/0 0/1 ONLINE ONLINE rac1 ora....rtdb.cs application 0/0 0/1 ONLINE ONLINE rac1 ora....er1.srv application 0/0 0/0 ONLINE ONLINE rac1 ora....r1.inst application 0/5 0/0 ONLINE ONLINE rac1 ora....r2.inst application 0/5 0/0 ONLINE OFFLINE ora....pcdb.cs application 0/0 0/1 ONLINE ONLINE rac1 ora....er1.srv application 0/0 0/0 ONLINE ONLINE rac1 ora....SM1.asm application 0/5 0/0 ONLINE ONLINE rac1 ora....N1.lsnr application 0/5 0/0 ONLINE ONLINE rac1 ora....bn1.gsd application 0/5 0/0 ONLINE ONLINE rac1 ora....bn1.ons application 0/3 0/0 ONLINE ONLINE rac1 ora....bn1.vip application 0/0 0/0 ONLINE ONLINE rac1 ora....SM2.asm application 0/5 0/0 ONLINE ONLINE rac2 ora....N2.lsnr application 0/5 0/0 ONLINE OFFLINE ora....bn2.gsd application 0/5 0/0 ONLINE ONLINE rac2 ora....bn2.ons application 0/3 0/0 ONLINE ONLINE rac2 ora....bn2.vip application 0/0 0/0 ONLINE ONLINE rac2 |
手工啟動報錯
[oracle@rac2 ~]$ srvctl start listener -n rac2 rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:LSNRCTL for Linux: Version 10.2.0.3.0 - Production on 19-JUL-2014 22:05:06 rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Copyright (c) 1991, 2006, Oracle. All rights reserved. rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Starting /opt/oracle/app/database/bin/tnslsnr: please wait... rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:TNSLSNR for Linux: Version 10.2.0.3.0 - Production rac2:ora.rac2.LISTENER_rac2.lsnr:System parameter file is /opt/oracle/app/database/network/admin/listener.ora rac2:ora.rac2.LISTENER_rac2.lsnr:Log messages written to /opt/oracle/app/database/network/log/listener_rac2.log rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-01151: Missing listener name, LISTENER_rac2, in LISTENER.ORA rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Listener failed to start. See the error message(s) above... rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:LSNRCTL for Linux: Version 10.2.0.3.0 - Production on 19-JUL-2014 22:05:06 rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Copyright (c) 1991, 2006, Oracle. All rights reserved. rac2:ora.rac2.LISTENER_rac2.lsnr: rac2:ora.rac2.LISTENER_rac2.lsnr:Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac2-vip)(PORT=1521)(IP=FIRST))) rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-12541: TNS:no listener rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-12560: TNS:protocol adapter error rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-00511: No listener rac2:ora.rac2.LISTENER_rac2.lsnr: Linux Error: 111: Connection refused rac2:ora.rac2.LISTENER_rac2.lsnr:Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.83)(PORT=1521)(IP=FIRST))) rac2:ora.rac2.LISTENER_rac2.lsnr:TNS-12541: TNS:no listener rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-12560: TNS:protocol adapter error rac2:ora.rac2.LISTENER_rac2.lsnr: TNS-00511: No listener rac2:ora.rac2.LISTENER_rac2.lsnr: Linux Error: 111: Connection refused CRS-0215: Could not start resource 'ora.rac2.LISTENER_rac2.lsnr'. |
懷疑監聽配置listener.ora檔案出現問題,檢視
[oracle@standbydbn2 ~]$ cd /opt/oracle/app/database/network/admin/ [oracle@standbydbn2 admin]$ ls -l total 88 -rw-r--r-- 1 oracle oinstall 240 Jul 27 2011 1 -rw-r--r-- 1 oracle oinstall 378 Jul 27 2011 listener1107271PM5834.bak -rw-r--r-- 1 oracle oinstall 448 Jul 17 16:09 listener1407174PM0951.bak -rw-r--r-- 1 oracle oinstall 448 Jul 17 19:00 listener1407177PM0037.bak -rw-r--r-- 1 oracle oinstall 448 Jul 17 19:01 listener1407177PM0113.bak -rw-r--r-- 1 oracle oinstall 0 Jul 17 22:10 listener.ora -rw-r--r-- 1 oracle oinstall 553 Jul 27 2011 listener.ora.20110727bak -rw-r--r-- 1 oracle oinstall 402 Oct 8 2011 listener.ora.bak drwxr-x--- 2 oracle oinstall 4096 Jul 26 2011 samples -rw-r----- 1 oracle oinstall 172 Dec 26 2003 shrept.lst -rw-r--r-- 1 oracle oinstall 35 Jul 17 16:09 sqlnet1407174PM0951.bak -rw-r--r-- 1 oracle oinstall 35 Jul 17 19:00 sqlnet1407177PM0037.bak -rw-r--r-- 1 oracle oinstall 35 Jul 17 19:01 sqlnet1407177PM0113.bak -rw-r--r-- 1 oracle oinstall 4130 May 7 15:17 sqlnet.log -rw-r--r-- 1 oracle oinstall 35 Jul 17 22:10 sqlnet.ora -rw-r--r-- 1 oracle oinstall 416 Jul 27 2011 tnsnames1107271PM5834.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 16:09 tnsnames1407174PM0951.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 19:00 tnsnames1407177PM0037.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 19:01 tnsnames1407177PM0113.bak -rw-r--r-- 1 oracle oinstall 2393 Jul 17 22:10 tnsnames.ora -rw-r--r-- 1 oracle oinstall 1736 Jul 27 2011 tnsnames.ora.20110727bak -rw-r--r-- 1 oracle oinstall 1977 Aug 2 2011 tnsnames.ora.20110802bak [oracle@standbydbn2 admin]$ cat listener.ora [oracle@standbydbn2 admin]$ cat listener.ora |
檔案真是空的,按照正常節點的編輯檔案,監聽正常啟動。手工啟動db,發現alter報錯
Errors in file /opt/oracle/app/admin/master/bdump/master2_dbw0_25946.trc: ORA-01157: cannot identify/lock data file 772 - see DBWR trace file ORA-01110: data file 772: '+DATA5/xxx/datafile/xxx.dbf' ORA-17503: ksfdopn:2 Failed to open file +DATA5/standby/datafile/xx.dbf ORA-15001: diskgroup "DATA5" does not exist or is not mounted ORA-15001: diskgroup "DATA5" does not exist or is not mounted |
asmcmd進入發現DATA5不存在
[oracle@rac2 admin]$ export ORACLE_SID=+ASM2
[oracle@rac2 admin]$ asmcmd ASMCMD> lsdg State TYPE Rebal Unbal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Name MOUNTED EXTERN N N 512 4096 1048576 6277049 494723 0 494723 0 DATA2/ MOUNTED EXTERN N N 512 4096 1048576 6655909 388388 0 388388 0 DATA3/ MOUNTED EXTERN N N 512 4096 1048576 8011682 546051 0 546051 0 DATA4/ MOUNTED EXTERN N N 512 4096 1048576 6578238 339569 0 339569 0 NEW_DG/ MOUNTED EXTERN N N 512 4096 1048576 307196 262681 0 262681 0 REDO01/ MOUNTED EXTERN N N 512 4096 1048576 307196 262681 0 262681 0 REDO02/ |
在正常的節點檢視data5磁碟組所包含的磁碟,
因為系統使用asmlib,使用oracleasm檢視,發現data5的都不存在,
[root@rac1 admin]# oracleasm listdisks NEW1 NEW2 NEW3 NEW4 NEW5 NEW6 VOL1 VOL10 VOL11 VOL12 VOL13 VOL14 VOL15 VOL16 VOL17 VOL18 VOL2 VOL3 VOL4 VOL5 VOL6 VOL7 VOL8 VOL9 VOLDATA1 VOLDATA10 VOLDATA11 VOLDATA12 VOLDATA13 VOLDATA14 VOLDATA15 VOLDATA16 VOLDATA17 VOLDATA18 VOLDATA19 VOLDATA2 VOLDATA20 VOLDATA21 VOLDATA22 VOLDATA23 VOLDATA24 VOLDATA25 VOLDATA26 VOLDATA27 VOLDATA28 VOLDATA29 VOLDATA3 VOLDATA30 VOLDATA31 VOLDATA32 VOLDATA33 VOLDATA34 VOLDATA4 VOLDATA5 VOLDATA6 VOLDATA7 VOLDATA8 VOLDATA9 VOLREDO1 VOLREDO2 [root@rac1 admin]# oracleasm querydisk -p VOL1 Disk "VOL1" defines a device with no label |
發現VOL1這個磁碟的Lable丟失。
[root@rac1 admin]# oracleasm scandisks Reloading disk partitions: done Cleaning any stale ASM disks... Cleaning disk "VOL1" Cleaning disk "VOL16" Cleaning disk "VOL17" Cleaning disk "VOL2" Cleaning disk "VOL3" Cleaning disk "VOL4" Cleaning disk "VOL5" Cleaning disk "VOL7" Cleaning disk "VOLDATA1" Cleaning disk "VOLDATA30" Cleaning disk "VOLDATA33" Cleaning disk "VOLDATA4" Cleaning disk "VOLDATA7" Scanning system for ASM disks... |
掃了下盤,擦lable都丟失。
找到以前的記錄確定了具體磁碟,使用powermt檢視盤狀態
[root@rac2 ~]# powermt display dev=emcpowerc Pseudo name=emcpowerc CLARiiON ID=CKM00111201809 [R910] Logical device ID=6006016025B12C00069BFB6DF996E011 [LUN 6] state=alive; policy=CLAROpt; priority=0; queued-IOs=0; Owner: default=SP B, current=SP B Array failover mode: 4 ============================================================================== --------------- Host --------------- - Stor - -- I/O Path -- -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors ============================================================================== 3 qla2xxx sde SP A1 active alive 0 0 4 qla2xxx sdm SP B1 active alive 0 0 |
kfed確定磁碟名
[root@rac1 admin]# /opt/oracle/app/database/bin/kfed read /dev/emcpowerc1 kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD kfbh.datfmt: 1 ; 0x003: 0x01 kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 kfbh.block.obj: 2147483648 ; 0x008: TYPE=0x8 NUMB=0x0 kfbh.check: 1205065909 ; 0x00c: 0x47d3d8b5 kfbh.fcn.base: 173 ; 0x010: 0x000000ad kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 kfdhdb.driver.provstr: ORCLDISK ; 0x000: length=8 kfdhdb.driver.reserved[0]: 0 ; 0x008: 0x00000000 kfdhdb.driver.reserved[1]: 0 ; 0x00c: 0x00000000 kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000 kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000 kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000 kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000 kfdhdb.compat: 168820736 ; 0x020: 0x0a100000 kfdhdb.dsknum: 0 ; 0x024: 0x0000 kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER kfdhdb.dskname: VOL1 ; 0x028: length=4 kfdhdb.grpname: DATA5 ; 0x048: length=5 kfdhdb.fgname: VOL1 ; 0x068: length=4 kfdhdb.capname: ; 0x088: length=0 kfdhdb.crestmp.hi: 33005137 ; 0x0a8: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de kfdhdb.crestmp.lo: 3051740160 ; 0x0ac: USEC=0x0 MSEC=0x177 SECS=0x1e MINS=0x2d kfdhdb.mntstmp.hi: 33005137 ; 0x0b0: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de kfdhdb.mntstmp.lo: 3060555776 ; 0x0b4: USEC=0x0 MSEC=0x318 SECS=0x26 MINS=0x2d kfdhdb.secsize: 512 ; 0x0b8: 0x0200 kfdhdb.blksize: 4096 ; 0x0ba: 0x1000 kfdhdb.ausize: 1048576 ; 0x0bc: 0x00100000 kfdhdb.mfact: 113792 ; 0x0c0: 0x0001bc80 kfdhdb.dsksize: 511993 ; 0x0c4: 0x0007cff9 kfdhdb.pmcnt: 6 ; 0x0c8: 0x00000006 kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001 kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002 kfdhdb.f1b1locn: 2 ; 0x0d4: 0x00000002 kfdhdb.redomirrors[0]: 0 ; 0x0d8: 0x0000 kfdhdb.redomirrors[1]: 65535 ; 0x0da: 0xffff kfdhdb.redomirrors[2]: 65535 ; 0x0dc: 0xffff kfdhdb.redomirrors[3]: 65535 ; 0x0de: 0xffff kfdhdb.dbcompat: 168820736 ; 0x0e0: 0x0a100000 kfdhdb.grpstmp.hi: 33005137 ; 0x0e4: HOUR=0x11 DAYS=0x12 MNTH=0x7 YEAR=0x7de kfdhdb.grpstmp.lo: 3051595776 ; 0x0e8: USEC=0x0 MSEC=0xea SECS=0x1e MINS=0x2d kfdhdb.ub4spare[0]: 0 ; 0x0ec: 0x00000000 kfdhdb.ub4spare[1]: 0 ; 0x0f0: 0x00000000 |
備份磁碟頭
[root@rac2 admin]# dd if=/dev/emcpowerc1 of=/tmp/VOL1.50m.dd bs=1M count=50 |
使用oracleasm renamedisk,這裡加-f是強制修改
[root@rac2 disks]# oracleasm renamedisk -f /dev/emcpowerc1 VOL1 Writing disk header: done Instantiating disk "VOL1": done |
兩個節點掃盤、檢視
[root@rac2 disks]# oracleasm listdisks VOL1 ....略 [root@rac1 admin]# oracleasm scandisks Reloading disk partitions: done Cleaning any stale ASM disks... Scanning system for ASM disks... Instantiating disk "VOL1" |
按照此步驟修復了出問題的磁碟,手工mount磁碟組,庫正常開啟。
奇怪的問題,此庫剛恢復了沒幾天,正在執行竟然asmlib的label丟失了。。。。。。