一次盤陣down掉導致的oracle rac失敗總結(原)

coolhe發表於2010-08-28

環境:ORACLE10g RAC + ASM +AIX

節點:192.168.5.15, 192.168.5.16

前天同事說資料庫不能啟動了,讓我去檢視下,我用crs_stat 發現 db02(16)機器online,db01主節點offline了。然後我用crs_stop –all關閉,然後又crs_start -all重啟了下,出現說沒有資源resource沒有或者failed的資訊,這個資訊原來我沒有見過,資訊上還顯示是vip失敗,我檢視了下ip,發現db01點的vip沒有了。Db02節點還正常。於是就用aix的命令(smitty mkinetvi)配置了虛擬ip,又進行了關閉和重啟crs,發現還是原來的問題。….後來找了1個多小時,最後lspv的時候,發現原來的pv沒有了,少了4pv,暈!!!!分割槽不見了。這怎麼能啟動資料庫?然後跑到機房,看看是不是光纖卡,或者光線被誰給碰掉了, 結果正常。用IBM400的盤櫃軟體連上盤櫃,檢視盤櫃資訊,出現了警告燈,說什麼“邏輯路徑”錯誤,看來確實是盤櫃的問題。聯絡儲存廠商,後來來了工程師,檢查了下,並搞定了。怎麼搞定的,他也沒有說什麼,就是把光纖交換機重啟了下,光纖卡又插了插,就搞定了。不知道怎麼回事。

       今天,同事給我說分割槽有了,我用lspv看了下,呵呵~ 分割槽都回來了。從29都是裸裝置,沒有pvid.

# lspv

hdisk0          00cc85bf3d2db424                    rootvg          active

hdisk1          00cc85bf404044eb                    rootvg          active

hdisk2          none                                None

hdisk3          none                                None

hdisk4          none                                None

hdisk5          none                                None

hdisk6          none                                None

hdisk7          none                                None

hdisk8          none                                None

hdisk9          none                                None

hdisk10         none                                None

hdisk11         none                                None

hdisk12         00cc85bf8266c2a8                    datavg          active

#

然後,crs_start –all啟動服務,出現如下錯誤:

ash-3.00$ crs_start -all

Attempting to start `ora.db01.vip` on member `db01`

Attempting to start `ora.db02.vip` on member `db02`

Start of `ora.db02.vip` on member `db02` succeeded.

Attempting to start `ora.db02.ASM2.asm` on member `db02`

Start of `ora.db01.vip` on member `db01` failed.

Attempting to start `ora.db01.vip` on member `db02`

Start of `ora.db01.vip` on member `db02` succeeded.

db02 : CRS-1019: Resource ora.db01.ASM1.asm (application) cannot run on db02

db02 : CRS-1019: Resource ora.db01.ASM1.asm (application) cannot run on db02

db02 : CRS-1019: Resource ora.db01.LISTENER_DB01.lsnr (application) cannot run

n db02

db02 : CRS-1019: Resource ora.db01.ASM1.asm (application) cannot run on db02

Start of `ora.db02.ASM2.asm` on member `db02` succeeded.

Attempting to start `ora.GASDB.GASDB2.inst` on member `db02`

Start of `ora.GASDB.GASDB2.inst` on member `db02` succeeded.

Attempting to start `ora.db02.LISTENER_DB02.lsnr` on member `db02`

Start of `ora.db02.LISTENER_DB02.lsnr` on member `db02` succeeded.

Attempting to start `ora.racdb.racdb2.inst` on member `db02`

Start of `ora.racdb.racdb2.inst` on member `db02` succeeded.

CRS-1002: Resource 'ora.db02.ons' is already running on member 'db02'

CRS-1002: Resource 'ora.GASDB.db' is already running on member 'db01'

Attempting to start `ora.db01.gsd` on member `db01`

Attempting to start `ora.db01.ons` on member `db01`

Attempting to start `ora.db02.gsd` on member `db02`

Attempting to start `ora.racdb.db` on member `db01`

Start of `ora.racdb.db` on member `db01` succeeded.

Start of `ora.db01.gsd` on member `db01` succeeded.

Start of `ora.db02.gsd` on member `db02` succeeded.

Start of `ora.db01.ons` on member `db01` succeeded.

CRS-0223: Resource 'ora.GASDB.GASDB1.inst' has placement error.

CRS-0223: Resource 'ora.GASDB.db' has placement error.

CRS-0223: Resource 'ora.db01.ASM1.asm' has placement error.

CRS-0223: Resource 'ora.db01.LISTENER_DB01.lsnr' has placement error.

CRS-0223: Resource 'ora.db02.ons' has placement error.

CRS-0223: Resource 'ora.racdb.racdb1.inst' has placement error.

 

bash-3.00$ crs_stat -t

Name           Type           Target    State     Host

------------------------------------------------------------

ora....B1.inst application    OFFLINE   OFFLINE

ora....B2.inst application    ONLINE    ONLINE    db02

ora.GASDB.db   application    ONLINE    ONLINE    db01

ora....SM1.asm application    OFFLINE   OFFLINE

ora....01.lsnr application    OFFLINE   OFFLINE

ora.db01.gsd   application    ONLINE    ONLINE    db01

ora.db01.ons   application    ONLINE    ONLINE    db01

ora.db01.vip   application    ONLINE    ONLINE    db02

ora....SM2.asm application    ONLINE    ONLINE    db02

ora....02.lsnr application    ONLINE    ONLINE    db02

ora.db02.gsd   application    ONLINE    ONLINE    db02

ora.db02.ons   application    ONLINE    ONLINE    db02

ora.db02.vip   application    ONLINE    ONLINE    db02

ora.racdb.db   application    ONLINE    ONLINE    db01

ora....b1.inst application    OFFLINE   OFFLINE

ora....b2.inst application    ONLINE    ONLINE    db02

看來還是VIP錯誤。是不是我虛擬IP配錯了。Db02節點的vip沒有問題,看下db02ip,一看之下,果然配錯了。

--15

bash-3.00# ifconfig -a

en0: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 192.168.5.15 netmask 0xffffff00 broadcast 192.168.5.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 10.168.5.15 netmask 0xff000000 broadcast 10.255.255.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

vi0: flags=84000041

        inet 192.168.5.17 netmask 0xffffff00

lo0: flags=e08084b

        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255

        inet6 ::1/0

         tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

--16

bash-3.00# ifconfig -a

en0: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 192.168.5.16 netmask 0xffffff00 broadcast 192.168.5.255

        inet 192.168.5.18 netmask 0xffffff00 broadcast 192.168.5.255

        inet 192.168.5.17 netmask 0xffffff00 broadcast 192.168.5.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 10.168.5.16 netmask 0xff000000 broadcast 10.255.255.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

lo0: flags=e08084b

        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255

        inet6 ::1/0

         tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

 

後來又google了些解決方法。都沒有找到一個如何解決的步驟。不過我想既然還是vip的問題,就解決ip問題就ok了。

解決步驟如下:

1. ----ping db01, db02, db01_vip, db02_vip均能ping

 

2. ----停止racdb資料庫服務

bash-3.00$ crs_stop ora.racdb.db

Attempting to stop `ora.racdb.db` on member `db01`

Stop of `ora.racdb.db` on member `db01` succeeded.

 

3.----srvctl啟動db01節點,出現如下資訊

bash-3.00$ srvctl start nodeapps -n db01

db01:ora.db01.vip:IP:192.168.5.17 is not configured as alias (host=db01)

db01:ora.db01.vip:IP:192.168.5.17 is not configured as alias (host=db01)

CRS-0215: Could not start resource 'ora.db01.LISTENER_DB01.lsnr'.

 

4. ---檢查crs

bash-3.00$ crsctl check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

 

5. ---檢查vip

crs_stat -p ora.db01.vip

 

6. ---關閉所有服務

#crs_stop -all

 

7. ---刪除db01的虛擬vi0, 新增en0ip別名

#ifconfig vi0 192.168.5.17 delete

 

8. ---刪除db02的虛擬en017ip

#ifconfig vi0 192.168.5.17 delete

 

9. ---2節點執行ifconfig -a 檢視ip

--15

# ifconfig -a

en0: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 192.168.5.15 netmask 0xffffff00 broadcast 192.168.5.255

        inet 192.168.5.17 netmask 0xffffff00 broadcast 192.168.5.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 10.168.5.15 netmask 0xff000000 broadcast 10.255.255.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

vi0: flags=84000041

lo0: flags=e08084b

        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255

        inet6 ::1/0

         tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

--16

bash-3.00# ifconfig -a

en0: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 192.168.5.16 netmask 0xffffff00 broadcast 192.168.5.255

        inet 192.168.5.18 netmask 0xffffff00 broadcast 192.168.5.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=5e080863,c0

,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>

        inet 10.168.5.16 netmask 0xff000000 broadcast 10.255.255.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

vi0: flags=84000000<64BIT>

lo0: flags=e08084b

        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255

        inet6 ::1/0

         tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

10. ---重啟服務

#crs_start –all

bash-3.00$ crs_stat -t

Name           Type           Target    State     Host

------------------------------------------------------------

ora....B1.inst application    ONLINE    ONLINE    db01

ora....B2.inst application    ONLINE    ONLINE    db02

ora.GASDB.db   application    ONLINE    ONLINE    db01

ora....SM1.asm application    ONLINE    ONLINE    db01

ora....01.lsnr application    ONLINE    ONLINE    db01

ora.db01.gsd   application    ONLINE    ONLINE    db01

ora.db01.ons   application    ONLINE    ONLINE    db01

ora.db01.vip   application    ONLINE    ONLINE    db01

ora....SM2.asm application    ONLINE    ONLINE    db02

ora....02.lsnr application    ONLINE    ONLINE    db02

ora.db02.gsd   application    ONLINE    ONLINE    db02

ora.db02.ons   application    ONLINE    ONLINE    db02

ora.db02.vip   application    ONLINE    ONLINE    db02

ora.racdb.db   application    ONLINE    ONLINE    db01

ora....b1.inst application    ONLINE    ONLINE    db01

ora....b2.inst application    ONLINE    ONLINE    db02

最後解決OK!!!! 通過這次問題,其實主要要掌握RAC中的體系及概念還是很重要的,瞭解和掌握了這些,就能看到問題所在,並解決。

 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/3090/viewspace-672035/,如需轉載,請註明出處,否則將追究法律責任。

相關文章