AIX 5.3 重啟系統後VG PERMISSION被改變導致Oracle10.2.0.5叢集啟動失敗

snowdba發表於2014-09-06
客戶的資料庫要升級,從Oracle9.2.0.4雙節點RAC升級到Oracle10.2.0.5雙節點RAC。作業系統是AIX 5.3。伺服器為P740 x 2。
主機、儲存、資料庫廠商之間配合的有些問題,導致安裝期間問題不斷。我負責安裝資料庫,被折騰到夠嗆。在不被通知的情況下重啟主機、更換磁碟、更換背板、更換光纖一系列不可控的以外停機一次一次的衝擊著OracleRAC,衝擊著我的耐性。

有過共同經歷的DBA都看到過那種雙手一攤,一臉無辜的表情說出“我什麼也沒做”的情景。我只能按照各種日誌上報錯的時間來給個“提醒”,在xx天xx時xx分系統被意外關閉過。你有印象麼....?

本篇部落格記錄了一次主機重啟後導致的叢集軟體啟動失敗的案例。

環境介紹

OS:      AIX 5.3
DB:         Oracle 10.2.0.5 RAC
Instance: scg1,     scgl2
Storage:  ASM


故障現象

主機被重啟,1號主機叢集啟動成功,2號主機叢集啟動失敗。

啟動2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly

日誌無任何輸出
tail -f /u01/app/oracle/product/10.2.0/crs_1/log/scgl2/crsd/crsd.log

關閉2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crd

系統報錯
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Read-only file system] [30]

故障分析

由於OCR所在磁碟不能寫入資料導致叢集啟動失敗。
檢查ORC磁碟組的屬組、許可權都沒有問題。
檢查主機的VG PERMISSION 找的問題原因,該屬性值是passive-only。正確的屬性應該是read/write

解決方案

在2號機上操作
# lsvg datavg
VOLUME GROUP:       datavg                   VG IDENTIFIER:  00f7639d00004c00000001482d5c96b7
VG STATE:           active                   PP SIZE:        256 megabyte(s)
VG PERMISSION:      passive-only             TOTAL PPs:      3196 (818176 megabytes)
MAX LVs:            256                      FREE PPs:       1176 (301056 megabytes)
LVs:                10                       USED PPs:       2020 (517120 megabytes)
OPEN LVs:           0                        QUORUM:         3 (Enabled)
TOTAL PVs:          4                        VG DESCRIPTORS: 4
STALE PVs:          0                        STALE PPs:      0
ACTIVE PVs:         4                        AUTO ON:        no
Concurrent:         Enhanced-Capable         Auto-Concurrent: Disabled
VG Mode:            Concurrent
Node ID:            2                        Active Nodes:       1
MAX PPs per VG:     32512
MAX PPs per PV:     1016                     MAX PVs:        32
LTG size (Dynamic): 256 kilobyte(s)          AUTO SYNC:      no
HOT SPARE:          no                       BB POLICY:      relocatable

0,在兩節點分別停止叢集軟體
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f

/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f

1, 停止vg
varyoffvg datavg

2, 啟動vg
varyonvg datavg

3, 更改datavg裡面的10個lv的許可權
# lsvg -l datavg
datavg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
ora_raw1_100gb      raw        400     400     1    closed/syncd  N/A
ora_raw2_100gb      raw        400     400     1    closed/syncd  N/A
ora_raw3_100gb      raw        400     400     1    closed/syncd  N/A
ora_raw4_100gb      raw        400     400     1    closed/syncd  N/A
ora_raw5_100gb      raw        400     400     2    closed/syncd  N/A
ora_raw6_1gb        raw        4       4       1    closed/syncd  N/A
ora_raw7_1gb        raw        4       4       1    closed/syncd  N/A
ora_raw8_1gb        raw        4       4       1    closed/syncd  N/A
ora_raw9_1gb        raw        4       4       1    closed/syncd  N/A
ora_raw10_1gb       raw        4       4       1    closed/syncd  N/A

chlv -p w ora_raw1_100gb
chlv -p w ora_raw2_100gb
chlv -p w ora_raw3_100gb
chlv -p w ora_raw4_100gb
chlv -p w ora_raw5_100gb
chlv -p w ora_raw6_1gb
chlv -p w ora_raw7_1gb
chlv -p w ora_raw8_1gb
chlv -p w ora_raw9_1gb
chlv -p w ora_raw10_1gb

4, 再次停止vg
varyoffvg datavg

5, 再次啟動vg
varyonvg -c datavg

6,檢查許可權,已經從passive-only改成read/write
# lsvg datavg
VOLUME GROUP:       datavg                   VG IDENTIFIER:  00f7639d00004c00000001482d5c96b7
VG STATE:           active                   PP SIZE:        256 megabyte(s)
VG PERMISSION:      read/write               TOTAL PPs:      3196 (818176 megabytes)
MAX LVs:            256                      FREE PPs:       1176 (301056 megabytes)
LVs:                10                       USED PPs:       2020 (517120 megabytes)
OPEN LVs:           0                        QUORUM:         3 (Enabled)
TOTAL PVs:          4                        VG DESCRIPTORS: 4
STALE PVs:          0                        STALE PPs:      0
ACTIVE PVs:         4                        AUTO ON:        no
Concurrent:         Enhanced-Capable         Auto-Concurrent: Disabled
VG Mode:            Concurrent
Node ID:            2                        Active Nodes:       1
MAX PPs per VG:     32512
MAX PPs per PV:     1016                     MAX PVs:        32
LTG size (Dynamic): 256 kilobyte(s)          AUTO SYNC:      no
HOT SPARE:          no                       BB POLICY:      relocatable

7,分別啟動2臺主機的叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly

# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly

# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

8,啟動資料庫
SQL> select open_mode from gv$database;

OPEN_MODE
----------
READ WRITE
READ WRITE

本次故障處理花費了我好多時間,總結經驗如下:
Oracle排錯要準確,發現Oracle沒有問題要敢於把問題丟擲去,讓主機、儲存配合檢查。
提升主機和儲存的知識,在不依賴主機工程師和儲存工程師的狀態下多做些檢查。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29047826/viewspace-1265183/,如需轉載,請註明出處,否則將追究法律責任。

相關文章