AIX 5.3 重啟系統後VG PERMISSION被改變導致Oracle10.2.0.5叢集啟動失敗
客戶的資料庫要升級,從Oracle9.2.0.4雙節點RAC升級到Oracle10.2.0.5雙節點RAC。作業系統是AIX 5.3。伺服器為P740 x 2。
主機、儲存、資料庫廠商之間配合的有些問題,導致安裝期間問題不斷。我負責安裝資料庫,被折騰到夠嗆。在不被通知的情況下重啟主機、更換磁碟、更換背板、更換光纖一系列不可控的以外停機一次一次的衝擊著OracleRAC,衝擊著我的耐性。
有過共同經歷的DBA都看到過那種雙手一攤,一臉無辜的表情說出“我什麼也沒做”的情景。我只能按照各種日誌上報錯的時間來給個“提醒”,在xx天xx時xx分系統被意外關閉過。你有印象麼....?
本篇部落格記錄了一次主機重啟後導致的叢集軟體啟動失敗的案例。
環境介紹
OS: AIX 5.3
DB: Oracle 10.2.0.5 RAC
Instance: scg1, scgl2
Storage: ASM
故障現象
主機被重啟,1號主機叢集啟動成功,2號主機叢集啟動失敗。
啟動2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
日誌無任何輸出
tail -f /u01/app/oracle/product/10.2.0/crs_1/log/scgl2/crsd/crsd.log
關閉2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crd
系統報錯
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Read-only file system] [30]
故障分析
由於OCR所在磁碟不能寫入資料導致叢集啟動失敗。
檢查ORC磁碟組的屬組、許可權都沒有問題。
檢查主機的VG PERMISSION 找的問題原因,該屬性值是passive-only。正確的屬性應該是read/write
解決方案
在2號機上操作
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: passive-only TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
0,在兩節點分別停止叢集軟體
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
1, 停止vg
varyoffvg datavg
2, 啟動vg
varyonvg datavg
3, 更改datavg裡面的10個lv的許可權
# lsvg -l datavg
datavg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
ora_raw1_100gb raw 400 400 1 closed/syncd N/A
ora_raw2_100gb raw 400 400 1 closed/syncd N/A
ora_raw3_100gb raw 400 400 1 closed/syncd N/A
ora_raw4_100gb raw 400 400 1 closed/syncd N/A
ora_raw5_100gb raw 400 400 2 closed/syncd N/A
ora_raw6_1gb raw 4 4 1 closed/syncd N/A
ora_raw7_1gb raw 4 4 1 closed/syncd N/A
ora_raw8_1gb raw 4 4 1 closed/syncd N/A
ora_raw9_1gb raw 4 4 1 closed/syncd N/A
ora_raw10_1gb raw 4 4 1 closed/syncd N/A
chlv -p w ora_raw1_100gb
chlv -p w ora_raw2_100gb
chlv -p w ora_raw3_100gb
chlv -p w ora_raw4_100gb
chlv -p w ora_raw5_100gb
chlv -p w ora_raw6_1gb
chlv -p w ora_raw7_1gb
chlv -p w ora_raw8_1gb
chlv -p w ora_raw9_1gb
chlv -p w ora_raw10_1gb
4, 再次停止vg
varyoffvg datavg
5, 再次啟動vg
varyonvg -c datavg
6,檢查許可權,已經從passive-only改成read/write
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
7,分別啟動2臺主機的叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
8,啟動資料庫
SQL> select open_mode from gv$database;
OPEN_MODE
----------
READ WRITE
READ WRITE
本次故障處理花費了我好多時間,總結經驗如下:
Oracle排錯要準確,發現Oracle沒有問題要敢於把問題丟擲去,讓主機、儲存配合檢查。
提升主機和儲存的知識,在不依賴主機工程師和儲存工程師的狀態下多做些檢查。
主機、儲存、資料庫廠商之間配合的有些問題,導致安裝期間問題不斷。我負責安裝資料庫,被折騰到夠嗆。在不被通知的情況下重啟主機、更換磁碟、更換背板、更換光纖一系列不可控的以外停機一次一次的衝擊著OracleRAC,衝擊著我的耐性。
有過共同經歷的DBA都看到過那種雙手一攤,一臉無辜的表情說出“我什麼也沒做”的情景。我只能按照各種日誌上報錯的時間來給個“提醒”,在xx天xx時xx分系統被意外關閉過。你有印象麼....?
本篇部落格記錄了一次主機重啟後導致的叢集軟體啟動失敗的案例。
環境介紹
OS: AIX 5.3
DB: Oracle 10.2.0.5 RAC
Instance: scg1, scgl2
Storage: ASM
故障現象
主機被重啟,1號主機叢集啟動成功,2號主機叢集啟動失敗。
啟動2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
日誌無任何輸出
tail -f /u01/app/oracle/product/10.2.0/crs_1/log/scgl2/crsd/crsd.log
關閉2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crd
系統報錯
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Read-only file system] [30]
故障分析
由於OCR所在磁碟不能寫入資料導致叢集啟動失敗。
檢查ORC磁碟組的屬組、許可權都沒有問題。
檢查主機的VG PERMISSION 找的問題原因,該屬性值是passive-only。正確的屬性應該是read/write
解決方案
在2號機上操作
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: passive-only TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
0,在兩節點分別停止叢集軟體
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
1, 停止vg
varyoffvg datavg
2, 啟動vg
varyonvg datavg
3, 更改datavg裡面的10個lv的許可權
# lsvg -l datavg
datavg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
ora_raw1_100gb raw 400 400 1 closed/syncd N/A
ora_raw2_100gb raw 400 400 1 closed/syncd N/A
ora_raw3_100gb raw 400 400 1 closed/syncd N/A
ora_raw4_100gb raw 400 400 1 closed/syncd N/A
ora_raw5_100gb raw 400 400 2 closed/syncd N/A
ora_raw6_1gb raw 4 4 1 closed/syncd N/A
ora_raw7_1gb raw 4 4 1 closed/syncd N/A
ora_raw8_1gb raw 4 4 1 closed/syncd N/A
ora_raw9_1gb raw 4 4 1 closed/syncd N/A
ora_raw10_1gb raw 4 4 1 closed/syncd N/A
chlv -p w ora_raw1_100gb
chlv -p w ora_raw2_100gb
chlv -p w ora_raw3_100gb
chlv -p w ora_raw4_100gb
chlv -p w ora_raw5_100gb
chlv -p w ora_raw6_1gb
chlv -p w ora_raw7_1gb
chlv -p w ora_raw8_1gb
chlv -p w ora_raw9_1gb
chlv -p w ora_raw10_1gb
4, 再次停止vg
varyoffvg datavg
5, 再次啟動vg
varyonvg -c datavg
6,檢查許可權,已經從passive-only改成read/write
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
7,分別啟動2臺主機的叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
8,啟動資料庫
SQL> select open_mode from gv$database;
OPEN_MODE
----------
READ WRITE
READ WRITE
本次故障處理花費了我好多時間,總結經驗如下:
Oracle排錯要準確,發現Oracle沒有問題要敢於把問題丟擲去,讓主機、儲存配合檢查。
提升主機和儲存的知識,在不依賴主機工程師和儲存工程師的狀態下多做些檢查。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29047826/viewspace-1265183/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Linux主機名修改後導致mysql重啟失敗LinuxMySql
- 重灌Windows系統後,Linux系統啟動引導失敗WindowsLinux
- 由AIX系統故障導致系統重啟,使Oracle資料庫自動啟動例項AIOracle資料庫
- /etc/fstab的錯誤設定導致系統啟動失敗
- MongoDB例項重啟失敗探究(大事務Redo導致)MongoDB
- sock鎖檔案導致的MySQL啟動失敗MySql
- 安裝GI後重啟作業系統後啟動ASM例項失敗及解決方法作業系統ASM
- /tmp檔案系統無許可權導致監聽listener啟動失敗
- RAC oracle 許可權更改導致 實力啟動失敗Oracle
- IP地址被清空導致例項重啟
- ORACLE 11.2.0.4 for solaris更換硬體後主機時間改變導致一節點叢集服務無法啟動Oracle
- aix系統vgAI
- service network restart 命令使用時導致叢集該節點重啟REST
- LightDB/Postgresql 記錄客戶端啟動版本問題導致啟動失敗問題SQL客戶端
- /dev/bpf裝置缺失導致RAC安裝時HAIP啟動失敗devAI
- docker啟動失敗Docker
- MySQL啟動失敗MySql
- 因AIX系統目錄許可權問題導致TSM備份失敗AI
- Java ibatis配置問題導致Myeclipse啟動web專案失敗JavaBATEclipseWeb
- 由adoacorectl.sh啟動失敗導致網頁無法顯示網頁
- AIX6.1 oracle10.2.0.5 單機檔案系統,實現開機自動啟動AIOracle
- RabbitMQ叢集重啟報錯MQ
- 解決一次gitlab因異常關機導致啟動失敗Gitlab
- 【MySQL】AppArmor導致datadir遷移無法啟動&初始化失敗MySqlAPP
- FAQ系列|列型別被自動修改導致複製失敗型別
- 使用Huge Pages後資料庫啟動失敗資料庫
- AIX重啟AI
- Win10更新重啟後安裝失敗怎麼修復_win10更新重啟後安裝失敗的修復步驟Win10
- win10啟動Apache伺服器失敗怎麼回事_win10系統Apache啟動失敗如何處理Win10Apache伺服器
- AIX sshd不隨系統啟動AI
- gcluster/gnode 許可權設定為全權 777,叢集啟動失敗;GC
- 系統重啟後卷不能自動掛載
- ORACLE RAC 11.2.0.4 ASM加盤導致叢集重啟之ASM sga設定過小OracleASM
- sqlplus啟動失敗SQL
- linux smartd啟動失敗Linux
- Windows系統解決PhPStudy MySQL啟動失敗問題WindowsPHPMySql
- Win10系統啟動Apache失敗的解決方法Win10Apache
- Win7 Nginx啟動失敗 cmd命令失敗Win7Nginx