AIX 5.3 重啟系統後VG PERMISSION被改變導致Oracle10.2.0.5叢集啟動失敗
客戶的資料庫要升級,從Oracle9.2.0.4雙節點RAC升級到Oracle10.2.0.5雙節點RAC。作業系統是AIX 5.3。伺服器為P740 x 2。
主機、儲存、資料庫廠商之間配合的有些問題,導致安裝期間問題不斷。我負責安裝資料庫,被折騰到夠嗆。在不被通知的情況下重啟主機、更換磁碟、更換背板、更換光纖一系列不可控的以外停機一次一次的衝擊著OracleRAC,衝擊著我的耐性。
有過共同經歷的DBA都看到過那種雙手一攤,一臉無辜的表情說出“我什麼也沒做”的情景。我只能按照各種日誌上報錯的時間來給個“提醒”,在xx天xx時xx分系統被意外關閉過。你有印象麼....?
本篇部落格記錄了一次主機重啟後導致的叢集軟體啟動失敗的案例。
環境介紹
OS: AIX 5.3
DB: Oracle 10.2.0.5 RAC
Instance: scg1, scgl2
Storage: ASM
故障現象
主機被重啟,1號主機叢集啟動成功,2號主機叢集啟動失敗。
啟動2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
日誌無任何輸出
tail -f /u01/app/oracle/product/10.2.0/crs_1/log/scgl2/crsd/crsd.log
關閉2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crd
系統報錯
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Read-only file system] [30]
故障分析
由於OCR所在磁碟不能寫入資料導致叢集啟動失敗。
檢查ORC磁碟組的屬組、許可權都沒有問題。
檢查主機的VG PERMISSION 找的問題原因,該屬性值是passive-only。正確的屬性應該是read/write
解決方案
在2號機上操作
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: passive-only TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
0,在兩節點分別停止叢集軟體
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
1, 停止vg
varyoffvg datavg
2, 啟動vg
varyonvg datavg
3, 更改datavg裡面的10個lv的許可權
# lsvg -l datavg
datavg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
ora_raw1_100gb raw 400 400 1 closed/syncd N/A
ora_raw2_100gb raw 400 400 1 closed/syncd N/A
ora_raw3_100gb raw 400 400 1 closed/syncd N/A
ora_raw4_100gb raw 400 400 1 closed/syncd N/A
ora_raw5_100gb raw 400 400 2 closed/syncd N/A
ora_raw6_1gb raw 4 4 1 closed/syncd N/A
ora_raw7_1gb raw 4 4 1 closed/syncd N/A
ora_raw8_1gb raw 4 4 1 closed/syncd N/A
ora_raw9_1gb raw 4 4 1 closed/syncd N/A
ora_raw10_1gb raw 4 4 1 closed/syncd N/A
chlv -p w ora_raw1_100gb
chlv -p w ora_raw2_100gb
chlv -p w ora_raw3_100gb
chlv -p w ora_raw4_100gb
chlv -p w ora_raw5_100gb
chlv -p w ora_raw6_1gb
chlv -p w ora_raw7_1gb
chlv -p w ora_raw8_1gb
chlv -p w ora_raw9_1gb
chlv -p w ora_raw10_1gb
4, 再次停止vg
varyoffvg datavg
5, 再次啟動vg
varyonvg -c datavg
6,檢查許可權,已經從passive-only改成read/write
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
7,分別啟動2臺主機的叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
8,啟動資料庫
SQL> select open_mode from gv$database;
OPEN_MODE
----------
READ WRITE
READ WRITE
本次故障處理花費了我好多時間,總結經驗如下:
Oracle排錯要準確,發現Oracle沒有問題要敢於把問題丟擲去,讓主機、儲存配合檢查。
提升主機和儲存的知識,在不依賴主機工程師和儲存工程師的狀態下多做些檢查。
主機、儲存、資料庫廠商之間配合的有些問題,導致安裝期間問題不斷。我負責安裝資料庫,被折騰到夠嗆。在不被通知的情況下重啟主機、更換磁碟、更換背板、更換光纖一系列不可控的以外停機一次一次的衝擊著OracleRAC,衝擊著我的耐性。
有過共同經歷的DBA都看到過那種雙手一攤,一臉無辜的表情說出“我什麼也沒做”的情景。我只能按照各種日誌上報錯的時間來給個“提醒”,在xx天xx時xx分系統被意外關閉過。你有印象麼....?
本篇部落格記錄了一次主機重啟後導致的叢集軟體啟動失敗的案例。
環境介紹
OS: AIX 5.3
DB: Oracle 10.2.0.5 RAC
Instance: scg1, scgl2
Storage: ASM
故障現象
主機被重啟,1號主機叢集啟動成功,2號主機叢集啟動失敗。
啟動2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
日誌無任何輸出
tail -f /u01/app/oracle/product/10.2.0/crs_1/log/scgl2/crsd/crsd.log
關閉2號主機叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crd
系統報錯
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Read-only file system] [30]
故障分析
由於OCR所在磁碟不能寫入資料導致叢集啟動失敗。
檢查ORC磁碟組的屬組、許可權都沒有問題。
檢查主機的VG PERMISSION 找的問題原因,該屬性值是passive-only。正確的屬性應該是read/write
解決方案
在2號機上操作
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: passive-only TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
0,在兩節點分別停止叢集軟體
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
/u01/app/oracle/product/10.2.0/crs_1/bin/crsctl stop crs -f
1, 停止vg
varyoffvg datavg
2, 啟動vg
varyonvg datavg
3, 更改datavg裡面的10個lv的許可權
# lsvg -l datavg
datavg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
ora_raw1_100gb raw 400 400 1 closed/syncd N/A
ora_raw2_100gb raw 400 400 1 closed/syncd N/A
ora_raw3_100gb raw 400 400 1 closed/syncd N/A
ora_raw4_100gb raw 400 400 1 closed/syncd N/A
ora_raw5_100gb raw 400 400 2 closed/syncd N/A
ora_raw6_1gb raw 4 4 1 closed/syncd N/A
ora_raw7_1gb raw 4 4 1 closed/syncd N/A
ora_raw8_1gb raw 4 4 1 closed/syncd N/A
ora_raw9_1gb raw 4 4 1 closed/syncd N/A
ora_raw10_1gb raw 4 4 1 closed/syncd N/A
chlv -p w ora_raw1_100gb
chlv -p w ora_raw2_100gb
chlv -p w ora_raw3_100gb
chlv -p w ora_raw4_100gb
chlv -p w ora_raw5_100gb
chlv -p w ora_raw6_1gb
chlv -p w ora_raw7_1gb
chlv -p w ora_raw8_1gb
chlv -p w ora_raw9_1gb
chlv -p w ora_raw10_1gb
4, 再次停止vg
varyoffvg datavg
5, 再次啟動vg
varyonvg -c datavg
6,檢查許可權,已經從passive-only改成read/write
# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00f7639d00004c00000001482d5c96b7
VG STATE: active PP SIZE: 256 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 3196 (818176 megabytes)
MAX LVs: 256 FREE PPs: 1176 (301056 megabytes)
LVs: 10 USED PPs: 2020 (517120 megabytes)
OPEN LVs: 0 QUORUM: 3 (Enabled)
TOTAL PVs: 4 VG DESCRIPTORS: 4
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 4 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent
Node ID: 2 Active Nodes: 1
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
7,分別啟動2臺主機的叢集
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
# /u01/app/oracle/product/10.2.0/crs_1/bin/crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
8,啟動資料庫
SQL> select open_mode from gv$database;
OPEN_MODE
----------
READ WRITE
READ WRITE
本次故障處理花費了我好多時間,總結經驗如下:
Oracle排錯要準確,發現Oracle沒有問題要敢於把問題丟擲去,讓主機、儲存配合檢查。
提升主機和儲存的知識,在不依賴主機工程師和儲存工程師的狀態下多做些檢查。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29047826/viewspace-1265183/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- MongoDB例項重啟失敗探究(大事務Redo導致)MongoDB
- sock鎖檔案導致的MySQL啟動失敗MySql
- kubernetes叢集斷電後etcd啟動失敗之etcd備份方案
- ORACLE 11.2.0.4 for solaris更換硬體後主機時間改變導致一節點叢集服務無法啟動Oracle
- LightDB/Postgresql 記錄客戶端啟動版本問題導致啟動失敗問題SQL客戶端
- AIX系統擴vg操作步驟AI
- docker啟動失敗Docker
- tomcat 啟動失敗Tomcat
- easyswoole啟動失敗
- sqlplus啟動失敗SQL
- MySQL啟動失敗MySql
- 解決一次gitlab因異常關機導致啟動失敗Gitlab
- gcluster/gnode 許可權設定為全權 777,叢集啟動失敗;GC
- api-server-pod-重啟失敗APIServer
- Windows系統解決PhPStudy MySQL啟動失敗問題WindowsPHPMySql
- RabbitMQ叢集重啟報錯MQ
- Solaris叢集節點重啟
- ORACLE RAC 11.2.0.4 ASM加盤導致叢集重啟之ASM sga設定過小OracleASM
- 開啟 Keep-Alive 可能會導致http 請求偶發失敗Keep-AliveHTTP
- Win10更新重啟後安裝失敗怎麼修復_win10更新重啟後安裝失敗的修復步驟Win10
- win10啟動Apache伺服器失敗怎麼回事_win10系統Apache啟動失敗如何處理Win10Apache伺服器
- 企業使用ERP系統導致失敗的因素所在
- Linux使用Ambari啟動服務啟動失敗Linux
- Win10系統啟動Apache失敗的解決方法Win10Apache
- Win7 Nginx啟動失敗 cmd命令失敗Win7Nginx
- Android之點選Home鍵後再次開啟導致APP重啟問題AndroidAPP
- 解決方案集錦——Tomcat伺服器啟動失敗Tomcat伺服器
- 啟用系統登入失敗處理功能
- AIX_EXT_VGAI
- aix lvm big vgAILVM
- tomcat 埠 8005 被 windows 系統服務佔用導致啟動閃退的問題TomcatWindows
- Oracle歸檔檔案丟失導致OGG不用啟動Oracle
- win10系統network location awareness啟動失敗解決方法Win10
- windows10系統下apache啟動失敗的解決方法WindowsApache
- 痞子衡嵌入式:系統時鐘配置不當會導致i.MXRT1xxx系列下OTFAD加密啟動失敗加密
- dota2啟動失敗 初始化vulkan失敗
- namenode單節點啟動成功後自動消失/格式化失敗/fsimage載入失敗
- Oracle RAC啟動失敗(DNS故障)OracleDNS
- 問題一:Kibaba 啟動失敗