一次ASM環境故障解決
由於RAC的測試環境空間不足,在給ASM新增新的磁碟空間時,出現了故障。
操作的步驟大致如下,在節點1啟動了dbca來管理ASM裝置。由於配置的部分裸裝置在ASM圖形介面下看不到。因此在節點1上透過root使用者將裸裝置的訪問許可權授予了oracle。
這時,從圖形介面的候選磁碟中,已經可以看到這些裸裝置了。透過圖形介面將裸裝置加到了磁碟組中。
但是這個操作出現了兩個錯誤:ORA-15032和ORA-15075錯誤。
ORA-15032: not all alterations performed
Cause: At least one ALTER DISKGROUP action failed.
Action: Check the other messages issued along with this summary error.
ORA-15075: disk(s) are not visible cluster-wide
Cause: An ALTER DISKGROUP ADD DISK command specified a disk that could not be discovered by one or more nodes in a RAC cluster configuration.
Action: Determine which disks are causing the problem from the GV$OSM_DISK fixed view. Check operating system permissions for the device and the storage sub-system configuration on each node in a RAC cluster that cannot identify the disk.
其實ORA-15075錯誤中的資訊已經足夠明顯了。如果有一定的經驗或者根據這個錯誤進行分析就能找到問題的原因。
但是由於發生了其他的意外,導致解決問題的方向發生了變化。
一個奇怪的現象是,我認為操作已經失敗了,但是這些裸裝置在dbca的ASM配置中已經可見了。
當我正在檢查這兩個錯誤資訊的時候。同事告訴我節點2上的例項連不上了。
透過作業系統命令檢查發現,例項2已經關閉了。不過例項2的ASM例項仍然存在。看到這個現象感覺有點奇怪。對ASM的操作引起的錯誤,ASM例項都沒有出錯,怎麼資料庫例項關閉了呢。
檢查alert檔案,嘗試重啟系統,看看錯誤資訊:
$ tail -500 alert*
List of nodes:
.
.
.
Thu Mar 29 17:10:24 2007
SUCCESS: disk DISK_0012 (12.4042303515) added to diskgroup DISK
SUCCESS: disk DISK_0013 (13.4042303516) added to diskgroup DISK
SUCCESS: disk DISK_0014 (14.4042303517) added to diskgroup DISK
SUCCESS: disk DISK_0015 (15.4042303518) added to diskgroup DISK
SUCCESS: disk DISK_0016 (16.4042303519) added to diskgroup DISK
Thu Mar 29 17:25:36 2007
SUCCESS: disk DISK_0017 (17.4042303525) added to diskgroup DISK
SUCCESS: disk DISK_0018 (18.4042303520) added to diskgroup DISK
SUCCESS: disk DISK_0019 (19.4042303521) added to diskgroup DISK
SUCCESS: disk DISK_0020 (20.4042303522) added to diskgroup DISK
SUCCESS: disk DISK_0021 (21.4042303523) added to diskgroup DISK
SUCCESS: disk DISK_0022 (22.4042303524) added to diskgroup DISK
Thu Mar 29 17:29:45 2007
SUCCESS: diskgroup DISK was dismounted
SUCCESS: diskgroup DISK was dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00204: error in reading (block 35, # blocks 1) of control file
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
LMON: terminating instance due to error 204
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_pmon_2754.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
System state dump is made for local instance
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms1_2797.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms0_2793.trc:
ORA-00204: error in reading (block , # blocks ) of control file
System State dumped to trace file /data/oracle/admin/testrac/bdump/testrac2_diag_2756.trc
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmd0_2791.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_psp0_2778.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204: 讀取控制檔案時出錯 (塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204: 讀取控制檔案時出錯 (塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789
$ sqlplus "/ as sysdba"
SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3月 29 17:36:07 2007
Copyright (c) 1982, 2005, Oracle. All Rights Reserved.
已連線到空閒例程。
SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DISK/testrac/spfiletestrac.ora'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> shutdown
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory
其實alert檔案中已經明顯包含了導致錯誤的原因:
SUCCESS: diskgroup DISK was dismounted
SUCCESS: diskgroup DISK was dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00204: error in reading (block 35, # blocks 1) of control file
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
ASM的磁碟組已經DISMOUNT了,由於對ASM不熟悉,因此對ASM資訊沒有過多的關注,只是注意了後面的資訊:
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204: 讀取控制檔案時出錯 (塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204: 讀取控制檔案時出錯 (塊 , # 塊 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789
並認為這是導致問題的原因。
其實從後面的啟動資訊也可以看出問題:
ORA-15077: could not locate ASM instance serving a required diskgroup
ORA-15077: could not locate ASM instance serving a required diskgroup
Cause: The instance failed to perform the specified operation because it could not locate a required ASM instance.
Action: Start an ASM instance and mount the required diskgroup.
但是由於前一陣剛剛碰到一個bug,這個bug的關鍵錯誤資訊恰好也是ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora,於是暫時又忽略了關鍵資訊。Bug的詳細描述可以參考:http://yangtingkun.itpub.net/post/468/272289
於是思路自然的轉到這個bug上,認為這次碰到的問題可能和上次有關。嘗試使用本地pfile檔案啟動資料庫:
SQL> startup pfile=/export/home/oracle/inittestrac2.ora
ORACLE 例程已經啟動。
Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 503317800 bytes
Database Buffers 1627389952 bytes
Redo Buffers 14745600 bytes
ORA-00205: ?????????, ??????, ???????
再一次被誤導,去檢查ORA-00205錯誤資訊。
ORA-00205: error in identifying control file, check alert log for more info
Cause: The system could not find a control file of the specified name and size.
Action: Check that ALL control files are online and that they are the same files that the system created at cold start time.
直到發現控制檔案本身並沒有問題——例項1一直正常執行。才意識到自己走錯了路。
仔細檢查了所有的錯誤資訊,已經導致錯誤的產生的原因——新增磁碟組的操作。終於發現了問題的真正所在。
在授權的時候,只在節點1對裸裝置進行了授權,而沒有在節點2進行授權。因此,雖然節點1上的dbca配置的ASM例項可以成功的將裸裝置加到磁碟組中。但是節點2同樣的操作由於缺少許可權,導致了磁碟組DISMOUNT,間接導致了例項關閉。
於是在節點2上對裸裝置進行授權,重啟ASM例項,問題解決。
$ su -
Password:
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s1
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s3
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s4
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s5
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s6
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s7
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s1
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s3
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s4
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s5
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s6
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s7
$ sqlplus "/ as sysdba"
SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3月 29 17:52:38 2007
Copyright (c) 1982, 2005, Oracle. All Rights Reserved.
連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> shutdown
ORA-01507: 未裝載資料庫
ORACLE 例程已經關閉。
SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DISK/testrac/spfiletestrac.ora'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> exit從 Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options 斷開
$ ps -ef|grep ASM
oracle 1993 1 0 Mar 28 ? 0:00 asm_mman_+ASM2
oracle 1979 1 0 Mar 28 ? 0:00 asm_pmon_+ASM2
oracle 1987 1 0 Mar 28 ? 0:18 asm_lmd0_+ASM2
oracle 2658 1 0 Mar 28 ? 0:00 asm_o000_+ASM2
oracle 1983 1 0 Mar 28 ? 0:00 asm_psp0_+ASM2
oracle 2332 1 0 Mar 28 ? 0:01 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode2.ASM2.asm
oracle 1981 1 0 Mar 28 ? 0:00 asm_diag_+ASM2
oracle 1985 1 0 Mar 28 ? 0:01 asm_lmon_+ASM2
oracle 1989 1 0 Mar 28 ? 0:01 asm_lms0_+ASM2
oracle 2028 1 0 Mar 28 ? 0:04 asm_ckpt_+ASM2
oracle 2026 1 0 Mar 28 ? 0:00 asm_lgwr_+ASM2
oracle 2008 1 0 Mar 28 ? 0:01 asm_dbw0_+ASM2
oracle 2030 1 0 Mar 28 ? 0:00 asm_smon_+ASM2
oracle 2032 1 0 Mar 28 ? 0:00 asm_rbal_+ASM2
oracle 2034 1 0 Mar 28 ? 0:00 asm_gmon_+ASM2
oracle 2065 1 0 Mar 28 ? 0:01 asm_lck0_+ASM2
oracle 23532 20734 0 17:54:05 pts/1 0:00 grep ASM
oracle 15238 1 0 17:29:43 ? 0:00 asm_b000_+ASM2
$ srvctl stop asm -n racnode2
$ srvctl start asm -n racnode2
$ sqlplus "/ as sysdba"
SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3月 29 17:55:17 2007
Copyright (c) 1982, 2005, Oracle. All Rights Reserved.
已連線到空閒例程。
SQL> startup
ORACLE 例程已經啟動。
Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 469763368 bytes
Database Buffers 1660944384 bytes
Redo Buffers 14745600 bytes資料庫裝載完畢。資料庫已經開啟。
SQL>
至此問題解決。其實導致問題的原因很簡單,但是問題出現了需要冷靜的分析和判斷,否則很容易被一些其他的資訊干擾而誤入歧途,走了很多其他的彎路。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-69224/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 一次生產環境的docker MySQL故障DockerMySql
- Oracle ASM故障資料恢復解決方案OracleASM資料恢復
- 從一次故障解決想到的
- 單機搭建ASM環境ASM
- 一次網路丟包故障的解決
- 一次asm磁碟頭部資訊丟失故障ASM
- 有贊環境解決方案
- 解決所有環境問題
- webpack(1)安裝環境與解決環境問題Web
- 1024程式設計師節/探討ORACLE環境故障的解決方法程式設計師Oracle
- asm故障組故障組ASM
- RAC和ASM環境下打patchASM
- 單機環境配置ASM例項ASM
- 【ASM學習】在windows 環境下建立ASM例項ASMWindows
- RAC環境網路故障測試
- 一次ASM空間滿了的問題解決ASM
- 一次library cache pin故障的解決過程
- 故障解析丨一次死鎖問題的解決
- mac php環境終極解決方案MacPHP
- 學習ASM技術(一)--環境搭建ASM
- NFS故障解決NFS
- Windows 環境安裝 Horizon 報錯解決Windows
- Fabric 環境搭建遇到問題及解決
- freebsd開發環境解決方案(轉)開發環境
- 一次資料檔案映象丟失引起的故障解決
- Oracle RAC環境下ASM磁碟組擴容OracleASM
- ASM單例項(Oracle 11.2.0.4)環境(一)ASM單例Oracle
- ASM單例項(Oracle 11.2.0.4)環境(二)ASM單例Oracle
- Oracle10g RAC ASM 環境日常管理OracleASM
- (轉)Oracle rac環境下清除asm例項OracleASM
- 單機環境安裝配置ASM例項ASM
- [肥朝]從一次解決開發環境問題聊聊為什麼要看原始碼開發環境原始碼
- 一次對requirements環境的配置UIREM
- 故障解決法(摘抄)
- Nacos 解決 laravel 多環境下配置切換Laravel
- 11.2環境ASM例項spfile放在ASM磁碟組的訪問方式ASM
- RAC環境ASM磁碟組間修改spfile的位置ASM
- Windows平臺模擬單例項ASM環境Windows單例ASM