一次ASM環境故障解決

yangtingkun發表於2007-03-30

由於RAC的測試環境空間不足,在給ASM新增新的磁碟空間時,出現了故障。


操作的步驟大致如下,在節點1啟動了dbca來管理ASM裝置。由於配置的部分裸裝置在ASM圖形介面下看不到。因此在節點1上透過root使用者將裸裝置的訪問許可權授予了oracle

這時,從圖形介面的候選磁碟中,已經可以看到這些裸裝置了。透過圖形介面將裸裝置加到了磁碟組中。

但是這個操作出現了兩個錯誤:ORA-15032ORA-15075錯誤。

ORA-15032: not all alterations performed

Cause: At least one ALTER DISKGROUP action failed.

Action: Check the other messages issued along with this summary error.

ORA-15075: disk(s) are not visible cluster-wide

Cause: An ALTER DISKGROUP ADD DISK command specified a disk that could not be discovered by one or more nodes in a RAC cluster configuration.

Action: Determine which disks are causing the problem from the GV$OSM_DISK fixed view. Check operating system permissions for the device and the storage sub-system configuration on each node in a RAC cluster that cannot identify the disk.

其實ORA-15075錯誤中的資訊已經足夠明顯了。如果有一定的經驗或者根據這個錯誤進行分析就能找到問題的原因。

但是由於發生了其他的意外,導致解決問題的方向發生了變化。

一個奇怪的現象是,我認為操作已經失敗了,但是這些裸裝置在dbcaASM配置中已經可見了。

當我正在檢查這兩個錯誤資訊的時候。同事告訴我節點2上的例項連不上了。

透過作業系統命令檢查發現,例項2已經關閉了。不過例項2ASM例項仍然存在。看到這個現象感覺有點奇怪。對ASM的操作引起的錯誤,ASM例項都沒有出錯,怎麼資料庫例項關閉了呢。

檢查alert檔案,嘗試重啟系統,看看錯誤資訊:

$ tail -500 alert*
List of nodes:
.
.
.
Thu Mar 29 17:10:24 2007
SUCCESS: disk DISK_0012 (12.4042303515) added to diskgroup DISK
SUCCESS: disk DISK_0013 (13.4042303516) added to diskgroup DISK
SUCCESS: disk DISK_0014 (14.4042303517) added to diskgroup DISK
SUCCESS: disk DISK_0015 (15.4042303518) added to diskgroup DISK
SUCCESS: disk DISK_0016 (16.4042303519) added to diskgroup DISK
Thu Mar 29 17:25:36 2007
SUCCESS: disk DISK_0017 (17.4042303525) added to diskgroup DISK
SUCCESS: disk DISK_0018 (18.4042303520) added to diskgroup DISK
SUCCESS: disk DISK_0019 (19.4042303521) added to diskgroup DISK
SUCCESS: disk DISK_0020 (20.4042303522) added to diskgroup DISK
SUCCESS: disk DISK_0021 (21.4042303523) added to diskgroup DISK
SUCCESS: disk DISK_0022 (22.4042303524) added to diskgroup DISK
Thu Mar 29 17:29:45 2007
SUCCESS: diskgroup DISK was dismounted
SUCCESS: diskgroup DISK was dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00204: error in reading (block 35, # blocks 1) of control file
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
LMON: terminating instance due to error 204
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_pmon_2754.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
System state dump is made for local instance
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms1_2797.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms0_2793.trc:
ORA-00204: error in reading (block , # blocks ) of control file
System State dumped to trace file /data/oracle/admin/testrac/bdump/testrac2_diag_2756.trc
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmd0_2791.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_psp0_2778.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204:
讀取控制檔案時出錯 ( , # )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204:
讀取控制檔案時出錯 ( , #
)
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3 29 17:36:07 2007

Copyright (c) 1982, 2005, Oracle. All Rights Reserved.

已連線到空閒例程。

SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DISK/testrac/spfiletestrac.ora'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> shutdown
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory

其實alert檔案中已經明顯包含了導致錯誤的原因:

SUCCESS: diskgroup DISK was dismounted
SUCCESS: diskgroup DISK was dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00204: error in reading (block 35, # blocks 1) of control file
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted

ASM的磁碟組已經DISMOUNT了,由於對ASM不熟悉,因此對ASM資訊沒有過多的關注,只是注意了後面的資訊:

Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204:
讀取控制檔案時出錯 ( , #
)
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204:
讀取控制檔案時出錯 ( , #
)
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789

並認為這是導致問題的原因。

其實從後面的啟動資訊也可以看出問題:

ORA-15077: could not locate ASM instance serving a required diskgroup

ORA-15077: could not locate ASM instance serving a required diskgroup

Cause: The instance failed to perform the specified operation because it could not locate a required ASM instance.

Action: Start an ASM instance and mount the required diskgroup.

但是由於前一陣剛剛碰到一個bug,這個bug的關鍵錯誤資訊恰好也是ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora,於是暫時又忽略了關鍵資訊。Bug的詳細描述可以參考:http://yangtingkun.itpub.net/post/468/272289

於是思路自然的轉到這個bug上,認為這次碰到的問題可能和上次有關。嘗試使用本地pfile檔案啟動資料庫:

SQL> startup pfile=/export/home/oracle/inittestrac2.ora
ORACLE
例程已經啟動。

Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 503317800 bytes
Database Buffers 1627389952 bytes
Redo Buffers 14745600 bytes
ORA-00205: ?????????, ??????, ???????

再一次被誤導,去檢查ORA-00205錯誤資訊。

ORA-00205: error in identifying control file, check alert log for more info

Cause: The system could not find a control file of the specified name and size.

Action: Check that ALL control files are online and that they are the same files that the system created at cold start time.

直到發現控制檔案本身並沒有問題——例項1一直正常執行。才意識到自己走錯了路。

仔細檢查了所有的錯誤資訊,已經導致錯誤的產生的原因——新增磁碟組的操作。終於發現了問題的真正所在。

在授權的時候,只在節點1對裸裝置進行了授權,而沒有在節點2進行授權。因此,雖然節點1上的dbca配置的ASM例項可以成功的將裸裝置加到磁碟組中。但是節點2同樣的操作由於缺少許可權,導致了磁碟組DISMOUNT,間接導致了例項關閉。

於是在節點2上對裸裝置進行授權,重啟ASM例項,問題解決。

$ su -
Password:
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s1
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s3
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s4
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s5
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s6
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s7
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s1
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s3
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s4
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s5
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s6
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s7
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3 29 17:52:38 2007

Copyright (c) 1982, 2005, Oracle. All Rights Reserved.

連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> shutdown
ORA-01507:
未裝載資料庫


ORACLE
例程已經關閉。
SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DISK/testrac/spfiletestrac.ora'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> exit
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
斷開

$ ps -ef|grep ASM
oracle 1993 1 0 Mar 28 ? 0:00 asm_mman_+ASM2
oracle 1979 1 0 Mar 28 ? 0:00 asm_pmon_+ASM2
oracle 1987 1 0 Mar 28 ? 0:18 asm_lmd0_+ASM2
oracle 2658 1 0 Mar 28 ? 0:00 asm_o000_+ASM2
oracle 1983 1 0 Mar 28 ? 0:00 asm_psp0_+ASM2
oracle 2332 1 0 Mar 28 ? 0:01 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode2.ASM2.asm
oracle 1981 1 0 Mar 28 ? 0:00 asm_diag_+ASM2
oracle 1985 1 0 Mar 28 ? 0:01 asm_lmon_+ASM2
oracle 1989 1 0 Mar 28 ? 0:01 asm_lms0_+ASM2
oracle 2028 1 0 Mar 28 ? 0:04 asm_ckpt_+ASM2
oracle 2026 1 0 Mar 28 ? 0:00 asm_lgwr_+ASM2
oracle 2008 1 0 Mar 28 ? 0:01 asm_dbw0_+ASM2
oracle 2030 1 0 Mar 28 ? 0:00 asm_smon_+ASM2
oracle 2032 1 0 Mar 28 ? 0:00 asm_rbal_+ASM2
oracle 2034 1 0 Mar 28 ? 0:00 asm_gmon_+ASM2
oracle 2065 1 0 Mar 28 ? 0:01 asm_lck0_+ASM2
oracle 23532 20734 0 17:54:05 pts/1 0:00 grep ASM
oracle 15238 1 0 17:29:43 ? 0:00 asm_b000_+ASM2
$ srvctl stop asm -n racnode2
$ srvctl start asm -n racnode2
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3 29 17:55:17 2007

Copyright (c) 1982, 2005, Oracle. All Rights Reserved.

已連線到空閒例程。

SQL> startup
ORACLE
例程已經啟動。

Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 469763368 bytes
Database Buffers 1660944384 bytes
Redo Buffers 14745600 bytes
資料庫裝載完畢。資料庫已經開啟。
SQL>

至此問題解決。其實導致問題的原因很簡單,但是問題出現了需要冷靜的分析和判斷,否則很容易被一些其他的資訊干擾而誤入歧途,走了很多其他的彎路。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-69224/,如需轉載,請註明出處,否則將追究法律責任。

相關文章