ASM還真是讓人不是很踏實，今天又碰到一個和ASM有關的問題。

今天RAC的測試資料庫在匯入資料時長時間沒有響應。透過V$SESSION_WAIT檢查發現，系統在等待歸檔：

SQL> SELECT SID, EVENT FROM V$SESSION_WAIT WHERE SID = 132;

SID EVENT
---------- ----------------------------------------------------------------
132 log file switch (archiving needed)

資料庫採用ASM，由於是測試環境，沒有給ASM太大的空間，而且為了方便，將歸檔日誌也放在ASM中了。

產生問題的原因很明顯了，由於歸檔日誌不斷產生，導致ASM空間用完，因此新的歸檔無法產生，導致資料庫處於等待狀態。

本來認為是個小問題，於是透過RMAN將所有的歸檔日誌刪除。奇怪的是，成功的刪除所有的歸檔日誌後，這個問題仍然沒有解決。

SQL> SELECT SID, EVENT FROM V$SESSION_WAIT WHERE SID = 132;

SID EVENT
---------- ----------------------------------------------------------------
132 log file switch (archiving needed)

再次檢查等待會話，發現仍然在等待歸檔，懷疑Oracle處於兩次重試之間的間歇期，於是手工切換當前例項的日誌去啟用歸檔操作：

SQL> alter system switch logfile;

結果這個操作也一直處於等待中。

難道是ASM中的空間並沒有釋放？於是在當前節點（節點1：racnode1）啟動dbca，檢視ASM的配置，結果居然出現下面的資訊：

DBCA could not startup the ASM instance configured on this node. To proceed with ASM diskgroup management you need the ASM instance to be up and running. Do you want to recreate the ASM instance on the node?

奇怪RAC環境一直是在ASM上執行的，怎麼會出現這個問題呢？難道ASM例項沒有啟動：

SQL> host
$ ps -ef|grep ASM
oracle 8205 1 0 12:31:18 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 9734 1 0 12:32:17 ? 0:01 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11007 1 0 12:33:11 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4157 1 0 Apr 13 ? 0:02 asm_lck0_+ASM1
oracle 4077 1 0 Apr 13 ? 0:02 asm_lmon_+ASM1
oracle 4087 1 0 Apr 13 ? 0:03 asm_dbw0_+ASM1
oracle 4079 1 0 Apr 13 ? 0:08 asm_lmd0_+ASM1
oracle 4124 1 0 Apr 13 ? 0:00 asm_rbal_+ASM1
oracle 11098 1 0 12:33:14 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 20253 1 0 11:59:39 ? 2:50 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4578 1 0 Apr 13 ? 0:00 asm_o000_+ASM1
oracle 11131 1 0 12:33:16 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 12954 1 0 12:14:52 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11164 1 0 12:33:17 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4073 1 0 Apr 13 ? 0:01 asm_diag_+ASM1
oracle 7028 1 0 12:10:57 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 9349 1 0 12:32:02 ? 0:09 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11067 1 0 12:33:13 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4085 1 0 Apr 13 ? 0:00 asm_mman_+ASM1
oracle 12754 1 0 Apr 13 ? 0:01 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode1.ASM1.asm
oracle 10915 10806 0 15:58:37 pts/4 0:00 grep ASM
oracle 4081 1 0 Apr 13 ? 0:02 asm_lms0_+ASM1
oracle 10773 1 0 12:33:00 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4126 1 0 Apr 13 ? 0:24 asm_gmon_+ASM1
oracle 23205 1 0 15:01:11 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 16362 1 0 12:17:02 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 1075 1 0 12:26:38 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4101 1 0 Apr 13 ? 0:00 asm_lgwr_+ASM1
oracle 4122 1 0 Apr 13 ? 0:00 asm_smon_+ASM1
oracle 11195 1 0 12:33:18 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4120 1 0 Apr 13 ? 0:00 asm_ckpt_+ASM1
oracle 23017 1 0 12:21:22 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 22951 1 0 12:01:32 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4071 1 0 Apr 13 ? 0:00 asm_pmon_+ASM1
oracle 20259 1 2 11:59:40 ? 21:03 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4917 1 1 Apr 13 ? 14:01 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11162 1 0 12:33:17 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 19873 1 0 12:37:51 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11129 1 0 12:33:15 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4075 1 0 Apr 13 ? 0:00 asm_psp0_+ASM1

顯然ASM例項已經啟動，那麼怎麼會出現這個問題呢，於是嘗試直接連線到ASM例項上：

SQL> exit從 Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options 斷開
$ bash
bash-2.03$ export ORACLE_SID=+ASM1
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4月 16 16:02:42 2007

已連線。
SQL> SELECT * FROM V$ASM_DISKGROUP;
SELECT * FROM V$ASM_DISKGROUP
*第 1 行出現錯誤:
ORA-01012: not logged on

SQL>

看來是ASM出現了問題。

在節點2上嘗試用dbca管理ASM，結果發現ASM尚有40G的剩餘空間，這是歸檔刪除後的剩餘空間。

而且節點2上直接登陸ASM例項：

$ bash
bash-2.03$ export ORACLE_SID=+ASM2
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4月 16 16:08:33 2007

連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> SET PAGES 100 LINES 120
SQL> COL NAME FORMAT A30
SQL> SELECT GROUP_NUMBER, NAME, TOTAL_MB, FREE_MB FROM V$ASM_DISKGROUP;

GROUP_NUMBER NAME TOTAL_MB FREE_MB
------------ ------------------------------ ---------- ----------
1 DISK 110934 40797

SQL>

看來是節點1上的ASM出現了故障，導致Oracle的例項1無法正常工作。

嘗試在節點2上執行日誌切換：

SQL> EXIT從 Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options 斷開
bash-2.03$ exit
exit
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4月 16 16:11:45 2007

連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> SELECT INSTANCE_NAME FROM V$INSTANCE;

INSTANCE_NAME
----------------
testrac2

SQL> ALTER SYSTEM SWITCH LOGFILE;

系統已更改。

果然可以，看來問題完全定位到節點1的ASM例項上了。懷疑是空間用完導致ASM例項出現了問題。

將節點1上的ASM例項重啟就應該可以解決問題，但是我不想損失已經進行了一半的工作，於是嘗試透過其他的方法解決無法歸檔的問題，待工作完成後再重啟例項。

登陸節點1上的例項：

$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 4月 16 16:15:11 2007

連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> SET PAGES 100 LINES 120
SQL> SELECT INSTANCE_NAME FROM V$INSTANCE;

INSTANCE_NAME
----------------
testrac1

SQL> SHOW PARAMETER LOG_ARCHIVE

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_config string
log_archive_dest string
log_archive_dest_1 string LOCATION=+DISK/testrac/
log_archive_dest_10 string
log_archive_dest_2 string
log_archive_dest_3 string
log_archive_dest_4 string
log_archive_dest_5 string
log_archive_dest_6 string
log_archive_dest_7 string
log_archive_dest_8 string
log_archive_dest_9 string
log_archive_dest_state_1 string enable
log_archive_dest_state_10 string enable
log_archive_dest_state_2 string enable
log_archive_dest_state_3 string enable
log_archive_dest_state_4 string enable
log_archive_dest_state_5 string enable
log_archive_dest_state_6 string enable
log_archive_dest_state_7 string enable
log_archive_dest_state_8 string enable
log_archive_dest_state_9 string enable
log_archive_duplex_dest string
log_archive_format string %t_%s_%r.dbf
log_archive_local_first boolean TRUE
log_archive_max_processes integer 2
log_archive_min_succeed_dest integer 1
log_archive_start boolean FALSE
log_archive_trace integer 0

觀察歸檔的相關資訊，雖然無法不重啟資料庫的情況下改變歸檔模式，不過由於LOG_ARCHIVE_MIN_SUCCEED_DEST引數設定為1，因此可以透過新增第二歸檔LOCATION的方式來解決位置1無法使用的問題。

由於例項2沒有歸檔的問題，可以僅修改當前例項的引數：

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2 = 'LOCATION=/data/oracle/oradata/testrac' SCOPE = MEMORY SID = 'testrac1';

系統已更改。

修改後，發現剛才一直處於等待狀態的ALTER SYSTEM SWITCH LOGFILE命令已經執行完畢，且一直等待歸檔的會話也改變了等待事件：

SQL> alter system switch logfile;

系統已更改。

SQL> SELECT SID, EVENT FROM V$SESSION_WAIT WHERE SID = 132;

SID EVENT
---------- ----------------------------------------------------------------
132 SQL*Net more data from client

至此，問題基本上解決了，只需要等到匯入工作完成後，手工重啟節點1的資料庫和ASM例項就可以了。

問題出現的時候只是透過V$SESSION_WAIT和資料庫的錯誤資訊進行了分析，事後檢查ASM和例項1的alert檔案，再次確定了ASM的問題：

Mon Apr 16 12:01:33 2007
WARNING: allocation failure on disk DISK_0019 for file 392 xnum 2147483648
Mon Apr 16 12:10:58 2007
WARNING: allocation failure on disk DISK_0006 for file 396 xnum 163
Mon Apr 16 12:14:52 2007
WARNING: allocation failure on disk DISK_0004 for file 396 xnum 2147483648
Mon Apr 16 12:17:02 2007
WARNING: allocation failure on disk DISK_0004 for file 397 xnum 2147483648
Mon Apr 16 12:21:22 2007
WARNING: allocation failure on disk DISK_0002 for file 399 xnum 2147483648
Mon Apr 16 12:26:39 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:31:18 2007
WARNING: allocation failure on disk DISK_0009 for file 400 xnum 2147483648
Mon Apr 16 12:33:00 2007
WARNING: allocation failure on disk DISK_0015 for file 400 xnum 2147483648
Mon Apr 16 12:33:11 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:14 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:15 2007
WARNING: allocation failure on disk DISK_0003 for file 400 xnum 2147483648
Mon Apr 16 12:33:16 2007
WARNING: allocation failure on disk DISK_0017 for file 400 xnum 2147483648
Mon Apr 16 12:33:17 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:17 2007
WARNING: allocation failure on disk DISK_0004 for file 400 xnum 2147483648
Mon Apr 16 12:33:18 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:33:19 2007
WARNING: allocation failure on disk DISK_0038 for file 400 xnum 445
Mon Apr 16 12:37:51 2007
WARNING: allocation failure on disk DISK_0019 for file 400 xnum 2147483648
Mon Apr 16 15:01:12 2007
WARNING: allocation failure on disk DISK_0006 for file 400 xnum 2147483648
Mon Apr 16 15:23:23 2007
Starting ORACLE instance (normal)
Mon Apr 16 15:26:05 2007
Starting ORACLE instance (normal)
Mon Apr 16 15:30:53 2007
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
Mon Apr 16 15:31:11 2007
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
Mon Apr 16 15:58:34 2007
Starting ORACLE instance (normal)

上面是ASM的錯誤資訊，指出了ASM分配空間失敗，而下面的資料庫例項錯誤資訊：

ORACLE Instance testrac1 - Archival Error
Mon Apr 16 12:33:24 2007
ORA-16014: log 1 sequence# 85 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Mon Apr 16 12:33:24 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_5191.trc:
ORA-16014: log 1 sequence# 85 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'

則說明ASM的問題導致了歸檔的錯誤。

幸好出現問題的是測試庫，重啟對於測試資料庫來說不算問題，但是重啟對於產品庫來說就不是小問題了，即使這個產品庫是RAC環境。

從這個問題上看，感覺ASM的bug還是多了一點，這個問題目前還找不到不重啟ASM例項就可以解決問題的方法，而ASM給人的感覺太不透明瞭。

BTW：由於再節點1和節點2上多次嘗試啟動dbca來管理asm，還碰到了ORA-20錯誤，對於這個問題metalink有專門的文件描述：Doc ID: Note:308376.1：Subject: Dbca Fails To Create Asm Disk Group, then upon restart of ASM instance errors。

連線ASM例項後出現ORA-1012錯誤

相關文章