連線ASM例項後出現ORA-1012錯誤(三)

yangtingkun發表於2007-06-12

繼續討論這個問題的產生原因。

連線ASM例項後出現ORA-1012錯誤:http://yangtingkun.itpub.net/post/468/280923

連線ASM例項後出現ORA-1012錯誤(二):http://yangtingkun.itpub.net/post/468/294068


上面一篇文章分析了導致ORA-1012錯誤的原因,是由於連線ASM例項的程式數超過了ASM初始化引數的設定,從而導致了問題的產生。

這篇文章進一步分析,是什麼導致了ASM產生了大量的連線。

首先透過作業系統命令檢查:

$ ps -ef|grep ASM|grep -v grep
oracle 17061 1 0 May 14 ? 0:00 asm_psp0_+ASM1
oracle 17059 1 0 May 14 ? 0:05 asm_diag_+ASM1
oracle 17108 1 0 May 14 ? 0:00 asm_smon_+ASM1
oracle 2556 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17057 1 0 May 14 ? 0:03 asm_pmon_+ASM1
oracle 17063 1 0 May 14 ? 0:11 asm_lmon_+ASM1
oracle 17110 1 0 May 14 ? 0:00 asm_rbal_+ASM1
oracle 17067 1 0 May 14 ? 0:08 asm_lms0_+ASM1
oracle 17079 1 0 May 14 ? 0:00 asm_mman_+ASM1
oracle 17104 1 0 May 14 ? 0:01 asm_lgwr_+ASM1
oracle 17420 1 0 May 14 ? 0:12 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode1.ASM1.asm
oracle 28629 1 0 Jun 08 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 19622 1 0 Jun 01 ? 0:05 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17065 1 0 May 14 ? 0:19 asm_lmd0_+ASM1
oracle 17112 1 0 May 14 ? 2:10 asm_gmon_+ASM1
oracle 17143 1 0 May 14 ? 0:20 asm_lck0_+ASM1
oracle 1204 1 0 May 31 ? 0:00 asm_o001_+ASM1
oracle 8710 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11907 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 29561 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 23342 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17098 1 0 May 14 ? 0:01 asm_dbw0_+ASM1
oracle 23469 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17106 1 0 May 14 ? 0:01 asm_ckpt_+ASM1
oracle 12134 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 24374 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 28631 1 0 Jun 08 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 22777 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 12888 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 15334 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4977 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 26859 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 9633 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 28634 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 23136 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 736 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 6678 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11080 11079 0 17:54:43 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

$ ps -ef|grep ASM|grep -v grep|wc -l
39

僅僅是一個測試庫,且只有我一個人在測試RMAN的備份,怎麼會有這麼多的會話呢,登陸ASM例項進行檢查:

bash-2.03$ export ORACLE_SID=+ASM1
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 6 11 18:21:23 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> select count(*) from v$session;

COUNT(*)
----------
38

SQL> select count(*) from v$process;

COUNT(*)
----------
39

SQL> select distinct process from v$session where username is not null;

PROCESS
------------------------
11234
19620
19969
19995
23709
23768

已選擇6行。

SQL> select process, count(*) from v$session where username is not null group by process;

PROCESS COUNT(*)
------------------------ ----------
11234 1
19620 1
19969 11
19995 8
23709 1
23768 1

已選擇6行。

根據ASM例項上V$SESSION上的PROCESS列,找到連線到ASM例項的作業系統程式。上面的查詢列出了所有非ASM例項後臺程式對應的作業系統程式。可以看到,絕大部分程式是由於1996919995兩個程式造成的。

下面依次分析這些程式:

$ ps -ef|grep 11234|grep -v grep
oracle 11246 11234 0 18:21:23 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11234 8887 0 18:21:23 pts/1 0:00 sqlplus / as sysdba
$ ps -ef|grep 19620|grep -v grep
oracle 19620 1 0 Jun 01 ? 0:06 ora_asmb_testrac1

這兩個程式一個是透過SQLPLUS連線到ASM例項的程式,也就是當前查詢的後臺程式,另一個是節點testrac1連線到ASM例項的程式。

$ ps -ef|grep 23709|grep -v grep
oracle 23709 1 0 Jun 07 ? 0:56 oracletestrac1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ps -ef|grep 23768|grep -v grep
oracle 23768 1 0 Jun 07 ? 0:39 oracletestrac1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 6 11 18:29:41 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> set pages 100 lines 120
SQL> col program format a25
SQL> col module format a25
SQL> col event format a30
SQL> col client_info format a20
SQL> select s.program, module, client_info, event from v$session s, v$process p where paddr = p.addr and spid in (23709, 23768);

PROGRAM MODULE CLIENT_INFO EVENT
------------------------- --------------------- ----------------- ----------------------
rman@racnode1 (TNS V1-V3) backup full datafile rman channel=c1 control file parallel write
rman@racnode1 (TNS V1-V3) backup full datafile rman channel=c2 control file parallel write

這兩個會話是當時備份出錯後,很久無法結束,透過CTRL+強制結束的,導致會話異常結束,Oracle這裡並沒有將資源完全釋放。

不過這兩個會話應該不是問題的關鍵,程式1996919995兩個程式在ASM例項上佔用了18個會話,下面看看這兩個例項在做什麼:

$ ps -ef|grep 19969|grep -v grep
oracle 19969 1 0 Jun 01 ? 0:48 ora_arc0_testrac1
$ ps -ef|grep 19995|grep -v grep
oracle 19995 1 0 Jun 01 ? 0:18 ora_arc1_testrac1

居然是Oracle的歸檔程式,回想最開始錯誤產生的時候,就是歸檔程式出現的問題,那麼基本上確定,問題是由於歸檔無法建立新的檔案導致問題的產生。

檢查例項1上的alert檔案,發現問題產生的時候出現大量下面的錯誤:

ORA-19504: failed to create file "+DISK/testrac/1_307_618591128.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DISK/testrac/1_307_618591128.dbf
ORA-15041: diskgroup space exhausted
Sat Jun 9 10:08:04 2007
ARC1: Error 19504 Creating archive log file to '+DISK/testrac/1_307_618591128.dbf'
ARCH: Archival stopped, error occurred. Will continue retrying
Sat Jun 9 10:08:04 2007
ORACLE Instance testrac1 - Archival Error
Sat Jun 9 10:08:04 2007
ORA-16038: log 1 sequence# 307 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:04 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_19995.trc:
ORA-16038: log 1 sequence# 307 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:28 2007
ARCH: Archival stopped, error occurred. Will continue retrying
Sat Jun 9 10:08:28 2007
ORACLE Instance testrac1 - Archival Error
Sat Jun 9 10:08:28 2007
ORA-16014: log 1 sequence# 307 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:28 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-16014: log 1 sequence# 307 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:13:28 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_19995.trc:
ORA-19504: failed to create file "+DISK/testrac/1_307_618591128.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DISK/testrac/1_307_618591128.dbf
ORA-15041: diskgroup space exhausted

顯然錯誤是從ASM磁碟組沒有空間,導致歸檔失敗開始的。檢查失敗歸檔的程式號,就是上面出現問題的1996919995。顯然,Oracle在不斷重試歸檔的過程中,出現了一些錯誤,導致歸檔程式連線到ASM例項的會話沒有釋放,進而導致了下面的錯誤:

Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '+DISK/testrac/onlinelog/group_2.260.618591151'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_2.260.618591151
ORA-00020: maximum number of processes () exceeded

Tue Jun 12 09:29:04 2007
Unexpected communication failure with ASM instance:
ORA-00020: maximum number of processes () exceeded
Unexpected communication failure with ASM instance:
ORA-00020: maximum number of processes () exceeded
Tue Jun 12 09:29:05 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_1.259.618591145
ORA-00020: maximum number of processes () exceeded
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_1.258.618591139
ORA-00020: maximum number of processes () exceeded

解決問題的最好方法是重啟資料庫例項和ASM,如果暫時無法重啟,可以使用第一篇文章給出的方法,另外建立一個歸檔目的地。

或者採用下面的方法清楚異常的會話和程式:

SQL> select count(*) from v$session;

COUNT(*)
----------
38

SQL> select 'ALTER SYSTEM KILL SESSION ''' || SID || ', ' || SERIAL# || ''';'
2 from v$session where process in (19995, 19969);

'ALTERSYSTEMKILLSESSION'''||SID||','||SERIAL#||''';'
--------------------------------------------------------------------------------
ALTER SYSTEM KILL SESSION '12, 4';
ALTER SYSTEM KILL SESSION '34, 475';
ALTER SYSTEM KILL SESSION '14, 1';
ALTER SYSTEM KILL SESSION '15, 1';
ALTER SYSTEM KILL SESSION '16, 1';
ALTER SYSTEM KILL SESSION '17, 1';
ALTER SYSTEM KILL SESSION '18, 3';
ALTER SYSTEM KILL SESSION '19, 1';
ALTER SYSTEM KILL SESSION '20, 1';
ALTER SYSTEM KILL SESSION '21, 1';
ALTER SYSTEM KILL SESSION '22, 1';
ALTER SYSTEM KILL SESSION '23, 1';
ALTER SYSTEM KILL SESSION '24, 1';
ALTER SYSTEM KILL SESSION '25, 2';
ALTER SYSTEM KILL SESSION '26, 1';
ALTER SYSTEM KILL SESSION '27, 4';
ALTER SYSTEM KILL SESSION '28, 148';
ALTER SYSTEM KILL SESSION '33, 7';
ALTER SYSTEM KILL SESSION '13, 1';

已選擇19行。

SQL> ALTER SYSTEM KILL SESSION '12, 4';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '34, 475';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '14, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '15, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '16, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '17, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '18, 3';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '19, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '20, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '21, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '22, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '23, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '24, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '25, 2';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '26, 1';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '27, 4';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '28, 148';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '33, 7';

系統已更改。

SQL> ALTER SYSTEM KILL SESSION '13, 1';

系統已更改。

並透過作業系統命令清楚剛才RMAN失敗造成的兩個後臺程式:

bash-2.03$ kill -9 23709
bash-2.03$ kill -9 28631

無論是清除Oracle例項的後臺程式23709,還是清除23768程式對應的ASM例項後臺程式28631都可以達到相同的目的。

等待一段時間,Oracle自動清除所有的後臺程式,資料庫恢復正常。

bash-2.03$ ps -ef|grep ASM|grep -v grep|wc -l
18

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-69308/,如需轉載,請註明出處,否則將追究法律責任。

相關文章