導數時資料庫hang住分析

一、案例說明
最近客戶升級資料庫，資料庫本身不大，並且有足夠的停機視窗，所以我們採用了最最安全的方式expdp及impdp的方式，將資料庫匯出匯入。但是在匯入的過程中，我們在檢查表空間變化時，一直沒有動靜！
SQL> select tablespace_name,sum(bytes/1024/1024/1024) gbytes from dba_free_space group by tablespace_name;

TABLESPACE_NAME                    GBYTES
------------------------------ ----------
TEST_INDEX1                    20.1948853
SYSAUX                         1.39544678
UNDOTBS1                       15.4883423
USERS                          .003601074
SYSTEM                         2.24859619
TEST_DATA1                     24.4277954

6 rows selected.

SQL> /

TABLESPACE_NAME                    GBYTES
------------------------------ ----------
TEST_INDEX1                    20.1948853
SYSAUX                         1.39544678
UNDOTBS1                       15.4883423
USERS                          .003601074
SYSTEM                         2.24859619
TEST_DATA1                     24.4277954

6 rows selected.

檢視系統表間，一直沒變化，感覺挺奇怪的，於是，我們又一次檢視系統負載情況

[root@testdb ~]# top

top - 08:09:36 up 15:44, 3 users, load average: 0.23, 0.22, 0.20
Tasks: 662 total,   1 running, 661 sleeping,   0 stopped,   0 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49428444k total, 25010496k used, 24417948k free,   133088k buffers
Swap: 41943032k total,        0k used, 41943032k free,   907668k cached

可以看到沒有負載，剛才存在的i/o也沒有了，但我們的後臺程式並沒有結束，懷疑資料庫出現了問題，於是我們採用hanganalyze進行分析。
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug hanganalyze 3;
Hang Analysis in /u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_11425.trc

[oracle@testdb ~]$ more /u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_11425.trc
Trace file /u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_11425.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db1
System name:    Linux
Node name:      testdb
Release:        2.6.32-358.el6.x86_64
Version:        #1 SMP Fri Feb 22 00:31:26 UTC 2013
Machine:        x86_64
VM name:        VMWare Version: 6
Instance name: testdb
Redo thread mounted by this instance: 1
Oracle process number: 44
Unix process pid: 11425, image: oracle@testdb (TNS V1-V3)

*** 2013-12-05 09:35:21.247
*** SESSION ID:(1283.3) 2013-12-05 09:35:21.247
*** CLIENT ID:() 2013-12-05 09:35:21.247
*** SERVICE NAME:(SYS$USERS) 2013-12-05 09:35:21.247
*** MODULE NAME:(sqlplus@testdb (TNS V1-V3)) 2013-12-05 09:35:21.247
*** ACTION NAME:() 2013-12-05 09:35:21.247

Processing Oradebug command 'setmypid'

*** 2013-12-05 09:35:21.256
Oradebug command 'setmypid' console output:

*** 2013-12-05 09:35:31.208
Processing Oradebug command 'hanganalyze 3'

*** 2013-12-05 09:35:31.722
===============================================================================
HANG ANALYSIS:
instances (db_name.oracle_sid): testdb.testdb
oradebug_node_dump_level: 3
analysis initiated by oradebug
os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 09:35:30 ]
      NOTE: scheduling delay has not been sampled for 0.724575 secs    0.000000 secs from [ 09:35:26 - 09:35:32 ], 5 sec avg
    0.000000 secs from [ 09:34:32 - 09:35:32 ], 1 min avg
    0.000000 secs from [ 09:30:32 - 09:35:32 ], 5 min avg
vktm time drift history
===============================================================================

Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'control file sequential read'<='log file switch (archiving needed)'<='buffer busy waits'
     Chain 1 Signature Hash: 0x2ad21449
[b] Chain 2 Signature: 'control file sequential read'<='log file switch (archiving needed)'<='buffer busy waits'
     Chain 2 Signature Hash: 0x2ad21449
[c] Chain 3 Signature: 'control file sequential read'<='log file switch (archiving needed)'
     Chain 3 Signature Hash: 0x7df7ca49

可以看到最後，對chain 1進行了總結性的分析，buffer busy的產生，是由於沒有歸檔，而沒有歸檔，況且系統並沒有i/o問題，也不存在其它負載。從後面的跟蹤日誌看，log file switch (archiving needed)等待了15分鐘。所以基本上可以肯定是由於日誌寫不下去了，才導致情況的發生。
檢查歸檔日誌目錄

SQL> select name,total_mb,free_mb from v$asm_diskgroup;

NAME TOTAL_MB FREE_MB
------------------------------ ---------- ----------
ARCHDG 51199 938

在這裡，可以看到歸檔日誌已經很少了，而我們的日誌大小為：
SQL> select group#,bytes/1024/1024 mbytes from v$log;

    GROUP#     MBYTES
---------- ----------
         1       1024
         2       1024
         3       1024
         4       1024
         5       1024
可以看到日誌大小為1g，所以是由於歸檔空間不夠導致的。

解決辦法：開啟備份，將歸檔日誌備份並從歸檔目錄中刪除掉後，任務繼續！

二、hanganalyze分析說明

1、當前，hanganalyze有三種級別;

1.1、一種是會話級別的：

SQL> alter session set events 'immediate trace name hanganalyze level 3';

Session altered.

SQL> oradebug tracefile_name;
/u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_17310.trc

1.2、一種是例項級別:

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug hanganalyze 3;

1.3、一種是叢集範圍的：

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug setinst all
Statement processed.
SQL> oradebug -g def hanganalyze 3;
Hang Analysis in /u01/app/oracle/admin/ncerp/bdump/ncerp1_diag_20154.trc

2、hanganalyze的列說明

sid是 Session ID
sess_srno是serial#
proc_ptr是Process Pointer
ospid 是OS Process ID
cnode是Node Id，Oracle9i才用
Nodenum是hanganalyze
自己為了記錄這些會話而定製的編號，從0開始排起。
State 是node的狀態
Adjlist是臨近的node(通常代表一個blocker node)
Predecessor是Predecessor node ,通常代表一個 waiter node

如下所示：

([nodenum]/cnode/sid/sess_srno/session/ ospid/ state/start/finish/[adjlist]/predec
essor):
[780]/ 0/ 781/ 13388/0x7645ac60/15488/ IGN/ 1/ 2/ /none
[781]/ 0/ 782/ 38575/0x75460770/15494/NLEAF/ 3/ 4/ [832]/184

從上面的情況來看，NLEAF：通常可以看作這些會話是被阻塞的資源，那麼可以看到nodenum為781是阻塞源，阻塞了832

state狀態說明：
2.1、IN_HANG：這表示該node處於死鎖狀態，通常還有其他node（blocker）也處於該狀態
2.2、LEAF/LEAF_NW：該node通常是blocker。通過條目的”predecessor”列可以判斷這個node是否是blocker。LEAF說明該NODE沒有等待其他資源，而LEAF_NW則可能是沒有等待其他資源或者是在使用CPU.
2.3、NLEAF：通常可以看作這些會話是被阻塞的資源。發生這種情況一般說明資料庫發生效能問題而不是資料庫hang
2.4、IGN/IGN_DMP：這類會話通常被認為是空閒會話，除非其adjlist列裡存在node。如果是非空閒會話則說明其adjlist裡的node正在等待其他node釋放資源。
2.5、SINGLE_NODE/SINGLE_NODE_NW：近似於空閒會話

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/21752515/viewspace-1062853/，如需轉載，請註明出處，否則將追究法律責任。

導數時資料庫hang住分析

相關文章