一次系統卡住不動的故障排查

今天，我在oracle 的scott使用者下，執行了 create table emp1 as select * from emp 這條語句後，資料庫就卡住不動了。然後我再開個新的會話，在scott使用者下執行select * from tab後,也是卡住了。

於是我用v$sql和v$session 聯合查詢，定位出哪個會話在執行哪些操作,消耗多長時間：

SQL> selectb.sid,a.sql_text,b.status,b.last_call_et,b.event from v$sql a,v$session b wherea.sql_id=b.sql_id;

SID SQL_TEXT STATUS LAST_CALL_ET EVENT

---------- ------------------------------------------------------------------------------ ----------------------------------------------------------------------------

129 select b.sid,a.sql_text,b.status,b.last_call_et,b.event from v$sqla,v ACTIVE 0 SQL*Net messageto client

$session b where a.sql_id=b.sql_id

131 select * from tab ACTIVE 3 buffer busywaits

152 create table emp1 as select * from emp ACTIVE 9101 flashback buf free by RVWR

透過上面可以看到，status 列為active說明當前的sql的語句正在執行，會話處於活躍狀態，last_cakk_et=9101說明這條SQL語句已經執行了9101秒，當前會話正在處於等待狀態，等待的事件為flashback buf free by RVWR。

（注：可以透過下面這條命令定位更加詳細的資訊，如客戶端ip等。

select b.sid,b.machine,b.program,b.module,a.sql_text,b.status,b.last_call_et,b.eventfrom v$sql a,v$session b where a.sql_id=b.sql_id;）

透過百度，得知flashback buf free by RVWR等待事件可能是由於flashbackbuffer空間緊張。

下面檢視資料庫top 5事件：

SQL> selectrownum,sql_id,sql_text,optimizer_cost from (select sql_id,sql_text,optimizer_costfrom v$sqlarea order by 3 desc) whererownum<5 ;

ROWNUM SQL_ID SQL_TEXT OPTIMIZER_COST

---------- -------------------------------------------------------------------------------------------------

1 gfjvxb25b773h select o.owner#,o.obj#,decode(o.linkname,null,decode(u.name,null,'SYS 2427

',u.name),o.remoteowner),o.name,o.linkname,o.namespace,o.subname from

user$ u, obj$ o whereu.user#(+)=o.owner# and o.type#=:1 and not exis

ts (select p_obj# fromdependency$ where p_obj# = o.obj#) order by o.o

bj# for update

2 63fyqfhnd7u5k select OBJOID, CLSOID, RUNTIME, PRI, JOBTYPE, SCHLIM, WT, INST, RU 288

NNOW, ENQ_SCHLIM from( select a.obj# OBJOID, a.class_oid CLSOID, d

ecode(bitand(a.flags, 16384), 0,a.next_run_date, a.last_enabled_time)

RUNTIME, (2*a.priority + decode(bitand(a.job_status, 4), 0, 0,

decode(a.running_instance, :1, -1, 1))) PRI, 1 JOBTYPE,

ROWNUM SQL_ID SQL_TEXT OPTIMIZER_COST

---------- -------------------------------------------------------------------------------------------------

a.schedule_limitSCHLIM, a.job_weight WT, decode(a.running_instanc

e, NULL, 0,a.running_instance) INST, decode(bitand(a.flags, 16384)

, 0, 0, 1)RUNNOW, decode(bitand(a.job_status, 8388608),0, 0, 1) E

NQ_SCHLIM from sys.scheduler$_job a where bitand(a.job_status, 515)

= 1 and ((bitand(a.flags, 134217728 +268435456) = 0) or (b

itand(a.job_status,1024) <> 0)) and bitand(a.flags,4096) = 0 a

nd (a.next_run_date<= :2 or bitand(a.flags, 16384)<> 0) and

(a.class_oid isnull or (a.class_oid is notnull and a.clas

s_oid in (selectb.obj# from sys.scheduler$_class b

where b.affin

ROWNUM SQL_ID SQL_TEXT OPTIMIZER_COST

---------- -------------------------------------------------------------------------------------------------

3 cfz686a6qp0kg select o.obj#, u.name, o.name, t.spare1, DECODE(bitand(t.flags 207

, 268435456),268435456, t.initrans, t.pctfree$) from sys.obj$ o, s

ys.user$ u, sys.tab$t where (bitand(t.trigflag, 1048576) = 104857

6) and o.obj#=t.obj# and o.owner# = u.user#

4 b7jn4mf49n569 select o.name, u.name from obj$ o, type$ t, user$ u where o.oid$ = t. 185

tvoid andu.user#=o.owner# and bitand(t.properties,8388608) = 8388608

and (sysdate-o.ctime)> 0.0007

5 767pug2dbpqpc select * from tab 184

6 8ypwcums111db select distinct (case when cd.dbid = wr.dbid and 152

透過top 5事件沒看出什麼問題。

再看一下v$lock，檢查資料庫是否有鎖表的問題：

SQL> select sid,type,id1,id2,lmode,request,blockfrom v$lock;

SID TY ID1 ID2 LMODE REQUEST BLOCK

---------- -- ---------- -------------------- ---------- ----------

165 XR 4 0 1 0 0

165 CF 0 0 2 0 0

165 RS 25 1 2 0 0

161 FD 1 0 6 0 0

166 RT 1 0 6 0 0

167 PW 1 0 3 0 0

161 FD 0 0 6 0 0

164 TS 3 1 3 0 0

164 US 1 0 6 0 0

167 MR 1 0 4 0 0

167 MR 2 0 4 0 0

SID TY ID1 ID2 LMODE REQUEST BLOCK

---------- -- ---------- -------------------- ---------- ----------

167 MR 3 0 4 0 0

167 MR 4 0 4 0 0

167 MR 5 0 4 0 0

167 MR 201 0 4 0 0

164 TT 1 16 4 0 0

164 HW 1 8388617 6 0 0

164 CI 1 5 6 0 0

165 CI 1 1 4 0 0

160 TT 1 0 4 0 0

142 JQ 0 1 6 0 0

162 JS 0 0 4 0 0

SID TY ID1 ID2 LMODE REQUEST BLOCK

---------- -- ---------- -------------------- ---------- ----------

158 JS 0 1 6 0 1

162 JS 0 1 0 6 0

159 UL 7567243 0 6 0 0

159 JS 0 0 4 0 0

159 JS 0 1 0 6 0

142 TM 50288 0 2 0 0

142 TM 50291 0 3 0 0

158 TM 5123 0 3 0 0

透過上述可以看到，sid為158的會話block=1，說明它在阻塞其他的會話；sid為162的會話的request 為6，說明他正在請求lmode=6的會話，即sid=158的會話。

我的解決辦法：把那個會話kill掉。

關於一次系統卡住不動的故障排查

相關文章