曾經在的第一次活動中提到----AWR報告裡的每一項內容都值得我們認真研究，每一項內容都值得我們仔細關注。

深以為然，但有時候在實際的解決問題的過程中，我們並不需要去關注AWR報告裡的每一項內容，往往AWR報告裡的一到兩項內容就能夠幫助我們迅速定位問題。

我們來看一個僅僅利用AWR報告裡的兩項內容來定位並解決paging space被撐爆的例子。

整個事情的背景是這樣的：

我們的一個庫連續三天，基本上是在相同的時間，alert log裡都報了這樣的錯：

10月11日晚上9點26分：

Mon Oct 11 21:26:48 2010

Process startup failed, error stack:

Mon Oct 11 21:26:48 2010

Errors in file /caipratrc/app/oracle/admin/ipraca/bdump/ipraca_psp0_74076.trc:

ORA-27300: OS system dependent operation:fork failed with status: 12

ORA-27301: OS failure message: Not enough space

ORA-27302: failure occurred at: skgpspawn3

10月12日晚上9點26分：

Tue Oct 12 21:26:41 2010

Process startup failed, error stack:

Tue Oct 12 21:26:41 2010

Errors in file /caipratrc/app/oracle/admin/ipraca/bdump/ipraca_psp0_74076.trc:

ORA-27300: OS system dependent operation:fork failed with status: 12

ORA-27301: OS failure message: Not enough space

ORA-27302: failure occurred at: skgpspawn3

10月13日晚上9點37分：

Wed Oct 13 21:37:39 2010

Process startup failed, error stack:

Wed Oct 13 21:37:39 2010

Errors in file /caipratrc/app/oracle/admin/ipraca/bdump/ipraca_psp0_74076.trc:

ORA-27300: OS system dependent operation:fork failed with status: 12

ORA-27301: OS failure message: Not enough space

ORA-27302: failure occurred at: skgpspawn3

剛好13號晚上9點37分的時候我在現場，當時我觀察到的症狀就是上述庫所在的database server突然失去了響應，telnet已經連不上了。大概5分鐘後又重新可以telnet上，且此時變得一切正常，只是alert log裡多了上述的錯誤記錄。

好了，背景已經交待清楚，我們現在來開始定位問題和解決問題的過程：

MOS上文件Troubleshooting ORA-27300 ORA-27301 ORA-27302 errors [ID 579365.1]明確指出，當如下內容同時出現的時候，

ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3

即表明swap space已經耗盡了，其原話是這麼說的：

** All related to swap space being depleted. Check the OS system logs. **

於是我們現在來檢查OS system log：

10月11日晚上9點28分：

LABEL: PGSP_KILL

IDENTIFIER: C5C09FFA

Date/Time: Mon Oct 11 21:28:32 GMT+08:00 2010

Sequence Number: 245

Machine Id: 0001DA17D600

Node Id: P570_04_LE

Class: S

Type: PERM

WPAR: Global

Resource Name: SYSVMM

Description

SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes

SYSTEM RUNNING OUT OF PAGING SPACE

Failure Causes

INSUFFICIENT PAGING SPACE DEFINED FOR THE SYSTEM

PROGRAM USING EXCESSIVE AMOUNT OF PAGING SPACE

Recommended Actions

DEFINE ADDITIONAL PAGING SPACE

REDUCE PAGING SPACE REQUIREMENTS OF PROGRAM(S)

Detail Data

PROGRAM

oracle

USER'S PROCESS ID:

291178

PROGRAM'S PAGING SPACE USE IN 1KB BLOCKS

9873556

10月12日和10月13日的OS system log與上述內容類似，在此不再贅述。

從上面內容中我們可以看出如下幾點：

1、確實是oracle的某個程式（程式號為291178）佔用了大量的paging space，這裡顯示291178佔用了9873556K的paging space，即接近10個G（上述庫所在的database server的paging space只有12個G）。

2、這個程式最後異常退出了----SOFTWARE PROGRAM ABNORMALLY TERMINATED。

看到這裡我們已經可以解釋為什麼上述庫所在的database server在出現問題的時間段會短暫的失去響應，隨後會恢復正常----因為oracle的某個process幾乎耗盡了所有的paging space，而隨著這個process的異常退出，其所佔用的paging space得到了釋放。

根據上述內容，我判斷：

1、就這個庫而言，SGA大量佔用paging space的可能性不大，因為buffer cache、shared pool都會有LRU演算法來實現age out。

2、最有可能大量佔用paging space的就是PGA，上述庫的pga_aggregate_target為1個G：

SQL> show parameter pga;

NAME TYPE VALUE

------------------------------------ ----------- ------------------------------

pga_aggregate_target big integer 1000M

這樣，一旦某個server process的session cursor佔用的記憶體超過1個G，就會用到paging space了。

現在的問題是：

1、這個process到底在做什麼？為什麼其會佔用接近10個G的paging space？

2、 為什麼每次出現paging space耗盡錯誤的時間點基本上都相同？

這就需要藉助AWR報告了。

基於上述判斷，我認為只需要分別採集10月11日、12日、13日三天的21點至22點的AWR報告就能印證我的上述判斷並且能找到這個process當時到底在做什麼。

因為上述process最後異常退出了，所以我們只需關注AWR報告裡所有executions為0的過程；驗證是否是server process的session cursor佔用了大量的paging space只需關注AWR報告裡PGA的使用趨勢。

在10月11日21點至22點的AWR報告中我們看到：

Elapsed Time (s)	CPU Time (s)	Executions	Elap per Exec (s)	% Total DB Time	SQL Text
1,315	183	0		12.78	Begin P_MCH_DAILY_JOB(:v0, :v...
1,238	170	5	247.62	12.03	delete from mchtkc where rowid...
612	41	1	612.20	5.95	Begin P_IWB_SALNOARRIVE(:v0, ...
499	14	1	498.56	4.85	select ipdprf, ipdfrm, ipdtk...
172	3	1	172.20	1.67	Begin P_SAL_SMARTRIX_SLT(:v0, ...
124	13	1	124.39	1.21	Begin P_COMMON_WORK_COUNT(:v0,...
87	1	1	87.26	0.85	call ORACLE_OCM.MGMT_CONFIG.co...
76	0	1	76.14	0.74	Begin P_IWB_AUTOASSIGN_COR(:v0...

	PGA Aggr Target(M)	Auto PGA Target(M)	PGA Mem Alloc(M)	W/A PGA Used(M)	%PGA W/A Mem	%Auto W/A Mem	%Man W/A Mem	Global Mem Bound(K)
B	1,000	63	10,977.24	0.00	0.00	0.00	0.00	102,400
E	1,000	842	170.18	0.00	0.00	0.00	0.00	102,400

上述結果顯示在10月11日21點至22點唯一沒有執行完的就是P_MCH_DAILY_JOB，在21點，上述庫的PGA佔用量是10977.24M（即10個G），但到了22點，上述庫的PGA佔用量就變成了170.18M。

在10月12日、10月13日21點至22點的AWR報告中我們都看到了相類似的結果，在此不再贅述。

上述問題時間段的AWR報告有力的驗證了我們的判斷，而且，現在所有的矛頭都指向了儲存過程P_MCH_DAILY_JOB。

我們現在來看看P_MCH_DAILY_JOB到底做了什麼事情。

實際上，P_MCH_DAILY_JOB的邏輯是非常複雜的，但是非常幸運的是，出現問題的程式碼在這個儲存過程的最開頭。

首先我們注意到P_MCH_DAILY_JOB的定義部分定義了一個urowid的陣列rids和一個數值型的變數n_index：

--------------------------Begin MCHTKT 定義------------------------------

type typ_mtkprf is table of mchtkt.mtkprf%type index by binary_integer;

mtkprfs typ_mtkprf;

type typ_mtkfrm is table of mchtkt.mtkfrm%type index by binary_integer;

mtkfrms typ_mtkfrm;

type typ_mtktkt is table of mchtkt.mtktkt%type index by binary_integer;

mtktkts typ_mtktkt;

type typ_mchtkt_rid is table of urowid index by binary_integer;

mchtkt_rids typ_mchtkt_rid;

cur_mchtkt sys_refcursor;

vc_mchtkt_sql varchar2(4000) := 'select mtkprf,mtkfrm,mtktkt,rowid

from mchtkt

where mtkmts = :1';

--------------------------End MCHTKT 定義------------------------------

--------------------------Begin MCHTKC 定義------------------------------

type typ_mchtkc_rid is table of urowid index by binary_integer;

mchtkc_rids typ_mchtkc_rid;

type typ_rid is table of urowid index by binary_integer;

rids typ_rid;

n_index number;

cur_mchtkc sys_refcursor;

vc_mchtkc_sql varchar2(4000) := 'select rowid from mchtkc where mtcprf = :1 and mtcfrm = :2 and mtctkt = :3';

--------------------------End MCHTKC 定義------------------------------

接著，在P_MCH_DAILY_JOB實際程式碼的最開始，出現了這樣的程式碼：

----------Begin 刪除mchtkt中Ticket配比狀態為'TF'的所有ticket記錄及其相關聯的mchtkc中的所有coupon記錄-----------

open cur_mchtkt for vc_mchtkt_sql using 'TF';

loop

fetch cur_mchtkt bulk collect into mtkprfs,mtkfrms,mtktkts,mchtkt_rids limit CN_BATCH_SIZE;

for i in 1 .. mchtkt_rids.count loop

open cur_mchtkc for vc_mchtkc_sql using mtkprfs(i),mtkfrms(i),mtktkts(i);

fetch cur_mchtkc bulk collect into mchtkc_rids;

close cur_mchtkc;

for j in 1 .. mchtkc_rids.count loop

n_index := n_index + 1;

rids(n_index) := mchtkc_rids(j);

end loop;

forall i in 1 .. rids.count

execute immediate 'delete from mchtkc where rowid = :1 ' using rids(i);

forall i in 1 .. mchtkt_rids.count

execute immediate 'delete from mchtkt where rowid = :1 ' using mchtkt_rids(i);

exit when mchtkt_rids.count < CN_BATCH_SIZE;

commit;

end loop;

close cur_mchtkt;

----------End 刪除mchtkt中Ticket配比狀態為'TF'的所有ticket記錄及其相關聯的mchtkc中的所有coupon記錄-----------

注意到上述程式碼犯了兩個錯誤，其中的第一個錯誤是致命的：

1、沒有在內層迴圈執行完畢後重置n_index的值，所以n_index的值會越來越大，同時陣列rids也會越來越大，rids會大到什麼程度呢？我們只需將其對應cursor的sql執行一下就知道結果了：

SQL> select count(*) from mchtkt where mtkmts = 'TF';

COUNT(*)

----------

4365367

所以極限情況下rids會是一個包含436萬item的陣列----這就解釋了為什麼會佔用接近10個G的paging space。

2、上述程式碼會接著執行刪除mchtkt和mchtkc的對應記錄的操作，其初衷是好的，採用了分步commit的方式----標準的處理海量資料的方法。但是其commit語句擺的地方不對，上述commit語句在exit when mchtkt_rids.count < CN_BATCH_SIZE和end loop之間，這樣這個commit只會在rids的item到了4365367後才會執行，而還沒等到執行這一步，paging space就被耗盡，執行上述P_MCH_DAILY_JOB的server process就會異常退出，於是oracle會rollback之前做的delete操作，所以mchtkt的滿足上述cursor條件的記錄的數量在10月11日、10月12日、10月13日始終都會是4365367，就好像從來沒有對mchtkt執行過delete操作一樣，我們來驗證一下：

SQL> select count(*) from mchtkt as of timestamp to_timestamp('2010-10-13 18:00:00','YYYY-MM-DD HH24:MI:SS') where mtkmts = 'TF';

COUNT(*)

----------

4365367

SQL> select count(*) from mchtkt as of timestamp to_timestamp('2010-10-12 18:00:00','YYYY-MM-DD HH24:MI:SS') where mtkmts = 'TF';

COUNT(*)

----------

4365367

SQL> select count(*) from mchtkt as of timestamp to_timestamp('2010-10-11 18:00:00','YYYY-MM-DD HH24:MI:SS') where mtkmts = 'TF';

COUNT(*)

----------

4365367

同時這也就解釋了為什麼每次出現paging space耗盡錯誤的時間點基本上都相同，因為P_MCH_DAILY_JOB是在一個大的每天都會執行的定時作業裡，這幾天P_MCH_DAILY_JOB的開始執行時間基本上都相同，又因為mchtkt的資料量始終都是4365367且上述有問題的程式碼是在儲存過程P_MCH_DAILY_JOB實際執行程式碼的最開始。

分析清楚了原因，解決問題就變得非常簡單了，只需把上述有問題的程式碼改成如下所示就可以了：

open cur_mchtkt for vc_mchtkt_sql using 'TF';

loop

fetch cur_mchtkt bulk collect into mtkprfs,mtkfrms,mtktkts,mchtkt_rids limit CN_BATCH_SIZE;

for i in 1 .. mchtkt_rids.count loop

open cur_mchtkc for vc_mchtkc_sql using mtkprfs(i),mtkfrms(i),mtktkts(i);

fetch cur_mchtkc bulk collect into mchtkc_rids;

close cur_mchtkc;

for j in 1 .. mchtkc_rids.count loop

n_index := n_index + 1;

rids(n_index) := mchtkc_rids(j);

end loop;

forall i in 1 .. n_index

execute immediate 'delete from mchtkc where rowid = :1 ' using rids(i);

forall i in 1 .. mchtkt_rids.count

execute immediate 'delete from mchtkt where rowid = :1 ' using mchtkt_rids(i);

commit;

n_index := 0;

exit when mchtkt_rids.count < CN_BATCH_SIZE;

end loop;

close cur_mchtkt;

我在10月14日修改完上述程式碼，今天早上我來看上述庫的alert log，上述錯誤不再重現，即這個問題已經被我成功解決。

轉帖eygle：利用AWR報告解決paging space被撐爆的例子

相關文章