環境：

Oracle 10g
AIX 6.1
HACMP 6

現象：

1.資料庫Down。
2.alert日誌中頻繁報ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool",...)錯誤。

3.資料庫每晚的邏輯匯出報錯：

TESTDB:/expbackup#>tail -200 backehr140322.log

EXP-00056: ORACLE error 1034 encountered

ORA-01034: ORACLE not available

ORA-27101: shared memory realm does not exist

IBM AIX RISC System/6000 Error: 2: No such file or directory

EXP-00005: all allowable logon attempts failed

EXP-00000: Export terminated unsuccessfully

即使是手動發起備份指令碼，也是報同樣的錯誤，備份無法進行，但是每晚的RMAN定時備份能正常進行，每天的備份集狀態正常。

分析：

Oracle的解釋：
[oracle@TESTDB ~]$ oerr ORA 4031
04031, 00000, "unable to allocate %s bytes of shared memory (\"%s\",\"%s\",\"%s\",\"%s\")"
// *Cause: More shared memory is needed than was allocated in the shared
// pool or Streams pool.
// *Action: If the shared pool is out of memory, either use the
// DBMS_SHARED_POOL package to pin large packages,
// reduce your use of shared memory, or increase the amount of
// available shared memory by increasing the value of the
// initialization parameters SHARED_POOL_RESERVED_SIZE and
// SHARED_POOL_SIZE.
// If the large pool is out of memory, increase the initialization
// parameter LARGE_POOL_SIZE.
// If the error is issued from an Oracle Streams or XStream process,
// increase the initialization parameter STREAMS_POOL_SIZE or increase
// the capture or apply parameter MAX_SGA_SIZE.

查閱了一些Oracle的資料，發現ORA-04031錯誤一般可能由於兩個原因引起的：
1.記憶體中存在大量碎片，導致在分配記憶體的時候，沒有連續的記憶體可存放，此問題一般是需要在開發的角度上入手，比如增加繫結變數，減少應解析來改善和避免；
2.記憶體容量不足，需要擴大記憶體。

Alert Log在資料庫Down之前的報了很多ORA-04031錯誤：
Sat Mar 22 19:00:09 2014
Errors in file /oracle/admin/oratest/bdump/oratest_j001_233886.trc:
ORA-12012: error on auto execute of job 8950
ORA-04031: unable to allocate ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select name,password,datats#...","Typecheck","kgghteInit")
ORA-06512: at "SYS.DBMS_SPACE", line 3289
ORA-06512: at line 1
ORA-00604: error occurred at recursive SQL level 5
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select name,password,datats#...","Typecheck","kgghteInit")
ORA-06512: at "SYS.PRVT_ADVISOR", line 1624
ORA-06512: at "SYS.DBMS_ADVISOR", line 186
ORA-06512: at "SYS.DBMS_SPACE", line 1500
ORA-06512: at "SYS.DBMS_SPACE", line 1566
bytes of shared memory ("","","","")
Sat Mar 22 20:28:28 2014
Errors in file /oracle/admin/oratest/bdump/orahtest_j000_278630.trc:
ORA-12012: error on auto execute of job 42781
ORA-04031: unable to allocate ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select x.inst_id,x.indx+1,ks...","Typecheck","kgghtInit")
ORA-06512: at "SYS.DBMS_ISCHED", line 3047
ORA-06512: at "SYS.DBMS_SCHEDULER", line 1926
ORA-06512: at "EXFSYS.DBMS_RLMGR_DR", line 3150
ORA-01403: no data found
ORA-06512: at line 1
bytes of shared memory ("","","","")
Sat Mar 22 20:55:46 2014
Errors in file /oracle/admin/oratest/bdump/oratest_reco_319580.trc:
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select host,userid,password,...","Typecheck","kgghteInit")
Sat Mar 22 20:55:46 2014
RECO: terminating instance due to error 4031
Instance terminated by RECO, pid = 319580

其中重點檢視導致Instance Down掉的這條日誌，是由於程式號為319580的RECO程式導致的，提示是無法在shared pool中分配4120bytes的記憶體塊，具體的資訊需要分析一下資料庫的跟蹤檔案：orahr_reco_319580.trc，在這裡，不要用tkprof之類的工具進行處理，直接用文字編輯器開啟分析，其中比較值得注意的地方有：
==============================
Memory Utilization of Subpool 1
================================
Allocation Name Size
_________________________ __________
"free memory " 4306672
"miscellaneous " 304
"Undo Meta Data " 144
"dpslut_kfdsg " 512
"File Space Usage " 11336
"trace buffer " 950272
"trace_knlasg " 504

由此看到，雖然free memory並不大，但是其確實大於報錯中提到的4120bytes，由此看來，有可能引起該報錯的原因有兩個，就是上文中提到的：

1.確實有碎片；2.記憶體可能不足。

在取資料分析的時候，正好是上班時間，因此只能透過一些臨時手段進行該問題的處理，比較有效的方法是：

刷共享池：alter system flush shared_pool;

作此操作一定要注意！千萬不要在業務繁忙的時候做，很有可能引起當機！可以在中午或者是下班時刻執行，執行之前，如果有條件，最好做一次全備，不論是邏輯匯出還是RMAN備份。

記憶體不足這個原因需要做進一步的分析，用資料說話，不能單憑感覺和經驗。

跟客戶和相關負責人瞭解了一下情況，得知客戶在不久前剛進行過一次資料庫的遷移，我感覺可能是由於在遷移過程中有地方疏忽造成的隱患。

由於之前幫客戶處理資料庫問題的時候，抓過資料庫的各種資訊，所以手頭上有當時資料庫的詳細狀態記錄；經檢視parameter後，發現，有很多引數跟遷移前不同，其中sga_target和sga_max_size這兩個值比以前小几十倍。

因此初步斷定，當前的記憶體確實可能較之前比非常不足，跟客戶反映後，客戶同意無業務時間進行引數修改和重啟資料庫的工作。

於是在下班後，將資料庫引數修改，命令如下：

alter system set sga_max_size=5000M scope=spfile;

alter system set sga_target=5000M scope=spfile;

重啟資料庫後，手動發起邏輯備份指令碼，發現執行成功！

待做完全備以後，將資料庫切換到HACMP的另一節點，修改引數，重啟，匯出均成功後，確定操作有效，判斷正確。

至此為止基本的處理已經告一段落，但是問題的解決要除根！具體是因為哪個環節導致的這個隱患的存在一定要找出來，否則每一次遷移都要承受資料庫當機之苦。

經調查，發現當初遷移的時候，是將引數檔案透過create pfile from spfile生成的，然後複製到另一臺機器上進行的restore pfile。

後來開啟這個pfile一看，裡面只有短短的10幾行，但是在資料庫遷移前後我發現有20多個引數是有區別的，因此判斷是在恢復操作中很多沒有明文宣告的引數可能是按照預設的值設定的，但是這些引數在遷移之前可能是有人為了適應應用環境手動調整過的。

因此，恢復引數檔案的方法是存在問題的，單純的移動到目的地，然後透過start pfile='***'看來是會導致很多引數無法復原到之前的狀態。

在此，我建議引數檔案用Oracle RMAN去恢復，因此在備份的時候，一定要注意備份相關的控制檔案和引數檔案，備份相關的操作演示請參閱：RMAN備份、恢復實驗室之備份篇【備份實驗】

恢復演示如下：

1.啟動例項：

RMAN> startup nomount;

startup failed: ORA-01078: failure in processing system parameters
LRM-00109: could not open parameter file '/oracle/app/oracle/product/11.2.0/db_1/dbs/initorcl.ora'

starting Oracle instance without parameter file for retrieval of spfile
Oracle instance started

Total System Global Area 158662656 bytes

Fixed Size 2211448 bytes
Variable Size 92275080 bytes
Database Buffers 58720256 bytes
Redo Buffers 5455872 bytes

2.設定DBID:

RMAN> set dbid=1347976258

executing command: SET DBID

3.恢復控制檔案：
RMAN> restore controlfile from '/oracle/app/oracle/product/11.2.0/db_1/dbs/c-1347976258-20140318-01';

Starting restore at 18-MAR-14
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=125 device type=DISK

channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:08
output file name=/oracle/app/oradata/ORCL/controlfile/o1_mf_8xm4g27m_.ctl
Finished restore at 18-MAR-14

4.將資料庫置於MOUNT狀態：

RMAN> alter database mount;

database mounted
released channel: ORA_DISK_1

5.檢視備份集資訊：
由於控制檔案中有備份資訊，因此恢復控制檔案後便可看到具體的備份資訊：
List of Backups
===============
Key TY LV S Device Type Completion Time #Pieces #Copies Compressed Tag
------- -- -- - ----------- --------------- ------- ------- ---------- ---
37 B 0 A DISK 02-AUG-13 1 1 NO TAG20130802T054625
38 B 0 A DISK 02-AUG-13 1 1 NO TAG20130802T054625
39 B 1 A DISK 02-AUG-13 1 1 NO TAG20130802T055026
40 B 1 A DISK 02-AUG-13 1 1 NO TAG20130802T055026
41 B 1 A DISK 18-OCT-13 1 1 NO TAG20131018T005620
42 B 1 A DISK 18-OCT-13 1 1 NO TAG20131018T005620
43 B A A DISK 19-NOV-13 1 1 NO TAG20131119T013353
44 B 0 A DISK 19-NOV-13 1 1 NO TAG20131119T013521
45 B 0 A DISK 19-NOV-13 1 1 NO TAG20131119T013521
46 B A A DISK 19-NOV-13 1 1 NO TAG20131119T013801
47 B A A DISK 07-FEB-14 1 1 NO TAG20140207T002846
48 B 1 A DISK 07-FEB-14 1 1 NO TAG20140207T003003
49 B A A DISK 07-FEB-14 1 1 NO TAG20140207T003109
50 B F A DISK 07-FEB-14 1 1 NO TAG20140207T003110
51 B A A DISK 07-FEB-14 1 1 NO TAG20140207T011201
52 B 1 A DISK 07-FEB-14 1 1 NO TAG20140207T011203
53 B A A DISK 07-FEB-14 1 1 NO TAG20140207T011249
54 B F A DISK 07-FEB-14 1 1 NO TAG20140207T011250
55 B A A DISK 07-FEB-14 1 1 NO TAG20140207T013906
56 B 0 A DISK 07-FEB-14 1 1 NO TAG20140207T013908
57 B A A DISK 07-FEB-14 1 1 NO TAG20140207T014045
58 B F A DISK 07-FEB-14 1 1 NO TAG20140207T014046
59 B 1 A DISK 10-MAR-14 1 1 NO ZHANGZICHAO_STUDY
60 B F A DISK 10-MAR-14 1 1 NO TAG20140310T035229
61 B A A DISK 18-MAR-14 1 1 NO TAG20140318T032915
62 B 1 A DISK 18-MAR-14 1 1 NO TAG20140318T033619
63 B A A DISK 18-MAR-14 1 1 NO TAG20140318T034112
64 B F A DISK 18-MAR-14 1 1 NO TAG20140318T034114
65 B F A DISK 18-MAR-14 1 1 NO TAG20140318T034213

具體檢視備份集的方法有很多，請參閱： RMAN備份、恢復實驗室之備份篇【檢視備份情況】

確認備份集都在之後，進行資料庫恢復即可！

6. Restore Database:

RMAN> restore database;

Starting restore at 18-MAR-14
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=133 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to /oracle/app/oradata/ORCL/datafile/o1_mf_system_9h7jv8g1_.dbf
channel ORA_DISK_1: restoring datafile 00002 to /oracle/app/oradata/ORCL/datafile/o1_mf_sysaux_8xm483f7_.dbf
channel ORA_DISK_1: restoring datafile 00003 to /oracle/app/oradata/ORCL/datafile/o1_mf_undotbs1_8xm483gd_.dbf
channel ORA_DISK_1: restoring datafile 00004 to /oracle/app/oradata/ORCL/datafile/o1_mf_users_9ksobds9_.dbf
channel ORA_DISK_1: restoring datafile 00005 to /oracle/app/oradata/ORCL/datafile/o1_mf_example_9h7hct57_.dbf
channel ORA_DISK_1: restoring datafile 00006 to /oracle/app/oradata/catalog_rman/rmantbs01.dbf
channel ORA_DISK_1: reading from backup piece /oracle/app/oracle/product/11.2.0/db_1/dbs/1rp002ls_1_1
channel ORA_DISK_1: piece handle=/oracle/app/oracle/product/11.2.0/db_1/dbs/1rp002ls_1_1 tag=TAG20140207T013908
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:06:39
Finished restore at 18-MAR-14

7. Recoer Database:

RMAN> recover database;

Starting recover at 18-MAR-14
using channel ORA_DISK_1
channel ORA_DISK_1: starting incremental datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00002: /oracle/app/oradata/ORCL/datafile/o1_mf_sysaux_9lgs3p0k_.dbf
destination for restore of datafile 00003: /oracle/app/oradata/ORCL/datafile/o1_mf_undotbs1_9lgs3pmp_.dbf
destination for restore of datafile 00004: /oracle/app/oradata/ORCL/datafile/o1_mf_users_9lgs3q3d_.dbf
destination for restore of datafile 00005: /oracle/app/oradata/ORCL/datafile/o1_mf_example_9lgs3p6t_.dbf
destination for restore of datafile 00006: /oracle/app/oradata/catalog_rman/rmantbs01.dbf
channel ORA_DISK_1: reading from backup piece /oracle/app/oracle/product/11.2.0/db_1/dbs/1up2pv13_1_1
channel ORA_DISK_1: piece handle=/oracle/app/oracle/product/11.2.0/db_1/dbs/1up2pv13_1_1 tag=ZHANGZICHAO_STUDY
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:11
channel ORA_DISK_1: starting incremental datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00001: /oracle/app/oradata/ORCL/datafile/o1_mf_system_9lgs3ov6_.dbf
destination for restore of datafile 00002: /oracle/app/oradata/ORCL/datafile/o1_mf_sysaux_9lgs3p0k_.dbf
destination for restore of datafile 00003: /oracle/app/oradata/ORCL/datafile/o1_mf_undotbs1_9lgs3pmp_.dbf
destination for restore of datafile 00004: /oracle/app/oradata/ORCL/datafile/o1_mf_users_9lgs3q3d_.dbf
destination for restore of datafile 00005: /oracle/app/oradata/ORCL/datafile/o1_mf_example_9lgs3p6t_.dbf
destination for restore of datafile 00006: /oracle/app/oradata/catalog_rman/rmantbs01.dbf
channel ORA_DISK_1: reading from backup piece /oracle/app/oracle/product/11.2.0/db_1/dbs/21p3f19n_1_1
channel ORA_DISK_1: piece handle=/oracle/app/oracle/product/11.2.0/db_1/dbs/21p3f19n_1_1 tag=TAG20140318T033619
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07

starting media recovery

archived log for thread 1 with sequence 134 is already on disk as file /oracle/app/oradata/ORCL/onlinelog/o1_mf_2_8xm4g6s5_.log
archived log for thread 1 with sequence 135 is already on disk as file /oracle/app/oradata/ORCL/onlinelog/o1_mf_3_8xm4g8gt_.log
archived log file name=/oracle/app/oradata/ORCL/onlinelog/o1_mf_2_8xm4g6s5_.log thread=1 sequence=134
archived log file name=/oracle/app/oradata/ORCL/onlinelog/o1_mf_3_8xm4g8gt_.log thread=1 sequence=135
media recovery complete, elapsed time: 00:00:03
Finished recover at 18-MAR-14

8.開啟資料庫：
RMAN> alter database open resetlogs;

database opened

ORA-04031 故障解決案例詳細分析一則

環境：

現象：

分析：

恢復演示如下：

操作完成！

相關文章