Windows下RAC歸檔滿和ORA-04031問題

(V 1.0)

2012年8月

版本資訊

日期	版本	描述	作者/修改人	備註
2012-8-31	1.0	建立	Jusin Hao

目錄

本文件用於記錄******安裝操作過程。

本文件包含****的安裝操作等內容。

：歸檔滿現象及環境資訊：

# localhost name resolution is handled within DNS itself.

# 127.0.0.1 localhost

# ::1 localhost

10.139.5.13 db01

10.139.5.14 db02

10.139.5.15 db01-vip

10.139.5.16 db02-vip

10.139.5.17 scan

10.10.10.1 db01-pri

10.10.10.2 db02-pri

10.139.5.20 dbbackup

10.139.5.21 appbackup

SQL> select inst_id,instance_name from gv$instance;

INST_ID INSTANCE_NAME

---------- ----------------

1 jchr1

2 jchr2

SQL> show parameter db_recovery

NAME TYPE VALUE

------------------------------------ ----------- -----------------

db_recovery_file_dest string

db_recovery_file_dest_size big integer 20G

SQL> archive log list

資料庫日誌模式存檔模式

自動存檔啟用

存檔終點 +DATA

最早的聯機日誌序列 31075

下一個存檔日誌序列 31076

當前日誌序列 31076

SQL>

SQL> select flashback_on from v$database;

FLASHBACK_ON

------------------

NO

指令碼delete.bat：

rman target / nocatalog CMDFILE 'D:\d-arch\rmanbackup.txt' LOG 'D:\d-arch\rman_backup_%DATE:~0,4%%DATE:~5,2%%DATE:~8,2%".log'

指令碼D:\d-arch\rmanbackup.txt

delete noprompt archivelog ALL COMPLETED BEFORE "sysdate - 6";

檢視指令碼日誌發現指令碼是執行成功的;

但是客戶經常報歸檔空間滿，由上面引數，我們知道其歸檔日誌是存放在+DATA卷組中的（ASM卷組）

：分析：

1、常規的：我們可以手動備份後清除日誌：

1）backup format '/install_source/rman_bak/arch_%d_%U' archivelog all delete input; --刪除已經備份過的歸檔

2）delete noprompt archivelog ALL COMPLETED BEFORE "sysdate - 6";

執行上面這兩個命令會發現發現

3）delete noprompt archivelog all; --這會清除所有歸檔日誌，包括剛新生成的（慎用）

2、檢視ASM卷組狀態

C:\>asmcmd

ASMCMD>

ASMCMD> ls

DATA/

OCR/

ASMCMD> cd data

ASMCMD> ls

JCHR/

ASMCMD> cd jchr

ASMCMD> ls

ARCHIVELOG/

CONTROLFILE/

DATAFILE/

ONLINELOG/

PARAMETERFILE/

STANDBYLOG/

TEMPFILE/

spfilejchr.ora

spfilejchr2.ora

ASMCMD> lsdg

State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files N

ame

MOUNTED EXTERN N 512 4096 1048576 819197 48329 0 48329 0 N D

ATA/

MOUNTED EXTERN N 512 4096 1048576 10237 9841 0 9841 0 N O

CR/

透過上面的資訊，我們可以發現+DATA\jchr\archivelog\下存放的是資料庫的歸檔檔案，但是存在很多2012年和2013年前幾個月的的歸檔日誌檔案，說明以前的歸檔檔案並沒有從ASM卷組上物理刪除；同時，有這些歸檔日誌長期佔用著物理空間，整個ASM卷組的磁碟空間只剩下40多個G（48329M），因此後續的歸檔日誌很容易超出整個空間，導致無法歸檔。為什麼會這樣呢？
Oracle控制檔案以及Oracle RMAN的的備份恢復的原理，Oracle 控制檔案裡邊記錄了資料庫的名字，id,建立的時間戳，以及歸檔記錄和備份資訊。
Oracle RMAN的備份恢復的所有資訊都依賴於：要麼是控制檔案，要麼是恢復目錄(catalog)，因為所有的備份與恢復資訊都會依據備份是的方式儲存到這兩個位置。
理所當然的是，對這兩個東東里的備份集、映象副本、歸檔日誌等等所有能備份的物件的任意操作，首先會參這些物件的記錄的資訊，其次是當被記錄的物件發生變化時做相應的更新。

--檢視已經歸檔日誌的統計：

select name,status,count(*) from gv$archived_log group by name,status order by 1 desc;

從上面的查詢可知，當前的兩個節點其歸檔日誌只有2個，其餘的17630個其NAME都是NULL值。
看看關於檢視v$archived_log中NAME列的解釋：
Archived log file name. If set to NULL, either the log file was cleared before it was archived or an RMAN backup command with the "delete input" option was executed to back up archivelog all (RMAN> backup archivelog all delete input;).
上面的這段話表明當前的這些日誌檔案要麼被手動清除，要麼被rman的delete input選項清除。
其次status列的D欄位也表明了這些個名字為空的歸檔日誌已經被Deleted.也就是說有17630個歸檔日誌已經被刪除了。

--如下檢視查詢，我們可以看到歸檔日誌狀態為D且name為null的，兩個例項上都是8815個：

select inst_id,name,count(*) from gv$archived_log group by inst_id,name order by 1 desc;

--檢視控制檔案已歸檔記錄的相關資訊：

SQL> select * from gv$controlfile_record_section where type='ARCHIVED LOG';

INST_ID TYPE RECORD_SIZE RECORDS_TOTAL RECORDS_USED FIRST_INDEX LAST_INDEX LAST_RECID

----------- ---------- ----------------------------------------------------------------------------

2 ARCHIVED LOG 584 10952 10710 8822 8579 52415

1 ARCHIVED LOG 584 10952 10710 8822 8579 52415

SQL> select count (*) from v$archived_log;

COUNT(*)

----------

10710

RECORDS_TOTAL：Number of records allocated for the section

列RECORDS_TOTAL表明為當前TYPE分配的可儲存的總數，在兩個instance上都為10952條
從最近一次切換日誌的查詢結果可知，被刪除的有8815條，如果下次日誌切換再增加一條往哪裡放呢？

那些已經超出預設保留期的歸檔日誌被覆蓋（在控制檔案中的記錄），即被重用，使用者在控制檔案中儲存ARCHIVED LOG部分的保留時間由誰來決定呢，引數control_file_record_keep_time，預設為7天 ，這意味著7天前的歸檔日誌和備份資訊可能在控制檔案中已經不存在了

SQL> show parameter control

NAME TYPE VALUE

------------------------------------ ----------- ------------------------------

control_file_record_keep_time integer 7

control_files string +DATA/jchr/controlfile/current

.261.793900303, +DATA/jchr/con

trolfile/current.260.793900303

control_management_pack_access string DIAGNOSTIC+TUNING

# 下面的查詢正好表明為什麼2012_10_23和之前的日誌為什麼沒有被刪除
# 因為20130505 12:00:04之後的歸檔日誌記錄在控制檔案中已經被覆蓋了，所以使用delete archivelog all時是根本無法清除之前的日誌的，即對於rman下的delete archivelog all方式不會刪除控制檔案中對應的歸檔日誌資訊（在控制檔案中設定delete狀態的歸檔日誌）

SQL> alter session set nls_date_format='yyyymmdd hh24:mi:ss';

會話已更改。

SQL> select min (FIRST_TIME), min (COMPLETION_TIME), max (FIRST_TIME), max (COMP

LETION_TIME) from v$archived_log;

MIN(FIRST_TIME) MIN(COMPLETION_TI MAX(FIRST_TIME) MAX(COMPLETION_TI

----------------- ----------------- ----------------- -----------------

20130505 10:06:16 20130519 22:18:57 20130519 22:24:57

SQL> select min (FIRST_TIME), min (COMPLETION_TIME), max (FIRST_TIME), max (COMP

LETION_TIME) from gv$archived_log;

MIN(FIRST_TIME) MIN(COMPLETION_TI MAX(FIRST_TIME) MAX(COMPLETION_TI

----------------- ----------------- ----------------- -----------------

20130505 10:06:16 20130505 12:00:04 20130519 22:18:57 20130519 22:24:57

：方案：

如下手工依次刪除之前廢棄的歸檔日誌即可：

ASMCMD> rm -r 2012*

You may delete multiple files and/or directories.

Are you sure? (y/n) y

：ASM INSTANCE 操作報ORA-04031現象

在命令體是否下執行select、crsctl、crs_stat等命令會報錯，如下:

1） ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","unknown object",) (DBD ERROR: error possibly near indicator at char 44 in '/* ASMCMD */ select name, parent_index from v$asm_alias where reference_index=33582893')

2） ("shared pool","unknown object","sga heap(1,0)","kglsim heap") (DBD ERROR: OCIStmtExecute)

3） SQL> show parameter share

ORA-00604: error occurred at recursive SQL level 2

ORA-04031: unable to allocate 4064 bytes of shared memory ("sharedpool","select x.inst_id,x.indx+1,ks...","sga heap(1,0)","kglsim heap")

：分析

--無法檢視ASM的引數檔案路徑：

SQL> show parameter spfile

ORA-00604: error occurred at recursive SQL level 2

ORA-04031: unable to allocate 4064 bytes of shared memory ("shared

pool","select x.inst_id,x.indx+1,ks...","sga heap(1,0)","kglsim heap")

--備份引數檔案

SQL> create pfile='d:\initasm1.ora' from spfile;

File created.

--引數檔案內容：

*._library_cache_advice=FALSE

+asm1.asm_diskgroups='DATA'#Manual Mount

+asm2.asm_diskgroups='DATA'#Manual Mount

*.asm_power_limit=1

*.diagnostic_dest='D:\app\Administrator'

*.instance_type='asm'

*.large_pool_size=12M

*.remote_login_passwordfile='EXCLUSIVE'

--嘗試手動停止個元件：

D:\app\11.2.0\grid\BIN>srvctl.bat status asm

PRCR-1070 : 無法檢查資源 ora.asm 是否已註冊

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat status asm

PRCR-1070 : 無法檢查資源 ora.asm 是否已註冊

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>crs_stat.exe -t

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")

D:\app\11.2.0\grid\BIN>crs_stat.exe -t -n db01

^C

D:\app\11.2.0\grid\BIN>srvctl.bat status asm -n db01

PRCR-1070 : 無法檢查資源 ora.asm 是否已註冊

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat status nodeapps -n db01

-n 選項已過時。

PRCR-1035 : 無法查詢 null 的 CRS 資源 ora.cluster_vip.type

PRCR-1068 : 無法查詢資源

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.net1.network 是否已註冊

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.gsd 是否已註冊

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.ons 是否已註冊

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.eons 是否已註冊

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>

D:\app\11.2.0\grid\BIN>srvctl.bat status nodeapps

PRCR-1035 : 無法查詢 1 的 CRS 資源 ora.cluster_vip.type

PRCR-1068 : 無法查詢資源

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.net1.network 是否已註冊

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.gsd 是否已註冊

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.ons 是否已註冊

Cannot communicate with crsdPRCR-1070 : 無法檢查資源 ora.eons 是否已註冊

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat status instance

PRKO-2082 : 缺少必需的選項 -d

D:\app\11.2.0\grid\BIN>srvctl.bat status instance -d jchr

PRKO-3132 : 檢查資料庫 jchr 的例項的狀態需要 -i 選項或 -n 選項

D:\app\11.2.0\grid\BIN>srvctl.bat status instance -d jchr -i jchr1

PRCD-1027 : 無法檢索資料庫 jchr

PRCR-1070 : 無法檢查資源 ora.jchr.db 是否已註冊

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>srvctl.bat stop instance -d jchr -i jchr1

PRCD-1027 : 無法檢索資料庫 jchr

PRCR-1070 : 無法檢查資源 ora.jchr.db 是否已註冊

Cannot communicate with crsd

D:\app\11.2.0\grid\BIN>crsctl check crs

CRS-4638: Oracle High Availability Services 聯機

CRS-4535: 無法與叢集就緒服務通訊

CRS-4529: 叢集同步服務聯機

CRS-4533: 事件管理器聯機

D:\app\11.2.0\grid\BIN>

D:\app\11.2.0\grid\BIN>crs_stat

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")

D:\app\11.2.0\grid\BIN>set ORACLE_SID=jchr1

D:\app\11.2.0\grid\BIN>crs_stat

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","kglsim object batch")

D:\app\11.2.0\grid\BIN>

D:\app\11.2.0\grid\BIN>crs_stat -t

Errors in file :

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","begin

dbms_diskgroup.read(:h...","sga heap(1,0)","")

--檢視asm的alert日誌

D:\app\Administrator\diag\asm\+asm\+asm1\trace\ alert_+asm1.log

Thu Sep 13 15:17:26 2012

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Interface type 1 Private 10.0.0.0 configured from GPnP Profile for use as a cluster interconnect

Interface type 1 Public 10.139.5.0 configured from GPnP Profile for use as a public interface

Shared memory segment for instance monitoring created

Picked latch-free SCN scheme 3

Using LOG_ARCHIVE_DEST_1 parameter default value as D:\app\11.2.0\grid\RDBMS

Autotune of undo retention is turned on.

LICENSE_MAX_USERS = 0

SYS auditing is disabled

Starting up:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options.

Using parameter settings in client-side pfile D:\APP\11.2.0\GRID\DATABASE\INIT+ASM1.ORA on machine DB01

System parameters with non-default values:

large_pool_size = 12M

instance_type = "asm"

remote_login_passwordfile= "EXCLUSIVE"

asm_power_limit = 1

diagnostic_dest = "D:\APP\ADMINISTRATOR"

Cluster communication is configured to use the following interface(s) for this instance

10.10.10.1

cluster interconnect IPC version:Oracle 11 Winsock2 TCP/IP IPC

IPC Vendor 1 proto 1

Version 1.0

Thu Sep 13 15:17:26 2012

PMON started with pid=2, OS id=916

Thu Sep 13 15:17:26 2012

VKTM started with pid=3, OS id=5972 at elevated priority

VKTM running at (10)millisec precision with DBRM quantum (100)ms

--檢視對應的trace ：+asm1_ora_3912.trc

Memory Utilization of Subpool 1

================================

Allocation Name Size

___________________________ ____________

"free memory " 87682648

"miscellaneous " 0

"KJC dest ctx " 1456

"kfcgx heap " 48376

"kfr group ctx " 736

"ksv work msg " 2272

---同樣例項2上面也無法檢視：

D:\app\11.2.0\grid\BIN>set ORACLE_SID=+ASM2

D:\app\11.2.0\grid\BIN>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on 星期一 5月 20 20:11:38 2013

連線到:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options

SQL> select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shar

d pool' order by bytes desc;

select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shared po

l' order by bytes desc

*

第 1 行出現錯誤:

ORA-00604: error occurred at recursive SQL level 2

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared

pool","select inst_id,'',ksmssnam,k...","sga heap(1,0)","kglsim object batch")

：方案

1、手動修改ASM例項的shared pool，為其指定SGA和shared_pool_size大小；

2、方法：介於Oracle11G+ ASM（ocr在asm上）時ASM例項無法單獨重啟，因此參考RAC_ASM_Practic(ASM例項引數修改).txt 裡的方法修改引數；

參考1：ASM Instance generates ORA-04031 ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim heap" [ID 1450745.1]

Cause

This was researched in Bug 12566932

1. The over allocation occurred in "kglsim allocations" as you can see in these errors

ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim object batch")

ORA-04031: unable to allocate 4064 bytes of shared memory ("shared pool","select name_kfgrp, number_kf...","sga heap(1,0)","kglsim heap")

2. You can also see which compoinent is using the memory by issuing this select statement:

set lines 100
set pages 9999
col mb format 999,999,999
select name, round((bytes/1024/1024),0) MB
from v$sgastat where pool='shared pool'
order by bytes

3. There was memory available at the time of the error which indicates the shared pool is fragmented

Solution

In the Bug the following workaround resolved the errors:

1. Set the following parameter _library_cache_advice=false

alter system set "_library_cache_advice"=false scope=spfile;

2. Bounce the database

如下資料庫的shared pool各元件及空閒大小，由於asm例項執行即報錯，根本無法返回結果；

SET ORACLE_SID=jchr1

SQL> select name, round((bytes/1024/1024),0) MB from v$sgastat where pool='shared pool' order by bytes

NAME MB

-------------------------- ----------

KQR L PO 14

dbwriter coalesce buffer 16

ges resource 18

KCL name table 18

SQLA 19

dbktb: trace buffer 23

ksunfy : SSO free list 26

event statistics per sess 28

gcs shadows 30

ges big msg buffers 30

FileOpenBlock 30

NAME MB

-------------------------- ----------

ges enqueues 31

gcs resources 41

ASH buffers 50

free memory 2914:

已選擇961行。

參考2：How to Start (or stop) 10gR2 or 11gR1 Oracle Clusterware Services Manually in Windows [ID 729512.1]
RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent) [ID 810394.1]

1. Try starting crs with "crsctl start crs" command from $CRS_HOME\bin directory

OR

2. Start the following services from Windows 'Services' or by using the Windows command 'net start at the
        command prompt:
            a. If OPMD is present, then start it: Oracle Process Manager Service. In addition to OPMD, start the
                OracleClusterVolumeService if you are using Oracle Cluster File System (OCFS) in your environment.
             b. If OPMD is not present then start the following services in the order provided:
                    OracleObjectService
                    OracleClusterVolumeService (if using OCFS)
                    OracleCSService
                    OracleEVMService
                    OracleCRService

http://cywxzyh.blog.163.com/blog/static/163857248201144112236781/

參考3：ASM & Shared Pool (ORA-4031) [ID 437924.1]

參考4：其他參考：

http://zhang41082.itpub.net/post/7167/463093

查詢了metalink文件，發現果然是Oracle的bug，metalink文件Bug No. 4431215描述了這個問題。如果在ASM上建立的目錄名稱和SID字首相同，則這個目錄下的所有目錄一旦為空，就會被ASM例項自動刪除，不管這個目錄是ASM自動建立，還是使用者手工建立的。

解決這個問題的辦法倒是很多，比如將歸檔目錄設定到日誌檔案的目錄中，或者改變主目錄的名稱使其和ORACLE_SID有所區別。還有更簡單的辦法，在歸檔目錄下複製一個小檔案，使得歸檔被刪除後，歸檔目錄不為空，從而避免路徑被清除。

Oracle在10.2.0.4和11.1.0.6中fixed了這個bug。因此升級或打PATCH也是解決這個問題的一個方法。

http://blog.chinaunix.net/uid-22948773-id-2600822.html

Windows下RAC歸檔滿和ORA-04031問題

：歸檔滿現象及環境資訊：

：分析：

：方案：

：ASM INSTANCE 操作報ORA-04031現象

：分析

：方案

相關文章