【Oracle】RAC ASM日誌報錯 ORA-15078 CRSD自動關閉
系統:Redhat 6.4
問題描述:CRS磁碟DISMOUNT造成CRS無法啟動
CRS檢查:
$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services <----------
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
$ crsctl stat res -t
CRS-0184: Cannot communicate with the CRS daemon.
登陸到該庫node1,發現CRS程式已經關閉
$ crsctl check cluster -all
**************************************************************
node1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
試著啟動crs程式,但是無法啟動
[root@node1 ~]# /app/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
[root@node1 ~]# /app/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services <<----CRS沒有起來
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
在grid使用者下檢視crs程式日誌
$ cd $ORACLE_HOME/log/node1/crsd
$ vim crsd.log
-------------------
2013-12-10 15:47:19.902: [ OCRASM][33715952]ASM Error Stack :
2013-12-10 15:47:19.934: [ OCRASM][33715952]proprasmo: kgfoCheckMount returned [6]
2013-12-10 15:47:19.934: [ OCRASM][33715952]proprasmo: The ASM disk group crs is not found or not mounted <-------
2013-12-10 15:47:19.934: [ OCRRAW][33715952]proprioo: Failed to open [+crs]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2013-12-10 15:47:19.934: [ OCRRAW][33715952]proprioo: No OCR/OLR devices are usable
2013-12-10 15:47:19.934: [ OCRASM][33715952]proprasmcl: asmhandle is NULL
2013-12-10 15:47:19.935: [ GIPC][33715952] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5343]
2013-12-10 15:47:19.935: [ default][33715952]clsvactversion:4: Retrieving Active Version from local storage.
2013-12-10 15:47:19.937: [ OCRRAW][33715952]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2013-12-10 15:47:19.938: [ OCRRAW][33715952]proprinit: Could not open raw device
2013-12-10 15:47:19.938: [ OCRASM][33715952]proprasmcl: asmhandle is NULL
2013-12-10 15:47:19.939: [ OCRAPI][33715952]a_init:16!: Backend init unsuccessful : [26]
2013-12-10 15:47:19.939: [ CRSOCR][33715952] OCR context init failure. Error: PROC-26: Error while accessing the physical storage <-------
2013-12-10 15:47:19.939: [ CRSD][33715952] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
2013-12-10 15:47:19.939: [ CRSD][33715952][PANIC] CRSD exiting: Could not init OCR, code: 26
2013-12-10 15:47:19.939: [ CRSD][33715952] Done.
---------------------
透過日誌,可以看出是CRS磁碟組有問題
也確實如此,沒有掛載CRS磁碟組
su - grid
SQL> set linesize 200
SQL> select GROUP_NUMBER,NAME,TYPE,ALLOCATION_UNIT_SIZE,STATE from v$asm_diskgroup;
GROUP_NUMBER NAME TYPE ALLOCATION_UNIT_SIZE STATE
------------ ------------------------------ ------ -------------------- -----------
0 CRS 0 DISMOUNTED<--------
2 DATA1 EXTERN 4194304 MOUNTED
檢視asm例項alert日誌,返現CRS磁碟組被強制解除安裝了
SQL> show parameter dump
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
background_core_dump string partial
background_dump_dest string /app/gridbase/diag/asm/+asm/+A
SM1/trace
cd /app/gridbase/diag/asm/+asm/+ASM1/trace
$ vim alert_+ASM1.log
-------------------------------------------
Tue Dec 10 11:13:57 2013
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
Tue Dec 10 11:13:57 2013
NOTE: process _b000_+asm1 (15822) initiating offline of disk 0.3916226472 (CRS_0000) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (15822) initiating offline of disk 1.3916226471 (CRS_0001) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (15822) initiating offline of disk 2.3916226470 (CRS_0002) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 12 for pid 37, osid 15822
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: checking PST for grp 1 done.
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96cdfa8, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 1/0xe96cdfa7, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 2/0xe96cdfa6, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 13 for pid 37, osid 15822
ERROR: no read quorum in group: required 2, found 0 disks
Tue Dec 10 11:13:57 2013
NOTE: cache dismounting (not clean) group 1/0x165C2F6D (CRS)
WARNING: Offline for disk CRS_0000 in mode 0x7f failed.
WARNING: Offline for disk CRS_0001 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 15824, image: oracle@node1 (B001)
WARNING: Offline for disk CRS_0002 in mode 0x7f failed.
Tue Dec 10 11:13:57 2013
NOTE: halting all I/Os to diskgroup 1 (CRS)
Tue Dec 10 11:13:57 2013
NOTE: LGWR doing non-clean dismount of group 1 (CRS)
NOTE: LGWR sync ABA=3.42 last written ABA 3.42
Tue Dec 10 11:13:57 2013
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
Tue Dec 10 11:13:57 2013
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE
Tue Dec 10 11:13:57 2013
NOTE: No asm libraries found in the system
520 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Tue Dec 10 11:13:57 2013
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0x165C2F6D (CRS)
SQL> alter diskgroup CRS dismount force /* ASM SERVER:375140205 */ <------------CRS被強制dismount---
Tue Dec 10 11:13:57 2013
NOTE: cache deleting context for group CRS 1/0x165c2f6d
GMON dismounting group 1 at 14 for pid 41, osid 15824
NOTE: Disk CRS_0000 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0001 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0002 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 1
Tue Dec 10 11:14:27 2013
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 1
Tue Dec 10 11:14:29 2013
ASM Health Checker found 1 new failures
Tue Dec 10 11:14:57 2013
SUCCESS: diskgroup CRS was dismounted
SUCCESS: alter diskgroup CRS dismount force /* ASM SERVER:375140205 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group CRS
--------------------------------------
掛載CRS 磁碟組
su - grid
sqlplus / as sysasm --!!!一定是sysasm
SQL> alter diskgroup crs mount;
SQL> select GROUP_NUMBER,NAME,TYPE,ALLOCATION_UNIT_SIZE,STATE from v$asm_diskgroup;
GROUP_NUMBER NAME TYPE ALLOCATION_UNIT_SIZE STATE
------------ ------------------------------ ------ -------------------- -----------
1 CRS NORMAL 4194304 MOUNTED
2 DATA1 EXTERN 4194304 MOUNTED
啟動CRS
但是常用的start crs命令執行不成功
# /app/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
使用該命令啟動成功
[root@node1 ~]# /app/grid/bin/crsctl start res ora.crsd -init
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded
# /app/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online <------------------
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
解決路線圖:
crsd_log-->asm_instance_alert_log-->mount crs diskgroup -->start crs
上面的解決方法借鑑.chinaunix 十字螺絲釘 如有侵權請告知
ASM日誌看到因為找不到CRS磁碟組導致磁碟DISMOUNT,但系統和儲存工程師沒有找到相關問題,以後發現問題原因會繼續補充;
補充內容:
ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文件 ID 1581684.1)
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.3 to 12.1.0.2 [Release 11.2 to 12.1]Oracle Database - Enterprise Edition - Version 10.2.0.4 to 10.2.0.4 [Release 10.2]
Information in this document applies to any platform.
SYMPTOMS
Normal or high redundancy diskgroup is dismounted with these WARNING messages.
//ASM alert.log
Mon Jul 01 09:10:47 2013
WARNING: Waited 15 secs for write IO to PST disk 1 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 6.
....
GMON dismounting group 6 at 72 for pid 44, osid 8782162
CAUSE
The ASM disk could go into unresponsiveness, normally in the following scenarios:
+ During path 'failover' in a multipath set up
+ Server load, or any sort of storage/multipath/OS maintenance
The Doc ID 10109915.8 briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer. And then _asm_hbeatiowait helps in setting the time out.
SOLUTION
1] Check with OS and Storage admin that there is disk unresponsiveness.
2] Possibly keep the disk responsiveness to below 15 seconds.
This will depend on various factors like
+ Operating System
+ Presence of Multipath ( and Multipath Type )
+ Any kernel parameter
So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.
For example, on AIX rw_timeout setting affects this and defaults to 30 seconds.
Another example is Linux with native multipathing. In such set up, number of physical paths and polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.
So for your set up ( combination of OS / multipath / storage ), you need to find out this.
3] If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):
_asm_hbeatiowait
As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, which is fixed in 12.1.0.2
Run below in asm instance to set desired value for _asm_hbeatiowait
alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';
And then restart asm instance / crs, to take new parameter value in effect.
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30327022/viewspace-2132540/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- oracle關閉狀態刪除活動日誌報錯恢復(一)Oracle
- Oracle RAC+ASM 關閉全過程OracleASM
- Oracle 11g RAC檢視ASM日誌、grid日誌和DB日誌OracleASM
- 【ASK_ORACLE】RAC節點自動重啟但日誌裡未報錯的原因和解決方法Oracle
- ORACLE AS 自動關閉Oracle
- 【ASM】Oracle RAC css啟動報錯"Duplicate voting file found"ASMOracleCSS
- Oracle RAC 啟動與關閉Oracle
- 關閉自動收集 for oracleOracle
- oracle 歸檔日誌開啟,關閉Oracle
- 【RAC】啟動/關閉CRS, OHAS, ASM & RDBMS 的步驟ASM
- Oracle RAC crsd日誌-0Resource ora!dbtest2!vip has incomplete attribute setOracle
- ORACLE 歸檔日誌開啟關閉方法Oracle
- 關閉Druid中某些錯誤日誌列印UI
- Oracle單例項+ASM啟動與關閉Oracle單例ASM
- win10 關閉自動傳送錯誤報告方法 windows10錯誤報告怎麼關閉Win10Windows
- 開啟關閉oracle資料庫附加日誌Oracle資料庫
- Oracle 10g RAC 啟動與關閉Oracle 10g
- ORACLE RAC 的啟動和關閉順序Oracle
- Oracle10g RAC 關閉及啟動Oracle
- Oracle監聽日誌2g-監聽啟動報錯Oracle
- linux 中oracle 10g rac 關閉crs開機自啟動LinuxOracle 10g
- Oracle 記憶體自動管理--關閉自動管理Oracle記憶體
- Oracle監聽啟動後自動關閉Oracle
- oracle自動啟動和關閉的方法Oracle
- 關閉監聽的日誌。
- ORACLE RAC 日誌結構解析Oracle
- hpux的報錯日誌UX
- Oracle10g RAC (ASM) 資料庫及服務開啟關閉OracleASM資料庫
- Oracle11g RAC (ASM) 及Active Dataguard開啟與關閉 [final]OracleASM
- oracle dataguard 自動刪除歸檔日誌Oracle
- Oracle自動啟動和關閉的方法 (轉)Oracle
- Oracle 11gR2 RAC的關閉和啟動Oracle
- Oracle10g RAC 關閉及啟動步驟Oracle
- Oracle RAC自啟動Oracle
- oracle 11gR2 關閉asm例項 報ORA-01031 錯誤處理OracleASM
- Oracle11g RAC在例項關閉後自動在啟動例項上歸檔Oracle
- 使用Marker統一關閉mybatis日誌MyBatis
- 關閉和開啟歸檔日誌