CRS-2765: Resource 'ora.crsd' has failed on server
ocrd程式無法啟動,表決盤無法訪問,15秒後超時,出現dismount情況。
ocssd.log
>>>>>>>>>>>>>>>>>>>>>
-
2016-01-15 10:33:11.620: [ CSSD][1042200320]clssnmSendingThread: sent 5 status msgs to all nodes
-
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs
-
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvWorkerThread not scheduled for 15550 msecs
-
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs
-
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvWorkerThread not scheduled for 15540 msecs
- 2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs
ASM alert.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-
-
Fri Jan 15 10:33:20 2016
-
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
-
WARNING: Waited 20 secs for write IO to PST disk 0 in group 1.
-
WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
-
WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
-
Submitted all GCS remote-cache requests
-
Fix write in gcs resources
-
Reconfiguration complete
-
Fri Jan 15 10:33:20 2016
-
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
-
WARNING: Waited 20 secs for write IO to PST disk 0 in group 1.
-
WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
-
WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
-
WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
-
WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
-
Fri Jan 15 10:33:21 2016
-
NOTE: process _b000_+asm1 (11776) initiating offline of disk 0.3915952267 (OCR_0000) with mask 0x7e in group 2
-
NOTE: process _b000_+asm1 (11776) initiating offline of disk 1.3915952269 (OCR_0001) with mask 0x7e in group 2
-
NOTE: checking PST: grp = 2
-
GMON checking disk modes for group 2 at 9 for pid 34, osid 11776
-
ERROR: no read quorum in group: required 2, found 1 disks
-
NOTE: checking PST for grp 2 done.
-
NOTE: initiating PST update: grp = 2, dsk = 0/0xe968b08b, mask = 0x6a, op = clear
-
NOTE: initiating PST update: grp = 2, dsk = 1/0xe968b08d, mask = 0x6a, op = clear
-
GMON updating disk modes for group 2 at 10 for pid 34, osid 11776
-
ERROR: no read quorum in group: required 2, found 1 disks
-
Fri Jan 15 10:33:21 2016
-
NOTE: cache dismounting (not clean) group 2/0x22584077 (OCR)
-
NOTE: messaging CKPT to quiesce pins Unix process pid: 11785, image: oracle@cntl119 (B001)
-
Fri Jan 15 10:33:21 2016
-
NOTE: halting all I/Os to diskgroup 2 (OCR)
-
Fri Jan 15 10:33:21 2016
-
NOTE: LGWR doing non-clean dismount of group 2 (OCR)
-
NOTE: LGWR sync ABA=4.55 last written ABA 4.55
-
WARNING: Offline for disk OCR_0000 in mode 0x7f failed.
-
WARNING: Offline for disk OCR_0001 in mode 0x7f failed.
-
Fri Jan 15 10:33:21 2016
-
kjbdomdet send to inst 2
-
detach from dom 2, sending detach message to inst 2
-
Fri Jan 15 10:33:21 2016
-
ERROR: ORA-15130 in COD recovery for diskgroup 2/0x22584077 (OCR)
-
ERROR: ORA-15130 thrown in RBAL for group number 2
-
Errors in file /grid/base/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_4292.trc:
-
ORA-15130: diskgroup "OCR" is being dismounted
-
Fri Jan 15 10:33:21 2016
-
NOTE: No asm libraries found in the system
-
ASM Health Checker found 1 new failures
-
Fri Jan 15 10:33:21 2016
-
List of instances:
-
1 2
-
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)
-
Global Resource Directory partially frozen for dirty detach
-
* dirty detach - domain 2 invalid = TRUE
-
130 GCS resources traversed, 0 cancelled
-
Dirty Detach Reconfiguration complete
-
Fri Jan 15 10:33:21 2016
-
WARNING: dirty detached from domain 2
-
NOTE: cache dismounted group 2/0x22584077 (OCR)
-
NOTE: cache deleting context for group OCR 2/0x22584077
-
SQL> alter diskgroup OCR dismount force /* ASM SERVER:576209015 */ >>>>>>>>>>>>>>>>>>>>>>>>crs強制解除安裝磁碟組
-
GMON dismounting group 2 at 11 for pid 36, osid 11785
-
NOTE: Disk OCR_0000 in mode 0x7f marked for de-assignment
-
NOTE: Disk OCR_0001 in mode 0x7f marked for de-assignment
- NOTE: Disk OCR_0002 in mode 0x7f marked for de-assignment
APPLIES TO:
Oracle Database - Standard Edition - Version 11.2.0.1 to 11.2.0.4 [Release 11.2]Managed Cloud Services Problem Resolution - Version N/A to N/A
Information in this document applies to any platform.
SYMPTOMS
The CRSD process (crsd.bin) aborted due to the OCR location being inaccessible (here +OCR_VOTE).
? <GI_HOME>/log/<node>/alert<node>.log shows the following messages:
[crsd(14307)]CRS-1006:The OCR location +OCR_VOTE is inaccessible. Details in /app/11.2.0/grid_1/log/rac1/crsd/crsd.log.
2013-11-07 06:51:08.465:
[ohasd(12810)]CRS-2765:Resource 'ora.crsd' has failed on server 'rac1'.
2013-11-07 06:51:09.907:
[crsd(31845)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /app/11.2.0/grid_1/log/rac1/crsd/crsd.log."
? The log of the grid user agent (<GI_HOME>/log/<node>/agent/crsd/oraagent_grid/oraagent_grid.log) shows that it received a message to stop the ora.OCR_VOTE.dg resource:
2013-11-07 06:37:28.744: [ AGFW][1112471872]{1:58982:16178} Preparing STOP command for: ora.OCR_VOTE.dg rac1 1
2013-11-07 06:37:28.745: [ AGFW][1112471872]{1:58982:16178} ora.OCR_VOTE.dg rac1 1 state changed from: ONLINE to: STOPPING
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] (:CLSN00108:) clsn_agent::stop {
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stop: enter {
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] getResAttrib: attrib name USR_ORA_OPI value true len 4
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] Agent::flagUsrOraOpiIsSet(true) reason not dependency
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stop: tha exit }
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stopSingle status:2 }
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] (:CLSN00108:) clsn_agent::stop }
2013-11-07 06:37:28.747: [ AGFW][1101465920]{1:58982:16178} Command: stop for resource: ora.OCR_VOTE.dg rac1 1 completed with status: SUCCESS
? The CSS daemon log (<GI_HOME>/log/<node>/cssd/ocssd.log) shows that several CSS threads have not been scheduled for several seconds:
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7760 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvWorkerThread not scheduled for 7830 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7830 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 8040 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvWorkerThread not scheduled for 8040 msecs
..
2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 60030 msecs
2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 59620 msecs
2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 60240 msecs
? The ASM alert log shows that the OCR diskgroup was dismounted after IO issue:
Thu Nov 07 06:36:28 2013
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
Thu Nov 07 06:36:28 2013
NOTE: process _b000_+asm1 (21285) initiating offline of disk 0.3915951409 (OCR_VOTE01) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (21285) initiating offline of disk 1.3915951410 (OCR_VOTE02) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (21285) initiating offline of disk 2.3915951411 (OCR_VOTE03) with mask 0x7e in group 1
Thu Nov 07 06:36:28 2013
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0xDE385DDC (OCR_VOTE)
SQL> alter diskgroup OCR_VOTE dismount force /* ASM SERVER:3728235996 */
Thu Nov 07 06:37:28 2013
SUCCESS: diskgroup OCR_VOTE was dismounted
SUCCESS: alter diskgroup OCR_VOTE dismount force /* ASM SERVER:3728235996 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group OCR_VOTE
? Running crsctl shows that CRSD process (crsd.bin) is not running:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
CAUSE
The issue was caused by a storage problem. The disks were not responding for approximately one minute, which caused the update of the PST disk to exceed 15 secs and consequently caused the OCR diskgroup to be dismounted which caused the CRS daemon (crsd.bin) to abort.
? The OS messages log shows that there was a disk issue (message may vary depending on the storage used):
? OSW vmstat shows 10 processes wait on IO (column b=10)
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 10 0 50944832 776328 10809588 0 0 1 3 1 1 0 0 100 0 0
0 10 0 50936532 776328 10809628 0 0 0 244 1044 9054 1 1 97 1 0
0 10 0 50939424 776328 10809872 0 0 0 0 1024 6426 0 0 98 2 0
? In this particular case the ASM disks for the +OCR_VOTE disk group were mapped to 3 partitions (/dev/sda1 - /dev/sda3) on a single disk device :
# ls -ltr /dev/oracleasm/disks/*
brw-rw---- 1 grid asmadmin 8, 1 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE01 --- major is 8,minor is 1
brw-rw---- 1 grid asmadmin 8, 2 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE02 --- major is 8,minor is 2
brw-rw---- 1 grid asmadmin 8, 3 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE03 --- major is 8,minor is 3
brw-rw---- 1 grid asmadmin 8, 5 Nov 18 05:04 /dev/oracleasm/disks/ASM_DATA
brw-rw---- 1 grid asmadmin 8, 6 Nov 18 05:04 /dev/oracleasm/disks/ASM_BACKUP
# ls -ltr /dev/sda*
brw-r----- 1 root disk 8, 0 Nov 18 05:03 /dev/sda
brw-r----- 1 root disk 8, 1 Nov 18 05:04 /dev/sda1 --- major is 8,minor is 1, used for OCR_VOTE01
brw-r----- 1 root disk 8, 2 Nov 18 05:04 /dev/sda2 --- major is 8,minor is 2, used for OCR_VOTE02
brw-r----- 1 root disk 8, 3 Nov 18 05:04 /dev/sda3 --- major is 8,minor is 3, used for OCR_VOTE03
brw-r----- 1 root disk 8, 4 Nov 18 05:04 /dev/sda4
brw-r----- 1 root disk 8, 5 Nov 18 05:04 /dev/sda5
brw-r----- 1 root disk 8, 6 Nov 18 05:04 /dev/sda6
brw-r----- 1 root disk 8, 7 Nov 18 05:04 /dev/sda7
OSW iostat shows device /dev/sda1, /dev/sda2, /dev/sda3 are abnormal for approximate 1 minute.
Start from 06:36:18 many IO requests in request queue(avgqu-sz is 16) and %util is 100%,
however disk throughput is zero (rkB/s =0 and wkB/s=0), these information indicate disk Device saturation occurred.
Disk IO were recovered at 06:37:19.
zzz ***Thu Nov 7 06:35:17 CST 2013
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 9.00 10.00 39.00 53.50 9.74 0.03 1.32 1.32 2.50
sda1 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.00 0.00 0.00 0.00
sda2 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.01 2.00 2.00 0.80
sda3 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.00 0.00 0.00 0.00
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 2.00 2.00 32.00 20.00 26.00 0.02 4.00 4.00 1.60
sda6 0.00 0.00 1.00 2.00 4.00 20.00 16.00 0.00 0.33 0.33 0.10
sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
zzz ***Thu Nov 7 06:36:18 CST 2013
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.02 0.00 0.00 100.10
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10
sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 0.00 0.00 100.10
sda6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 100.10
sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
zzz ***Thu Nov 7 06:37:19 CST 2013
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 7.00 7.00 19.00 41.50 8.64 0.03 1.86 1.86 2.60
sda1 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.00 0.00 0.00 0.00
sda2 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.01 3.00 3.00 0.90
sda3 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.01 3.00 3.00 0.90
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 1.00 2.00 16.00 20.00 24.00 0.01 2.67 2.67 0.80
sda6 0.00 0.00 0.00 2.00 0.00 20.00 20.00 0.00 0.00 0.00 0.00
sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SOLUTION
Contact the system/storage administrator and possibly the storage vendor to determine why the disk (shared storage) was not responding for an extended amount time.
In addition
- avoid using multiple partitions of the same physical disk as ASM disks for the Voting disks. Each ASM disk ideally should be on a different physical device or even a different SAN if possible. This may not be possible when using LUNs as they typically don't map to a physical disk.
- make sure the OCR is mirrored onto a 2nd diskgroup so the CRSD process can survive the dismount of one of the diskgroups.
Other ISSUE
Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.
The ASM disk could go into unresponsiveness, normally in the following scenarios:
+ During path 'failover' in a multipath set up
+ Server load, or any sort of storage/multipath/OS maintenance
The Doc ID 10109915.8 briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer. And then _asm_hbeatiowait helps in setting the time out.
SOLUTION
1] Check with OS and Storage admin that there is disk unresponsiveness.
2] Possibly keep the disk responsiveness to below 15 seconds.
This will depend on various factors like
+ Operating System
+ Presence of Multipath ( and Multipath Type )
+ Any kernel parameter
So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.
For example, on AIX rw_timeout setting affects this and defaults to 30 seconds.
>>>>>>>>>>>>>>>>>>>>>>>>aix rw_timeout預設是30秒,我的環境是linux
Another example is Linux with native multipathing. In such set up, number of physical paths and polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.
So for your set up ( combination of OS / multipath / storage ), you need to find out this.
3] If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):
_asm_hbeatiowait
Set it to 200.
>>>>>>>>>>>>>>>>>>>>>>>>>>>polling_interval*no_path_retry
Run below in asm instance to set desired value for _asm_hbeatiowait
alter system set "_asm_hbeatiowait"= scope=spfile sid='*';
由於暫時沒有辦法解決儲存問題,故採用增大iowait來暫時解決。根本還是儲存鏈路。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17086096/viewspace-1978542/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Android resource linking failedAndroidAI
- 部署在IIS中服務出現異常 Failed to load resource: the server responded with a status of 503 (Service Unavailable)AIServer
- MySQL server has gone awayMySqlServerGo
- ERROR:Failed to load resource: net::ERR_INCOMPLETE_CHUNKED_ENCODINGErrorAIEncoding
- 問題MySQL server has gone awayMySqlServerGo
- A replica with the same server_uuid/server_id as this replica has connected to the source;ServerUI
- Server Tomcat v9.0 Server at localhost failed to start.ServerTomcatlocalhostAI
- [渲染層網路層錯誤] Failed to load local font resource ?AI
- 上傳報錯:Upload Failed: Your upload has failed a virus scan. Please choose another file.AI
- Cargo invocation has failed: Error: exit code: 101.解決辦法CargoAIError
- Thread: ADMU3011E: Server launched but failed initialization. Server logfiles shthreadServerAI
- npm報錯:request to https://registry.npm.taobao.org failed, reason certificate has expiredNPMHTTPAI
- 【故障處理】ORA-28547: connection to server failed, probableServerAI
- 【已解決】報錯 NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driverAI
- 關於解決Server Tomcat v9.0 Server at localhost failed to start的問題ServerTomcatlocalhostAI
- 解決MySQL server has gone away錯誤的解決方案MySqlServerGo
- Thinkphp mysql 資料庫斷線重連 MySQL server has gone awayPHPMySql資料庫ServerGo
- 進行遠端連線一個 appium server 時報 not map to a valid resourceAPPServer
- MySQL Insert資料量過大導致報錯 MySQL server has gone awayMySqlServerGo
- spring security oauth2搭建resource-server demo及token改造成JWT令牌SpringOAuthServerJWT
- Laravel Resource Routes和API Resource Routes講解LaravelAPI
- Mongodb安裝坑 - Service 'MongoDB Server' (MongDB) failed to start. Verify that you have...MongoDBServerAI
- 安裝MySQL5.7報錯:The action ‘Install’ for product ‘MySQL Server 5.7.19’ failed.MySqlServerAI
- Server2016 ADFS4.0 The same client browser session has made '6' requests in the last '13'seconds(二)ServerclientSessionAST
- 【譯】Resource Hints
- [CSS 3] :has()CSS
- Realcase: Failed to upgrade SQL Server 2016 SP2 CU11. (Installation success or error status: 1648)AISQLServerError
- DRM - Dynamic Resource MasteringAST
- Metasploit resource命令技巧
- Spring系列.Resource介面Spring
- RuntimeError: An attempt has been made to start a new process before the current process hasError
- Reflect.has() 方法
- yarn certificate has expiredYarn
- 帝國CMS匯入恢復資料MySQL server has gone away錯誤的解決辦法MySqlServerGo
- vue_resource和axiosVueiOS
- Resource is out of sync with the file system
- Error-Expected resource of typeError
- profile的resource limits和資源計劃resource_manager_plan的limitMIT
- Proxy handler.has() 方法