透過DBCA在伺服器上建立一個額外的例項，結束後以外發現，原有的資料庫例項出現了異常。

簡單描述一下系統環境，Oracle 11.2.0.2 for Linux x86-64，隨後又安裝了GRID，建立了ASM磁碟組。原有的資料庫有部分表空間和REDO存放在ASM磁碟組中。

為了測試，透過DBCA建立了額外的例項，事實上，透過DBCA又先後建立了兩個資料庫例項，第二個資料庫例項建立後一切正常，將第二個例項關閉，透過DBCA建立第三個資料庫例項時，導致伺服器上的第一個例項出現了異常。

檢查例項1上的告警日誌，可以發現大量下面的錯誤資訊：

2011-08-30 10:20:54.502000 +08:00
Process W000 died, see its trace file
2011-08-30 10:20:58.518000 +08:00
Process W000 died, see its trace file
2011-08-30 10:21:02.524000 +08:00
Process W000 died, see its trace file
2011-08-30 10:21:06.527000 +08:00
Process W000 died, see its trace file
Errors in file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_j000_19030.trc:
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1000 (oinstall), current egid = 1200 (dba)
2011-08-30 10:21:07.536000 +08:00
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_cjq0_12308.trc:
Errors in file /u01/app/oracle/diag/rdbms/fhacdb1/fhacdb1/trace/fhacdb1_j000_19036.trc:
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1000 (oinstall), current egid = 1200 (dba)

資料庫並非簡單的報錯而已，實際上即使是SQLPLUS / AS SYSDBA登入資料庫，進行任何操作也會顯示沒有連線到資料庫，因此對於例項1來說，只能執行SHUTDOWN ABORT關閉例項並重新啟動。

查詢METALINK發現不少類似的情況，導致這個錯誤產生的原因就是ORA_CRS_HOME/bin下的oracle可執行檔案或者ORACLE_HOME/bin下的oracle可執行檔案發生修改所致。雖然從隨後的ORA-2730X資訊中無法判斷是GRID使用者下還是ORACLE使用者下的oracle檔案被修改，但是可以確定的是，oracle可執行檔案的組資訊被修改，原本應該是oinstall，而現在變成了dba。

分別檢查ORA_CRS_HOME/bin和ORACLE_HOME/bin下的oracle檔案，發現是ORACLE目錄下的ORACLE_HOME/bin/oracle發生了改變：

[oracle@dbserver1 ~]$ cd $ORACLE_HOME/bin
[oracle@dbserver1 bin]$ ls -l oracle
-rwsr-s--x 1 oracle dba 228881982 Jul 26 16:55 oracle
[oracle@dbserver1 bin]$ ls -l |grep dba
-rwsr-s--x 1 oracle dba      228881982 Jul 26 16:55 oracle
-rwxr-xr-x 1 oracle oinstall     32372 Jul 26 16:55 osdbagrp
-rwxr-xr-x 1 oracle oinstall     35677 Aug 20 2010 osdbagrp0

可以看到，只有oracle檔案變成了dba組，bin目錄下所有其他檔案仍然是oinstall組。為了對比，在另外一個伺服器上，有一臺同時搭建的相同環境的oracle，檢查oracle檔案的組資訊：

[oracle@dbserver2 ~]$ cd $ORACLE_HOME/bin
[oracle@dbserver2 bin]$ ls -l oracle
-rwsr-s--x. 1 oracle oinstall 217733745 Apr 27 14:02 oracle
[oracle@dbserver2 bin]$ ls -l |grep dba
-rwxr-xr-x. 1 oracle oinstall 32372 Apr 27 14:02 osdbagrp
-rwxr-xr-x. 1 oracle oinstall 35677 Aug 20 2010 osdbagrp0

在這個伺服器上，oracle檔案的組屬性仍然是oinstall，這說明節點1上的oracle組屬性確實被修改了。

前面提到了，利用DBCA新增第二個例項的時候，資料庫並未出現問題，而新增第三個例項的時候，才導致oracle組資訊變化，那麼同樣是dbca的操作，為什麼有所區別呢。對比第二個和第三個例項的區別，問題應該和是否使用ASM有關。第二個例項的所有檔案都儲存在本地硬碟，而第三個例項的所有檔案都儲存在ASM磁碟組中，甚至會和例項1共用一些磁碟組，應該就是這個差別導致了問題的產生。

檢查ASM告警日誌：

2011-08-30 10:21:08.853000 +08:00
NOTE: client al32utf8:al32utf8 registered, osid 19095, mbr 0x1
2011-08-30 10:53:27.195000 +08:00
NOTE: client al32utf8:al32utf8 deregistered
2011-08-30 10:53:31.248000 +08:00
Starting background process ASMB
ASMB started with pid=42, OS id=23251
NOTE: client +ASM:+ASM registered, osid 23253, mbr 0x1
2011-08-30 10:53:36.804000 +08:00
NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
2011-08-30 10:53:51.372000 +08:00
NOTE: client al32utf8:al32utf8 registered, osid 23441, mbr 0x1

可以看到，就是第三個例項註冊到ASM的時刻，在例項1上出現了錯誤。應該就是這個時刻，oracle檔案的組資訊發生了變化。

檢查一下例項3上的告警日誌：

2011-08-30 10:20:53.722000 +08:00
Adjusting the default value of parameter parallel_max_servers
from 960 to 135 due to the value of parameter processes (150)
Starting ORACLE instance (normal)
2011-08-30 10:20:55.149000 +08:00
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Shared memory segment for instance monitoring created
Picked latch-free SCN scheme 3
2011-08-30 10:20:56.823000 +08:00
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/11.2.0/dbhome_1/dbs/arch
Autotune of undo retention is turned on.
IMODE=BR
ILAT =29
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
Using parameter settings in client-side pfile /u01/app/oracle/admin/al32utf8/pfile/init.ora on machine dbserver1
System parameters with non-default values:
processes = 150
memory_target = 6432M
db_block_size = 8192
compatible = "11.2.0.0.0"
db_create_file_dest = "+DATA"
undo_tablespace = "UNDOTBS1"
remote_login_passwordfile= "EXCLUSIVE"
db_domain = ""
dispatchers = "(PROTOCOL=TCP) (SERVICE=al32utf8XDB)"
local_listener = "LISTENER_AL32UTF8"
audit_file_dest = "/u01/app/oracle/admin/al32utf8/adump"
audit_trail = "DB"
db_name = "al32utf8"
open_cursors = 300
diagnostic_dest = "/u01/app/oracle"
2011-08-30 10:21:06.647000 +08:00
PMON started with pid=2, OS id=19032
PSP0 started with pid=3, OS id=19034
2011-08-30 10:21:07.721000 +08:00
VKTM started with pid=4, OS id=19038 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
GEN0 started with pid=5, OS id=19042
DIAG started with pid=6, OS id=19044
DBRM started with pid=7, OS id=19046
DIA0 started with pid=8, OS id=19048
MMAN started with pid=9, OS id=19050
DBW0 started with pid=10, OS id=19052
DBW1 started with pid=11, OS id=19054
DBW2 started with pid=12, OS id=19056
LGWR started with pid=13, OS id=19058
CKPT started with pid=14, OS id=19060
SMON started with pid=15, OS id=19062
RECO started with pid=16, OS id=19064
MMON started with pid=17, OS id=19066
MMNL started with pid=18, OS id=19068
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
starting up 1 shared server(s) ...
ORACLE_BASE from environment = /u01/app/oracle
CREATE DATABASE "al32utf8"
MAXINSTANCES 8
MAXLOGHISTORY 1
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
DATAFILE SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL
SYSAUX DATAFILE SIZE 2048M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
SMALLFILE DEFAULT TEMPORARY TABLESPACE TEMP TEMPFILE SIZE 8192M AUTOEXTEND ON NEXT 640K MAXSIZE UNLIMITED
SMALLFILE UNDO TABLESPACE "UNDOTBS1" DATAFILE SIZE 8192M AUTOEXTEND ON NEXT 5120K MAXSIZE UNLIMITED
CHARACTER SET AL32UTF8
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 SIZE 1024M,
GROUP 2 SIZE 1024M,
GROUP 3 SIZE 1024M
USER SYS IDENTIFIED BY *USER SYSTEM IDENTIFIED BY
Starting background process ASMB
ASMB started with pid=22, OS id=19092
2011-08-30 10:21:08.901000 +08:00
Starting background process RBAL
RBAL started with pid=23, OS id=19105
NOTE: initiating MARK startup
Starting background process MARK
MARK started with pid=24, OS id=19108
NOTE: MARK has subscribed
2011-08-30 10:21:12.325000 +08:00
NOTE: Loaded library: System
SUCCESS: diskgroup DATA was mounted
Database mounted in Exclusive Mode
Lost write protection disabled
ERROR: failed to establish dependency between database al32utf8 and diskgroup resource ora.DATA.dg
2011-08-30 10:21:36.862000 +08:00
Successful mount of redo thread 1, with mount id 3103441300
Assigning activation ID 3103441300 (0xb8fac194)
Thread 1 opened at log sequence 1
Current log# 1 seq# 1 mem# 0: +DATA/al32utf8/onlinelog/group_1.265.760530073
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
processing ?/rdbms/admin/dcore.bsq
create tablespace SYSTEM datafile SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED

EXTENT MANAGEMENT LOCAL online
2011-08-30 10:21:44.431000 +08:00
Completed: create tablespace SYSTEM datafile SIZE 1024M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED

可以確認例項1出現問題的時刻，正是例項3啟動並在ASM磁碟組中建立表空間的時刻。

瞭解了問題的原委後，這個問題並不難解決，只需要重啟例項1，就可以解決了。但是這個問題對於包含ASM的產品環境而言，存在很大的隱患，使用DBCA建立新庫，會影響現有正在執行的資料庫，這是一個埋藏比較深，且有可能造成嚴重後果的bug。

這個錯誤應該只會影響11.2以上版本，11.1以前的ASM在ORACLE_HOME目錄下，應該不會導致這個問題的產生。

DBCA建庫導致已有資料庫出現ORA-27140錯誤

相關文章