mod_rele_dev_by_major: Unheld driver: major number
事後,主機工程師分析DUMP後認為是ORACLE訪問檔案出現了異常,導致主機當機
這也太神奇了,為了撇清責任,只好開始研究了下這個案例
1. 檢查alert日誌,發現資料庫沒有任何異常,直接是由於主機當機導致資料庫當機
Wed Jun 12 02:12:58 2013
Thread 1 advanced to log sequence 114384 (LGWR switch)
Current log# 1 seq# 114384 mem# 0: /dev/md/oracle_dg/rdsk/d315
Current log# 1 seq# 114384 mem# 1: /dev/md/oracle_dg/rdsk/d316
Wed Jun 12 02:19:43 2013
Thread 1 advanced to log sequence 114385 (LGWR switch)
Current log# 6 seq# 114385 mem# 0: /dev/md/oracle_dg/rdsk/d325
Current log# 6 seq# 114385 mem# 1: /dev/md/oracle_dg/rdsk/d326
Wed Jun 12 02:26:21 2013
Thread 1 advanced to log sequence 114386 (LGWR switch)
Current log# 2 seq# 114386 mem# 0: /dev/md/oracle_dg/rdsk/d317
Current log# 2 seq# 114386 mem# 1: /dev/md/oracle_dg/rdsk/d318
Wed Jun 12 09:50:58 2013
Starting ORACLE instance (normal)
Wed Jun 12 09:50:58 2013
Specified value of sga_max_size is too small, bumping to 17582522368
Wed Jun 12 09:51:29 2013
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned off.
2. 檢查後臺程式,使用者程式的 trc檔案,未發現任何程式有異常報告
3. 檢查系統日誌
a) 節點在02:27:30發生panic
b) 節點發生panic時,CPU2存在異常
c) 發生異常的函式為mod_rele_dev_by_major,異常的裝置類別ID為85
Jun 12 02:27:30 bimsnewdb1 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Jun 12 02:27:30 bimsnewdb1 unix: [ID 836849 kern.notice]
Jun 12 02:27:30 bimsnewdb1 ^Mpanic[cpu2]/thread=3008b8a41e0:
Jun 12 02:27:30 bimsnewdb1 unix: [ID 644039 kern.notice] mod_rele_dev_by_major: Unheld driver: major number <85>
Jun 12 02:27:30 bimsnewdb1 unix: [ID 100000 kern.notice]
4. ID為85的裝置類別為md,本系統就是使用軟RAID後的磁碟做成磁碟卷,然後劃分為邏輯盤,提供給資料庫用作裸裝置
oracle@bimsnewdb1 $ cat /etc/name_to_major |grep " 85$"
md 85
5. 分析主機工程師從dump中解析出的act.5檔案。故障發生時,CPU2上的程式為6278,其是一個Oracle資料庫的伺服器端程式,其是凌晨1點開始的RMAN備份的一個遞迴子程式
6. 檢視該RMAN的日誌,發現RMAN任務是直接終止,RMAN無任何報錯。也就是說,RMAN在指定過程中沒有遇到如何報錯,也不可能主動重啟節點
通道 c2: 正在啟動段 1 於 2013-06-12 02:18:53
通道 c2: 已完成段 1 於 2013-06-12 02:24:18
段控制程式碼=/dbbak/rmandbbak/backupset/lev1_db_NEWBIMS_20130612_t817870733_s65165_p1 標記=LEVEL_1 註釋=NONE
通道 c2: 備份集已完成, 經過時間:00:05:25
通道 c2: 啟動增量級別 1 資料檔案備份集
通道 c2: 正在指定備份集中的資料檔案
輸入資料檔案 fno=00035 name=/dev/md/oracle_dg/rdsk/d392
輸入資料檔案 fno=00064 name=/dev/md/oracle_dg/rdsk/d426
輸入資料檔案 fno=00093 name=/dev/md/oracle_dg/rdsk/d455
輸入資料檔案 fno=00122 name=/dev/md/oracle_dg/rdsk/d484
輸入資料檔案 fno=00151 name=/dev/md/oracle_dg/rdsk/d513
輸入資料檔案 fno=00180 name=/dev/md/oracle_dg/rdsk/d542
輸入資料檔案 fno=00209 name=/dev/md/oracle_dg/rdsk/d573
通道 c2: 正在啟動段 1 於 2013-06-12 02:24:20
通道 c1: 已完成段 1 於 2013-06-12 02:24:23
段控制程式碼=/dbbak/rmandbbak/backupset/lev1_db_NEWBIMS_20130612_t817870732_s65164_p1 標記=LEVEL_1 註釋=NONE
通道 c1: 備份集已完成, 經過時間:00:05:31
-----------------------------------------------------------------------------
++++++++++++++++++ Begin RMAN Level 1 DB Backup ++++++++++++++++++ Thu Jun 13 01:00:00 CST 2013
-----------------------------------------------------------------------------
7. 在act.5中分析該程式的堆疊
CPU 0x2 is RUNNING READY CONFIGURED ENABLED ### CPU=0x2 (2) ###
###
PANIC occurred on this CPU
a) 在CPU2上發生了PANIC
cpu addr 0x3000659e000
running thread addr 3008b8a41e0
pause thread addr 2a100c87ca0
dispatched thread addr 3008b8a41e0
interrupt stack is 2a100c8ff50
interrupt thread list starts at 2a100c97ca0
in_prom 0
PANIC occurred on this thread
b) 發生PANIC的thread地址為3008b8a41e0,接下來會dump該執行緒資訊
***
process id 6278 is
oraclenewbims (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
, parent process is 6227
uid is 0x64 0t100, gid is 0x65 0t101
thread addr 3008b8a41e0, proc addr 3005f5784c0, lwp addr 30041f048d8
Thread bound to cpu id 0x2
t_state is 0x4 - TS_ONPROC
Scheduling info:
t_pri is 0x3c, t_epri is 0, t_cid is 0x1
scheduling class is: TS
t_disp_time: is 0x11fa96bfb, 0t4826164219
last switched: 6 mins 44 secs ago on cpu 0x2
pc is 1071ef0, sp is 187bbb0, t_stk 2a10a745ae0
stack trace is:
unix: setjmp (?)
unix: panicsys+0x48 (0x3008b8a41e0,0x30060406e80,kmem_bigtsb_default_arena+0x138
,page_retire_kstat+0x210)
unix: vpanic+0xcc (0x127cc30,0x2a10a744eb0)
genunix: cmn_err+0x98 (3,
"mod_rele_dev_by_major: Unheld driver: major number ",0x55)
genunix: mod_rele_dev_by_major+0x88 (0x55)
genunix: dev_to_instance+0x7c (0x55)
genunix:e_ddi_hold_devi_by_dev +0x14 (0x550000212d,0)
specfs: spec_open+0x84 (0x2a10a745278,0x2001,0x30060406e80)
genunix: fop_open+0x78 (0x2a10a745278,0x2001,0x30060406e80)
pxfs:
__1cMio_repl_implEopen6MinH_A_out_4nHpxfs_v1Efobj_Cpn0C___rn0BJfobj_info_pnGsolo
bjEcred_rnFCORBALEnvironment__v_+0xd0 (0x3003781c540,0x2001,0x2a10a745340,
0x2a10a745610,0x3002ed71688,0x2a10a745680)
pxfs:
__1cNdevice_serverXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolobjEcred_n
H_A_out_4n0BEfobj_Cpn0F___rn0BJfobj_info_rnFCORBALEnvironment__v_+0x1fc (
0x600380b28f0,0x2a10a7456e8,0x2001,0x3002ed71688,0x2a10a745410,0x2a10a745610,
0x2a10a745680)
pxfs:
__1cXdevice_server_repl_implXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsol
objEcred_nH_A_out_4n0BEfobj_Cpn0F___rn0BJfobj_info_rnFCORBALEnvironment__v_+0x28
(0x600380b28e8,0x2a10a7456e8,0x2001,?,0x2a10a7454e0,?,0x2a10a745680)
cl_dcs:
__1cDmdcSdevice_server_stubXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolo
bjEcred_nH_A_out_4n0CEfobj_Cpn0G___rn0CJfobj_info_rnFCORBALEnvironment__v_+0xe0
(0x6004716a808,0x2a10a7456e8,0x2001,0x3002ed71688,0x2a10a745670,0x2a10a745610,
0x2a10a745680)
pxfs: __1cJpxspecialEopen6MppnFvnode_ipnEcred__i_+0x368 (0x3008374fcc8,
0x2a10a745930,0x2001,0x30060406e80)
genunix: fop_open+0x78 (0x2a10a745930,0x2001,0x30060406e80)
genunix: vn_openat+0x500 (0x8039830f0,UIO_USERSPACE,0x2001,0xc00,0x2a10a745a98,
CRCREAT,0x12,0)
genunix: copen+0x260 (0x12?,0x8039830f0,0x2001?,0x7ffffc00)
unix: syscall_trap+0xac ()
c) 這裡是最重要的程式發生問題時的堆疊資訊
Ø 發生錯誤的函式為mod_rele_dev_by_major
Ø 從pxfs部分來看,故障原因和vnode資源有關
Ø 從較高階的函式來看,copen,vn_openat,fop_open,這是在開啟檔案時發生的異常
Ø Oracle程式呼叫底層函式報錯,不可能是Oracle的問題,而是底層函式呼叫上存在問題
8. 查詢SUN的資料,發現:mod_rele_dev_by_major函式的作用是,根據引數major,在devnamesp中執行查詢。找到後在資源上執行操作。devnamesp對應的結構為dev_ops,該結構為:
struct dev_ops {
int devo_rev
int devo_refcnt
(function returning) int *devo_getinfo
(function returning) int *devo_identify
(function returning) int *devo_probe
(function returning) int *devo_attach
(function returning) int *devo_detach
(function returning) int *devo_reset
struct cb_ops *devo_cb_ops
struct bus_ops *devo_bus_ops
(function returning) int *devo_power
}
9. 在mod_rele_dev_by_major函式中,存在這樣一段程式碼:
#define DEV_OPS_HELD(opsp) ((opsp)->devo_refcnt > 0)
if (!DEV_OPS_HELD(ops)) {
cmn_err(CE_PANIC,
"mod_rele_dev_by_major: Unheld driver: major number ",
(uint_t)major);
}
該段程式碼的意思為:
a) dev_ops結構中的devo_refcnt為reference count,是一個計數值
b) 當devo_refcnt小於0是,將發起panic。
c) devo_refcnt最大值為0x7fffffff,一旦超過該值,將被解析為負
10. 在中心庫主機上用如下dtrace指令碼監控devo_refcnt值
$ cat devbymajor.d
#!/usr/sbin/dtrace -qs
fbt::mod_rele_dev_by_major:entry
{
@modcnt[arg0,"dec"] = count() ;
@ktest[arg0,"dec",execname,stack()] = count() ;
}
fbt::mod_hold_dev_by_major:entry
{
@modcnt[arg0,"inc"] = count() ;
@ktest[arg0,"inc",execname,stack()] = count() ;
}
dtrace:::END
{
printa("%d\t%s\t=\t%@d\n",@modcnt);
printf("-x-x-x- Stacks and counts -x-x-x-\n");
printa("%d\t%s\t%@d\t%s\t%k\n",@ktest);
}
# ./devbymajor.d > devbymajor.txt
11. 指令碼在newbims主機中執行15分鐘後,獲得結果發現(擷取部分,詳見附錄)
32 dec = 3
32 inc = 3
118 dec = 4
118 inc = 4
85 inc = 13664
85 inc 6832 oracle
genunix`ddi_hold_installed_driver+0x4
pxfs`__1cNdevice_serverLmake_specvp6FrknHpxfs_v1Gpvnode_pnEcred__pnFvnode__+0x4c
pxfs`__1cNdevice_serverXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolobjEcred_nH_A_out_4n0BEfobj_Cpn0F___rn0BJfobj_info_rnFCORBALEnvironment__v_+0x10c
pxfs`__1cXdevice_server_repl_implXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolobjEcred_nH_A_out_4n0BEfobj_Cpn0F___rn0BJfobj_info_rnFCORBALEnvironment__v_+0x28
cl_dcs`__1cDmdcSdevice_server_stubXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolobjEcred_nH_A_out_4n0CEfobj_Cpn0G___rn0CJfobj_info_rnFCORBALEnvironment__v_+0xe0
pxfs`__1cJpxspecialEopen6MppnFvnode_ipnEcred__i_+0x368
genunix`fop_open+0x78
genunix`vn_openat+0x500
genunix`copen+0x260
unix`syscall_trap+0xac
85 inc 6832 oracle
genunix`ddi_hold_installed_driver+0x4
pxfs`__1cJpxspecialEopen6MppnFvnode_ipnEcred__i_+0x5c
genunix`fop_open+0x78
genunix`vn_openat+0x500
genunix`copen+0x260
unix`syscall_trap+0xac
a) 裝置種類85的devo_refcnt增加了13664,減少了0,資源存在洩露
b) 其他種類資源獲取釋放正常
c) 呼叫mod_rele_dev_by_major函式的函式有:
Ø e_ddi_hold_devi_by_dev:可增可減
Ø ddi_hold_installed_driver:只增
d) 系統呼叫了ddi_hold_installed_driver函式增加了devo_refcnt,那必須通過呼叫ddi_rele_driver來減少devo_refcnt
e) 從跟蹤來看,並未發現對ddi_rele_driver的呼叫
f) 根據SUN的說法,在開發文件中只說明瞭通過ddi_hold_installed_driver函式獲得資源,而沒有提到需要使用ddi_rele_driver釋放資源。那麼在第三方驅動開發時,可能導致問題。當前中心庫正好使用了sun cluster附帶的磁碟管理軟體
12. 同時,我們對成都AAA主機進行了監控,該主機沒有安裝SUN Cluster,所以,沒有觀察到Oracle對ddi_hold_installed_driver函式的呼叫
root@cdradius # ./devbymajor.d
^C
-x-x-x- Stacks and counts -x-x-x-
23 dec = 2
23 inc = 3
-x-x-x- Stacks and counts -x-x-x-
23 dec 1 sshd
genunix`e_ddi_hold_devi_by_dev+0xe0
specfs`spec_clone+0x148
specfs`spec_open+0x238
genunix`fop_open+0x78
genunix`vn_openat+0x500
genunix`copen+0x260
unix`syscall_trap32+0xcc
23 dec 1 sshd
genunix`dev_to_instance+0x7c
genunix`e_ddi_hold_devi_by_dev+0x14
specfs`spec_clone+0x148
specfs`spec_open+0x238
genunix`fop_open+0x78
genunix`vn_openat+0x500
genunix`copen+0x260
unix`syscall_trap32+0xcc
23 inc 1 sshd
genunix`e_ddi_hold_devi_by_dev+0x34
specfs`spec_clone+0x148
specfs`spec_open+0x238
genunix`fop_open+0x78
genunix`vn_openat+0x500
genunix`copen+0x260
unix`syscall_trap32+0xcc
23 inc 1 sshd
genunix`ddi_hold_installed_driver+0x4
clone`clnopen+0x54
genunix`qattach+0x128
genunix`stropen+0x4fc
specfs`spec_open+0x1bc
genunix`fop_open+0x78
genunix`vn_openat+0x500
genunix`copen+0x260
unix`syscall_trap32+0xcc
23 inc 1 sshd
genunix`dev_to_instance+0x2c
genunix`e_ddi_hold_devi_by_dev+0x14
specfs`spec_clone+0x148
specfs`spec_open+0x238
genunix`fop_open+0x78
genunix`vn_openat+0x500
genunix`copen+0x260
unix`syscall_trap32+0xcc
13. 查詢了SUN的BUG庫,懷疑命中該參考了SUN的Bug 15736437 : SUNBT7080718-3.3U2 PANIC: "MOD_RELE_DEV_BY_MAJOR: UNHELD DRIVER: MAJOR NUMBER
a) 報錯的堆疊相同
b) 當前系統的sun cluster版本3.2.0上,存在該BUG
c) 但bug描述不是太清晰,沒有說明是cluster自身核心的問題,還是程式呼叫導致的問題。但是從分析來看,cluster核心自身問題較大
14. 該補丁包含在144221-09中。7080718 panic: "mod_rele_dev_by_major: Unheld driver: major number ",當前3.2版本,包含了該補丁的最新補丁集為 144221-12
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/8242091/viewspace-764407/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- JDK版本 - Major VersionJDK
- DATA 2100 Major Python ProgrammingPython
- The last major melee archetype includes the sneakyAST
- Choosing a driver model for developing a USB client driverdevclient
- #HACKTHEBOX——Driver
- J2SE - Unsupported major.minor version 51.0
- React.memo vs. useMemo: Major differences and use casesReact
- JavaScript Number()JavaScript
- mongo-go-driverGo
- disable Nouveau kernel driver
- Spark Driver Program剖析Spark
- Kata:Hamming number
- JavaScript Number toLocaleString()JavaScript
- JavaScript Number toString()JavaScript
- Number.NaNNaN
- JavaScript Number 物件JavaScript物件
- Leetcode Number of islandsLeetCode
- DRIVER_POWER_STATE_FAILUREAI
- Driver.js 引導JS
- flutter driver 整合測試Flutter
- java.lang.IllegalArgumentException: Unsupported class file major version xx解決之道JavaException
- mysql8+seaweedfs-csi-driverMySql
- Perfect Number 完美數
- [LeetCode] Third Maximum NumberLeetCode
- [LeetCode] Find the Duplicate NumberLeetCode
- Leetcode 9 Palindrome NumberLeetCode
- Number.parseInt() 方法
- Number.parseFloat()方法
- Number.isSafeInteger()方法
- Number.isNaN()方法NaN
- Number.isFinite()方法
- Js中的NumberJS
- Number.POSITIVE_INFINITY
- Number.ATIVE_INFINITY
- Big Number hd 1212
- 7.104 ITERATION_NUMBER
- Leetcode 447 Number of BoomerangsLeetCodeOOM
- react input[type='number']React
- number(p,s)(轉)