oracle 10.2.0.1 rac的lmd程式的含義之一

不一樣的天空w發表於2018-01-20

結論

1,測試環境為oracle 10.2.0.1 rac
2,lmd程式如果異常中斷,會導致所屬RAC例項重啟,並且在關庫前會生成一個SYSTEMSTATE DUMP檔案
3,lmon程式是監控lmd程式,即lmd程式如果死掉,會由lmon程式重啟它
4,lmd程式負責全域性佇列服務,即GES,說白了,就是管理跨RAC多例項的資源請求,由此可見LMD程式的重要性,如果LMD出現故障,資料庫DML操作會HANG住
   進而會引發RAC節點間的IPC通訊延時
5,IPC通訊延時會產生對應的LMD的TRACE FILE   
6,關於與lmd相關的引數仍在研究中,暫無結論

測試



--lmd含義
lmd程式是負責全域性佇列服務的程式,即GES;
它是負責每個RAC例項來自遠端RAC節點的資源請求;並且它是一個DAEMON程式,也就是說會由一個監控程式保護它,如果它不存在,由監控程式重啟它




--可見lmd程式如果異常中斷,會直接導致RAC節點強制關閉,並且在關閉例項前生成一個systemstate dump,以供分析
[oracle@jingfa1 ~]$ ps -ef|grep lmd
oracle    4774     1  0 Nov09 ?        00:00:31 asm_lmd0_+ASM1
oracle   11220     1  0 02:13 ?        00:00:15 ora_lmd0_jingfa1
oracle   30706 30376  0 05:19 pts/3    00:00:00 grep lmd
[oracle@jingfa1 ~]$ kill -9 11220


Tue Nov 10 05:20:03 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_pmon_11212.trc:
ORA-00482: LMD* process terminated with error
Tue Nov 10 05:20:03 2015
PMON: terminating instance due to error 482
Tue Nov 10 05:20:03 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lms0_11222.trc:
ORA-00482: LMD* process terminated with error
Tue Nov 10 05:20:03 2015
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/jingfa/bdump/jingfa1_diag_11214.trc
Tue Nov 10 05:20:03 2015
Trace dumping is performing id=[cdmp_20151110052003]
Tue Nov 10 05:20:08 2015
Instance terminated by PMON, pid = 11212
--緊接例項又會自動重啟
Tue Nov 10 05:21:05 2015
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0


可見lmd程式又會自動重啟
[oracle@jingfa1 ~]$ ps -ef|grep lmd
oracle    3474 30376  0 05:23 pts/3    00:00:00 grep lmd
oracle    4774     1  0 Nov09 ?        00:00:31 asm_lmd0_+ASM1
oracle   32703     1  0 05:21 ?        00:00:00 ora_lmd0_jingfa1


上述說lmd程式的健康是由其監控程式負責的,經查官方手冊是lmon程式,LMON程式負責每個RAC例項跨例項或者叫全域性佇列及資源的管理,以及全域性佇列鎖的恢復操作


[oracle@jingfa1 bdump]$ ps -ef|grep lmon
oracle    4772     1  0 Nov09 ?        00:00:29 asm_lmon_+ASM1
oracle   19857 30376  0 05:34 pts/3    00:00:00 grep lmon
oracle   32701     1  0 05:21 ?        00:00:02 ora_lmon_jingfa1
[oracle@jingfa1 bdump]$ kill -9 32701
可見如果異常中斷LMON,其所屬的LMD程式也會強制關閉
[oracle@jingfa1 bdump]$ ps -ef|grep lmd
oracle    4774     1  0 Nov09 ?        00:00:32 asm_lmd0_+ASM1
oracle   21171 30376  0 05:34 pts/3    00:00:00 grep lmd


可見只要異常中斷lmon程式,會強制重啟資料庫例項
Tue Nov 10 05:34:18 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_pmon_32695.trc:
ORA-00481: LMON process terminated with error
Tue Nov 10 05:34:18 2015
PMON: terminating instance due to error 481
Tue Nov 10 05:34:18 2015
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/jingfa/bdump/jingfa1_diag_32697.trc
Tue Nov 10 05:34:18 2015
Trace dumping is performing id=[cdmp_20151110053418]
Tue Nov 10 05:34:23 2015
Instance terminated by PMON, pid = 32695
Tue Nov 10 05:35:19 2015
Starting ORACLE instance (normal)


可見lmon及lmd會自動重啟
[oracle@jingfa1 bdump]$ ps -ef|grep lmon
oracle    4772     1  0 Nov09 ?        00:00:30 asm_lmon_+ASM1
oracle   21820     1  0 05:35 ?        00:00:01 ora_lmon_jingfa1
oracle   27926 30376  0 05:39 pts/3    00:00:00 grep lmon
[oracle@jingfa1 bdump]$ ps -ef|grep lmd
oracle    4774     1  0 Nov09 ?        00:00:33 asm_lmd0_+ASM1
oracle   21822     1  0 05:35 ?        00:00:00 ora_lmd0_jingfa1
oracle   28028 30376  0 05:39 pts/3    00:00:00 grep lmd




引申下,也就是說肯定作業系統層面會有某種機制,確保lmon及lmd程式異常中斷後,會重啟它們,哪這種機制到底是什麼呢?
經分析作業系統層面的各個程式,主要是/etc/init.d下,對比後發現lmon及其所屬lmd是隸屬於ORACLE層面,而非叢集層面,沒有對應的程式控制它們,


我們換個思路分析,與lmd程式相關的引數有哪些,其含義是什麼?


NAME_1                                             VALUE_1                                            DESC1
-------------------------------------------------- -------------------------------------------------- --------------------------------------------------
_lm_lmd_waittime                                   8                                                  default wait time for lmd in centiseconds




---node1
SQL> select addr,program,username,pid,spid from v$process where username='oracle' and spid=21822;


ADDR             PROGRAM                                          USERNAME               PID SPID
---------------- ------------------------------------------------ --------------- ---------- ------------
0000000083A585C8 oracle@jingfa1 (LMD0)                            oracle                   6 21822


--node2
SQL> select addr,program,username,pid,spid from v$process where username='oracle' and spid=668;


ADDR             PROGRAM                                          USERNAME               PID SPID
---------------- ------------------------------------------------ --------------- ---------- ------------
0000000083A585C8 oracle@jingfa2 (LMD0)                            oracle                   6 668


--node2
SQL> conn tbs_zxy/system
Connected.
SQL> update t_lock set a=11 where a=1;


1 row updated.


--node1
SQL> update t_lock set a=1111 where a=1;
--hang住
可見上述引數並不直接與鎖的檢測有關喲,但是lmd是和全域性鎖有關的


換個思路,如果oradebug 模擬暫停lmd,再產生全域性鎖會如何呢


---node1
暫停lmd
SQL> oradebug setospid 21822
Oracle pid: 6, Unix process pid: 21822, image: oracle@jingfa1 (LMD0)
SQL> oradebug suspend
Statement processed.


Tue Nov 10 06:03:44 2015
Unix process pid: 21822, image: oracle@jingfa1 (LMD0) flash frozen




---node2
暫停lmd
SQL> oradebug setospid 668
Oracle pid: 6, Unix process pid: 668, image: oracle@jingfa2 (LMD0)
SQL> oradebug suspend
Statement processed.


Tue Nov 10 06:06:08 2015
Unix process pid: 668, image: oracle@jingfa2 (LMD0) flash frozen


---node2
SQL> update t_lock set a=11 where a=1;


1 row updated.


--node1
SQL> update t_lock set a=1111 where a=1;
--hang住


現在開始觀察節點1及節點2的告警日誌


--node2
Tue Nov 10 06:09:42 2015
IPC Send timeout detected.Sender: ospid 682  --可見傳送程式是SMON程式
Receiver: inst 1 binc 432326879 ospid 21822  --可見接受者是NODE1的LMD程式
Tue Nov 10 06:09:45 2015
IPC Send timeout to 0.0 inc 20 for msg type 12 from opid 12 --同上,接受者也是SMON程式
Tue Nov 10 06:09:45 2015
Communications reconfiguration: instance_number 1
Tue Nov 10 06:09:45 2015
IPC Send timeout detected.Sender: ospid 696  --可見是MMON程式為傳送程式
Receiver: inst 1 binc 432326879 ospid 21822   --可見接受程式是節點的lmd程式
Tue Nov 10 06:09:48 2015
IPC Send timeout to 0.0 inc 20 for msg type 12 from opid 15  ---同上,接受者為mmon傳送程式




--node1
Tue Nov 10 06:09:23 2015
IPC Send timeout detected. Receiver ospid 21822  --可見接受為LMD程式
Tue Nov 10 06:09:23 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lmd0_21822.trc: --產生一個LMD的TRACE檔案
IPC Send timeout detected. Receiver ospid 21822 --同上
Tue Nov 10 06:09:27 2015
Errors in file /u01/app/oracle/admin/jingfa/bdump/jingfa1_lmd0_21822.trc:  




由上可見lmd確實與全域性鎖獲取相關,如果LMD程式出現故障,會導致RAC2個節點通訊出現問題






[oracle@jingfa2 bdump]$ ps -ef|grep 682
oracle     682     1  0 02:14 ?        00:00:01 ora_smon_jingfa2
oracle    7157 13004  0 06:15 pts/1    00:00:00 grep 682




SQL> select spid,pid,program from v$process where spid=696;


SPID                PID PROGRAM
------------ ---------- ------------------------------------------------
696                  15 oracle@jingfa2 (MMON)

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/31397003/viewspace-2150352/,如需轉載,請註明出處,否則將追究法律責任。

相關文章