Oracle RAC(Cluster)的重構整理(3)

aaqwsh發表於2011-07-14

node2alert.log

Sat Jul 09 16:41:28 CST 2011

Reconfiguration started (old inc 2, new inc 4)

List of nodes:

 0 1

 Global Resource Directory frozen

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sat Jul 09 16:41:29 CST 2011

 LMS 0: 0 GCS shadows cancelled, 0 closed

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

Sat Jul 09 16:41:30 CST 2011

 LMS 0: 5074 GCS shadows traversed, 2242 replayed

Sat Jul 09 16:41:30 CST 2011

 Submitted all GCS remote-cache requests

 Post SMON to start 1st pass IR

 Fix write in gcs resources

Reconfiguration complete

 

 

node1alert.log(node2 shutdown abort):

Sat Jul 09 17:32:37 CST 2011

Reconfiguration started (old inc 4, new inc 6)

List of nodes:

 0

 Global Resource Directory frozen

 * dead instance detected - domain 0 invalid = TRUE

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sat Jul 09 17:32:38 CST 2011

 LMS 0: 0 GCS shadows cancelled, 0 closed

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

 Post SMON to start 1st pass IR

Sat Jul 09 17:32:39 CST 2011

 LMS 0: 5947 GCS shadows traversed, 0 replayed

Sat Jul 09 17:32:39 CST 2011

 Submitted all GCS remote-cache requests

 Fix write in gcs resources

Reconfiguration complete

Sat Jul 09 17:32:40 CST 2011

Instance recovery: looking for dead threads

Sat Jul 09 17:32:40 CST 2011

Beginning instance recovery of 1 threads

Sat Jul 09 17:32:42 CST 2011

Started redo scan

Sat Jul 09 17:32:46 CST 2011

Completed redo scan

 3 redo blocks read, 5 data blocks need recovery

Sat Jul 09 17:32:46 CST 2011

Started redo application at

 Thread 2: logseq 5, block 1884

Sat Jul 09 17:32:47 CST 2011

Recovery of Online Redo Log: Thread 2 Group 3 Seq 5 Reading mem 0

  Mem# 0: +RAC_DISK/racdb/onlinelog/group_3.258.751759681

Sat Jul 09 17:32:47 CST 2011

Completed redo application

Sat Jul 09 17:32:47 CST 2011

Completed instance recovery at

 Thread 2: logseq 5, block 1887, scn 532837

 3 data blocks read, 5 data blocks written, 3 redo blocks read

Sat Jul 09 17:32:48 CST 2011

Thread 2 advanced to log sequence 6 (thread recovery)

 

這裡涉及到一個重要的服務Cluster Group ServiceCGS):

LMON:各個例項的LMON程式會定期通訊,以檢查叢集中各節點的健康狀態,當某個節點出現故障時, 負責叢集 重構。它提供的服務叫Cluster Group ServiceCGS),ORACLE

Clusterware使用Process Monitor Daemon解決腦裂的方法,如果某節點上的例項異常掛起,如果單從NetworkOSClusterware幾個層面 看,可能檢測不到這種異常。因此資料

庫必須有自我監控的機制。LMON程式提供了節點監控(Node Montor)功能。這個功能是用 來記錄應用層各個節點的健康狀態,節點的健康狀態通過GRD中的一個點陣圖bitmap記錄,

每個節點一位,0代表關閉,1代表正常執行,各節點的LMON互相通訊,確認這個點陣圖的一致性。

    LMON可以和下層的Clusterware合作也可以 單獨工作。當LMON檢測到例項級別的腦裂時,期待藉助於Clusterware解決腦裂,但RAC並不假設Clusterware 肯定能解決問題 ,因

LMON不會無盡等待Clusterware層的處理結果,當等待超時LMON程式會自動觸發IMRInstance Membership RecoveryIMR可以看做是ORACLE在資料庫層提供的腦裂、IO隔離機制

    LMON主要藉助兩種心跳來完成健康監測:

    1、節點間的心跳

    2、控制檔案的磁碟心跳, 每個例項的CKPT程式 3秒更新一次控制檔案的Checkpoint Progress Record資料塊,控制檔案是 共享的,因此例項可以互相檢測對方是否及時更新以判斷狀態。

 

LMON 相應的日誌:

*** 2011-07-09 16:41:25.412

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 2 0.

*** 2011-07-09 16:41:25.570

     Name Service frozen

kjxgmcs: Setting state to 2 1.

kjxgrssvote: reconfig bitmap chksum 0xccd0ae50 cnt 2 master 0 ret 0

kjxggpoll: change poll time to 50 ms

*** 2011-07-09 16:41:25.665

Obtained RR update lock for sequence 3, RR seq 2

*** 2011-07-09 16:41:25.752

Voting results, upd 0, seq 4, bitmap: 0 1

CGS/IMR TIMEOUTS:

  CSS recovery timeout = 71 sec

  IMR Reconfig timeout = 300 sec

  CGS rcfg timeout = 300 sec

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 4 2.

 kjfmuin: bitmap 0 1

 kjfmmhi: received msg from 0 (inc 2)

 kjfmmhi: received msg from 1 (inc 4)

     Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 4 3.

     Name Service recovery started

     Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 4 4.

     Multicasted all local name entries for publish

     Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 4 5.

     Name Service normal

     Name Service recovery done

*** 2011-07-09 16:41:27.200

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 4 6.

kjxggpoll: change poll time to 600 ms

*** 2011-07-09 16:41:28.279

kjfcrfg: DRM window size = 128->128 (min lognb = 10)

*** 2011-07-09 16:41:28.279

Reconfiguration started (old inc 2, new inc 4)

Synchronization timeout interval: 900 sec

List of nodes:

 0 1

Undo tsn affinity 1

*** 2011-07-09 16:41:28.311

*** 2011-07-09 16:41:28.311

kjfcrfg: query of NESTED_RECONFIGURATION for node 1 failed with 7

 Global Resource Directory frozen

node 0

node 1

release 10 2 0 5

 asby init, 0/0/x2

 asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

*   DOMAIN 0 (valid 1): 0

* End of domain mappings

* Domain maps after recomputation:

*   DOMAIN 0 (valid 1): 0 1

* End of domain mappings

 Dead  inst

 Join  inst 1

 Exist inst 0

 Active Sendback Threshold = 50 %

 Communication channels reestablished

 sent syncr inc 4 lvl 1 to 0 (4,5/0/0)

 sent synca inc 4 lvl 1 (4,5/0/0)

 received all domreplay (4.6)

 sent master 0 (4.6)

*** 2011-07-09 16:41:29.535

KJBDOMHVMAP: BEGINS

*** 2011-07-09 16:41:29.560

KJBDOMHVMAP: ENDS

 sent dom info (4.6)

 sent hv info (4.6)

 sent syncr inc 4 lvl 2 to 0 (4,7/0/0)

 sent synca inc 4 lvl 2 (4,7/0/0)

 Master broadcasted resource hash value bitmaps

* kjfcrfg: domain 0 valid, valid_ver = 4

 Non-local Process blocks cleaned out

 Set master node info

 sent syncr inc 4 lvl 3 to 0 (4,13/0/0)

 sent synca inc 4 lvl 3 (4,13/0/0)

 Submitted all remote-enqueue requests

kjfcrfg: Number of mesgs sent to node 1 = 774

 sent syncr inc 4 lvl 4 to 0 (4,15/0/0)

 sent synca inc 4 lvl 4 (4,15/0/0)

 Dwn-cvts replayed, VALBLKs dubious

 sent syncr inc 4 lvl 5 to 0 (4,18/0/0)

 sent synca inc 4 lvl 5 (4,18/0/0)

 All grantable enqueues granted

 sent syncr inc 4 lvl 6 to 0 (4,20/0/0)

 sent synca inc 4 lvl 6 (4,20/0/0)

 Submitted all GCS cache requests

 sent syncr inc 4 lvl 7 to 0 (4,22/0/0)

 sent synca inc 4 lvl 7 (4,22/0/0)

 Post SMON to start 1st pass IR

 Fix write in gcs resources

 sent syncr inc 4 lvl 8 to 0 (4,24/0/0)

 sent synca inc 4 lvl 8 (4,24/0/0)

*** 2011-07-09 16:41:31.006

Reconfiguration complete

 

 

*** 2011-07-09 17:32:33.682

kjxgmpoll reconfig bitmap: 0

*** 2011-07-09 17:32:33.745

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 4 0.

*** 2011-07-09 17:32:34.157

     Name Service frozen

kjxgmcs: Setting state to 4 1.

kjxgrssvote: reconfig bitmap chksum 0x6668604e cnt 1 master 0 ret 0

kjxggpoll: change poll time to 50 ms

*** 2011-07-09 17:32:34.464

Obtained RR update lock for sequence 5, RR seq 4

*** 2011-07-09 17:32:37.539

Voting results, upd 0, seq 6, bitmap: 0

CGS/IMR TIMEOUTS:

  CSS recovery timeout = 71 sec

  IMR Reconfig timeout = 300 sec

  CGS rcfg timeout = 300 sec

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 6 2.

kjfmSendAbortInstMsg: send an abort message to node 1

kjfmSendAbortInstMsg: unique id 0x0 reason 0x1

 kjfmuin: bitmap 0

 kjfmmhi: received msg from 0 (inc 2)

     Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 6 3.

     Name Service recovery started

     Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 6 4.

     Multicasted all local name entries for publish

     Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 6 5.

     Name Service normal

     Name Service recovery done

*** 2011-07-09 17:32:37.598

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 6 6.

kjxggpoll: change poll time to 600 ms

kjfmact: call ksimdic on instance (1)

*** 2011-07-09 17:32:37.843

kjfcrfg: DRM window size = 128->128 (min lognb = 10)

*** 2011-07-09 17:32:37.845

Reconfiguration started (old inc 4, new inc 6)

Synchronization timeout interval: 900 sec

List of nodes:

 0

Undo tsn affinity 1

*** 2011-07-09 17:32:37.906

 Global Resource Directory frozen

node 0

 asby init, 0/0/x2

 asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

*   DOMAIN 0 (valid 1): 0 1

* End of domain mappings

* kjbdomrcfg2: domain 0 invalid = TRUE

* Domain maps after recomputation:

*   DOMAIN 0 (valid 0): 0

* End of domain mappings

 Active Sendback Threshold = 50 %

 Communication channels reestablished

 sent syncr inc 6 lvl 1 to 0 (6,5/0/0)

 sent syncr inc 6 lvl 2 to 0 (6,7/0/0)

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

 Set master node info

 sent syncr inc 6 lvl 3 to 0 (6,13/0/0)

 Submitted all remote-enqueue requests

 sent syncr inc 6 lvl 4 to 0 (6,15/0/0)

 Dwn-cvts replayed, VALBLKs dubious

 sent syncr inc 6 lvl 5 to 0 (6,18/0/0)

 All grantable enqueues granted

 sent syncr inc 6 lvl 6 to 0 (6,20/0/0)

*** 2011-07-09 17:32:39.351

 Post SMON to start 1st pass IR

 Submitted all GCS cache requests

 sent syncr inc 6 lvl 7 to 0 (6,22/0/0)

 Fix write in gcs resources

 sent syncr inc 6 lvl 8 to 0 (6,24/0/0)

*** 2011-07-09 17:32:39.673

Reconfiguration complete

*   domain 0 valid?: 0

kjxgfipccb: msg 0x0xb7db2a6c, mbo 0x0xb7db2a68, type 19, ack 0, ref 0, stat 34

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/758322/viewspace-702235/,如需轉載,請註明出處,否則將追究法律責任。

相關文章