Oracle RAC: Member voting

tolywang發表於2011-03-01


The CGS is responsible for checking whether members are valid. To determine periodically whether all

members are alive, a voting mechanism is used to check the validity of each member. All members in the

database group vote by providing details of what they presume the instance membership bitmap looks like.

As mentioned, the bitmap is stored in the GRD. A predetermined master member tallies the vote flags of the

status flag and communicates to the respective processes that the voting is done; then it waits for

registration by all the members who have received the reconfigured bitmap.

How Voting Happens The CKPT process updates the control file every 3 seconds in an operation known as the

heartbeat. CKPT writes into a single block that is unique for each instance, thus intra-instance

coordination is not required. This block is called the checkpoint progress record.

All members attempt to obtain a lock on a control file record (the result record) for updating. The

instance that obtains the lock tallies the votes from all members. The group membership must conform. to

the decided (voted) membership before allowing the GCS/GES reconfiguration to proceed. The control file

vote result record is stored in the same block as the heartbeat in the control file checkpoint progress

record.

In this scenario of dead instances and member evictions, a potentially disastrous situation could arise if

pending I/O from a dead instance is to be flushed to the I/O subsystem. This could lead to potential data

corruption. I/O from an abnormally departing instance cannot be flushed to disk if database integrity is

to be maintained. To "shield" the database from data corruption in this situation, a technique called I/O

fencing is used. I/O fencing implementation is a function of CM and depends on the clusterware vendor.

I/O fencing is designed to guarantee data integrity in the case of faulty cluster communications causing a

split-brain condition. A split-brain occurs when cluster nodes hang or node interconnects fail, and as a

result, the nodes lose the communication link between them and the cluster. Split-brain is a problem in

any clustered environment and is a symptom of clustering solutions and not RAC. Split-brain conditions can

cause database corruption when nodes become uncoordinated in their access to the shared data files.

For a two-node cluster, split-brain occurs when nodes in a cluster cannot talk to each other (the

internode links fail) and each node assumes it is the only surviving member of the cluster. If the nodes

in the cluster have uncoordinated access to the shared storage area, they would end up overwriting each

other's data, causing data corruption because each node assumes ownership of shared data. To prevent data

corruption, one node must be asked to leave the cluster or should be forced out immediately. This is where

IMR comes in, as explained earlier in the chapter.

Many internal (hidden) parameters control IMR and determine when it should start. If a vendor clusterware

is used, split-brain resolution is left to it and Oracle would have to wait for the clusterware to provide

a consistent view of the cluster and resolve the split-brain issue. This can potentially cause a delay

(and a hang in the whole cluster) because each node can potentially think it is the master and try to own

all the shared resources. Still, Oracle relies on the clusterware for resolving these challenging issues.

Note that Oracle does not wait indefinitely for the clusterware to resolve a split-brain issue, but a

timer is used to trigger an IMR-based node eviction. Theseinternal timers are also controlled using hidden

parameters. The default values of these hidden parameters are not to be touched as that can cause severe

performance or operational issues with the cluster.

An obvious question that pops up in your mind might be, Why does a split-brain condition take place? Why

does Oracle say a link is down, when my communication engineer says its fine? This is easier asked than

answered, as the underlying hardware and software layers that support RAC are too complex, and it can be a

nightmare trying to figure out why a cluster broke in two when things seem to be normal. Usually,

configuration of communication links and bugs in the clusterware can cause these issues.

As mentioned time and again, Oracle completely relies on the cluster software to provide cluster services,

and if something is awry, Oracle, in its overzealous quest to protect data integrity, evicts nodes or

aborts an instance and assumes that something is wrong with the cluster.

Cluster reconfiguration is initiated when NM indicates a change in the database group, or IMR detects a

problem. Reconfiguration is initially managed by the CGS and after this is completed, IDLM (GES/GCS)

reconfiguration starts.

Cluster Reconfiguration Steps

The cluster reconfiguration process triggers IMR, and a seven-step process ensures complete

reconfiguration.

Name service is frozen. The CGS contains an internal database of all the members/instances in the cluster

with all their configuration and servicing details. The name service provides a mechanism to address this

configuration data in a structured and synchronized manner.
Lock database (IDLM) is frozen. The lock database is frozen to prevent processes from obtaining locks on

resources that were mastered by the departing/dead instance.
Determination of membership and validation and IMR.
Bitmap rebuild takes place, instance name and uniqueness verification. CGS must synchronize the cluster to

be sure that all members get the reconfiguration event and that they all see the same bitmap.
Delete all dead instance entries and republish all names newly configured.
Unfreeze and release name service for use.
Hand over reconfiguration to GES/GCS.
Now that you know when IMR starts and node evictions take place, let's look at the corresponding messages

in the alert log and LMON trace files to get a better picture. (The logs have been edited for brevity.

Note all the lines in boldface define the most important steps in IMR and the handoff to other recovery

steps in CGS.)

Problem with a Node Assume a four-node cluster (instances A, B, C, and D), in which instance C has a

problem communicating with other nodes because its private link is down. All other services on this node

are assumed to be working normally.

 

 

Alter log on instance C:

ORA-29740: evicted by member 2, group incarnation 6 Thu Jun 30 09:15:59 2005 LMON: terminating instance

due to error 29740 Instance terminated by LMON, pid = 692304 … … … Alter log on instance A: Thu Jun 30

09:15:59 2005 Communications reconfiguration: instance 2 Evicting instance 3 from cluster Thu Jun 30

09:16:29 2005 Trace dumping is performing id=[50630091559] Thu Jun 30 09:16:31 2005 Waiting for instances

to leave: 3 Thu Jun 30 09:16:51 2005 Waiting for instances to leave: 3 Thu Jun 30 09:17:04 2005

Reconfiguration started List of nodes: 0,1,3, Global Resource Directory frozen Communication channels

reestablished Master broadcasted resource hash value bitmaps Thu Jun 30 09:17:04 2005 Reconfiguration

started LMON trace file on instance A: *** 2005-06-30 09:15:58.262 kjxgrgetresults: Detect reconfig from

1, seq 12, reason 3 kjxgfipccb: msg 0x1113dcfa8, mbo 0x1113dcfa0, type 22, ack 0, ref 0, stat 3

kjxgfipccb: Send timed out, stat 3 inst 2, type 22, tkt (10496,1496) *** 2005-06-30 09:15:59.070

kjxgrcomerr: Communications reconfig: instance 2 (12,4) Submitting asynchronized dump request [2]

kjxgfipccb: msg 0x1113d9498, mbo 0x1113d9490, type 22, ack 0, ref 0, stat 6 kjxgfipccb: Send cancelled,

stat 6 inst 2, type 22, tkt (10168,1496) kjxgfipccb: msg 0x1113e54a8, mbo 0x1113e54a0, type 22, ack 0, ref

0, stat 6 kjxgfipccb: Send cancelled, stat 6 inst 2, type 22, tkt (9840,1496) Note that Send timed out,

stat 3 inst 2 is LMON trying send message(s) to the broken instance. kjxgrrcfgchk: Initiating reconfig,

reason 3 /* IMR Initiated */ *** 2005-06-30 09:16:03.305 kjxgmrcfg: Reconfiguration started, reason 3

kjxgmcs: Setting state to 12 0. *** 2005-06-30 09:16:03.449 Name Service frozen kjxgmcs: Setting state to

12 1. *** 2005-06-30 09:16:11.570 Voting results, upd 1, seq 13, bitmap: 0 1 3 Note that instance A has

not tallied the vote; hence it has received only the voting results. Here is an extract from the LMON

trace file on instance B, which managed to tally the vote: Obtained RR update lock for sequence 13, RR seq

13 *** 2005-06-30 09:16:11.570 Voting results, upd 0, seq 13, bitmap: 0 1 3 … … Here's the LMON trace

file on instance A: Evicting mem 2, stat 0x0007 err 0x0002 kjxgmps: proposing substate 2 kjxgmcs: Setting

state to 13 2. Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs:

Setting state to 13 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps:

proposing substate 4 kjxgmcs: Setting state to 13 4. Multicasted all local name entries for publish

Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 13 5. Name Service

normal Name Service recovery done *** 2005-06-30 09:17:04.369 kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 13 0. *** 2005-06-30 09:17:04.371 Name Service frozen kjxgmcs: Setting state to

13 1. GES/GCS recovery starts here: Global Resource Directory frozen node 0 node 1 node 3

res_master_weight for node 0 is 632960 res_master_weight for node 1 is 632960 res_master_weight for node 3

is 632960 … … … Death of a Member For the same four-node cluster (A, B, C, and D), instance C has died

unexpectedly: kjxgrnbrisalive: (3, 4) not beating, HB: 561027672, 561027672 *** 2005-06-19 00:30:52.018

kjxgrnbrdead: Detected death of 3, initiating reconfig kjxgrrcfgchk: Initiating reconfig, reason 2 ***

2005-06-19 00:30:57.035 kjxgmrcfg: Reconfiguration started, reason 2 kjxgmcs: Setting state to 6 0. ***

2005-06-19 00:30:57.037 Name Service frozen kjxgmcs: Setting state to 6 1. *** 2005-06-19 00:30:57.239

Obtained RR update lock for sequence 6, RR seq 6 *** 2005-06-19 00:33:27.261 Voting results, upd 0, seq 7,

bitmap: 0 2 Evicting mem 3, stat 0x0007 err 0x0001 kjxgmps: proposing substate 2 kjxgmcs: Setting state to

7 2. Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs: Setting

state to 7 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps: proposing

substate 4 kjxgmps: proposing substate 4 kjxgmcs: Setting state to 7 4. Multicasted all local name entries

for publish Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 7 5.

Name Service normal Name Service recovery done *** 2005-06-19 00:33:27.266 kjxgmps: proposing substate 6

… … … kjxgmps: proposing substate 2 GES/GCS recovery starts here: Global Resource Directory frozen node

0 node 2 res_master_weight for node 0 is 632960 res_master_weight for node 2 is 632960 Total master weight

= 1265920 Dead inst 3 Join inst Exist inst 0 2
Use the following table of contents to navigate to chapter excerpts or click here to view RAC

Troubleshooting in its entirety.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/35489/viewspace-688201/,如需轉載,請註明出處,否則將追究法律責任。

相關文章