Best Practices for failover during server failures [ID 1323472.1]

msdnchina發表於2011-11-15

Applies to:

Oracle GoldenGate - Version: 10.0.0.0 to 11.1.1.0.0 - Release: 10.0.0 to 11.1.1
Information in this document applies to any platform.

Goal


This document attempts to explain the following high availability scenario and provides best practice suggestions:

Simplified scenario:

Machine A is primary, B is standby and C is replicated via GoldenGate to A.

-Extract process run on C
-Datapump process run on C
-Replicat process run on A

The datapump process on C always has a trail file on A.
Therefor, when a fail-over from A to B happens, the trail file that C uses is no longer there, hence the Datapump hangs.

Even if Golden Gate is restarted on B, the original trail from A is not there anymore.

How to address this situation?
Can we simply recopy the trail file from A to B and then start GoldenGate?
What if the machine is no longer accessible? A anymore so the trail file is lost?

Solution


Can we simply recopy the trail file from A to B and then start Golden Gate? Yes.

In fact, some customers who do all their processing at night write all their trails on the local machine, then after hours they zip them and ftp them to the target. They unzip on the target and start the replicat at the appropriate spot.

Writing the trail locally first is always best. The trail on A could be lost if A loses it somehow, but if you have used minkeep on C then the trail is still there and can be moved anywhere else

We also suggest clients write their first trails locally and have enough space for a minkeep of 4 days. That way, if you lose connection or the target node over a 3 day weekend, there is still one day to move the files elsewhere and nothing is lost

For more info on minkeep parameter, see the following note: (Doc ID 1272645.1)--Take a look at Maintaining the OGG Marker table

Here is a more detailed step by step:

Shutdown A (assuming A is no longer accessible).


1- On C, run ggsci. Find what trail the datapump on C was writing to A

ggsci> info exttrail *, detail

1- 2- Zip the trail, ftp it to B, unzip it, possibly replacing an existing file of the same name.

(It should be more complete and not risk being damaged by the cause of the abort).

2- 3- add HANDLECOLLISIONS to the replicat param file

3- 4- Alter replicat to read from RBA 0 of that trail you just copied over--
process that trail with the replicat to end.

5- Then, on C, stop the datapump extract. Add a new rmttrail into the datapump extract parameters to be written to B. Comment out the old trail.

5- 6- Change the rmthost to point to B.

6- 7- do an add rmttrail command in ggsci to associate the extract datapump on C with the to be written trail on B

7- 8- Start the extract datapump on C. It is now writing the new trail name, extseqno 0, extrba 0, to B.
on B,stop the replicat

8- 9- On B, alter the replicat to read from the newly written trail name at extseqno 0, extrba 0 and start it.

9- 10- After a few minutes of running, stop the replicat.

11- Remove handlecollisions and restart the replicat.

This presumes that once A is down, its down. If you can still run ggsci on A, it makes things simpler because you can check replicat's checkpoints.
This would allow you to be sure about where to restart.
If B contains the replicat checkpoints current from A when the abend happens, you can just replace the existing trail file on B with the zipped one from C. The one on C might have more data that would otherwise be missed.

This also assumes that replicat runs current or near current. If there is a long lag, you may need another trail file that was not processed.

This procedure will work but is also dependent on how much you know about the state of the replicat lag and checkpoint. If you understand how these things work, you could safely fail this over. Otherwise, in a production scenario, get help from OGG support so they can be sure no data is missed and the least amount of data possible is reprocessed.

[@more@]

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/161195/viewspace-1056336/,如需轉載,請註明出處,否則將追究法律責任。

相關文章