OGG EXTRACT / REPLICAT CHECKPOINT RBA IS LARGER THAN LOCAL TRAIL SIZE

xiexingzhi發表於2012-09-03
Applies to:

Oracle GoldenGate - Version: 10.0.0.0 and later   [Release: 10.0.0 and later ]
Oracle GoldenGate - Version: 10.0.0.0 and later    [Release: 10.0.0 and later]
Information in this document applies to any platform.
Symptoms

After a system outage, a pump Extract or Replicat may get stuck on a trail file even if there are  more trail files available in machine for the reader object to process. This issue occurs when the pump extract / Replicat checkpoint RBA is larger than the local trail file size.
Cause

In general, Datapump Extracts and Replicats read the current trail file data from the disk cache instead from the physical file. There is therefore a chance that the Datapump or Replicat will checkpoint an RBA which is still in cache. If there is a disk or system outage, the data in the cache may be lost. This occurs because the system doesn't have a chance to flush the data to the disk. 

On restart of the writer process(archive/ redo log reading Extract or data pump), the writer process finds that the trail file size is smaller than the reader's (Datapump or Replicat) checkpoint RBA. The Datapump or Replicat hangs waiting for more data to come. The writer (archive/ redo log reading Extract or data pump) process then goes through the Full Audit Recovery(FAR) process, recovering the current trail file, closing o the existing trail file and creating a new trail file( with the next sequence number). The writer will then continue writing to the new trail file. It will put the "old data" which was been already processed or committed by the reader into the new sequence number trail file. This old data was originally in the disk cache and not written to the disk file upon system outage and restart. However, part of the data was already processed by the reader(Datapump or Replicat) before it died.

Solution

In the scenario described above,  records already processed by the reader (Datapump or Replicat) may be written to the new trail file. Manual intervention is required to avoid duplicate processing. 

The manual recovery process requires finding any records that will be duplicated (from the reader's point of view) in seqno X+1 (assuming current seqno is X) with total length of ((reader's too-big checkpoint) - (actual size of seqno X)). The reader should be altered to at a point in the trail follwoing the one with the too large checkpoint reference. The reader i saltered to the RBA ofthe  record just after the RestartAbend record + (totaled length of duplicated records)). 

The calculated RBA to which the reader trail file is altered should be the address of a record that starts a transaction. The start of a transaction is indicated by a TransInd value of (x00- first record in transaction) or (x03- only record in the transaction).

 
A live example which explains this scenario : 


GGSCI (ORACLEREP) 6> info repdb 

REPLICAT REPDB Last Started 2010-06-17 18:25 Status RUNNING 
Checkpoint Lag 00:00:00 (updated 00:00:04 ago) 
Log Read Checkpoint File ./dirdat/rp000012 
First Record RBA 53966725 



2. sh>ls -tlr dirdat 

total 3822488 

-rw-rw-rw- 1 ggs dba 299999891 Jun 16 22:09 rp000011 
-rw-rw-rw- 1 ggs dba 53966568 Jun 17 18:15 rp000012   << where the checkpoint is pointing
-rw-rw-rw- 1 ggs dba 256319510 Jun 17 18:51 rp000013 << the next available trail file

------------------------------------------------------------------------------------------------------------------------------------------------ 

The actual trail size is 53966568 where the replicat rba is at 53966725 

Current LogTrail is /home/pjacob/rp000013

Logdump 101 >n 


2010/06/17 11:15:13.810.439 FileHeader Len 753 RBA 0 

Name: *FileHeader* 

3000 0199 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1... 
0002 3200 0004 ffff ffff 3300 0008 02f1 af71 469a | ..2.......3......qF. 
5c07 3400 001d 001b 7572 693a 4f52 4143 4c45 2d30 | \.4.....uri:ORACLE-0 
312d 5241 433a 3a68 6f6d 653a 6767 7336 0000 1300 | 1-RAC::home:ggs6.... 
112e 2f64 6972 6461 742f 7270 3030 3030 3133 3700 | ../dirdat/rp0000137. 
0001 0138 0000 0400 0000 0d39 0000 0800 0000 0011 | ...8.......9........ 
e19d 9c3a 0000 8109 3438 3535 3038 3736 3600 0000 | ...:....485508766... 

Logdump 102 >n 

___________________________________________________________________ 

Hdr-Ind : E (x45) Partition : . (x00) 
UndoFlag : . (x00) BeforeAfter: A (x41) 
RecLength : 0 (x0000) IO Time : 2010/06/17 11:15:13.761.084 
IOType : 150 (x96) OrigNode : 0 (x00) 
TransInd : . (x03) FormatType : R (x52) 
SyskeyLen : 0 (x00) Incomplete : . (x00) 
AuditRBA : 0 AuditPos : 0 
Continued : N (x00) RecCount : 0 (x00) 

2010/06/17 11:15:13.761.084 RestartAbend Len 0 RBA 761 

Name: 
After Image: Partition 0 G s 

Logdump 103 >n 
___________________________________________________________________ 

Hdr-Ind : E (x45) Partition : . (x04) 
UndoFlag : . (x00) BeforeAfter: A (x41) 
RecLength : 28 (x001c) IO Time : 2010/06/17 03:00:26.000.150 
IOType : 15 (x0f) OrigNode : 255 (xff) 
TransInd : . (x03) FormatType : R (x52) 
SyskeyLen : 0 (x00) Incomplete : . (x00) 
AuditRBA : 685 AuditPos : 181997072 
Continued : N (x00) RecCount : 1 (x01) 


2010/06/17 03:00:26.000.150 FieldComp Len 28 RBA 823 

Name: SCHEMA.XXXX 

After Image: Partition 4 G s 

0000 000a 0000 0006 4152 532d 3034 0002 000a 0000 | ........ARS-04...... 
0000 0000 0000 4ab0 | ......J. 
Column 0 (x0000), Len 10 (x000a) 
Column 2 (x0002), Len 10 (x000a) 

Logdump 112 >n 

___________________________________________________________________ 

Hdr-Ind : E (x45) Partition : . (x04) 
UndoFlag : . (x00) BeforeAfter: A (x41) 
RecLength : 28 (x001c) IO Time : 2010/06/17 03:00:35.000.150 
IOType : 15 (x0f) OrigNode : 255 (xff) 
TransInd : . (x03) FormatType : R (x52) 
SyskeyLen : 0 (x00) Incomplete : . (x00) 
AuditRBA : 654 AuditPos : 225996876 
Continued : N (x00) RecCount : 1 (x01) 

2010/06/17 03:00:35.000.150 FieldComp Len 28 RBA 980 

Name: SCHEMA.XXXYY


After Image: Partition 4 G s 
0000 000a 0000 0006 4152 532d 3032 0002 000a 0000 | ........ARS-02...... 
0000 0000 0000 4ab9 | ......J. 

-----------------------------------------------------------------------------------------------------------

The actual trail size is 53966568 while the replicat checkpoint rba is at 53966725 

((reader's too-big checkpoint) - (actual size of seqno X) 

The extra byte count is 53966725 - 53966568 = 157 

The calculated rba of the record just after the RestartAbend record(823) + totaled length of duplicated records(157) = 980 

The replicat should be altered to trail file sequence number 13 and rba 980 by using the following command 

Ggsci> alter rep < rep name>, extseqno 13, extrba 980. 


Note: If the X+1 trail file does not contain any actual data, then you need to do the same for X+2 trail file and so on.

 
Note: If you just do a alter with the next sequence number(without following the above procedure), you can create data integrity issues. 

Note: If using an OGG build greater or equal to 10.4.0.81, then the Datapump Extract / Replicat will abend if the read checkpoint is beyond the current EOF. You can then use the above procedure to get the datapump or replicat running.

The issue is tracked via bug- 9669344 and development is working on a solution.

NOTE -- After the calculation if you get a RBA on the new trail which is not pointing to the start of a record please reach the support for further help.

Special Case Reported

*********************

A case had been reported in which source server crashes resulting in the target replicat hitting the described.  In this case, thesource pump extract's write rba is also larger then remote trail file size. This result's in therer being no next trail available. 
The steps to get the pump started is

1. Get the last record in the remote trail to which the pump extract is writing.
2. Find the corresponding record in the local trail, and get the rba of the next record
3. Make a backup of checkpoint files (./dirchk/pump-name.cpe) of all pumps. 
4. Position the pump to the rba obtained from step 2, and do an Etrollover
For ex:
alter , extseqno < the sequence number to which the pump is currently pointing>, extrba < obtained from step 2>
alter < pump ext name>, Etrollover ---- Keep in mind that the respective replicat processes must be altered manually because of the pump ETROLLOVER.
5. Start the pump.
6 The steps for the replicat will be the same as previously mentioned in this note.

References

BUG:9673276 - REPLICAT WAITING FOR MORE DATA
BUG:9857982 - DATAPUMP EXTRACT SHOULD ABEND IF THE READ CHECKPOINT IS BEYOND CURRENT EOF

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/22531473/viewspace-742548/,如需轉載,請註明出處,否則將追究法律責任。

相關文章