最近客戶1套OGG系統，抽取程式經常報OGG-01028錯誤，該錯誤一般是由於OGG Bug 10356426引起的，日誌如下：

ERROR OGG-01028 Oracle GoldenGate Capture for Oracle, extoa.prm: encountered commit SCN 2561.1344492794 (11000755737850) that is not greater than the highest SCN already processed 2561.1344493093 (11000755738149) Redo Thread 1 (1) xid 78.31.31562 (0x004e.001f.00007b4a), starting seq.rba 12835.396542992, scn 2561.1344492791 (11000755737847), commit seq.rba 12835.396544448 commit timestamp 2013-03-28 22:26:20.000000.
2013-03-28 22:26:22 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, extoa.prm: PROCESS ABENDING.

2013-03-28 22:27:22 WARNING OGG-01475 Oracle GoldenGate Manager for Oracle, mgr.prm: Cannot automatically restart EXTRACT EXTOA, which abended due to an out of order transaction. Issue ETROLLOVER to advance the output trail sequence past the current trail sequence and restart. Then, use ALTER EXTSEQNO on the subsequent pump EXTRACT, or REPLICAT, process group to start reading from the new trail file created by ALTER ETROLLOVER; the downstream process will not automatically switch to the new trail file.
2013-03-28 22:28:22 WARNING OGG-01475 Oracle GoldenGate Manager for Oracle, mgr.prm: Cannot automatically restart EXTRACT EXTOA, which abended due to an out of order transaction. Issue ETROLLOVER to advance the output trail sequence past the current trail sequence and restart. Then, use ALTER EXTSEQNO on the subsequent pump EXTRACT, or REPLICAT, process group to start reading from the new trail file created by ALTER ETROLLOVER; the downstream process will not automatically switch to the new trail file.
2013-03-28 22:29:13 INFO OGG-01021 Oracle GoldenGate Capture for Oracle, dpeoa.prm: Command received from GGSCI: STATS daily totalsonly *.
參考 MOS ID 957112.1文章

處理步驟如下
1.檢查2個節點的SCN 情況
select thread#,LAST_REDO_CHANGE# from v$thread;

2.檢查2個節點的OS TIME 是否一致
SQL> alter session set nls_date_format='YYYY-MM-DD hh24:mi:ss';

Session altered.

SQL> select sysdate,dbtimezone from dual;
node1$ date
node2$ date

3.檢查2個節點的redo genration 是否有很大差距
select inst_id,round(sum(blocks*block_size/(1024*1024*1024)))
from gv$archived_log where completion_time> sysdate-7
group by inst_id;

4.透過2個節點的AWR報告或SQL檢視7天內的log file parallel write及log file sync延遲時間
--在node1 執行
select 1 inst_id,avg(avg_ms)
from (
select
      round((time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0),1) avg_ms
from (
select
       s.dbid,
       to_char(s.BEGIN_INTERVAL_TIME,'DD-MON-YY HH24:MI') btime,
       total_waits count_end,
       time_waited_micro/1000 time_ms_end,
       Lag (e.time_waited_micro/1000)
              OVER( PARTITION BY e.event_name ORDER BY s.snap_id) time_ms_beg,
       Lag (e.total_waits)
              OVER( PARTITION BY e.event_name ORDER BY s.snap_id) count_beg
from
       DBA_HIST_SYSTEM_EVENT e,
       DBA_HIST_SNAPSHOT s
where
       s.BEGIN_INTERVAL_TIME    and    s.snap_id=e.snap_id
   and e.event_name in (
                  'log file sync',
                  'log file parallel write'
                )
   and s.dbid=e.dbid
)
order by btime
)
/
--在node2 再執行一次
select 2 inst_id,avg(avg_ms)
();

5.修改extract 引數檔案

THREADOPTIONS MAXCOMMITPROPAGATIONDELAY 90000 IOLATENCY 180000
--MAXCOMMITPROPAGATIONDELAY範圍為0-90000 預設為3000ms (3s)
--IOLATENCY最大為180000 預設為1500ms (1.5s)

透過以上5個步驟，確認排除故障隱患。

6.具體操作Solution Details:
6.1. Do an ETROLLOVER on Extract, and take note of the new sequence number of the trail file.
操作之前在源端檢視最新的trail file號
ls -lt /goldengate/dirdat/oaxxxx，那麼這裡需要記錄的應該是 xxxx+1，即ALTER EXTRACT extoa, ETROLLOVER後生成的。

ALTER EXTRACT extoa, ETROLLOVER
info extoa detail
info dpeoa detail

6.2. Start extract

START EXTRACT extoa

6.3. Send PUMP, LOGEND, to see if it's at the end of the previous trail.

SEND EXTRACT DPEOA, LOGEND

6.4. Once it is at the end of the trail file, You must stop the pump, and do an ETROLLOVER for it too. Take note of the new trail file sequence number that is created from this step.
操作之前在再次在源端檢視最新的trail file號/goldengate/dirdat/oaxxxx，並在操作以下步驟後記錄最新的sequence number
STOP EXTRACT DPEOA
ALTER EXTRACT DPEOA, ETROLLOVER

6.5. Alter the pump to SEQNO to the new trail file created from step #1.
這裡的#####取6.1 記錄的xxxx+1

ALTER EXTRACT DPEOA, EXTSEQNO ##### EXTRBA 0

6.6. Restart pump

START EXTRACT DPEOA

6.7. Send Replicat, LOGEND to make sure it has processed all the remaining data, and stop Replicat.
SEND REPLICAT REPOA, LOGEND ----提示yes時，執行下一步
STOP REPLICAT REPOA

6.8. If replicat is not at end of trail, generate a report and forcestop replicat
SEND REPLICAT REPOA, REPORT
STOP REPLICAT REPOA!

6.9. Add the following parameters to replicat parameter file to allow replicat to process each trail record as a single transaction, or set them to 1 if you have any of these parameters:
GROUPTRANSOPS 1
MAXTRANSOPS 1

6.10. Restart replicat

START REPLICAT REPOA

6.11. Once replicat has completely processed the trail, stop the replicat
STOP REPLICAT REPOA

6.12. Edit the replicat parameter file:
- Add parameter HANDLECOLLISIONS to Replicat parameter file
- Remove or comment out GROUPTRANSOPS and MAXTRANSOPS or revert them back to their original values.

Note: There are pre-conditions for using HANDLECOLLISIONS . Ther emust be either

a. no pkupdates

or
b. extract has "FETCHOPTIONS FETCHPKUPDATECOLS"

Also all the tables should have Primary key or unique index on the table to avoid data integrity issues when using handlecollisions.
要確保所有的同步的表均有主鍵或者唯一索引才能避免資料完整性

6.13. ALTER REPLICAT, SEQNO to the new trail file created in step #4.

ALTER REPLICAT REPOA, EXTSEQNO ###### EXTRBA 0

6.14. Start Replicat

START REPLICAT REPOA

6.15. Once Replicat has processed the out of order SCN operations, disable HANDLECOLLISIONS. You could also look for the CSN and wait for Replicat to checkpoint past it.
一但replicat處理完無序的SCN操作後就可以取消掉這個引數了。
SEND REPLICAT REPOA, NOHANDLECOLLISIONS.

6.16.Edit the replicat parameter and comment out the HANDLECOLLISIONS parameter. You do not need to stop/restart replicat. This will ensure that on any subsequent replicat restarts the parameter is disabled.

Note:
If the out of order transactions happen to the SAME record on the SAME table, then the Replicat will probably abend in step 7 or step 10. If that is the case, look at the discard file find out that particular record (the primary key value). Then add handlecollisions, continue with the steps below. Later, once the Replicat caught up, that particular record needs to be manually synced.
However the chance of an out of order transactions happening to the SAME record on SAME table is rare. Given an example, this will mean that the user application insert record A from node 1, then immediately it updated this record A on node 2. Out of order transactions usually happens to different tables or different records on same table.
In other words, if you do see collisions been handled in this recovery procedure, that means you have to re-sync ONLY those particular records that have collisions.

ERROR OGG-01028

相關文章