Oracle案例06—— OGG-01098 Could not flush "./dirdat/e1000004383" (error 28, No space left on device)

Rangle發表於2018-05-03

 一、前言

  自從換了新環境,各種問題層出不窮,如果不是之前積累的經驗豐富,估計都歇菜了,看來作為資料庫全棧工程師(oracle/mysql/sqlserver/sap hana/pg/mongodb/redis)還是有好處的(新環境需要完善的地方很多很多啊...),O(∩_∩)O哈哈~。今天同事找我說有個報表庫4.5號資料就沒有了,問我是不是ogg資料同步有問題。我一臉懵逼,首先4.5號出現的問題,現在都快1個月了才發現說明監控機制不完善,其次業務部門這反映也太滯後了也,出問題就先解決問題唄。

二、問題排查

登入Ogg源庫檢視相關程式:

[oracle@dg dirprm]$ ../ggsci

GGSCI (dg) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPRPT01     00:00:00      666:57:30   
EXTRACT     ABENDED     EXTRPT01    00:00:00      666:57:38   
GGSCI (dg) 2> info EXTRPT01

EXTRACT    EXTRPT01  Last Started 2017-10-27 18:12   Status ABENDED
Checkpoint Lag       00:00:00 (updated 666:58:56 ago)
Log Read Checkpoint  Oracle Redo Logs
                     2018-04-05 23:00:24  Seqno 282927, RBA 832138240
                     SCN 12.2143954677 (53683562229)

通過上面檢查,發現源端的資料抽取程式已經掛起27天左右了,也就是發生在4.5號的23:00,那麼具體是什麼原因導致的這個問題呢?需要通過檢查ogg錯誤日誌

[oracle@dg ogg]$ cd dirrpt/

[oracle@dg dirrpt]$ vi DPRPT010.rpt

2018-04-05 23:00:36  ERROR   OGG-01098  Could not flush "./dirdat/e1000004383" (error 28, No space left on device).
Failed to save data to 'dirdmp/gglog-EXTRPT01.dmp', error 28 - No space left on device

發現具體導致程式掛起的原因是因為磁碟空間不足,導致資料抽取無法寫入trail檔案,檢查磁碟空間發現磁碟目前是充足的,那麼嘗試重啟ext程式

GGSCI (dg) 1>start EXTRPT01
Sending START request to MANAGER ...
EXTRACT EXTRPT01 starting

GGSCI (dg) 3> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPRPT01     00:00:00      667:04:13   
EXTRACT     RUNNING     EXTRPT01    00:00:00      667:04:21   

隔1分鐘後

GGSCI (dg) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPRPT01     00:00:00      667:05:10   
EXTRACT     ABENDED     EXTRPT01    00:00:00      667:05:17   

啟動失敗,再次檢視錯誤日誌DPRPT01.rpt,錯誤資訊如下:

2018-05-03 11:36:52  ERROR   OGG-00446  Could not find archived log for sequence 282927 thread 1 under default destinations SQL <SELECT  name    FROM v$archived_log   WHERE sequence# = :ora_seq_no AND         thread# = :ora_thread AND         resetlogs_id = :ora_resetlog_id AND         archived = 'YES' AND         deleted = 'NO' >, error retrieving redo file name for sequence 282927, archived = 1, use_alternate = 0Not able to establish initial position for sequence 282927, rba 790067728.

2018-05-03 11:36:52  ERROR   OGG-01668  PROCESS ABENDING.

這裡可以看到抽取程式讀取源庫的歸檔日誌的時候無法找到相應歸檔日誌(估計已被清理)

col name for a55;
set line 200;
set pagesize 20000;

select sequence#,name,COMPLETION_TIME,STATUS from v$archived_log  where sequence#>=282926  and rownum<=30;

經過確認,發現確實自282927以後到5.2日前的歸檔日誌已經被刪除,至此可以確認問題導致的原因和現狀情況。

錯誤原因是:磁碟空間不足導致抽取程式掛起,然後隔了好久沒有恢復OGG資料同步,資料來源的歸檔日誌又被清理掉,所以無法通過啟動抽取程式完成恢復OGG資料同步。

三、解決方案

1、可以通過從備份恢復歸檔日誌,完成ogg資料同步直接。(鑑於每天歸檔日誌大約80G左右,一個月的歸檔資料恢復較大量以及資料同步依舊需要較大資料量,所以不採用此方式)

2、通過重新部署OGG主從同步程式,完成OGG資料同步,經過檢查發現需要同步的表有11張,資料量最大的也就6千萬左右資料,同步速度比較快。

select count(1) from   testuser.t_t1; --      1163
select count(1) from   testuser.t_t2;  --   3794574
select count(1) from   testuser.t_t3; --14461070
select count(1) from   testuser.t_t4; --    135962
select count(1) from   testuser.t_t5; --3331344
select count(1) from   testuser.t_t6;  --   5961455
select count(1) from   testuser.t_t7;  --131280
select count(1) from   testuser.t_t8; --7459898
select count(1) from   testuser.t_t9  --8698
select count(1) from   testuser.t_t10;  --62504749
select count(1) from   testuser.t_t11; --11581710

3、完成資料同步恢復後,需要完善資料庫狀態監控,包含不僅限於DG主備裝填、OGG程式狀態、例項狀態等

4、重新部署過程略。

1、停止源庫和目標庫ogg程式
1)源庫

GGSCI (source_userdg) 7> stop extrpt01
EXTRACT EXTRPT01 is already stopped.


GGSCI (source_userdg) 9> stop DPRPT01 

Sending STOP request to EXTRACT DPRPT01 ...
STOP request pending end-of-transaction (4 records so far)..

GGSCI (source_userdg) 24> kill extract DPRPT01 

Sending KILL request to MANAGER ...
Killed process (19665) for EXTRACT DPRPT01

2)目標庫

GGSCI (source_user_report) 2> stop REPRPT01
REPLICAT REPRPT01 is already stopped.

2、刪除原配置檔案
1)源庫
GGSCI (source_userdg) 26> delete EXTRPT01

2018-05-07 10:16:39  WARNING OGG-01753  Cannot unregister EXTRACT EXTRPT01 from database because no database login was provided. You can manually unregister this group later with the UNREGISTER EXTRACT command with LOGRETENTION. Issue DBLOGIN first.
Deleted EXTRACT EXTRPT01.


GGSCI (source_userdg) 27> delete DPRPT01
Deleted EXTRACT DPRPT01.

2)目標庫
GGSCI (source_user_report) 5> delete REPRPT01
ERROR: Could not delete DB checkpoint for REPLICAT REPRPT01 (Database login required to delete database checkpoint)

GGSCI (source_user_report) 6> dblogin useridalias gguser 
Successfully logged into database.

GGSCI (source_user_report as gguser@source_dbname) 7> delete REPRPT01
Deleted REPLICAT REPRPT01.

3、清理日誌和歷史檔案
1)源庫
[oracle@source_userdg ogg]$ rm -rf dirdat/*
[oracle@source_userdg ogg]$ rm -rf dirrpt/EXT*
[oracle@source_userdg ogg]$ rm -rf dirrpt/PUMP*
[oracle@source_userdg ogg]$ rm -rf dirrpt/ext01*
[oracle@source_userdg ogg]$ rm -rf dirrpt/DPR*  
[oracle@source_userdg ogg]$ rm -rf dirrpt/extrpt01.dsc 
[oracle@source_userdg ogg]$ rm -rf dirrpt/MGR{0,1,2}*
2)目標庫
[oracle@source_user_report ogg]$ rm -rf dirdat/r1000004383
[oracle@source_user_report ogg]$ rm -rf  dirrpt/REPRPT01*
[oracle@source_user_report ogg]$ rm -rf dirrpt/rep*
[oracle@source_user_report ogg]$ rm -rf dirrpt/MGR0.rpt 
3)目標庫表資料清理

alter session set current_schema=gguser;

truncate table xxx; 

4、重新配置

1)源庫
extract程式配置
GGSCI (source_userdg as gguser@source_user) 4> add extract extrpt01,tranlog,begin now
EXTRACT added.

GGSCI (source_userdg as gguser@source_user) 5> add exttrail ./dirdat/ex,extract extrpt01 MEGABYTES 100
EXTTRAIL added.

pump程式配置
GGSCI (source_userdg as gguser@source_user) 6> add extract DPRPT01,exttrailsource ./dirdat/ex
EXTRACT added.

GGSCI (source_userdg as gguser@source_user) 7> add rmttrail  ./dirdat/rt,extract DPRPT01
RMTTRAIL added.

2)目標庫

GGSCI (source_user_report) 1> dblogin useridalias gguser
Successfully logged into database.
GGSCI (source_user_report as gguser@source_dbname) 3> add checkpointtable gguser.checkpointtab

Successfully created checkpoint table gguser.checkpointtab.

GGSCI (source_user_report as gguser@source_dbname) 4> add replicat REPRPT01,exttrail   ./dirdat/rt,checkpointtable gguser.checkpointtab
REPLICAT added.

3)啟動源庫程式
GGSCI (source_userdg) 6> start EXTRPT01 

Sending START request to MANAGER ...
EXTRACT EXTRPT01 starting


GGSCI (source_userdg) 7> start DPRPT01

Sending START request to MANAGER ...
EXTRACT DPRPT01 starting

GGSCI (source_userdg) 8> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPRPT01     00:00:00      00:00:47    
EXTRACT     RUNNING     EXTRPT01    00:00:53      00:00:03   

4)源庫資料匯出

select * from dba_directories; 
DATA_PUMP_DIR
/U01/oracle/admin/source_user/dpdump/

col  CURRENT_SCN  format 999999999999999999999999;
set line 200;
set pagesize 20000;

select current_scn from v$database;
55011413592
##select dbms_flashback.get_system_change_number current_scn from dual;
##select timestamp_to_scn(to_timestamp('2018-05-06 08:05:14')) from dual;
##select to_char(scn_to_timestamp(55011413592),'yyyy-mm-dd hh24:mi:ss') from dual;

expdp \'/ as sysdba\' directory=DATA_PUMP_DIR file=20180507.dmp tables=source_user.t1,source_user.t2,source_user.t3 grants=n statistics=none triggers=n compress=n content=data_only FLASHBACK_SCN=55011413592 log=expdp.log

5)目標庫資料匯入
select * from dba_directories; 
DATA_PUMP_DIR    
/U01/oracle/admin/source_dbname/dpdump/
impdp \'/ as sysdba\' DIRECTORY=DATA_PUMP_DIR  DUMPFILE=20180507.dmp    REMAP_SCHEMA=source_user:gguser content=data_only LOGFILE=impdp.log               

6)啟動目標庫程式
start REPRPT01,aftercsn 55011413592

GGSCI (source_user_report) 17> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     REPRPT01    02:43:08      00:00:04 

5、資料校驗

 

相關文章