刪除重復資料

背景：
某定時應用程式（每天02：00啟動）會讀取指定目錄下的所有txt扁平資料檔案，並將資料儲存到ORACLE資料庫。本來應用程式將txt檔案中資料入庫後，會將目錄中的txt資料檔案備份到另外的目錄中，但是應用程式有BUG導致備份失敗。應用程式初始執行時需要對1億全量的資料入庫，接連執行8天，因應有程式有BUG導致資料重複入庫8次，結果資料庫表中有7億重複資料。之所以把應用程式做成定時，是因為每天有100萬增量資料需要入庫。

表T_TEST結構：
col_id1           NUMBER(11)
col_id2           NUMBER(5)
col_3             VARCHAR2(32)
col_4             NUMBER(10)
col_5             VARCHAR2(256)
updatetime        timestamp

注：通過col_id1，col_id2欄位可以判斷記錄是否重複，updatetime為記錄更新時間

問題：
    最後更新的記錄保留，將其它重複的記錄刪除

解決方案：
1.通過create table ... as select將不重複的記錄重建成表T_TEST_1
create table T_TEST_1 nologging tablespace &tablespace_name as
       select col_id1, col_id2, col_3, col_4, col_5
              from (select col_id1,
                           col_id2,
                           col_3,
                           col_4,
                          col_5,
                           updatetime,
                           row_number() over(partition by col_id1, col_id2 order by updatetime desc) rn
                        from T_TEST)
      where rn = 1;
2.對新表重建索引，原表有多少索引，在新表上也重建多少索引
create index IND_T_TEST_1 on T_TEST_1(col_id1, col_id2)
nologging tablespace &ind_tablespace_name;
3.收集新表統計資訊，確保SELECT查詢採用正確高效率的執行計劃
declare
BEGIN
dbms_stats.gather_table_stats(ownname          => '&user',
                                tabname          => 'T_TEST_1',
                                estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
                                cascade          => true,
                                method_opt       => 'FOR ALL COLUMNS SIZE 1',
                                granularity      => 'all');
END;
/
4.將新表和新索引更改為日誌方式
alter table T_TEST_1 logging;
alter index IND_T_TEST_1 logging;
5.備份舊錶，將新表切換上線
alter table T_TEST rename to T_TEST_BAK0902;
alter table T_TEST_1 rename to T_TEST;

不建議方案：
   不建議直接在原表T_TEST上做DELETE操作

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/24867586/viewspace-713711/，如需轉載，請註明出處，否則將追究法律責任。

刪除重復資料

相關文章