PostgreSQL資料去重大法

德哥發表於2017-06-03

SQL

背景

去重的需求比較常見，去重也可以衍生出很多變種。例如

1. 單列去重，很好理解，就是按某列去除重複記錄。保留規則（例如保留最新的，保留最舊的，或者保留某個其他欄位最大的）。

2. 多列去重，按多列，去除重複記錄。保留規則（例如保留最新的，保留最舊的，或者保留某個其他欄位最大的）。

3. 行去重，按行，去除重複記錄。保留規則（例如保留最新的，保留最舊的，或者保留某個其他欄位最大的）。

4. 多列混合去重，按多列混合模式(ROW1: col1=1 , col2=2；ROW2: col1=2, col2=1；這種混合重複的去重)，去除重複記錄。保留規則（例如保留最新的，保留最舊的，或者保留某個其他欄位最大的）。

下面依次舉例，使用不同的方法去重，使用者可以自由選擇效率最佳的。

單列去重

測試資料

create table test1(id int primary key, c1 int, c2 timestamp);  
insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp();  
  
create index idx_test1 on test1(c1,id);    
-- 這個索引可以起到加速效果。如果沒有這個索引，以下三種方法，第二種效率最高，其次是第三種。

需求：去除c1重複的行，保留id最大的。

方法1，使用聚合，not in

postgres=# explain delete from test1 where id not in (select max(id) from test1 group by c1);  
                                                    QUERY PLAN                                                      
------------------------------------------------------------------------------------------------------------------  
 Delete on test1  (cost=35115.63..53023.01 rows=500055 width=6)  
   ->  Seq Scan on test1  (cost=35115.63..53023.01 rows=500055 width=6)  
         Filter: (NOT (hashed SubPlan 1))  
         SubPlan 1  
           ->  GroupAggregate  (cost=0.42..35113.13 rows=1001 width=8)  
                 Group Key: test1_1.c1  
                 ->  Index Only Scan using idx_test1 on test1 test1_1  (cost=0.42..30102.57 rows=1000110 width=8)  
(7 rows)  
Time: 0.564 ms  
  
postgres=# delete from test1 where id not in (select max(id) from test1 group by c1);  
DELETE 998999  
Time: 1126.504 ms (00:01.127)

方法2，使用視窗查詢，IN

postgres=# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1;  
                                            QUERY PLAN                                              
--------------------------------------------------------------------------------------------------  
 Subquery Scan on t  (cost=0.42..60075.54 rows=995109 width=4)  
   Filter: (t.rn <> 1)  
   ->  WindowAgg  (cost=0.42..47574.17 rows=1000110 width=16)  
         ->  Index Only Scan using idx_test1 on test1  (cost=0.42..30072.24 rows=1000110 width=8)  
(4 rows)  
Time: 0.512 ms  
  
postgres=# delete from test1 where id in (select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1);  
DELETE 998999  
Time: 2430.276 ms (00:02.430)

方法3，使用PLPGSQL，內部使用排序+遊標。

每條記錄判斷一次的方法去重，只有一次排序+每條記錄比對的開銷。

do language plpgsql $$       
declare  
  v_rec record;  
  v_c1 int;  
  cur1 cursor for select c1,id from test1 order by c1,id for update;  
begin  
  for v_rec in cur1 loop  
    if v_rec.c1 = v_c1 then  
      delete from test1 where current of cur1;  
    end if;  
    v_c1 := v_rec.c1;  
  end loop;  
end;  
$$;  
  
DO  
Time: 7345.773 ms (00:07.346)  
  
postgres=# select count(*) from test1;  
 count   
-------  
  1001  
(1 row)  
  
Time: 61.672 ms  
postgres=# select * from test1 limit 10;  
 id | c1  |             c2               
----+-----+----------------------------  
  1 | 582 | 2017-06-02 10:21:10.60918  
  2 | 278 | 2017-06-02 10:21:10.609331  
  3 | 659 | 2017-06-02 10:21:10.609338  
  4 | 372 | 2017-06-02 10:21:10.609341  
  5 | 184 | 2017-06-02 10:21:10.609343  
  6 | 121 | 2017-06-02 10:21:10.609345  
  7 | 132 | 2017-06-02 10:21:10.609347  
  8 | 290 | 2017-06-02 10:21:10.609348  
  9 | 980 | 2017-06-02 10:21:10.60935  
 10 | 305 | 2017-06-02 10:21:10.609352  
(10 rows)

PostgreSQL 10黑科技

即使只有部分驅動列，也能使用索引排序。

例如index(c1)，可以用於order by c1,id;

《PostgreSQL 10.0 preview 優化器改進 – 不完整索引支援複合排序》

多列去重

測試資料

create table test1(id int primary key, c1 int, c2 int, c3 timestamp);  
insert into test1 select generate_series(1,1000000), random()*1000, random()*1000, clock_timestamp();  
  
create index idx_test1 on test1(c1,c2,id);    
-- 這個索引可以起到加速效果。

需求：去除c1,c2重複的行，保留id最大的。

方法1，

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id not in (select max(id) from test1 group by c1,c2);  
                                                                                 QUERY PLAN                                                                                   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1634.960..1634.960 rows=0 loops=1)  
   Buffers: shared hit=1378788  
   ->  Seq Scan on public.test1  (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1090.956..1446.374 rows=367618 loops=1)  
         Output: test1.ctid  
         Filter: (NOT (hashed SubPlan 1))  
         Rows Removed by Filter: 632382  
         Buffers: shared hit=1011170  
         SubPlan 1  
           ->  GroupAggregate  (cost=0.42..40570.36 rows=100000 width=12) (actual time=0.035..842.497 rows=632382 loops=1)  
                 Output: max(test1_1.id), test1_1.c1, test1_1.c2  
                 Group Key: test1_1.c1, test1_1.c2  
                 Buffers: shared hit=1004800  
                 ->  Index Only Scan using idx_test1 on public.test1 test1_1  (cost=0.42..32070.36 rows=1000000 width=12) (actual time=0.027..587.506 rows=1000000 loops=1)  
                       Output: test1_1.c1, test1_1.c2, test1_1.id  
                       Heap Fetches: 1000000  
                       Buffers: shared hit=1004800  
 Planning time: 0.211 ms  
 Execution time: 1641.679 ms  
(18 rows)

方法2，

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id in (select id from (select row_number() over(partition by c1,c2 order by id) as rn, id from test1) t where t.rn<>1);  
                                                                                      QUERY PLAN                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=83752.89..130385.27 rows=995000 width=34) (actual time=2199.376..2199.376 rows=0 loops=1)  
   Buffers: shared hit=1378741, temp read=6482 written=6420  
   ->  Hash Semi Join  (cost=83752.89..130385.27 rows=995000 width=34) (actual time=1381.636..1929.284 rows=367584 loops=1)  
         Output: test1.ctid, t.*  
         Hash Cond: (test1.id = t.id)  
         Buffers: shared hit=1011157, temp read=6482 written=6420  
         ->  Seq Scan on public.test1  (cost=0.00..16370.00 rows=1000000 width=10) (actual time=0.013..140.130 rows=1000000 loops=1)  
               Output: test1.ctid, test1.id  
               Buffers: shared hit=6370  
         ->  Hash  (cost=64513.39..64513.39 rows=995000 width=32) (actual time=1377.349..1377.349 rows=367584 loops=1)  
               Output: t.*, t.id  
               Buckets: 65536  Batches: 32  Memory Usage: 1326kB  
               Buffers: shared hit=1004787, temp written=2591  
               ->  Subquery Scan on t  (cost=0.42..64513.39 rows=995000 width=32) (actual time=0.074..1269.919 rows=367584 loops=1)  
                     Output: t.*, t.id  
                     Filter: (t.rn <> 1)  
                     Rows Removed by Filter: 632416  
                     Buffers: shared hit=1004787  
                     ->  WindowAgg  (cost=0.42..52013.39 rows=1000000 width=20) (actual time=0.054..1117.668 rows=1000000 loops=1)  
                           Output: row_number() OVER (?), test1_1.id, test1_1.c1, test1_1.c2  
                           Buffers: shared hit=1004787  
                           ->  Index Only Scan using idx_test1 on public.test1 test1_1  (cost=0.42..32013.39 rows=1000000 width=12) (actual time=0.035..627.329 rows=1000000 loops=1)  
                                 Output: test1_1.c1, test1_1.c2, test1_1.id  
                                 Heap Fetches: 1000000  
                                 Buffers: shared hit=1004787  
 Planning time: 0.565 ms  
 Execution time: 2199.450 ms  
(27 rows)

方法3，

postgres=# do language plpgsql $$       
declare  
  v_rec record;  
  v_c1 int;   
  v_c2 int;    
  cur1 cursor for select c1,c2 from test1 order by c1,c2,id for update;     
begin  
  for v_rec in cur1 loop  
    if v_rec.c1 = v_c1 and v_rec.c2=v_c2 then  
      delete from test1 where current of cur1;  
    end if;  
    v_c1 := v_rec.c1;   
    v_c2 := v_rec.c2;  
  end loop;  
end;  
$$;  
DO  
  
Time: 4637.183 ms (00:04.637)

行去重

測試資料

create table test1(c1 int, c2 int);  
insert into test1 select random()*1000, random()*1000 from generate_series(1,1000000);  
  
-- 行號ctid 系統列無法建立索引

需求：去除重複的行，保留任意一條。

因為沒有PK了，可以通過行號來進行保留。

方法1，ctid的not in會導致loop，驗證影響效能，不建議使用not in的方法

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid not in (select max(ctid) from test1 group by c1,c2);  
^CCancel request sent  
ERROR:  57014: canceling statement due to user request  
LOCATION:  ProcessInterrupts, postgres.c:2984  
Time: 426433.450 ms (07:06.433)  

postgres=# truncate test1;
TRUNCATE TABLE
postgres=# insert into test1 select random()*1000, random()*1000 from generate_series(1,10000); 
INSERT 0 10000
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test1 where ctid not in (select max(ctid) from test1 group by c1,c2);  
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test1  (cost=222.97..90895.39 rows=5085 width=8) (actual time=54.451..7741.146 rows=51 loops=1)
   Output: test1.c1, test1.c2
   Filter: (NOT (SubPlan 1))
   Rows Removed by Filter: 9949
   Buffers: shared hit=90
   SubPlan 1

     發生了LOOP

     ->  Materialize  (cost=222.97..238.23 rows=1017 width=14) (actual time=0.001..0.297 rows=5000 loops=10000)
           Output: (max(test1_1.ctid)), test1_1.c1, test1_1.c2
           Buffers: shared hit=45
           ->  HashAggregate  (cost=222.97..233.14 rows=1017 width=14) (actual time=4.757..6.655 rows=9949 loops=1)
                 Output: max(test1_1.ctid), test1_1.c1, test1_1.c2
                 Group Key: test1_1.c1, test1_1.c2
                 Buffers: shared hit=45
                 ->  Seq Scan on public.test1 test1_1  (cost=0.00..146.70 rows=10170 width=14) (actual time=0.005..1.588 rows=10000 loops=1)
                       Output: test1_1.c1, test1_1.c2, test1_1.ctid
                       Buffers: shared hit=45
 Planning time: 0.121 ms
 Execution time: 7741.277 ms
(18 rows)
  
使用使用者定義的列不會有這個問題,已反饋給社群  

drop table test1;
create table test1(id int, c1 int, c2 int);  
insert into test1 select id, random()*1000, random()*1000 from generate_series(1,10000) t(id); 

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test1 where id not in (select max(id) from test1 group by c1,c2);
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test1  (cost=1048.18..1243.43 rows=5610 width=12) (actual time=11.762..13.627 rows=48 loops=1)
   Output: test1.id, test1.c1, test1.c2
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 9952
   Buffers: shared hit=110
   SubPlan 1
     ->  GroupAggregate  (cost=921.96..1045.38 rows=1122 width=12) (actual time=5.355..9.162 rows=9952 loops=1)
           Output: max(test1_1.id), test1_1.c1, test1_1.c2
           Group Key: test1_1.c1, test1_1.c2
           Buffers: shared hit=55
           ->  Sort  (cost=921.96..950.01 rows=11220 width=12) (actual time=5.350..6.101 rows=10000 loops=1)
                 Output: test1_1.c1, test1_1.c2, test1_1.id
                 Sort Key: test1_1.c1, test1_1.c2
                 Sort Method: quicksort  Memory: 853kB
                 Buffers: shared hit=55
                 ->  Seq Scan on public.test1 test1_1  (cost=0.00..167.20 rows=11220 width=12) (actual time=0.004..1.528 rows=10000 loops=1)
                       Output: test1_1.c1, test1_1.c2, test1_1.id
                       Buffers: shared hit=55
 Planning time: 58.784 ms
 Execution time: 13.685 ms
(20 rows)

方法2，

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid in (select ctid from (select row_number() over(partition by c1,c2 order by ctid) as rn, ctid from test1) t where t.rn<>1);  
                                                                                   QUERY PLAN                                                                                      
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=422032.41..427035.41 rows=500000 width=36) (actual time=3525.690..3525.690 rows=0 loops=1)  
   Buffers: shared hit=376073 dirtied=4398, temp read=10658 written=10683  
   ->  Merge Join  (cost=422032.41..427035.41 rows=500000 width=36) (actual time=3003.047..3352.172 rows=367223 loops=1)  
         Output: test1.ctid, t.*  
         Inner Unique: true  
         Merge Cond: (test1.ctid = t.ctid)  
         Buffers: shared hit=8850, temp read=10658 written=10683  
         ->  Sort  (cost=122873.59..125373.59 rows=1000000 width=6) (actual time=786.814..883.721 rows=1000000 loops=1)  
               Output: test1.ctid  
               Sort Key: test1.ctid  
               Sort Method: external sort  Disk: 15656kB  
               Buffers: shared hit=4425, temp read=1957 written=1957  
               ->  Seq Scan on public.test1  (cost=0.00..14425.00 rows=1000000 width=6) (actual time=0.021..112.431 rows=1000000 loops=1)  
                     Output: test1.ctid  
                     Buffers: shared hit=4425  
         ->  Sort  (cost=299158.81..299159.31 rows=200 width=36) (actual time=2216.021..2268.235 rows=367223 loops=1)  
               Output: t.*, t.ctid  
               Sort Key: t.ctid  
               Sort Method: external sort  Disk: 18688kB  
               Buffers: shared hit=4425, temp read=8701 written=8726  
               ->  Unique  (cost=294176.17..299151.17 rows=200 width=36) (actual time=1790.180..1949.522 rows=367223 loops=1)  
                     Output: t.*, t.ctid  
                     Buffers: shared hit=4425, temp read=6365 written=6390  
                     ->  Sort  (cost=294176.17..296663.67 rows=995000 width=36) (actual time=1790.179..1874.394 rows=367223 loops=1)  
                           Output: t.*, t.ctid  
                           Sort Key: t.ctid  
                           Sort Method: external merge  Disk: 18744kB  
                           Buffers: shared hit=4425, temp read=6365 written=6390  
                           ->  Subquery Scan on t  (cost=125069.59..160069.59 rows=995000 width=36) (actual time=692.878..1542.122 rows=367223 loops=1)  
                                 Output: t.*, t.ctid  
                                 Filter: (t.rn <> 1)  
                                 Rows Removed by Filter: 632777  
                                 Buffers: shared hit=4425, temp read=4022 written=4039  
                                 ->  WindowAgg  (cost=125069.59..147569.59 rows=1000000 width=22) (actual time=692.858..1401.210 rows=1000000 loops=1)  
                                       Output: row_number() OVER (?), test1_1.ctid, test1_1.c1, test1_1.c2  
                                       Buffers: shared hit=4425, temp read=4022 written=4039  
                                       ->  Sort  (cost=125069.59..127569.59 rows=1000000 width=14) (actual time=692.850..947.055 rows=1000000 loops=1)  
                                             Output: test1_1.ctid, test1_1.c1, test1_1.c2  
                                             Sort Key: test1_1.c1, test1_1.c2, test1_1.ctid  
                                             Sort Method: external merge  Disk: 25496kB  
                                             Buffers: shared hit=4425, temp read=4022 written=4039  
                                             ->  Seq Scan on public.test1 test1_1  (cost=0.00..14425.00 rows=1000000 width=14) (actual time=0.010..131.128 rows=1000000 loops=1)  
                                                   Output: test1_1.ctid, test1_1.c1, test1_1.c2  
                                                   Buffers: shared hit=4425  
 Planning time: 0.247 ms  
 Execution time: 3547.727 ms  
(46 rows)

方法3，

postgres=# do language plpgsql $$       
declare  
  v_rec record;  
  v_c1 int;   
  v_c2 int;    
  cur1 cursor for select c1,c2 from test1 order by c1,c2,ctid for update;     
begin  
  for v_rec in cur1 loop  
    if v_rec.c1 = v_c1 and v_rec.c2=v_c2 then  
      delete from test1 where current of cur1;  
    end if;  
    v_c1 := v_rec.c1;   
    v_c2 := v_rec.c2;  
  end loop;  
end;  
$$;  
DO  
  
Time: 5395.774 ms (00:05.396)

多列混合去重

多列去重，可以使用陣列的方法，如下

但是需要注意陣列內元素的順序不一致時，兩個陣列元素是不相等的。

postgres=# select array[1,2] = array[2,1];  
 ?column?   
----------  
 f  
(1 row)  
  
postgres=# select array[1,2] @> array[2,1] and array[2,1] @> array[1,1,2];  
 ?column?   
----------  
 t  
(1 row)  
  
postgres=# select array[1,2] @> array[2,2,1] and array[2,1] @> array[1,1,2];  
 ?column?   
----------  
 t  
(1 row)

所以需要對元素排序存放，將需要參與去重的列，作為陣列元素即可。

建立一個支援任意列的排序函式，輸出排序後的陣列

postgres=# create or replace function sort_vals(variadic v_arr text[]) returns text[] as $$  
  select array_agg(arr order by arr) from unnest(v_arr) t(arr);  
$$ language sql strict;  
  
postgres=# select sort_vals(`a`,`a`,`b`,`a`,`c`);  
  sort_vals    
-------------  
 {a,a,a,b,c}  
(1 row)

測試資料

create table test1(c1 int, c2 int);  
insert into test1 select random()*1000, random()*1000 from generate_series(1,1000000);

需求：去除c1,c2交叉重複的記錄(1,2; 2,1 認為是重複的)，保留任意一條。

方法2-1，使用sort_vals排序重組陣列（簡單，好理解）

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid in (select ctid from (select row_number() over(partition by sort_vals(c1::text,c2::text) order by ctid) as rn, ctid from test1) t where t.rn<>1);  
                                                                                    QUERY PLAN                                                                                       
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=963704.16..968707.16 rows=500000 width=36) (actual time=16466.913..16466.913 rows=0 loops=1)  
   Buffers: shared hit=576071, temp read=18863 written=18901  
   ->  Merge Join  (cost=963704.16..968707.16 rows=500000 width=36) (actual time=15766.506..16202.766 rows=567213 loops=1)  
         Output: test1.ctid, t.*  
         Inner Unique: true  
         Merge Cond: (test1.ctid = t.ctid)  
         Buffers: shared hit=8858, temp read=18863 written=18901  
         ->  Sort  (cost=122873.59..125373.59 rows=1000000 width=6) (actual time=782.867..880.729 rows=1000000 loops=1)  
               Output: test1.ctid  
               Sort Key: test1.ctid  
               Sort Method: external sort  Disk: 15656kB  
               Buffers: shared hit=4425, temp read=1957 written=1957  
               ->  Seq Scan on public.test1  (cost=0.00..14425.00 rows=1000000 width=6) (actual time=0.009..110.757 rows=1000000 loops=1)  
                     Output: test1.ctid  
                     Buffers: shared hit=4425  
         ->  Sort  (cost=840830.56..840831.06 rows=200 width=36) (actual time=14983.595..15066.186 rows=567213 loops=1)  
               Output: t.*, t.ctid  
               Sort Key: t.ctid  
               Sort Method: external sort  Disk: 28864kB  
               Buffers: shared hit=4433, temp read=16906 written=16944  
               ->  Unique  (cost=835847.92..840822.92 rows=200 width=36) (actual time=14316.637..14568.357 rows=567213 loops=1)  
                     Output: t.*, t.ctid  
                     Buffers: shared hit=4433, temp read=13298 written=13336  
                     ->  Sort  (cost=835847.92..838335.42 rows=995000 width=36) (actual time=14316.636..14456.355 rows=567213 loops=1)  
                           Output: t.*, t.ctid  
                           Sort Key: t.ctid  
                           Sort Method: external merge  Disk: 28952kB  
                           Buffers: shared hit=4433, temp read=13298 written=13336  
                           ->  Subquery Scan on t  (cost=409241.34..701741.34 rows=995000 width=36) (actual time=12177.370..13945.667 rows=567213 loops=1)  
                                 Output: t.*, t.ctid  
                                 Filter: (t.rn <> 1)  
                                 Rows Removed by Filter: 432787  
                                 Buffers: shared hit=4433, temp read=9679 written=9704  
                                 ->  WindowAgg  (cost=409241.34..689241.34 rows=1000000 width=46) (actual time=12177.303..13765.873 rows=1000000 loops=1)  
                                       Output: row_number() OVER (?), test1_1.ctid, (sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text]))  
                                       Buffers: shared hit=4433, temp read=9679 written=9704  
                                       ->  Sort  (cost=409241.34..411741.34 rows=1000000 width=38) (actual time=12177.293..13163.065 rows=1000000 loops=1)  
                                             Output: test1_1.ctid, (sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text]))  
                                             Sort Key: (sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text])), test1_1.ctid  
                                             Sort Method: external merge  Disk: 51904kB  
                                             Buffers: shared hit=4430, temp read=9679 written=9704  
                                             ->  Seq Scan on public.test1 test1_1  (cost=0.00..274425.00 rows=1000000 width=38) (actual time=0.202..8735.620 rows=1000000 loops=1)  
                                                   Output: test1_1.ctid, sort_vals(VARIADIC ARRAY[(test1_1.c1)::text, (test1_1.c2)::text])  
                                                   Buffers: shared hit=4425  
 Planning time: 0.292 ms  
 Execution time: 16500.934 ms  
(46 rows)

方法2-2，使用SUBQUERY排序重組陣列，（效率高，燒腦）

explain (analyze,verbose,timing,costs,buffers)   
delete from test1 where ctid in   
(  
select rid from  
  (  
    select row_number() over(partition by val order by rid) as rn, rid from   
      (  
        select rid, array_agg(arr order by arr) val from   
          (select ctid rid, unnest(array[c1,c2]) arr from test1) t  
        group by rid  
      ) t  
  ) t  
where t.rn<>1  
);  
  
                                                                                      QUERY PLAN                                                                                         
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Delete on public.test1  (cost=18979422.39..18984425.62 rows=199 width=36) (actual time=10186.459..10186.459 rows=0 loops=1)  
   Buffers: shared hit=575991, temp read=19174 written=19212  
   ->  Merge Semi Join  (cost=18979422.39..18984425.62 rows=199 width=36) (actual time=9421.861..9923.844 rows=567141 loops=1)  
         Output: test1.ctid, t.*  
         Merge Cond: (test1.ctid = t.rid)  
         Buffers: shared hit=8850, temp read=19174 written=19212  
         ->  Sort  (cost=122879.44..125379.56 rows=1000050 width=6) (actual time=796.023..894.723 rows=1000000 loops=1)  
               Output: test1.ctid  
               Sort Key: test1.ctid  
               Sort Method: external sort  Disk: 15656kB  
               Buffers: shared hit=4425, temp read=1957 written=1957  
               ->  Seq Scan on public.test1  (cost=0.00..14425.50 rows=1000050 width=6) (actual time=0.016..122.654 rows=1000000 loops=1)  
                     Output: test1.ctid  
                     Buffers: shared hit=4425  
         ->  Sort  (cost=18856542.95..18856543.45 rows=199 width=36) (actual time=8625.739..8775.583 rows=567141 loops=1)  
               Output: t.*, t.rid  
               Sort Key: t.rid  
               Sort Method: external merge  Disk: 28952kB  
               Buffers: shared hit=4425, temp read=17217 written=17255  
               ->  Subquery Scan on t  (cost=18856528.85..18856535.35 rows=199 width=36) (actual time=6749.159..8251.185 rows=567141 loops=1)  
                     Output: t.*, t.rid  
                     Filter: (t.rn <> 1)  
                     Rows Removed by Filter: 432859  
                     Buffers: shared hit=4425, temp read=13598 written=13623  
                     ->  WindowAgg  (cost=18856528.85..18856532.85 rows=200 width=46) (actual time=6749.138..8073.103 rows=1000000 loops=1)  
                           Output: row_number() OVER (?), test1_1.ctid, (array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2]))))  
                           Buffers: shared hit=4425, temp read=13598 written=13623  
                           ->  Sort  (cost=18856528.85..18856529.35 rows=200 width=38) (actual time=6749.128..7507.854 rows=1000000 loops=1)  
                                 Output: test1_1.ctid, (array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2]))))  
                                 Sort Key: (array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2])))), test1_1.ctid  
                                 Sort Method: external merge  Disk: 44040kB  
                                 Buffers: shared hit=4425, temp read=13598 written=13623  
                                 ->  GroupAggregate  (cost=18106479.21..18856519.21 rows=200 width=38) (actual time=2315.955..4053.484 rows=1000000 loops=1)  
                                       Output: test1_1.ctid, array_agg((unnest(ARRAY[test1_1.c1, test1_1.c2])) ORDER BY (unnest(ARRAY[test1_1.c1, test1_1.c2])))  
                                       Group Key: test1_1.ctid  
                                       Buffers: shared hit=4425, temp read=5382 written=5382  
                                       ->  Sort  (cost=18106479.21..18356491.71 rows=100005000 width=10) (actual time=2315.934..2530.362 rows=2000000 loops=1)  
                                             Output: test1_1.ctid, (unnest(ARRAY[test1_1.c1, test1_1.c2]))  
                                             Sort Key: test1_1.ctid  
                                             Sort Method: external sort  Disk: 43056kB  
                                             Buffers: shared hit=4425, temp read=5382 written=5382  
                                             ->  ProjectSet  (cost=0.00..521950.88 rows=100005000 width=10) (actual time=0.019..836.774 rows=2000000 loops=1)  
                                                   Output: test1_1.ctid, unnest(ARRAY[test1_1.c1, test1_1.c2])  
                                                   Buffers: shared hit=4425  
                                                   ->  Seq Scan on public.test1 test1_1  (cost=0.00..14425.50 rows=1000050 width=14) (actual time=0.010..137.319 rows=1000000 loops=1)  
                                                         Output: test1_1.c1, test1_1.c2, test1_1.ctid  
                                                         Buffers: shared hit=4425  
 Planning time: 0.241 ms  
 Execution time: 10228.121 ms  
(49 rows)

方法3，

postgres=# do language plpgsql $$       
declare  
  v_rec record;  
  v_arr text[];   
  cur1 cursor for select sort_vals(c1::text,c2::text) as arr from test1 order by sort_vals(c1::text,c2::text),ctid for update;     
begin  
  for v_rec in cur1 loop  
    if v_rec.arr = v_arr then  
      delete from test1 where current of cur1;  
    end if;  
    v_arr := v_rec.arr;   
  end loop;  
end;  
$$;  
DO  
  
Time: 18542.457 ms (00:18.542)

小結

有索引加速和沒有索引加速時，三種方法在效能相差比較大。

但是最穩定的還是視窗函式的方法，所以建議使用視窗函式的方法，最靠譜。

陣列內元素順序不一致時，並不相等，可以自定義重排函式。

postgresql表如何去重
2021-09-11
SQL
MySQL去重資料
2020-05-31
MySql
C#資料去重
2024-04-27
C#
MySQL資料庫行去重複和列去重複
2015-01-23
MySql資料庫
mysql資料去重和排序
2015-10-19
MySql排序
C# datatable中重複資料去重
2015-09-23
C#
大資料去重（data deduplication）方案
2021-02-09
大資料
map/reduce實現資料去重
2014-07-07
Oracle:重複資料去重，只取最新的一條資料
2024-09-27
Oracle
PostgreSQL刪除表中重複資料
2022-01-19
SQL
解析postgresql 刪除重複資料案例
2022-01-12
SQL
列組合資料去重複值
2014-09-25
【資料庫】PostgreSQL中使用`SELECT DISTINCT`和`SUBSTRING`函式實現去重查詢
2024-03-28
資料庫SQL函式
資料校驗---記一次讀取json配置資料，資料去重，去空
2019-03-25
JSON
PostgreSQL分割槽表、繼承表記錄去重方法
2018-10-23
SQL繼承
ClickHouse 實時資料去重final+group by
2020-11-25
KaiwuDB 時序引擎資料去重功能詳解
2023-12-25
AI
MapReduce應用案例--簡單的資料去重
2016-04-11
MapReduce程式設計例項之資料去重
2015-11-24
程式設計
6個備份資料的去重加密工具
2015-07-26
加密
PostgreSQL 恢復大法 - 恢復部分資料庫、跳過壞塊、修復無法啟動的資料庫
2018-04-18
SQL資料庫
第七篇：經典案例：資料去重
2017-05-20
10W資料匯入該如何與庫中資料去重？
2024-08-16
切片去重(string,int型別去重)
2020-10-10
型別
Cannot set property ‘dataIndex‘ of undefined 大資料關係圖報錯，賦予的資料有重複，去重
2020-10-06
AIIndexUndefined大資料
sql 去重
2015-12-14
SQL
列表去重
2024-08-17
JS陣列去重 – JSON陣列去重
2019-09-05
陣列JSON
js 中基礎資料結構陣列去重問題
2019-02-27
JS資料結構陣列
Spark千億級資料去重，效能最佳化踩坑之路
2022-10-27
Spark
【Flink】基於 Flink 的流式資料實時去重
2024-10-11
陣列合並且去重&向一個陣列新增一條資料（重複的就不新增）&陣列物件去重處理
2019-09-03
陣列物件
陣列去重
2019-02-28
陣列
字串切片去重
2022-02-08
字串
Redis去重方法
2021-11-06
Redis
MySQL 去重SQL
2017-12-05
MySql
datagridView 去重
2016-04-10
View
結果去重
2024-10-30

PostgreSQL資料去重大法

標籤

背景

單列去重

方法1，使用聚合，not in

方法2，使用視窗查詢，IN

方法3，使用PLPGSQL，內部使用排序+遊標。

PostgreSQL 10黑科技

多列去重

行去重

多列混合去重

小結

相關文章