用PostgreSQL找回618秒逝去的青春-遞迴收斂優化

德哥發表於2016-12-21

SQL遞迴優化

背景

有一個這樣的場景，一張小表A，裡面儲存了一些ID，大約幾百個。

（比如說巡邏車輛ID，環衛車輛的ID，公交車，微公交的ID）。

另外有一張日誌表B，每條記錄中的ID是來自前面那張小表的，但不是每個ID都出現在這張日誌表中，比如說一天可能只有幾十個ID會出現在這個日誌表的當天的資料中。

（比如車輛的行車軌跡資料，每秒上報軌跡，資料量就非常龐大）。

那麼我怎麼快速的找出今天沒有出現的ID呢。

（哪些巡邏車輛沒有出現在這個片區，是不是偷懶了？哪些環衛車輛沒有出行，哪些公交或微公交沒有出行）？

select id from A where id not in (select id from B where time between ? and ?);

這個QUERY會很慢，有什麼優化方法呢。
當然，你還可以讓車輛簽到的方式來解決這個問題，但是總有未簽到的，或者沒有這種設計的時候，那麼怎麼解決呢？

優化方法

其實方法也很精妙，和我之前做的兩個CASE很相似。

《時序資料合併場景加速分析和實現 – 複合索引，視窗分組查詢加速，變態遞迴加速》

《distinct xx和count(distinct xx)的變態遞迴優化方法 – 索引收斂(skip scan)掃描》

在B表中，其實ID的值是很稀疏的，只是由於是流水，所以總量大。

優化的手段就是對B的取值區間，做遞迴的收斂查詢，然後再做NOT IN就很快了。

例子

建表

create table a(id int primary key, info text);

create table b(id int primary key, aid int, crt_time timestamp);
create index b_aid on b(aid);

插入測試資料

-- a表插入1000條
insert into a select generate_series(1,1000), md5(random()::text);

-- b表插入500萬條，只包含aid的500個id。
insert into b select generate_series(1,5000000), generate_series(1,500), clock_timestamp();

優化前的效能

	iming

explain (analyze,verbose,timing,costs,buffers) select * from a where id not in (select aid from b); 


                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.a  (cost=0.00..67030021.50 rows=500 width=37) (actual time=2932.080..618776.881 rows=500 loops=1)
   Output: a.id, a.info
   Filter: (NOT (SubPlan 1))
   Rows Removed by Filter: 500
   Buffers: shared hit=27037, temp read=4264454 written=8545
   SubPlan 1
     ->  Materialize  (cost=0.00..121560.00 rows=5000000 width=4) (actual time=0.002..298.049 rows=2500125 loops=1000)
           Output: b.aid
           Buffers: shared hit=27028, temp read=4264454 written=8545
           ->  Seq Scan on public.b  (cost=0.00..77028.00 rows=5000000 width=4) (actual time=0.009..888.427 rows=5000000 loops=1)
                 Output: b.aid
                 Buffers: shared hit=27028
 Planning time: 0.969 ms
 Execution time: 618794.299 ms
(14 rows)

另外你有一種選擇是使用outer join, b表同樣需要全掃一遍，有很大的改進，不過還可以更好，繼續往後看。

postgres=# explain (analyze,verbose,timing,costs,buffers) select a.id from a left join b on (a.id=b.aid) where b.* is null;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=31.50..145809.50 rows=25000 width=4) (actual time=2376.777..2376.862 rows=500 loops=1)
   Output: a.id
   Hash Cond: (b.aid = a.id)
   Filter: (b.* IS NULL)
   Rows Removed by Filter: 5000000
   Buffers: shared hit=27037
   ->  Seq Scan on public.b  (cost=0.00..77028.00 rows=5000000 width=44) (actual time=0.012..1087.997 rows=5000000 loops=1)
         Output: b.aid, b.*
         Buffers: shared hit=27028
   ->  Hash  (cost=19.00..19.00 rows=1000 width=4) (actual time=0.355..0.355 rows=1000 loops=1)
         Output: a.id
         Buckets: 1024  Batches: 1  Memory Usage: 44kB
         Buffers: shared hit=9
         ->  Seq Scan on public.a  (cost=0.00..19.00 rows=1000 width=4) (actual time=0.010..0.183 rows=1000 loops=1)
               Output: a.id
               Buffers: shared hit=9
 Planning time: 0.302 ms
 Execution time: 2376.934 ms
(18 rows)

遞迴收斂優化後的效能

explain (analyze,verbose,timing,costs,buffers) 
select * from a where id not in 
(
with recursive skip as (  
  (  
    select min(aid) aid from b where aid is not null  
  )  
  union all  
  (  
    select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)   
      from skip s where s.aid is not null  
  )  -- 這裡的where s.aid is not null 一定要加,否則就死迴圈了.  
)   
select aid from skip where aid is not null
);


                                                                                 QUERY PLAN                    
------------------------------------------------------------------------------------------------------
 Seq Scan on public.a  (cost=54.98..76.48 rows=500 width=37) (actual time=10.837..10.957 rows=500 loops=1)
   Output: a.id, a.info
   Filter: (NOT (hashed SubPlan 5))
   Rows Removed by Filter: 500
   Buffers: shared hit=2012
   SubPlan 5
     ->  CTE Scan on skip  (cost=52.71..54.73 rows=100 width=4) (actual time=0.042..10.386 rows=500 loops=1)
           Output: skip.aid
           Filter: (skip.aid IS NOT NULL)
           Rows Removed by Filter: 1
           Buffers: shared hit=2003
           CTE skip
             ->  Recursive Union  (cost=0.46..52.71 rows=101 width=4) (actual time=0.037..10.104 rows=501 loops=1)
                   Buffers: shared hit=2003
                   ->  Result  (cost=0.46..0.47 rows=1 width=4) (actual time=0.036..0.036 rows=1 loops=1)
                         Output: $1
                         Buffers: shared hit=4
                         InitPlan 3 (returns $1)
                           ->  Limit  (cost=0.43..0.46 rows=1 width=4) (actual time=0.031..0.032 rows=1 loops=1)
                                 Output: b_1.aid
                                 Buffers: shared hit=4
                                 ->  Index Only Scan using b_aid on public.b b_1  (cost=0.43..131903.43 rows=5000000 width=4) (actual time=0.030..0.030 rows=1 loops=1)
                                       Output: b_1.aid
                                       Index Cond: (b_1.aid IS NOT NULL)
                                       Heap Fetches: 1
                                       Buffers: shared hit=4
                   ->  WorkTable Scan on skip s  (cost=0.00..5.02 rows=10 width=4) (actual time=0.019..0.019 rows=1 loops=501)
                         Output: (SubPlan 2)
                         Filter: (s.aid IS NOT NULL)
                         Rows Removed by Filter: 0
                         Buffers: shared hit=1999
                         SubPlan 2
                           ->  Result  (cost=0.47..0.48 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=500)
                                 Output: $3
                                 Buffers: shared hit=1999
                                 InitPlan 1 (returns $3)
                                   ->  Limit  (cost=0.43..0.47 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=500)
                                         Output: b.aid
                                         Buffers: shared hit=1999
                                         ->  Index Only Scan using b_aid on public.b  (cost=0.43..66153.48 rows=1666667 width=4) (actual time=0.017..0.017 rows=1 loops=500)
                                               Output: b.aid
                                               Index Cond: ((b.aid > s.aid) AND (b.aid IS NOT NULL))
                                               Heap Fetches: 499
                                               Buffers: shared hit=1999
 Planning time: 0.323 ms
 Execution time: 11.082 ms
(46 rows)

採用收斂查詢優化後，耗時從最初的 618794毫秒降低到了 11毫秒，感覺一下子節約了好多青春。

遞迴尾呼叫優化
2019-04-16
遞迴優化
尾遞迴以及優化
2018-08-21
遞迴優化
TypeScript：又一個終將逝去的青春
2019-10-09
TypeScript
「GAN優化」如何選好正則項讓你的GAN收斂
2019-09-26
優化
?30 秒瞭解尾遞迴和尾遞迴優化
2019-03-18
遞迴優化
PostgreSQL 遞迴查詢
2020-03-16
SQL遞迴
Javascript中的尾遞迴及其優化
2019-03-02
JavaScript遞迴優化
遞迴優化：尾呼叫和Memoization
2019-09-28
遞迴優化
面試官：用“尾遞迴”優化斐波那契函式
2021-11-04
面試遞迴優化函式
一道Postgresql遞迴樹題
2020-09-26
SQL遞迴
聊聊缺陷收斂率
2024-06-21
IS-IS網路收斂
2020-10-25
【PostgreSQL 】PostgreSQL 15對distinct的優化
2022-07-17
SQL優化
JS尾遞迴優化斐波拉契數列
2019-01-04
JS遞迴優化
用 JavaScript 的方式理解遞迴
2018-10-25
JavaScript遞迴
當前訓練神經網路最快的方式：AdamW優化演算法+超級收斂
2018-07-04
神經網路優化演算法
15.2 一致收斂
2024-06-15
專訪中國移動首席科學家馮俊蘭：AI業務應用需要收斂再收斂
2018-12-03
AI
Oracle優化案例-定位start with connect by遞迴死迴圈資料（二十二）
2019-09-01
Oracle優化遞迴
kingbase SQL最佳化案例 ( union遞迴改 cte遞迴 )
2023-12-07
SQL遞迴
遞迴神經網路教程請簽收！
2018-11-20
遞迴神經網路
遞迴和尾遞迴
2024-03-18
遞迴
快排的優化（非遞迴（感覺沒變化遞迴和非遞迴）+ 三個隨機數選取準標準值（和相對的最後的位置交換） + 分割區間法）
2020-11-06
優化遞迴隨機
遞迴的應用場景和呼叫機制、遞迴需要遵守的重要規則
2020-11-17
遞迴
遞迴的列印和階乘運用
2024-08-11
遞迴
用遞迴方法求10的階乘
2021-09-09
遞迴
§1. 一致收斂性
2024-05-08
生成樹埠狀態與收斂
2022-03-19
什麼是遞迴?遞迴和迴圈的異同
2020-12-25
遞迴
快速排序【遞迴】【非遞迴】
2018-06-04
排序遞迴
尤拉計劃722:慢收斂系列
2020-06-28
遞迴
2024-08-25
遞迴
程式分析與優化 - 6 迴圈優化
2022-06-12
優化
徹底理解遞迴，從遞迴的本質說起！
2019-06-26
遞迴
函式的遞迴
2020-06-17
函式遞迴
JavaScript中的遞迴
2019-03-04
JavaScript遞迴
演算法測試，怎樣才能收斂？
2020-12-26
演算法
用 Promise + 遞迴實現灌酒動畫
2018-07-23
Promise遞迴動畫
Vue3.0的遞迴監聽和非遞迴監聽
2020-10-28
Vue遞迴

用PostgreSQL找回618秒逝去的青春-遞迴收斂優化

標籤

背景

優化方法

相關文章