分享一次公司晉級考試的SQL題目,非常有趣的案例(postgresql 標量子查詢 where like 謂詞過濾條件)

小至尖尖發表於2023-11-10

同事今天晉級高階工程師考試,發來一道公司出題目讓我幫忙進行最佳化,其中場景二的案例非常有意思?。

題目內容如下:

原始SQL:
scott=> explain analyze 
scott-> select 
scott-> a.id,
scott-> a.col2,
scott-> (select sum(b.id) from table02 b where a.col2 like b.col2||'%' )
scott-> from table01 a;


                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on table01 a  (cost=0.00..3905341.00 rows=100000 width=45) (actual time=0.579..50568.090 rows=100000 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=39.02..39.03 rows=1 width=8) (actual time=0.504..0.504 rows=1 loops=100000)
           ->  Seq Scan on table02 b  (cost=0.00..39.00 rows=10 width=4) (actual time=0.063..0.499 rows=8 loops=100000)
                 Filter: (a.col2 ~~ (col2 || '%'::text))
                 Rows Removed by Filter: 1992
 Planning Time: 0.097 ms
 Execution Time: 50590.882 ms
(8 行記錄)

時間:50591.756 ms (00:50.592)

table01、table02 這兩張表沒有建立任何索引全表掃描+標量子查詢SQL執行需要50s才能出結果,速度非常慢。

考題要求要最佳化這條SQL,意思既是無論是調整 postgresql資料庫的引數,對SQL加索引,等價改寫SQL,這些手段都沒問題,只要能讓執行速度變快就行。

 

由於當時我在忙其他的事情,大致看了一眼後給出了以下的改寫方案(我沒加索引,感覺加索引的用處不大):

改寫1:

scott=> explain analyze select 
scott-> a.id,
scott-> a.col2,
scott-> b.sum_b_id
scott-> from table01 a 
scott-> left join (select sum(b.id) sum_b_id,b.col2 from table02 b group by b.col2) b 
scott-> ON a.col2 like b.col2||'%'
scott-> ;

                                                    QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=39.00..448135.74 rows=127500 width=45) (actual time=1.283..8674.517 rows=100000 loops=1)
   Join Filter: (a.col2 ~~ (b.col2 || '%'::text))
   Rows Removed by Join Filter: 25400387
   ->  Seq Scan on table01 a  (cost=0.00..1841.00 rows=100000 width=37) (actual time=0.018..19.620 rows=100000 loops=1)
   ->  Materialize  (cost=39.00..45.37 rows=255 width=11) (actual time=0.000..0.024 rows=255 loops=100000)
         ->  HashAggregate  (cost=39.00..41.55 rows=255 width=11) (actual time=1.241..1.316 rows=255 loops=1)
               Group Key: b.col2
               ->  Seq Scan on table02 b  (cost=0.00..29.00 rows=2000 width=7) (actual time=0.007..0.342 rows=2000 loops=1)
 Planning Time: 0.181 ms
 Execution Time: 8682.974 ms
(10 行記錄)

時間:8684.338 ms (00:08.684)

可以看到等價改寫以後,SQL從原來執行 50s 的時間已經降低到 8.8s 左右,提升還是挺大的。

把答案給了同事,我也去忙其他的事情了。

 

晚上我閒下來沒事做,賊無聊,仔細看了下改寫1 SQL的執行計劃,感覺這種計劃可能不是最優的執行計劃。

因為我始終感覺走HASH可能才是最佳的執行計劃,如果這條SQL在ORACLE 上執行,CBO很大可能會讓計劃走HASH,但是在PG就是走NL(腦殘最佳化器)?。

吃完飯後一直在嘗試改寫,搞了哥很長時間,最終還是把HASH版本的SQL給改出來了,淚目?。

改寫2:

explain analyze
with a as (select t1.id, t1.col2, substr(t1.col2, 1, x.rn) rn1
           from table01 t1,
                (select min(length(col2)) rn from table02) x)
   , b as (select t2.col2, t2.sum_b_id, substr(t2.col2, 1, x.rn) rn2
           from (SELECT col2, SUM(id) as sum_b_id FROM table02 GROUP BY col2) t2,
                (select min(length(col2)) rn from table02) x)
SELECT a.id,
       a.col2,
       b.sum_b_id
FROM a LEFT JOIN b ON a.rn1 = b.rn2 and a.col2 LIKE b.col2 || '%';
         
         
                                                                QUERY PLAN                                                        
        
----------------------------------------------------------------------------------------------------------------------------------
--------
 Hash Left Join  (cost=127.86..14881.38 rows=100000 width=45) (actual time=2.578..219.098 rows=100000 loops=1)
   Hash Cond: (substr(t1.col2, 1, (min(length(table02.col2)))) = substr(table02_2.col2, 1, (min(length(table02_1.col2)))))
   Join Filter: (t1.col2 ~~ (table02_2.col2 || '%'::text))
   ->  Nested Loop  (cost=39.00..2880.02 rows=100000 width=41) (actual time=0.692..36.984 rows=100000 loops=1)
         ->  Aggregate  (cost=39.00..39.01 rows=1 width=4) (actual time=0.682..0.685 rows=1 loops=1)
               ->  Seq Scan on table02  (cost=0.00..29.00 rows=2000 width=3) (actual time=0.020..0.297 rows=2000 loops=1)
         ->  Seq Scan on table01 t1  (cost=0.00..1841.00 rows=100000 width=37) (actual time=0.006..15.019 rows=100000 loops=1)
   ->  Hash  (cost=85.67..85.67 rows=255 width=15) (actual time=1.871..1.876 rows=255 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 20kB
         ->  Nested Loop  (cost=78.00..85.67 rows=255 width=15) (actual time=1.641..1.749 rows=255 loops=1)
               ->  Aggregate  (cost=39.00..39.01 rows=1 width=4) (actual time=0.630..0.631 rows=1 loops=1)
                     ->  Seq Scan on table02 table02_1  (cost=0.00..29.00 rows=2000 width=3) (actual time=0.005..0.266 rows=2000 l
oops=1)
               ->  HashAggregate  (cost=39.00..41.55 rows=255 width=11) (actual time=1.010..1.069 rows=255 loops=1)
                     Group Key: table02_2.col2
                     ->  Seq Scan on table02 table02_2  (cost=0.00..29.00 rows=2000 width=7) (actual time=0.005..0.271 rows=2000 l
oops=1)
 Planning Time: 0.531 ms
 Execution Time: 227.639 ms
(17 行記錄)

這個案例從最早的 50秒 改寫到 8秒,到最後的 227毫秒出結果,花了不少時間研究各種改寫方式?。

只能說PG的最佳化器確實太拉跨了,要用CTE才能讓最佳化器走HASH連線,浪費開發者不少時間,換成ORACLE資料庫不會走這種SB執行計劃。?

以後估計會很少發部落格,正在考慮轉行賣炒粉,現在的市場真的是一言難盡。?

相關文章