PostgreSQLin與=any的SQL語法異同與效能優化

德哥發表於2018-02-02

背景

資料庫SQL也算一門比較神奇的語言了，比如很多需求可以有不同的SQL來實現：

我之前有輸出過一個IN的測試，這裡面實際上也涉及到多個語法，實現同一個功能點。測試CASE是1億 in 100萬的多種寫法的效能差異。

《HTAP資料庫 PostgreSQL 場景與效能測試之 25 – (OLTP) IN , EXISTS 查詢》

例如下面三個QUERY的語義就是一樣的

select * from tbl where id in (select id from t);  
  
select * from tbl where exists (select 1 from t where t.id=tbl.id);  
  
select * from tbl where id = any (array( select id from t ));

但是不同的SQL，資料庫可能會選擇不一樣的執行計劃，並且執行效率可能千差萬別。

幾個例子

1、建立測試表，模擬1萬 IN 100萬的操作。

postgres=# create table t(id int);  
CREATE TABLE  
postgres=# insert into t select generate_series(1,100*10000);  
INSERT 0 1000000

2、我們看一看不同寫法的執行計劃如何：

postgres=# explain select n = any(array(select id from t)) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                      
---------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=14425.00..14447.50 rows=1000 width=1)  
   InitPlan 1 (returns $0)  
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(3 rows)  
  
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                      
---------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=16925.00..16937.50 rows=1000 width=1)  
   SubPlan 1  
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(3 rows)

3、你會發現兩個語法用了不同的執行計劃，一個是InitPlan, 一個是SubPlan.

對於IN的寫法，work_mem引數會直接影響效能，work_mem的大小決定了subquery是否要裝載到hash table。

postgres=# set work_mem =`1MB`;  
SET  
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                     
--------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=0.00..12916012.50 rows=1000 width=1)  
   SubPlan 1  
     ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4)  
           ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(4 rows)  
  
postgres=# set work_mem =`100MB`;  
SET  
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                      
---------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=16925.00..16937.50 rows=1000 width=1)  
   SubPlan 1  
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(3 rows)

                if (subquery)  
                {  
                        /* Generate Paths for the ANY subquery; we`ll need all rows */  
                        subroot = subquery_planner(root->glob, subquery,  
                                                                           root,  
                                                                           false, 0.0);  
  
                        /* Isolate the params needed by this specific subplan */  
                        plan_params = root->plan_params;  
                        root->plan_params = NIL;  
  
                        /* Select best Path and turn it into a Plan */  
                        final_rel = fetch_upper_rel(subroot, UPPERREL_FINAL, NULL);  
                        best_path = final_rel->cheapest_total_path;  
  
                        plan = create_plan(subroot, best_path);  
  
                        /* Now we can check if it`ll fit in work_mem */  
                        /* XXX can we check this at the Path stage? */  
                        if (subplan_is_hashable(plan))  
                        {  
                                SubPlan    *hashplan;  
                                AlternativeSubPlan *asplan;  
  
                                /* OK, convert to SubPlan format. */  
                                hashplan = castNode(SubPlan,  
                                                                        build_subplan(root, plan, subroot,  
                                                                                                  plan_params,  
                                                                                                  ANY_SUBLINK, 0,  
                                                                                                  newtestexpr,  
                                                                                                  false, true));  
                                /* Check we got what we expected */  
                                Assert(hashplan->parParam == NIL);  
                                Assert(hashplan->useHashTable);  
                                /* build_subplan won`t have filled in paramIds */  
                                hashplan->paramIds = paramIds;  
  
                                /* Leave it to the executor to decide which plan to use */  
                                asplan = makeNode(AlternativeSubPlan);  
                                asplan->subplans = list_make2(result, hashplan);  
                                result = (Node *) asplan;  
                        }  
                }

/*  
 * subplan_is_hashable: can we implement an ANY subplan by hashing?  
 */  
static bool  
subplan_is_hashable(Plan *plan)  
{  
        double          subquery_size;  
  
        /*  
         * The estimated size of the subquery result must fit in work_mem. (Note:  
         * we use heap tuple overhead here even though the tuples will actually be  
         * stored as MinimalTuples; this provides some fudge factor for hashtable  
         * overhead.)  
         */  
        subquery_size = plan->plan_rows *  
                (MAXALIGN(plan->plan_width) + MAXALIGN(SizeofHeapTupleHeader));  
        if (subquery_size > work_mem * 1024L)  
                return false;  
  
        return true;  
}

程式碼裡面註釋中，針對in, exists, any的subplan優化器實現也有一些介紹，涉及到效能相關：

實際上exists這裡有提到，匹配到第一條就結束，所以評估是否使用雜湊表時可能需要的容量很小。

        /*  
         * For an EXISTS subplan, tell lower-level planner to expect that only the  
         * first tuple will be retrieved.  For ALL and ANY subplans, we will be  
         * able to stop evaluating if the test condition fails or matches, so very  
         * often not all the tuples will be retrieved; for lack of a better idea,  
         * specify 50% retrieval.  For EXPR, MULTIEXPR, and ROWCOMPARE subplans,  
         * use default behavior (we`re only expecting one row out, anyway).  
         *  
         * NOTE: if you change these numbers, also change cost_subplan() in  
         * path/costsize.c.  
         *  
         * XXX If an ANY subplan is uncorrelated, build_subplan may decide to hash  
         * its output.  In that case it would`ve been better to specify full  
         * retrieval.  At present, however, we can only check hashability after  
         * we`ve made the subplan :-(.  (Determining whether it`ll fit in work_mem  
         * is the really hard part.)  Therefore, we don`t want to be too  
         * optimistic about the percentage of tuples retrieved, for fear of  
         * selecting a plan that`s bad for the materialization case.  
         */

in vs = any vs exists效能對比

1、in, work_mem裝不下subquery

postgres=# set work_mem =`64kB`;  
  
postgres=# explain analyze select n in (select id from t) from generate_series(1,10000) as n;  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=0.00..12916012.50 rows=1000 width=1) (actual time=1.321..11484.646 rows=10000 loops=1)  
   SubPlan 1  
     ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4) (actual time=0.003..0.619 rows=5000 loops=10000)  
           ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..1.800 rows=10000 loops=1)  
 Planning time: 0.091 ms  
 Execution time: 11485.905 ms  
(6 rows)

2、in, work_mem裝下了subquery

postgres=# set work_mem =`64MB`;  
SET  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select n in (select id from t) from generate_series(1,10000) as n;  
                                                                  QUERY PLAN                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=16925.00..16937.50 rows=1000 width=1) (actual time=497.142..500.701 rows=10000 loops=1)  
   Output: (hashed SubPlan 1)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=4425  
   SubPlan 1  
     ->  Seq Scan on public.t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.024..124.703 rows=1000000 loops=1)  
           Output: t.id  
           Buffers: shared hit=4425  
 Planning time: 0.085 ms  
 Execution time: 507.427 ms  
(10 rows)

3、= any, work_mem很小無所謂，因為不涉及hashtable

postgres=# set work_mem =`64kB`;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select n = any(array(select id from t)) from generate_series(1,10000) as n;  
                                                                  QUERY PLAN                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=14425.00..14447.50 rows=1000 width=1) (actual time=233.871..446.120 rows=10000 loops=1)  
   Output: (n.n = ANY ($0))  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=4425, temp read=19 written=18  
   InitPlan 1 (returns $0)  
     ->  Seq Scan on public.t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..119.976 rows=1000000 loops=1)  
           Output: t.id  
           Buffers: shared hit=4425  
 Planning time: 0.085 ms  
 Execution time: 447.666 ms  
(10 rows)

4、exists, work_mem需求量較少(exists由於優化器在匹配到1條後即刻返回，所以會選擇使用索引，效能就非常好。)

postgres=# set work_mem =`64kB`;  
SET  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                              QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..2852.50 rows=1000 width=1) (actual time=1.172..18.893 rows=10000 loops=1)  
   Output: (SubPlan 1)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=40027, temp read=19 written=18  
   SubPlan 1  
     ->  Index Only Scan using idx_t_1 on public.t  (cost=0.42..2.84 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=10000)  
           Index Cond: (t.id = n.n)  
           Heap Fetches: 10000  
           Buffers: shared hit=40027  
 Planning time: 0.118 ms  
 Execution time: 19.902 ms  
(11 rows)  
  
  
  
postgres=# set work_mem =`64MB`;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                              QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..2852.50 rows=1000 width=1) (actual time=0.642..17.635 rows=10000 loops=1)  
   Output: (alternatives: SubPlan 1 or hashed SubPlan 2)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=40027  
   SubPlan 1  
     ->  Index Only Scan using idx_t_1 on public.t  (cost=0.42..2.84 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=10000)  
           Index Cond: (t.id = n.n)  
           Heap Fetches: 10000  
           Buffers: shared hit=40027  
   SubPlan 2  
     ->  Seq Scan on public.t t_1  (cost=0.00..14425.00 rows=1000000 width=4) (never executed)  
           Output: t_1.id  
 Planning time: 0.129 ms  
 Execution time: 18.612 ms  
(14 rows)

5、如果把索引幹掉，exists效能就會下降了，同時效能也和是否使用雜湊表有關。

postgres=# drop index idx_t_1;  
  
  
postgres=# set work_mem =`64kB`;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                                 QUERY PLAN                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..16925010.00 rows=1000 width=1) (actual time=1.072..3036.590 rows=10000 loops=1)  
   Output: (SubPlan 1)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=226260, temp read=19 written=18  
   SubPlan 1  
     ->  Seq Scan on public.t  (cost=0.00..16925.00 rows=1 width=0) (actual time=0.303..0.303 rows=1 loops=10000)  
           Filter: (t.id = n.n)  
           Rows Removed by Filter: 5000  
           Buffers: shared hit=226260  
 Planning time: 0.087 ms  
 Execution time: 3037.904 ms  
(11 rows)  
  
postgres=# set work_mem =`64MB`;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                                 QUERY PLAN                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..16925010.00 rows=1000 width=1) (actual time=517.150..521.142 rows=10000 loops=1)  
   Output: (alternatives: SubPlan 1 or hashed SubPlan 2)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=4425  
   SubPlan 1  
     ->  Seq Scan on public.t  (cost=0.00..16925.00 rows=1 width=0) (never executed)  
           Filter: (t.id = n.n)  
   SubPlan 2  
     ->  Seq Scan on public.t t_1  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.027..127.111 rows=1000000 loops=1)  
           Output: t_1.id  
           Buffers: shared hit=4425  
 Planning time: 0.098 ms  
 Execution time: 527.986 ms  
(13 rows)

小結

1、使用= any的寫法，不會走subplan，因此不涉及hash table的問題。和work_mem設定大小無關。效能比較暴力，特別是當它不是在subquery裡面時，效能賊好。

很多場景都可以使用，例如update limit, delete limit(閱後即焚），又或者就是簡單的IN查詢需求。

delete from tbl where ctid = any(array( select ctid from tbl where xxx limit xxx ));  
  
update tbl set xxx=xxx where ctid = any(array( select ctid from tbl where xxx limit xxx ));  
  
select * from tbl where id = any (array( query.... ));

推薦指數，五星。

2、exists，由於優化器會預設它只需要搜尋到1條命中目標就不搜了，所以優化器評估是否使用hash table時，需要的記憶體相對較少，即使較小的work_mem也可能使用hashtable。

推薦指數，四星。

3、in ()，當出現在subquery中時，優化器評估這個subquery是否要構建雜湊TABLE，直接和subquery的大小相關，所以需要較大的work_mem才會選擇使用hashtable。

推薦指數，三星。

最後，由於這些SQL語義都相同，在核心優化時，可以考慮做一些QUERY REWRITE，來優化這樣的SQL。

這樣的話，使用者可以不改SQL，就達到提高效率的目的。

感謝為此付出努力的所有PostgreSQL核心開發的小夥伴們。

React 同構與極致的效能優化
2017-12-25
React優化
sql語句執行順序與效能優化（1）
2018-10-14
SQL優化
sql語句效能優化
2021-01-03
SQL優化
資料庫效能優化-索引與sql相關優化
2018-08-01
資料庫優化索引SQL
SQL語句優化的原則與方法QO
2022-03-21
SQL優化
SQL Azure與SQL Server的異同點總結和歸納
2017-09-29
SQLServer
session與cookie的異同
2014-05-22
SessionCookie
資料庫效能優化之SQL語句優化
2016-01-05
資料庫優化SQL
Oracle sql with 語句語法與例子
2011-10-26
OracleSQL
Golang效能分析與優化
2020-11-30
Golang優化
前端工程與效能優化
2015-09-05
前端優化
Oracle 高效能SQL引擎剖析--SQL優化與調優機制詳解
2018-06-24
OracleSQL優化
oracle效能問題：sql語句優化
2013-06-20
OracleSQL優化
效能分析優化的道與術
2022-04-17
優化
PHP 7 的革新與效能優化
2015-09-20
PHP優化
Nginx安全優化與效能調優
2020-07-25
Nginx優化
ORACLE UPDATE 語句語法與效能分析
2009-05-08
Oracle
JSF與Struts的異同
2007-08-14
JS
sql效能優化
2017-04-24
SQL優化
sql 效能優化
2011-10-08
SQL優化
效能優化案例-SQL優化
2016-12-05
優化SQL
PHP 7革新與效能優化
2015-09-20
PHP優化
Python 與 JavaScript 語法差異點
2019-03-02
PythonJavaScript
SQLServer效能優化之改寫SQL語句
2010-07-28
SQLServer優化
oracle與infomix異同點
2018-12-22
Oracle
[譯] 同中有異的 Webpack 與 Rollup
2017-04-12
Web
MySQL 效能優化之SQL優化
2015-04-08
MySql優化
MySQL 52個SQL效能優化策略SQL語句彙總
2022-01-17
MySql優化
效能優化指南：效能優化的一般性原則與方法
2018-05-25
優化
UIImageView UIView圓角與效能之間的研究與優化
2019-03-01
UIView優化
SQL效能優化技巧
2022-02-14
SQL優化
Sql效能優化梳理
2017-09-07
SQL優化
Oracle SQL效能優化
2013-11-01
OracleSQL優化
FlutterWeb效能優化探索與實踐
2021-12-20
FlutterWeb優化
前端效能優化原理與實踐
2018-10-24
前端優化
babel-polyfill使用與效能優化
2018-06-01
Babel優化
SQL語句的優化
2017-11-14
SQL優化
SQL Server優化之SQL語句優化
2016-01-12
SQLServer優化

PostgreSQLin與=any的SQL語法異同與效能優化

標籤

背景

幾個例子

in vs = any vs exists效能對比

小結

相關文章