PostgreSQLin與=any的SQL語法異同與效能優化
標籤
PostgreSQL , in , = any (array()) , hash table , subplan , initplan
背景
資料庫SQL也算一門比較神奇的語言了,比如很多需求可以有不同的SQL來實現:
我之前有輸出過一個IN的測試,這裡面實際上也涉及到多個語法,實現同一個功能點。測試CASE是1億 in 100萬的多種寫法的效能差異。
《HTAP資料庫 PostgreSQL 場景與效能測試之 25 – (OLTP) IN , EXISTS 查詢》
例如下面三個QUERY的語義就是一樣的
select * from tbl where id in (select id from t);
select * from tbl where exists (select 1 from t where t.id=tbl.id);
select * from tbl where id = any (array( select id from t ));
但是不同的SQL,資料庫可能會選擇不一樣的執行計劃,並且執行效率可能千差萬別。
幾個例子
1、建立測試表,模擬1萬 IN 100萬的操作。
postgres=# create table t(id int);
CREATE TABLE
postgres=# insert into t select generate_series(1,100*10000);
INSERT 0 1000000
2、我們看一看不同寫法的執行計劃如何:
postgres=# explain select n = any(array(select id from t)) from generate_series(1,10000) as n;
QUERY PLAN
---------------------------------------------------------------------------------
Function Scan on generate_series n (cost=14425.00..14447.50 rows=1000 width=1)
InitPlan 1 (returns $0)
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
(3 rows)
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;
QUERY PLAN
---------------------------------------------------------------------------------
Function Scan on generate_series n (cost=16925.00..16937.50 rows=1000 width=1)
SubPlan 1
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
(3 rows)
3、你會發現兩個語法用了不同的執行計劃,一個是InitPlan, 一個是SubPlan.
對於IN的寫法,work_mem引數會直接影響效能,work_mem的大小決定了subquery是否要裝載到hash table。
postgres=# set work_mem =`1MB`;
SET
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;
QUERY PLAN
--------------------------------------------------------------------------------
Function Scan on generate_series n (cost=0.00..12916012.50 rows=1000 width=1)
SubPlan 1
-> Materialize (cost=0.00..23332.00 rows=1000000 width=4)
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
(4 rows)
postgres=# set work_mem =`100MB`;
SET
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;
QUERY PLAN
---------------------------------------------------------------------------------
Function Scan on generate_series n (cost=16925.00..16937.50 rows=1000 width=1)
SubPlan 1
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
(3 rows)
if (subquery)
{
/* Generate Paths for the ANY subquery; we`ll need all rows */
subroot = subquery_planner(root->glob, subquery,
root,
false, 0.0);
/* Isolate the params needed by this specific subplan */
plan_params = root->plan_params;
root->plan_params = NIL;
/* Select best Path and turn it into a Plan */
final_rel = fetch_upper_rel(subroot, UPPERREL_FINAL, NULL);
best_path = final_rel->cheapest_total_path;
plan = create_plan(subroot, best_path);
/* Now we can check if it`ll fit in work_mem */
/* XXX can we check this at the Path stage? */
if (subplan_is_hashable(plan))
{
SubPlan *hashplan;
AlternativeSubPlan *asplan;
/* OK, convert to SubPlan format. */
hashplan = castNode(SubPlan,
build_subplan(root, plan, subroot,
plan_params,
ANY_SUBLINK, 0,
newtestexpr,
false, true));
/* Check we got what we expected */
Assert(hashplan->parParam == NIL);
Assert(hashplan->useHashTable);
/* build_subplan won`t have filled in paramIds */
hashplan->paramIds = paramIds;
/* Leave it to the executor to decide which plan to use */
asplan = makeNode(AlternativeSubPlan);
asplan->subplans = list_make2(result, hashplan);
result = (Node *) asplan;
}
}
/*
* subplan_is_hashable: can we implement an ANY subplan by hashing?
*/
static bool
subplan_is_hashable(Plan *plan)
{
double subquery_size;
/*
* The estimated size of the subquery result must fit in work_mem. (Note:
* we use heap tuple overhead here even though the tuples will actually be
* stored as MinimalTuples; this provides some fudge factor for hashtable
* overhead.)
*/
subquery_size = plan->plan_rows *
(MAXALIGN(plan->plan_width) + MAXALIGN(SizeofHeapTupleHeader));
if (subquery_size > work_mem * 1024L)
return false;
return true;
}
程式碼裡面註釋中,針對in, exists, any的subplan優化器實現也有一些介紹,涉及到效能相關:
實際上exists這裡有提到,匹配到第一條就結束,所以評估是否使用雜湊表時可能需要的容量很小。
/*
* For an EXISTS subplan, tell lower-level planner to expect that only the
* first tuple will be retrieved. For ALL and ANY subplans, we will be
* able to stop evaluating if the test condition fails or matches, so very
* often not all the tuples will be retrieved; for lack of a better idea,
* specify 50% retrieval. For EXPR, MULTIEXPR, and ROWCOMPARE subplans,
* use default behavior (we`re only expecting one row out, anyway).
*
* NOTE: if you change these numbers, also change cost_subplan() in
* path/costsize.c.
*
* XXX If an ANY subplan is uncorrelated, build_subplan may decide to hash
* its output. In that case it would`ve been better to specify full
* retrieval. At present, however, we can only check hashability after
* we`ve made the subplan :-(. (Determining whether it`ll fit in work_mem
* is the really hard part.) Therefore, we don`t want to be too
* optimistic about the percentage of tuples retrieved, for fear of
* selecting a plan that`s bad for the materialization case.
*/
in vs = any vs exists效能對比
1、in, work_mem裝不下subquery
postgres=# set work_mem =`64kB`;
postgres=# explain analyze select n in (select id from t) from generate_series(1,10000) as n;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series n (cost=0.00..12916012.50 rows=1000 width=1) (actual time=1.321..11484.646 rows=10000 loops=1)
SubPlan 1
-> Materialize (cost=0.00..23332.00 rows=1000000 width=4) (actual time=0.003..0.619 rows=5000 loops=10000)
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..1.800 rows=10000 loops=1)
Planning time: 0.091 ms
Execution time: 11485.905 ms
(6 rows)
2、in, work_mem裝下了subquery
postgres=# set work_mem =`64MB`;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select n in (select id from t) from generate_series(1,10000) as n;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series n (cost=16925.00..16937.50 rows=1000 width=1) (actual time=497.142..500.701 rows=10000 loops=1)
Output: (hashed SubPlan 1)
Function Call: generate_series(1, 10000)
Buffers: shared hit=4425
SubPlan 1
-> Seq Scan on public.t (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.024..124.703 rows=1000000 loops=1)
Output: t.id
Buffers: shared hit=4425
Planning time: 0.085 ms
Execution time: 507.427 ms
(10 rows)
3、= any, work_mem很小無所謂,因為不涉及hashtable
postgres=# set work_mem =`64kB`;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select n = any(array(select id from t)) from generate_series(1,10000) as n;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series n (cost=14425.00..14447.50 rows=1000 width=1) (actual time=233.871..446.120 rows=10000 loops=1)
Output: (n.n = ANY ($0))
Function Call: generate_series(1, 10000)
Buffers: shared hit=4425, temp read=19 written=18
InitPlan 1 (returns $0)
-> Seq Scan on public.t (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..119.976 rows=1000000 loops=1)
Output: t.id
Buffers: shared hit=4425
Planning time: 0.085 ms
Execution time: 447.666 ms
(10 rows)
4、exists, work_mem需求量較少(exists由於優化器在匹配到1條後即刻返回,所以會選擇使用索引,效能就非常好。)
postgres=# set work_mem =`64kB`;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series n (cost=0.00..2852.50 rows=1000 width=1) (actual time=1.172..18.893 rows=10000 loops=1)
Output: (SubPlan 1)
Function Call: generate_series(1, 10000)
Buffers: shared hit=40027, temp read=19 written=18
SubPlan 1
-> Index Only Scan using idx_t_1 on public.t (cost=0.42..2.84 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=10000)
Index Cond: (t.id = n.n)
Heap Fetches: 10000
Buffers: shared hit=40027
Planning time: 0.118 ms
Execution time: 19.902 ms
(11 rows)
postgres=# set work_mem =`64MB`;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series n (cost=0.00..2852.50 rows=1000 width=1) (actual time=0.642..17.635 rows=10000 loops=1)
Output: (alternatives: SubPlan 1 or hashed SubPlan 2)
Function Call: generate_series(1, 10000)
Buffers: shared hit=40027
SubPlan 1
-> Index Only Scan using idx_t_1 on public.t (cost=0.42..2.84 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=10000)
Index Cond: (t.id = n.n)
Heap Fetches: 10000
Buffers: shared hit=40027
SubPlan 2
-> Seq Scan on public.t t_1 (cost=0.00..14425.00 rows=1000000 width=4) (never executed)
Output: t_1.id
Planning time: 0.129 ms
Execution time: 18.612 ms
(14 rows)
5、如果把索引幹掉,exists效能就會下降了,同時效能也和是否使用雜湊表有關。
postgres=# drop index idx_t_1;
postgres=# set work_mem =`64kB`;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series n (cost=0.00..16925010.00 rows=1000 width=1) (actual time=1.072..3036.590 rows=10000 loops=1)
Output: (SubPlan 1)
Function Call: generate_series(1, 10000)
Buffers: shared hit=226260, temp read=19 written=18
SubPlan 1
-> Seq Scan on public.t (cost=0.00..16925.00 rows=1 width=0) (actual time=0.303..0.303 rows=1 loops=10000)
Filter: (t.id = n.n)
Rows Removed by Filter: 5000
Buffers: shared hit=226260
Planning time: 0.087 ms
Execution time: 3037.904 ms
(11 rows)
postgres=# set work_mem =`64MB`;
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series n (cost=0.00..16925010.00 rows=1000 width=1) (actual time=517.150..521.142 rows=10000 loops=1)
Output: (alternatives: SubPlan 1 or hashed SubPlan 2)
Function Call: generate_series(1, 10000)
Buffers: shared hit=4425
SubPlan 1
-> Seq Scan on public.t (cost=0.00..16925.00 rows=1 width=0) (never executed)
Filter: (t.id = n.n)
SubPlan 2
-> Seq Scan on public.t t_1 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.027..127.111 rows=1000000 loops=1)
Output: t_1.id
Buffers: shared hit=4425
Planning time: 0.098 ms
Execution time: 527.986 ms
(13 rows)
小結
1、使用= any的寫法,不會走subplan,因此不涉及hash table的問題。和work_mem設定大小無關。效能比較暴力,特別是當它不是在subquery裡面時,效能賊好。
很多場景都可以使用,例如update limit, delete limit(閱後即焚),又或者就是簡單的IN查詢需求。
delete from tbl where ctid = any(array( select ctid from tbl where xxx limit xxx ));
update tbl set xxx=xxx where ctid = any(array( select ctid from tbl where xxx limit xxx ));
select * from tbl where id = any (array( query.... ));
推薦指數,五星。
2、exists,由於優化器會預設它只需要搜尋到1條命中目標就不搜了,所以優化器評估是否使用hash table時,需要的記憶體相對較少,即使較小的work_mem也可能使用hashtable。
推薦指數,四星。
3、in (),當出現在subquery中時,優化器評估這個subquery是否要構建雜湊TABLE,直接和subquery的大小相關,所以需要較大的work_mem才會選擇使用hashtable。
推薦指數,三星。
最後,由於這些SQL語義都相同,在核心優化時,可以考慮做一些QUERY REWRITE,來優化這樣的SQL。
這樣的話,使用者可以不改SQL,就達到提高效率的目的。
感謝為此付出努力的所有PostgreSQL核心開發的小夥伴們。
相關文章
- React 同構與極致的效能優化React優化
- sql語句執行順序與效能優化(1)SQL優化
- sql語句效能優化SQL優化
- 資料庫效能優化-索引與sql相關優化資料庫優化索引SQL
- SQL語句優化的原則與方法QOSQL優化
- SQL Azure與SQL Server的異同點總結和歸納SQLServer
- session與cookie的異同SessionCookie
- 資料庫效能優化之SQL語句優化資料庫優化SQL
- Oracle sql with 語句語法與例子OracleSQL
- Golang效能分析與優化Golang優化
- 前端工程與效能優化前端優化
- Oracle 高效能SQL引擎剖析--SQL優化與調優機制詳解OracleSQL優化
- oracle效能問題:sql語句優化OracleSQL優化
- 效能分析優化的道與術優化
- PHP 7 的革新與效能優化PHP優化
- Nginx安全優化與效能調優Nginx優化
- ORACLE UPDATE 語句語法與效能分析Oracle
- JSF與Struts的異同JS
- sql效能優化SQL優化
- sql 效能優化SQL優化
- 效能優化案例-SQL優化優化SQL
- PHP 7革新與效能優化PHP優化
- Python 與 JavaScript 語法差異點PythonJavaScript
- SQLServer效能優化之改寫SQL語句SQLServer優化
- oracle與infomix異同點Oracle
- [譯] 同中有異的 Webpack 與 RollupWeb
- MySQL 效能優化之SQL優化MySql優化
- MySQL 52個SQL效能優化策略SQL語句彙總MySql優化
- 效能優化指南:效能優化的一般性原則與方法優化
- UIImageView UIView圓角與效能之間的研究與優化UIView優化
- SQL效能優化技巧SQL優化
- Sql效能優化梳理SQL優化
- Oracle SQL效能優化OracleSQL優化
- FlutterWeb效能優化探索與實踐FlutterWeb優化
- 前端效能優化原理與實踐前端優化
- babel-polyfill使用與效能優化Babel優化
- SQL語句的優化SQL優化
- SQL Server優化之SQL語句優化SQLServer優化