greenplum 簡單sql優化案例
資料倉儲剛遷到GP上,發現下面sql語句執行起來並沒有想象中的快,三個表總資料量150萬,結果集70萬條記錄,卻執行了7秒多,感覺有蹊蹺。這裡記錄了整個優化過程。
語句如下:
語句如下:
select
a.pay_date,
c.user_id,
a.order_id,
COALESCE(a.amount,0.00),
d.product_type,
d.base_rate,
d.add_rate,
a.product_period,
c.name,
c.sex,
substr(c.birthday,1,7),
c.phone,
COALESCE(c.province,'未知'),
c.salesman
from CDM.cdm_invest_itemized a,CDM.cdm_user c,CDM.cdm_product d
where a.user_id=c.user_id and a.product_name=d.product_name and a.pay_day= d.startdate and a.pay_day<=d.enddate
執行計劃如下
Gather Motion 4:1 (slice3; segments: 4) (cost=1065.82..1066.07 rows=102 width=334)
Merge Key: "?column15?"
-> Sort (cost=1065.82..1066.07 rows=26 width=334)
Sort Key: a.pay_day
-> Hash Join (cost=656.01..1062.44 rows=26 width=334)
Hash Cond: d.product_name::text = a.product_name::text
Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
-> Seq Scan on cdm_product d (cost=0.00..322.00 rows=5550 width=56)
-> Hash (cost=610.34..610.34 rows=914 width=324)
-> Broadcast Motion 4:4 (slice2; segments: 4) (cost=190.00..610.34 rows=914 width=324)
-> Hash Join (cost=190.00..564.67 rows=229 width=324)
Hash Cond: a.user_id::text = c.user_id::text
-> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..353.50 rows=975 width=133)
Hash Key: a.user_id::text
-> Seq Scan on cdm_invest_itemized a (cost=0.00..275.50 rows=975 width=133)
Filter: pay_day::numeric < 20161111::numeric
-> Hash (cost=140.00..140.00 rows=1000 width=239)
-> Seq Scan on cdm_user c (cost=0.00..140.00 rows=1000 width=239)
執行7.5s秒
1、看下統計資訊是否有問題
edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');
relname | relpages | reltuples
---------------------+----------+-----------
cdm_user | 0 | 0
cdm_product | 0 | 0
cdm_invest_itemized | 0 | 0
果然,剛遷移過來的資料,統計資訊還是空的
二話不說,先收集統計資訊
edw=# VACUUM ANALYSE CDM.cdm_product;
edw=# VACUUM ANALYSE CDM.cdm_user;
edw=# VACUUM ANALYSE CDM.cdm_invest_itemized;
edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');
relname | relpages | reltuples
---------------------+----------+-----------
cdm_user | 6760 | 571022
cdm_product | 104 | 21280
cdm_invest_itemized | 6620 | 734028
統計資訊這下正常了,執行時間也降低到了3.8秒
Gather Motion 4:1 (slice3; segments: 4) (cost=52471.97..68149.70 rows=101713 width=191)
-> Hash Join (cost=52471.97..68149.70 rows=25429 width=191)
Hash Cond: c.user_id::text = a.user_id::text
-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)
-> Hash (cost=51200.56..51200.56 rows=25429 width=140)
-> Redistribute Motion 4:4 (slice2; segments: 4) (cost=2444.80..51200.56 rows=25429 width=140)
Hash Key: a.user_id::text
-> Hash Join (cost=2444.80..49166.32 rows=25429 width=140)
Hash Cond: a.product_name::text = d.product_name::text
Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
-> Seq Scan on cdm_invest_itemized a (cost=0.00..17630.42 rows=61169 width=131)
Filter: pay_day::numeric < 20161111::numeric
-> Hash (cost=1380.80..1380.80 rows=21280 width=57)
-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1380.80 rows=21280 width=57)
-> Seq Scan on cdm_product d (cost=0.00..316.80 rows=5320 width=57)
2、看下執行計劃cdm_product表發生廣播,cdm_invest_itemized發生重分佈,說明三個表的分佈鍵都不相同,看下三個張表的分佈鍵,發現cdm_invest_itemized表是隨機分佈的
edw=# select gp_segment_id,"count"(*) from cdm.cdm_user GROUP BY 1;
gp_segment_id | count
---------------+--------
3 | 143999
0 | 143818
1 | 143931
2 | 143462
(4 rows)
edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;
gp_segment_id | count
---------------+-------
2 | 5322
0 | 5322
3 | 5322
1 | 5322
(4 rows)
edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;
gp_segment_id | count
---------------+--------
2 | 181572
3 | 181580
0 | 181580
1 | 181576
(4 rows)
edw=# \d cdm.cdm_invest_itemized
....
Distributed randomly
edw=# \d cdm.cdm_user
....
Distributed by: (user_id)
edw=# \d cdm.cdm_product
....
Distributed by: (id)
修改分佈鍵
edw=# alter table CDM.cdm_invest_itemized set distributed by(user_id);
edw=# alter table CDM.cdm_product set distributed by(product_name);
再看下資料分佈
edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;
gp_segment_id | count
---------------+--------
3 | 184454
1 | 174896
2 | 186732
0 | 180226
edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;
gp_segment_id | count
---------------+-------
1 | 5288
3 | 5329
0 | 5370
2 | 5301
沒發生嚴重的資料傾斜
再看下執行計劃
Gather Motion 4:1 (slice2; segments: 4) (cost=50272.97..65941.86 rows=101207 width=191)
-> Hash Join (cost=50272.97..65941.86 rows=25302 width=191)
Hash Cond: c.user_id::text = a.user_id::text
-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)
-> Hash (cost=49007.88..49007.88 rows=25302 width=140)
-> Hash Join (cost=2445.47..49007.88 rows=25302 width=140)
Hash Cond: a.product_name::text = d.product_name::text
Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
-> Seq Scan on cdm_invest_itemized a (cost=0.00..17613.69 rows=61088 width=131)
Filter: pay_day::numeric < 20161111::numeric
-> Hash (cost=1381.62..1381.62 rows=21277 width=57)
-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1381.62 rows=21277 width=57)
-> Seq Scan on cdm_product d (cost=0.00..317.77 rows=5320 width=57)
表cdm_product和表cdm_invest_itemized關聯,分佈鍵不一致,cdm_product 表資料量(21280 )*節點數(4)<cdm_invest_itemized資料量(734028 ),表cdm_product廣播沒毛病;表cdm_invest_itemized和表cdm_user關聯分佈鍵不發生廣播和重分佈,走hash="" join沒毛病最終執行時間優化到2.65秒
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29989552/viewspace-2128307/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- SQL優化案例-單表分頁語句的優化(八)SQL優化
- Oracle效能優化-SQL優化(案例一)Oracle優化SQL
- Oracle效能優化-SQL優化(案例二)Oracle優化SQL
- Oracle效能優化-SQL優化(案例三)Oracle優化SQL
- Oracle效能優化-SQL優化(案例四)Oracle優化SQL
- MySQL SQL優化案例(一)MySql優化
- MySQL效能優化之簡單sql改寫MySql優化
- SQL優化案例-使用with as優化Subquery Unnesting(七)SQL優化
- SQL優化案例-union代替or(九)SQL優化
- sql改寫優化:簡單規則重組實現SQL優化
- Oracle某行系統SQL優化案例(二)OracleSQL優化
- Oracle 某行系統SQL優化案例(一)OracleSQL優化
- Oracle某行系統SQL優化(案例五)OracleSQL優化
- Oracle某行系統SQL優化案例(三)OracleSQL優化
- 對一條基於分割槽的簡單SQL的優化SQL優化
- SQL優化案例-自定義函式索引(五)SQL優化函式索引
- SQL優化案例-正確的使用索引(二)SQL優化索引
- Greenplum 效能優化之路 --(二)儲存格式優化
- HybridDBforPostgreSQL,Greenplum寫入效能優化實踐SQL優化
- SQL優化案例-從執行計劃定位SQL問題(三)SQL優化
- SQL優化之多表關聯查詢-案例一SQL優化
- sql優化 面試必問【簡答】SQL優化面試
- 一個left join SQL 簡單優化分析SQL優化
- Oracle優化案例-單表分頁語句的優化(八)Oracle優化
- 簡單優化容器服務優化
- nginx部署及簡單優化Nginx優化
- SQL優化案例-定位系統中大量的rollback(十八)SQL優化
- Oracle優化案例-復現SQL ordered by Parse Calls(三十二)Oracle優化SQL
- SQL優化案例-分割槽索引之無字首索引(六)SQL優化索引
- SQL Server一次SQL調優案例SQLServer
- storm簡單案例ORM
- 畫江湖之SQL優化 -10大經典案例場景SQL優化
- Oracle優化案例-教你線上搞定top cpu的sql(十二)Oracle優化SQL
- 資料庫優化 - SQL優化資料庫優化SQL
- sql優化之邏輯優化SQL優化
- SQL SERVER優化SQLServer優化
- SQL優化指南SQL優化
- Oracle優化案例-關閉auto space advisor和sql tuning advisor(十九)Oracle優化SQL
- Oracle優化案例-從執行計劃定位SQL問題(三)Oracle優化SQL