greenplum 簡單sql優化案例

orclwujian發表於2016-11-11

SQL優化

資料倉儲剛遷到GP上，發現下面sql語句執行起來並沒有想象中的快，三個表總資料量150萬，結果集70萬條記錄，卻執行了7秒多，感覺有蹊蹺。這裡記錄了整個優化過程。
語句如下：

select

a.pay_date,

c.user_id,

a.order_id,

COALESCE(a.amount,0.00),

d.product_type,

d.base_rate,

d.add_rate,

a.product_period,

c.name,

c.sex,

substr(c.birthday,1,7),

c.phone,

COALESCE(c.province,'未知'),

c.salesman

from CDM.cdm_invest_itemized a,CDM.cdm_user c,CDM.cdm_product d

where a.user_id=c.user_id and a.product_name=d.product_name and a.pay_day= d.startdate and a.pay_day<=d.enddate

執行計劃如下

Gather Motion 4:1 (slice3; segments: 4) (cost=1065.82..1066.07 rows=102 width=334)

Merge Key: "?column15?"

-> Sort (cost=1065.82..1066.07 rows=26 width=334)

Sort Key: a.pay_day

-> Hash Join (cost=656.01..1062.44 rows=26 width=334)

Hash Cond: d.product_name::text = a.product_name::text

Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate

-> Seq Scan on cdm_product d (cost=0.00..322.00 rows=5550 width=56)

-> Hash (cost=610.34..610.34 rows=914 width=324)

-> Broadcast Motion 4:4 (slice2; segments: 4) (cost=190.00..610.34 rows=914 width=324)

-> Hash Join (cost=190.00..564.67 rows=229 width=324)

Hash Cond: a.user_id::text = c.user_id::text

-> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..353.50 rows=975 width=133)

Hash Key: a.user_id::text

-> Seq Scan on cdm_invest_itemized a (cost=0.00..275.50 rows=975 width=133)

Filter: pay_day::numeric < 20161111::numeric

-> Hash (cost=140.00..140.00 rows=1000 width=239)

-> Seq Scan on cdm_user c (cost=0.00..140.00 rows=1000 width=239)

執行7.5s秒

1、看下統計資訊是否有問題

edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');

relname | relpages | reltuples

---------------------+----------+-----------

cdm_user | 0 | 0

cdm_product | 0 | 0

cdm_invest_itemized | 0 | 0

果然，剛遷移過來的資料，統計資訊還是空的

二話不說，先收集統計資訊

edw=# VACUUM ANALYSE CDM.cdm_product;

edw=# VACUUM ANALYSE CDM.cdm_user;

edw=# VACUUM ANALYSE CDM.cdm_invest_itemized;

edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');

relname | relpages | reltuples

---------------------+----------+-----------

cdm_user | 6760 | 571022

cdm_product | 104 | 21280

cdm_invest_itemized | 6620 | 734028

統計資訊這下正常了，執行時間也降低到了3.8秒

Gather Motion 4:1 (slice3; segments: 4) (cost=52471.97..68149.70 rows=101713 width=191)

-> Hash Join (cost=52471.97..68149.70 rows=25429 width=191)

Hash Cond: c.user_id::text = a.user_id::text

-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)

-> Hash (cost=51200.56..51200.56 rows=25429 width=140)

-> Redistribute Motion 4:4 (slice2; segments: 4) (cost=2444.80..51200.56 rows=25429 width=140)

Hash Key: a.user_id::text

-> Hash Join (cost=2444.80..49166.32 rows=25429 width=140)

Hash Cond: a.product_name::text = d.product_name::text

Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate

-> Seq Scan on cdm_invest_itemized a (cost=0.00..17630.42 rows=61169 width=131)

Filter: pay_day::numeric < 20161111::numeric

-> Hash (cost=1380.80..1380.80 rows=21280 width=57)

-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1380.80 rows=21280 width=57)

-> Seq Scan on cdm_product d (cost=0.00..316.80 rows=5320 width=57)

2、看下執行計劃cdm_product表發生廣播，cdm_invest_itemized發生重分佈，說明三個表的分佈鍵都不相同，看下三個張表的分佈鍵，發現cdm_invest_itemized表是隨機分佈的

edw=# select gp_segment_id,"count"(*) from cdm.cdm_user GROUP BY 1;

gp_segment_id | count

---------------+--------

3 | 143999

0 | 143818

1 | 143931

2 | 143462

(4 rows)

edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;

gp_segment_id | count

---------------+-------

2 | 5322

0 | 5322

3 | 5322

1 | 5322

(4 rows)

edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;

gp_segment_id | count

---------------+--------

2 | 181572

3 | 181580

0 | 181580

1 | 181576

(4 rows)

edw=# \d cdm.cdm_invest_itemized

....

Distributed randomly

edw=# \d cdm.cdm_user

....

Distributed by: (user_id)

edw=# \d cdm.cdm_product

....

Distributed by: (id)

修改分佈鍵

edw=# alter table CDM.cdm_invest_itemized set distributed by(user_id);

edw=# alter table CDM.cdm_product set distributed by(product_name);

再看下資料分佈

edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;

gp_segment_id | count

---------------+--------

3 | 184454

1 | 174896

2 | 186732

0 | 180226

edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;

gp_segment_id | count

---------------+-------

1 | 5288

3 | 5329

0 | 5370

2 | 5301

沒發生嚴重的資料傾斜

再看下執行計劃

Gather Motion 4:1 (slice2; segments: 4) (cost=50272.97..65941.86 rows=101207 width=191)

-> Hash Join (cost=50272.97..65941.86 rows=25302 width=191)

Hash Cond: c.user_id::text = a.user_id::text

-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)

-> Hash (cost=49007.88..49007.88 rows=25302 width=140)

-> Hash Join (cost=2445.47..49007.88 rows=25302 width=140)

Hash Cond: a.product_name::text = d.product_name::text

Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate

-> Seq Scan on cdm_invest_itemized a (cost=0.00..17613.69 rows=61088 width=131)

Filter: pay_day::numeric < 20161111::numeric

-> Hash (cost=1381.62..1381.62 rows=21277 width=57)

-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1381.62 rows=21277 width=57)

-> Seq Scan on cdm_product d (cost=0.00..317.77 rows=5320 width=57)

表cdm_product和表cdm_invest_itemized關聯，分佈鍵不一致，cdm_product 表資料量（21280 ）*節點數（4）<cdm_invest_itemized資料量（734028 ），表cdm_product廣播沒毛病；表cdm_invest_itemized和表cdm_user關聯分佈鍵不發生廣播和重分佈，走hash="" join沒毛病最終執行時間優化到2.65秒

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/29989552/viewspace-2128307/，如需轉載，請註明出處，否則將追究法律責任。

SQL優化案例-單表分頁語句的優化（八）
2018-11-28
SQL優化
Oracle效能優化-SQL優化(案例一)
2020-02-17
Oracle優化SQL
Oracle效能優化-SQL優化(案例二)
2020-02-17
Oracle優化SQL
Oracle效能優化-SQL優化(案例三)
2020-02-18
Oracle優化SQL
Oracle效能優化-SQL優化(案例四)
2020-02-18
Oracle優化SQL
MySQL SQL優化案例(一)
2021-11-30
MySql優化
MySQL效能優化之簡單sql改寫
2019-11-12
MySql優化
SQL優化案例-使用with as優化Subquery Unnesting（七）
2018-11-28
SQL優化
SQL優化案例-union代替or（九）
2018-11-28
SQL優化
sql改寫優化：簡單規則重組實現
2021-12-26
SQL優化
Oracle某行系統SQL優化案例(二)
2021-06-07
OracleSQL優化
Oracle 某行系統SQL優化案例(一)
2021-03-31
OracleSQL優化
Oracle某行系統SQL優化(案例五)
2021-08-22
OracleSQL優化
Oracle某行系統SQL優化案例(三)
2021-08-13
OracleSQL優化
對一條基於分割槽的簡單SQL的優化
2019-03-01
SQL優化
SQL優化案例-自定義函式索引（五）
2018-11-28
SQL優化函式索引
SQL優化案例-正確的使用索引（二）
2018-06-15
SQL優化索引
Greenplum 效能優化之路 --（二）儲存格式
2020-09-29
優化
HybridDBforPostgreSQL,Greenplum寫入效能優化實踐
2018-10-05
SQL優化
SQL優化案例-從執行計劃定位SQL問題（三）
2018-11-28
SQL優化
SQL優化之多表關聯查詢-案例一
2019-08-19
SQL優化
sql優化面試必問【簡答】
2020-11-26
SQL優化面試
一個left join SQL 簡單優化分析
2018-12-04
SQL優化
Oracle優化案例-單表分頁語句的優化（八）
2018-11-21
Oracle優化
簡單優化容器服務
2019-04-27
優化
nginx部署及簡單優化
2018-08-30
Nginx優化
SQL優化案例-定位系統中大量的rollback（十八）
2019-07-01
SQL優化
Oracle優化案例-復現SQL ordered by Parse Calls（三十二）
2021-06-07
Oracle優化SQL
SQL優化案例-分割槽索引之無字首索引（六）
2018-08-21
SQL優化索引
SQL Server一次SQL調優案例
2019-10-10
SQLServer
storm簡單案例
2020-11-27
ORM
畫江湖之SQL優化 -10大經典案例場景
2021-09-16
SQL優化
Oracle優化案例-教你線上搞定top cpu的sql（十二）
2018-12-18
Oracle優化SQL
資料庫優化 - SQL優化
2020-01-11
資料庫優化SQL
sql優化之邏輯優化
2018-12-12
SQL優化
SQL SERVER優化
2018-10-30
SQLServer優化
SQL優化指南
2018-12-01
SQL優化
Oracle優化案例-關閉auto space advisor和sql tuning advisor（十九）
2019-08-09
Oracle優化SQL
Oracle優化案例-從執行計劃定位SQL問題（三）
2018-07-01
Oracle優化SQL

greenplum 簡單sql優化案例

相關文章