greenplum 簡單sql優化案例

orclwujian發表於2016-11-11
資料倉儲剛遷到GP上,發現下面sql語句執行起來並沒有想象中的快,三個表總資料量150萬,結果集70萬條記錄,卻執行了7秒多,感覺有蹊蹺。這裡記錄了整個優化過程。
語句如下:
select
a.pay_date,
c.user_id,
a.order_id,
COALESCE(a.amount,0.00),
d.product_type,
d.base_rate,
d.add_rate,
a.product_period,
c.name,
c.sex,
substr(c.birthday,1,7),
c.phone,
COALESCE(c.province,'未知'),
c.salesman
from CDM.cdm_invest_itemized a,CDM.cdm_user c,CDM.cdm_product d
where  a.user_id=c.user_id and a.product_name=d.product_name and a.pay_day= d.startdate and a.pay_day<=d.enddate

執行計劃如下
Gather Motion 4:1  (slice3; segments: 4)  (cost=1065.82..1066.07 rows=102 width=334)
  Merge Key: "?column15?"
  ->  Sort  (cost=1065.82..1066.07 rows=26 width=334)
        Sort Key: a.pay_day
        ->  Hash Join  (cost=656.01..1062.44 rows=26 width=334)
              Hash Cond: d.product_name::text = a.product_name::text
              Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
              ->  Seq Scan on cdm_product d  (cost=0.00..322.00 rows=5550 width=56)
              ->  Hash  (cost=610.34..610.34 rows=914 width=324)
                    ->  Broadcast Motion 4:4  (slice2; segments: 4)  (cost=190.00..610.34 rows=914 width=324)
                          ->  Hash Join  (cost=190.00..564.67 rows=229 width=324)
                                Hash Cond: a.user_id::text = c.user_id::text
                                ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.00..353.50 rows=975 width=133)
                                      Hash Key: a.user_id::text
                                      ->  Seq Scan on cdm_invest_itemized a  (cost=0.00..275.50 rows=975 width=133)
                                            Filter: pay_day::numeric < 20161111::numeric
                                ->  Hash  (cost=140.00..140.00 rows=1000 width=239)
                                      ->  Seq Scan on cdm_user c  (cost=0.00..140.00 rows=1000 width=239)

執行7.5s秒
1、看下統計資訊是否有問題
edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');
       relname       | relpages | reltuples
---------------------+----------+-----------
 cdm_user            |        0 |        0
 cdm_product         |        0 |        0
 cdm_invest_itemized |        0 |        0
果然,剛遷移過來的資料,統計資訊還是空的
二話不說,先收集統計資訊
edw=# VACUUM ANALYSE CDM.cdm_product;
edw=# VACUUM ANALYSE CDM.cdm_user;
edw=# VACUUM ANALYSE CDM.cdm_invest_itemized;
edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');
       relname       | relpages | reltuples
---------------------+----------+-----------
 cdm_user            |     6760 |    571022
 cdm_product         |      104 |     21280
 cdm_invest_itemized |     6620 |    734028
統計資訊這下正常了,執行時間也降低到了3.8秒
Gather Motion 4:1  (slice3; segments: 4)  (cost=52471.97..68149.70 rows=101713 width=191)
  ->  Hash Join  (cost=52471.97..68149.70 rows=25429 width=191)
        Hash Cond: c.user_id::text = a.user_id::text
        ->  Seq Scan on cdm_user c  (cost=0.00..12470.22 rows=142756 width=98)
        ->  Hash  (cost=51200.56..51200.56 rows=25429 width=140)
              ->  Redistribute Motion 4:4  (slice2; segments: 4)  (cost=2444.80..51200.56 rows=25429 width=140)
                    Hash Key: a.user_id::text
                    ->  Hash Join  (cost=2444.80..49166.32 rows=25429 width=140)
                          Hash Cond: a.product_name::text = d.product_name::text
                          Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
                          ->  Seq Scan on cdm_invest_itemized a  (cost=0.00..17630.42 rows=61169 width=131)
                                Filter: pay_day::numeric < 20161111::numeric
                          ->  Hash  (cost=1380.80..1380.80 rows=21280 width=57)
                                ->  Broadcast Motion 4:4  (slice1; segments: 4)  (cost=0.00..1380.80 rows=21280 width=57)
                                      ->  Seq Scan on cdm_product d  (cost=0.00..316.80 rows=5320 width=57)

2、看下執行計劃cdm_product表發生廣播,cdm_invest_itemized發生重分佈,說明三個表的分佈鍵都不相同,看下三個張表的分佈鍵,發現cdm_invest_itemized表是隨機分佈的

edw=# select gp_segment_id,"count"(*) from cdm.cdm_user GROUP BY 1;
 gp_segment_id | count 
---------------+--------
             3 | 143999
             0 | 143818
             1 | 143931
             2 | 143462
(4 rows)

edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;
 gp_segment_id | count
---------------+-------
             2 |  5322
             0 |  5322
             3 |  5322
             1 |  5322
(4 rows)

edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;
 gp_segment_id | count 
---------------+--------
             2 | 181572
             3 | 181580
             0 | 181580
             1 | 181576
(4 rows)

edw=# \d cdm.cdm_invest_itemized
....
Distributed randomly

edw=# \d cdm.cdm_user
....
Distributed by: (user_id)

edw=# \d cdm.cdm_product
....
Distributed by: (id)

修改分佈鍵
edw=# alter table CDM.cdm_invest_itemized set distributed by(user_id);
edw=# alter table CDM.cdm_product set distributed by(product_name);  
再看下資料分佈
edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;
 gp_segment_id | count 
---------------+--------
             3 | 184454
             1 | 174896
             2 | 186732
             0 | 180226
edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;
 gp_segment_id | count
---------------+-------
             1 |  5288
             3 |  5329
             0 |  5370
             2 |  5301

沒發生嚴重的資料傾斜

再看下執行計劃
Gather Motion 4:1  (slice2; segments: 4)  (cost=50272.97..65941.86 rows=101207 width=191)
  ->  Hash Join  (cost=50272.97..65941.86 rows=25302 width=191)
        Hash Cond: c.user_id::text = a.user_id::text
        ->  Seq Scan on cdm_user c  (cost=0.00..12470.22 rows=142756 width=98)
        ->  Hash  (cost=49007.88..49007.88 rows=25302 width=140)
              ->  Hash Join  (cost=2445.47..49007.88 rows=25302 width=140)
                    Hash Cond: a.product_name::text = d.product_name::text
                    Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
                    ->  Seq Scan on cdm_invest_itemized a  (cost=0.00..17613.69 rows=61088 width=131)
                          Filter: pay_day::numeric < 20161111::numeric
                    ->  Hash  (cost=1381.62..1381.62 rows=21277 width=57)
                          ->  Broadcast Motion 4:4  (slice1; segments: 4)  (cost=0.00..1381.62 rows=21277 width=57)
                                ->  Seq Scan on cdm_product d  (cost=0.00..317.77 rows=5320 width=57)
表cdm_product和表cdm_invest_itemized關聯,分佈鍵不一致,cdm_product 表資料量(21280 )*節點數(4)<cdm_invest_itemized資料量(734028 ),表cdm_product廣播沒毛病;表cdm_invest_itemized和表cdm_user關聯分佈鍵不發生廣播和重分佈,走hash="" join沒毛病最終執行時間優化到2.65秒

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29989552/viewspace-2128307/,如需轉載,請註明出處,否則將追究法律責任。

相關文章