對含distinct操作的SQL的優化

space6212發表於2019-02-15
今天幫同事優化了一條含distinct操作的SQL,因為這個SQL在資料倉儲環境中非常經典,因此有比較拿出來說一說。

[@more@]原SQL:

sys@dwrac1> select count(distinct a.ip)
2 from (select t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

從執行計劃可以看到,Oracle先掃描兩個表的部分分割槽,在用hash連線,最後做distinct操作。
在這個案例中,這樣是有很大問題的,因為兩個結果集的IP重複度都比較高,根據a.ip=b.ip關聯後結果集劇烈膨脹,導致需要distinct的記錄集太大。
知道原因,那麼就可以想辦法避免了。我們的目標很簡單,就是讓關聯後的記錄集減少。
這個例子很明顯就可以先分別對兩個結果集做排重後再關聯,於是得到第一個優化後SQL:

sys@dwrac1> select count(distinct a.ip)
2 from (select distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

從執行計劃看到,雖然我們改寫了SQL,但是執行計劃沒有改變,這是因為CBO“聰明”地把兩個子查詢merge進去了。
我們可以用no_merge禁止子查詢的merge操作,於是得到第二次優化的SQL:


sys@dwrac1> select count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1087K (4)| 04:13:49 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
| 2 | MERGE JOIN | | 3533K| 87M| | 1087K (4)| 04:13:49 | | |
| 3 | SORT JOIN | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 4 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 5 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 6 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 7 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 8 | SORT JOIN | | 3533K| 43M| 135M| 989K (4)| 03:50:48 | | |
| 9 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 10 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 11 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 12 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
------------------------------------------------------------------------------------------------------------------------------------

從執行計劃看出,no_merge已經生效了,但是目前還不是最優的,因為對一個子查詢都做了太多的工作,既要hash unique又要sort,最後兩個記錄集關聯還是用merge join。
實際上,這種關聯用hash join更合適,可以強制它們走hash join,於是得到第三個優化SQL:

sys@dwrac1> select /*+ use_hash(a,b) */count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;


-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

從執行計劃看,已經基本達到我們的要求了,但是還不夠完美,因為我們對兩個記錄集做完排重後,關聯出來的記錄肯定是沒有重複的了,因此可以把最外層的distinct去掉,於是得到第四個優化的SQL:

sys@dwrac1> select /*+ use_hash(a,b) */count(a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;


-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT AGGREGATE | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

原SQL執行超過幾小時都出不來,最終優化過的SQL 只需要不到1分鐘就得到結果,優化效果非常顯著。

以上步驟是我做優化的思路,從優化的步驟看,要優化,不但需要對技術瞭解,更需要對資料分佈瞭解!
在資料倉儲中,我們可能經常會碰到類似的SQL,對此,我們的處理原則是:先排重,再關聯,儘可能避免中間結果集膨脹。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/231499/viewspace-1047939/,如需轉載,請註明出處,否則將追究法律責任。

相關文章