對含distinct操作的SQL的優化

今天幫同事優化了一條含distinct操作的SQL，因為這個SQL在資料倉儲環境中非常經典，因此有比較拿出來說一說。

[@more@]原SQL：

sys@dwrac1> select count(distinct a.ip)
2 from (select t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

從執行計劃可以看到，Oracle先掃描兩個表的部分分割槽，在用hash連線，最後做distinct操作。
在這個案例中，這樣是有很大問題的，因為兩個結果集的IP重複度都比較高，根據a.ip=b.ip關聯後結果集劇烈膨脹，導致需要distinct的記錄集太大。
知道原因，那麼就可以想辦法避免了。我們的目標很簡單，就是讓關聯後的記錄集減少。
這個例子很明顯就可以先分別對兩個結果集做排重後再關聯，於是得到第一個優化後SQL：

sys@dwrac1> select count(distinct a.ip)
2 from (select distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

從執行計劃看到，雖然我們改寫了SQL，但是執行計劃沒有改變，這是因為CBO“聰明”地把兩個子查詢merge進去了。
我們可以用no_merge禁止子查詢的merge操作，於是得到第二次優化的SQL：

sys@dwrac1> select count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1087K (4)| 04:13:49 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
| 2 | MERGE JOIN | | 3533K| 87M| | 1087K (4)| 04:13:49 | | |
| 3 | SORT JOIN | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 4 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 5 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 6 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 7 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 8 | SORT JOIN | | 3533K| 43M| 135M| 989K (4)| 03:50:48 | | |
| 9 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 10 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 11 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 12 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
------------------------------------------------------------------------------------------------------------------------------------

從執行計劃看出，no_merge已經生效了，但是目前還不是最優的，因為對一個子查詢都做了太多的工作，既要hash unique又要sort，最後兩個記錄集關聯還是用merge join。
實際上，這種關聯用hash join更合適，可以強制它們走hash join，於是得到第三個優化SQL：

sys@dwrac1> select /*+ use_hash(a,b) */count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

從執行計劃看，已經基本達到我們的要求了，但是還不夠完美，因為我們對兩個記錄集做完排重後，關聯出來的記錄肯定是沒有重複的了，因此可以把最外層的distinct去掉，於是得到第四個優化的SQL：

sys@dwrac1> select /*+ use_hash(a,b) */count(a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT AGGREGATE | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

原SQL執行超過幾小時都出不來，最終優化過的SQL 只需要不到1分鐘就得到結果，優化效果非常顯著。

以上步驟是我做優化的思路，從優化的步驟看，要優化，不但需要對技術瞭解，更需要對資料分佈瞭解！
在資料倉儲中，我們可能經常會碰到類似的SQL，對此，我們的處理原則是：先排重，再關聯，儘可能避免中間結果集膨脹。

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/231499/viewspace-1047939/，如需轉載，請註明出處，否則將追究法律責任。

對含distinct操作的SQL的優化

相關文章