使用exists(Semi-Join)優化distinct語句

chncaesar發表於2013-09-09
在Oracle 官方文件,semi-join是這麼解釋的:

A semijoin returns rows that match an EXISTS subquery without duplicating rows
from the left side of the predicate when multiple rows on the right side satisfy the
criteria of the subquery.

Semijoin and antijoin transformation cannot be done if the subquery is on an OR
branch of the WHERE clause.

由上可見,semi-join特別適用於如下場景:
表A 和B有1:N關係,需要根據B表某個條件,通過semi-join查詢A表上某個列(屬性)。因為semi-join已經包含一個去重過程,不需要在A表上加distinct。比內連線然後去重方式效能更好。

假設有兩個表Customer , orders。當然,每個客戶可以有多個訂單。現在需要找出訂單金額超過10000的客戶數量。本文例子在12cR1上實現,11g上應該是同樣結果。

show rel;
release 1201000100

desc customer
Name        Null     Type          
----------- -------- ------------- 
CUSTOMER_ID NOT NULL NUMBER(12)    
NAME                 VARCHAR2(100) 
STATUS               VARCHAR2(10)  

desc orders
Name        Null Type         
----------- ---- ------------ 
ORDER_ID         NUMBER(12)   
CUSTOMER_ID      NUMBER(12)   
CREATE_DATE      DATE         
ORDER_PRICE      NUMBER(38,6) 

select count(*) from  orders;
count(*)
--------------
1000002

select count(*) from customer;
count(*)
--------------
100000

orders表有100萬行,customer表有10萬行。

下面來比較下inner join和semi-join兩種SQL寫法的效能。
Select /*HE7*/ count(distinct cus.name)
from customer cus, orders ord
where cus.customer_id=ord.customer_id
and ord.order_price > 10000;

由於做了內連線(inner join),最終結果集裡將會出現所有匹配行,包含一部分重複customer(使用者)。著名Oracle佈道者Tom Kyte在2006年一篇文章裡就寫到:

“In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!

In general, you use a join when you need data from more than one table in the ultimate SELECT list.”

文章出處:

http://www.oracle.com/technetwork/issue-archive/2006/06-may/o36asktom-099107.html

在這裡,我們只需要customer表的資訊,orders表是用來做檢索(filtering)的。

只將customer表放在from處,更加自然。用exists(semi-join)改寫後:


Select /*HE8*/ count(name)
from customer
where exists(
  select 1 from orders
  where customer.customer_id=orders.customer_id
  and orders.order_price > 10000
);

由於semi-join,這裡我們不需要distinct,從customer能直接得到去重後的結果。檢視下執行計劃和最終的cost
SQL_ID  3kfvh06bqrn5h, child number 0
-------------------------------------
select /*HE7*/ count(distinct cus.name) from customer cus, orders ord 
where cus.customer_id=ord.customer_id and ord.order_price > 10000
 
Plan hash value: 3082118893
 
------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |       |       |  3212 (100)|          |
|   1 |  SORT AGGREGATE       |          |     1 |    52 |       |            |          |
|   2 |   VIEW                | VW_DAG_0 | 52570 |  2669K|       |  3212   (1)| 00:00:01 |
|   3 |    HASH GROUP BY      |          | 52570 |  2104K|  2688K|  3212   (1)| 00:00:01 |
|*  4 |     HASH JOIN SEMI    |          | 52570 |  2104K|  2640K|  2651   (1)| 00:00:01 |
|   5 |      TABLE ACCESS FULL| CUSTOMER |   100K|  1464K|       |   102   (0)| 00:00:01 |
|*  6 |      TABLE ACCESS FULL| ORDERS   |   753K|    18M|       |  1063   (1)| 00:00:01 |
------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("CUS"."CUSTOMER_ID"="ORD"."CUSTOMER_ID")
   6 - filter("ORD"."ORDER_PRICE">10000)
 
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

SQL_ID  ggs8p2c27nzmu, child number 0
-------------------------------------
select /*HE8*/ count(name) from customer where exists(   select 1 from 
orders   where customer.customer_id=orders.customer_id   and 
orders.order_price > 10000 )
 
Plan hash value: 2841769433
 
----------------------------------------------------------------------------------------
| Id  | Operation           | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |          |       |       |       |  2651 (100)|          |
|   1 |  SORT AGGREGATE     |          |     1 |    41 |       |            |          |
|*  2 |   HASH JOIN SEMI    |          | 52570 |  2104K|  2640K|  2651   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| CUSTOMER |   100K|  1464K|       |   102   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| ORDERS   |   753K|    18M|       |  1063   (1)| 00:00:01 |
----------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("CUSTOMER"."CUSTOMER_ID"="ORDERS"."CUSTOMER_ID")
   4 - filter("ORDERS"."ORDER_PRICE">10000)
 
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

很明顯,第二個執行計劃優於第一個。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/638844/viewspace-772370/,如需轉載,請註明出處,否則將追究法律責任。

相關文章