使用exists(Semi-Join)優化distinct語句

在Oracle 官方文件，semi-join是這麼解釋的：

A semijoin returns rows that match an EXISTS subquery without duplicating rows

from the left side of the predicate when multiple rows on the right side satisfy the

criteria of the subquery.

Semijoin and antijoin transformation cannot be done if the subquery is on an OR

branch of the WHERE clause.

由上可見，semi-join特別適用於如下場景：

表A 和B有1：N關係，需要根據B表某個條件，通過semi-join查詢A表上某個列（屬性）。因為semi-join已經包含一個去重過程，不需要在A表上加distinct。比內連線然後去重方式效能更好。

假設有兩個表Customer ， orders。當然，每個客戶可以有多個訂單。現在需要找出訂單金額超過10000的客戶數量。本文例子在12cR1上實現，11g上應該是同樣結果。

show rel;

release 1201000100

desc customer

Name Null Type

----------- -------- -------------

CUSTOMER_ID NOT NULL NUMBER(12)

NAME VARCHAR2(100)

STATUS VARCHAR2(10)

desc orders

Name Null Type

----------- ---- ------------

ORDER_ID NUMBER(12)

CUSTOMER_ID NUMBER(12)

CREATE_DATE DATE

ORDER_PRICE NUMBER(38,6)

select count(*) from orders;

count(*)

--------------

1000002

select count(*) from customer;

count(*)

--------------

100000

orders表有100萬行，customer表有10萬行。

下面來比較下inner join和semi-join兩種SQL寫法的效能。

Select /*HE7*/ count(distinct cus.name)

from customer cus, orders ord

where cus.customer_id=ord.customer_id

and ord.order_price > 10000;

由於做了內連線(inner join)，最終結果集裡將會出現所有匹配行，包含一部分重複customer(使用者）。著名Oracle佈道者Tom Kyte在2006年一篇文章裡就寫到：

“In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!

In general, you use a join when you need data from more than one table in the ultimate SELECT list.”

文章出處：

http://www.oracle.com/technetwork/issue-archive/2006/06-may/o36asktom-099107.html

在這裡，我們只需要customer表的資訊，orders表是用來做檢索（filtering）的。

只將customer表放在from處，更加自然。用exists(semi-join)改寫後：

Select /*HE8*/ count(name)

from customer

where exists(

select 1 from orders

where customer.customer_id=orders.customer_id

and orders.order_price > 10000

);

由於semi-join，這裡我們不需要distinct，從customer能直接得到去重後的結果。檢視下執行計劃和最終的cost

SQL_ID 3kfvh06bqrn5h, child number 0

-------------------------------------

select /*HE7*/ count(distinct cus.name) from customer cus, orders ord

where cus.customer_id=ord.customer_id and ord.order_price > 10000

Plan hash value: 3082118893

------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | | 3212 (100)| |

| 1 | SORT AGGREGATE | | 1 | 52 | | | |

| 2 | VIEW | VW_DAG_0 | 52570 | 2669K| | 3212 (1)| 00:00:01 |

| 3 | HASH GROUP BY | | 52570 | 2104K| 2688K| 3212 (1)| 00:00:01 |

|* 4 | HASH JOIN SEMI | | 52570 | 2104K| 2640K| 2651 (1)| 00:00:01 |

| 5 | TABLE ACCESS FULL| CUSTOMER | 100K| 1464K| | 102 (0)| 00:00:01 |

|* 6 | TABLE ACCESS FULL| ORDERS | 753K| 18M| | 1063 (1)| 00:00:01 |

------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

4 - access("CUS"."CUSTOMER_ID"="ORD"."CUSTOMER_ID")

6 - filter("ORD"."ORDER_PRICE">10000)

Note

-----

- dynamic statistics used: dynamic sampling (level=2)

SQL_ID ggs8p2c27nzmu, child number 0

-------------------------------------

select /*HE8*/ count(name) from customer where exists( select 1 from

orders where customer.customer_id=orders.customer_id and

orders.order_price > 10000 )

Plan hash value: 2841769433

----------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | | 2651 (100)| |

| 1 | SORT AGGREGATE | | 1 | 41 | | | |

|* 2 | HASH JOIN SEMI | | 52570 | 2104K| 2640K| 2651 (1)| 00:00:01 |

| 3 | TABLE ACCESS FULL| CUSTOMER | 100K| 1464K| | 102 (0)| 00:00:01 |

|* 4 | TABLE ACCESS FULL| ORDERS | 753K| 18M| | 1063 (1)| 00:00:01 |

----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("CUSTOMER"."CUSTOMER_ID"="ORDERS"."CUSTOMER_ID")

4 - filter("ORDERS"."ORDER_PRICE">10000)

Note

-----

- dynamic statistics used: dynamic sampling (level=2)

很明顯，第二個執行計劃優於第一個。

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/638844/viewspace-772370/，如需轉載，請註明出處，否則將追究法律責任。

使用exists(Semi-Join)優化distinct語句

相關文章