表的三種連線方式官方解釋及個人理解

lusklusklusk發表於2016-10-26

巢狀迴圈Nested Loop Joins
Nested loop joins are useful when the following conditions are true:
The database joins small subsets(子集) of data.(資料庫連線小部分資料)
The join condition is an efficient method of accessing the second table.(連線條件是訪問第二個表的有效方法)

It is important to ensure that the inner table is driven from (dependent on) the outer table. If the inner table's access path is independent of the outer table, then the same rows are retrieved(檢索) for every iteration(迴圈) of the outer loop, degrading performance considerably. In such cases, hash joins joining the two independent row sources perform better.
如果內表的訪問路徑獨立於外部表,則對於外部迴圈的每次迴圈都會檢索相同的行,從而大大降低效能

A nested loop join involves(涉及) the following steps:
The optimizer determines the driving table and designates(指定) it as the outer table.
The other table is designated as the inner table.
For every row in the outer table, Oracle Database accesses all the rows in the inner table. The outer loop is for every row in the outer table and the inner loop is for every row in the inner table. The outer loop appears before the inner loop in the execution plan, as follows:NESTED LOOPS、outer_loop、inner_loop 

The optimizer uses nested loop joins when joining small number of rows, with a good driving condition between the two tables. You drive from the outer loop to the inner loop, so the order of tables in the execution plan is important

The outer loop is the driving row source. It produces a set of rows for driving the join condition. The row source can be a table accessed using an index scan or a full table scan. Also, the rows can be produced from any other operation. For example, the output from a nested loop join can serve as a row source for another nested loop join.

The inner loop is iterated(迴圈) for every row returned from the outer loop, ideally by an index scan. If the access path for the inner loop is not dependent on the outer loop, then you can end up with a Cartesian(笛卡爾積) product; for every iteration of the outer loop, the inner loop produces the same set of rows. Therefore, you should use other join methods when two independent row sources are joined together.
兩表查詢沒有任何關聯條件如select * from t1,t2就是典型的笛卡爾積

Nested loop join的個人理解和總結

1、工作原理:
select ...from t1, t2 where t1.id=t2.id;
for i in 1..t1.rowcount loop
   s_t1_id := t1[i].id;
   for j in 1..t2.rowcount loop
      s_t2_id := t2[j].id;
      if s_t1_id = s_t2_id then
         .....
      end if;
   end loop;
end loop;

Driving Table(驅動表): 該表又稱為外層表(OUTER TABLE)。通俗的理解就是執行計劃中NESTED LOOPS先執行的那個表
Probed Table(匹配表): 該表又稱為內層表(INNER TABLE)

2、通過以上工作原理我們知道,T1表(即驅動表)每返回一條記錄,都要去T2表去輪詢一次,得到與其匹配的資料,推送到結果集中。所以在巢狀迴圈的使用中,必須設定返回記錄少的表作為驅動表,可以通過以下兩點來提高巢狀迴圈的速度。

   第一:儘量提高T1表取記錄的速度(T1表的連線建立索引)
  
第二:儘量提高T2表記錄匹配速度(T2表的連線列上建立索引)

3、優缺點及適用場景
優點:
- 找出第一條匹配記錄的速度最快。
- 適用於純OLTP系統,並且是小表關聯大表的情況。
缺點:
- 邏輯讀太高。

Nested loop join選擇不同表作為驅動表消耗資源的例子
兩表連線每次的where條件關聯只返回一條記錄,存在2張表,一個10條記錄,一個1000萬條記錄,若2表都存在連線欄位索引

小表為驅動表的代價:10*(通過索引在大表查詢一條記錄的代價),大概10*5

大表為驅動表的代價:1000*(通過索引在小表查詢一條記錄的代價),大概1000*3

通過索引獲取一條記錄(取一條記錄那在表裡肯定就是一個塊了,除非那行大於塊的大小8k),10行的表代價通常在3  blocks,索引2塊,表1塊。1000萬的表代價通常5 blocks,索引可能達到4塊,表1塊





雜湊連線 Hash Joins
The database uses hash joins to join large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing(探測) the hash table to find the joined rows.This method is best when the smaller table fits in available memory. The cost is then limited to a single read pass over the data for the two tables.(對兩個表的資料進行單次讀取)
The optimizer uses a hash join to join two tables if they are joined using an equijoin(等值連線) and if either of(任意一個) the following conditions are true:
A large amount of data must be joined.
A large fraction of a small table must be joined.

hash join的個人理解和總結
- 小表關聯大表
- 全表掃描小表,並把關聯欄位的值取出,在pga裡構建hash table。
- 遍歷大表,根據掃描的結果按照關聯欄位執行相同的hash函式,得到hash value,去pga的hash table裡找匹配記錄。
優點
- 邏輯讀很少,因為兩表讀只讀一次。
缺點:
- 如果小表的hash table不能完全放在PGA裡的話,則效能會極具下降。

摘錄引用一個關於Hash join原理的例子
假如兩個資料集:
S={1,1,1,3,3,4,4,4,4,5,8,8,8,8,10}
B={0,0,1,1,1,1,2,2,2,2,2,2,3,8,9,9,9,10,10,11}
Hash Join的第一步就是判定小表(即build input)是否能完全存放在hash area記憶體中
如果能完全存放在記憶體中,則在記憶體中建立hash table,這是最簡單的hash join
如果不能全部存放在記憶體中,則build input必須分割槽。Oracle採用內部一個hash函式作用於連線鍵上,將S和B分割成多個分割槽,這樣產生十個分割槽,如下表:

經過這樣的分割槽之後,只需要相應的分割槽之間做join即可
    如果有一個分割槽為NULL的話,則相應的分割槽join即可忽略,在將S表讀入記憶體分割槽時,oracle即記錄連線鍵的唯一值,構建成所謂的點陣圖向量,它需要佔hash area記憶體的5%左右。在這裡即為{1,3,4,5,8,10}
   當對B表進行分割槽時,將每一個連線鍵上的值與點陣圖向量相比較,如果不在其中,則將其記錄丟棄
   在我們這個例子中,B表中以下資料將被丟棄{0,0,2,2,2,2,2,2,9,9,9,9,11},這個過程就是點陣圖向量過濾,當S1,B1做完連線後,接著對Si,Bi進行連線

Oracle官方文件對hash的一些解釋
hash table
An in-memory data structure that associates(關聯) join keys with rows in a hash join. For example, in a join of the employees and departments tables, the join key might be the department ID. A hash function uses the join key to generate a hash value. This hash value is an index in an array, which is the hash table.

hash cluster
A type of table cluster that is similar to an indexed cluster, except the index key is replaced with a hash function. No separate cluster index exists. In a hash cluster, the data is the index.

hash key value
In a hash cluster, an actual or possible value inserted into the cluster key column. For example, if the cluster key is department_id, then hash key values could be 10, 20, 30, and so on.

hash value
In a hash cluster, a unique numeric ID that identifies a bucket. Oracle Database uses a hash function that accepts an infinite(無窮) number of hash key values as input and sorts them into a finite number of buckets. Each hash value maps to the database block address for the block that stores the rows corresponding to the hash key value (department 10, 20, 30, and so on).

hashing
A mathematical technique in which an infinite set of input values is mapped to a finite set of output values, called hash values. Hashing is useful for rapid lookups of data in a hash table.

以下為個人對hash的一些理解
bucket就類似下面a[1][0]、a[1][1]、a[1][2]、..a[1][9]、..a[10][0]
1、2、3、4、..、100
1、2、3、4、..、1000
1、2、3、4、..、10000
如何快速找出28
a[10][10]
a[100][10]
a[1000][10]

mod(10,10)
a[1][0]=1
a[1][1]=11
..
a[1][9]=91
a[2][0]=2
a[2][1]=12
..
a[2][9]=92
...
a[8][?]
[8][2]->28
以上a[10][0]=10、a[10][1]=20(因為十位為1,就是10.加上前面個位的10,就是20)..a[10][9]=100
1、11、21、31、...91
2、12、22、32、...92
3、13、23、33、...93
...
8、18、28、38、...98
..

10、20、30、40、...100

mod(100,10)
a[1][0]=1
a[1][1]=11
..
a[1][9]=91
a[2][0]=2
a[2][1]=12
..
a[2][9]=92
...
a[28][?]
[28][0]->28

mod(1000,10)
a[1][0]=1
a[1][1]=11
..
a[1][9]=91
a[2][0]=2
a[2][1]=12
..
a[2][9]=92
...
a[28][?]
[28][0]->28




排序合併Sort Merge Joins
Sort merge joins can join rows from two independent sources. Hash joins generally perform better than sort merge joins. However, sort merge joins can perform better than hash joins if both of the following conditions exist:
The row sources are sorted already.
A sort operation does not have to be done.

However, if a sort merge join involves choosing a slower access method (an index scan as opposed to a full table scan), then the benefit of using a sort merge might be lost.

Sort merge joins are useful when the join condition between two tables is an inequality condition such as <, <=, >, or >=. Sort merge joins perform better than nested loop joins for large data sets. You cannot use hash joins unless there is an equality condition.

In a merge join, there is no concept of a driving table. The join consists of two steps:
Sort join operation: Both the inputs are sorted on the join key.
Merge join operation: The sorted lists are merged together.

If the input is sorted by the join column, then a sort join operation is not performed for that row source. However, a sort merge join always creates a positionable sort buffer for the right side of the join so that it can seek back to the last match in the case where duplicate join key values come out of the left side of the join.

The optimizer can choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:
The join condition between two tables is not an equijoin.
Because of sorts required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join.


Sort Merge Joins的個人理解和總結
1、工作原理:假如有AB兩張表進行排序合併連線,ORACLE會首先將A表進行排序,形成一張臨時的”C,然後將B進行排序,形成一張臨時的”D,然後將CD進行合併操作,返回結果集。就是兩個行源分別按照關聯欄位排序以後,做關聯。
2、由於兩個表都需要排序,所有PGA中排序區需要有足夠的空間,避免在磁碟上排序。(缺點就是消耗PGA)






總結三種連線方式差異如下

訪問次數|巢狀迴圈|驅動表返回幾條,被驅動表訪問多少次
              |—————————————————————————————————————— 
              |雜湊連線|都最多訪問一次
              |--------------------------------------------------------------------------
              |排序合併|表都最多訪問一次
---------------------------------------------------------------------------
適應場景|巢狀迴圈|小量資料的連線或小表大表連線
              |—————————————————————————————————————— 
              |雜湊連線|大資料的連線或小表大表連線
              |--------------------------------------------------------------------------
              |排序合併|大資料的連線且有排序操作
---------------------------------------------------------------------------

排序差異|巢狀迴圈|無排序
              |-------------------------------------------------------------
              |雜湊連線|無排序,但會消耗記憶體建立Hash表
              |--------------------------------------------------------------
              |排序合併|有排序
----------------------------------------------------------------
限制場景|巢狀迴圈|無任何限制,<>,>,<,like都可以用
              | ——————————————————————————
              |雜湊連線|只能使用=
              | -----------------------------------------------------
              |排序合併|無任何限制,<>,>,<,like都可以用
----------------------------------------------------
索引使用|巢狀迴圈|驅動表的限制條件和被驅動表的連線條件上建立索引
              | ------------------------------------------------------------
              |雜湊連線|索引列無要求,與單表情況無異
              | -----------------------------------------------------------
              |排序合併|索引可以消除排序
              ---------------------------------------------------------

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30126024/viewspace-2127136/,如需轉載,請註明出處,否則將追究法律責任。

相關文章