Mysql join 的本質,Nested Loop Join 和 Hash Join

眉頭發表於2020-09-25

Mysql 在不同版本對join的實現有不同的實現方式,mysql5.7之前,是通過Nested Loop join方式實現的,在mysql 8以後對這種嵌入迴圈查詢的方式採用hash join的演算法進行了優化。
注:如下引用內容均摘抄與Mysql官網

Nested Loop Join

MySQL executes joins between tables using a nested-loop algorithm or variations on it.

MySQL使用巢狀迴圈演算法或它的變體來執行表之間的連線。
有如下兩種演算法:

  • Nested-Loop Join Algorithm

    A simple nested-loop join (NLJ) algorithm reads rows from the first table in a loop one at a time, passing each row to a nested loop that processes the next table in the join. This process is repeated as many times as there remain tables to be joined.
    一個簡單的巢狀迴圈聯接(NLJ)演算法一次從迴圈中的第一個表讀取一行,將每一行傳遞到一個巢狀迴圈,該迴圈處理聯接中的下一個表。只要還有需要連線的表,這個過程就會重複多次。

    如果使用簡單的NLJ演算法,連線處理如下:

    for each row in t1 matching range {
    	  for each row in t2 matching reference key {
    	    for each row in t3 {
    	      if row satisfies join conditions, send to client
    	    }
      }
    }
    

    因為NLJ演算法一次只將一行從外部迴圈傳遞到內部迴圈,所以它通常讀取在內部迴圈中處理的表多次。
    分析 t1 join t2 :在這裡插入圖片描述
    t1分別取出每一個記錄去從t2 中匹配t2中的每一個列,然後再合併資料,這樣,如果資料量過大(尤其是t1),對資料庫的開銷會很大,所以最好是小表join大表
    t1 join t2 的時候,mysql會內部進行一個優化(內連線,因為外連線已經指定了驅動表),在讀取資料的時候,不一定是先讀取t1,在讀取t2。在這種情況下如果要指定先讀哪一個表中的資料,可以使用關鍵字 STRAIGHT_JOIN,這樣就強制了左表join右表的順序。即:select * frm t1 STRAIGHT_JOIN t2 on t1.id = t2.id

  • Block Nested-Loop Join Algorithm

    A Block Nested-Loop (BNL) join algorithm uses buffering of rows read in outer loops to reduce the number of times that tables in inner loops must be read. For example, if 10 rows are read into a buffer and the buffer is passed to the next inner loop, each row read in the inner loop can be compared against all 10 rows in the buffer. This reduces by an order of magnitude the number of times the inner table must be read.
    塊巢狀迴圈(BNL)連線演算法使用緩衝在外部迴圈中讀取的行,以減少必須讀取內部迴圈中的表的次數。例如,如果將10行讀入一個緩衝區並將該緩衝區傳遞給下一個內部迴圈,則可以將內部迴圈中讀取的每一行與緩衝區中的所有10行進行比較。這可以將必須讀取內部表的次數減少一個數量級。

    如果使用簡單的B-NLJ演算法,連線處理如下:

    for each row in t1 matching range {
      for each row in t2 matching reference key {
        store used columns from t1, t2 in join buffer
        if buffer is full {
          for each row in t3 {
            for each t1, t2 combination in join buffer {
              if row satisfies join conditions, send to client
            }
          }
          empty join buffer
        }
      }
    }
    
    if buffer is not empty {
      for each row in t3 {
        for each t1, t2 combination in join buffer {
          if row satisfies join conditions, send to client
        }
      }
    }
    

    分析 t1 join t2 :
    在這裡插入圖片描述
    如果join列沒有索引,就會採用BNLJ,可以看到中間有一個join buffer 緩衝區,是將t1 表的所有join 相關的列都快取到join buffer中,然後批量的和t2表驚醒匹配,將NLJ中匹配多次的方式合併為一次,降低了t2表中的訪問頻率。
    預設情況下join_buffer_size=256K,在查詢的時候Mysql會將所有的需要的列快取到join buffer中,包括select的列。在一個有N個join關聯的sql中,在執行的時候會分配N-1個 join buffer

Hash join

Beginning with MySQL 8.0.18, MySQL employs a hash join for any query for which each join has an equi-join condition and uses no indexes
從MySQL 8.0.18開始,MySQL對任何查詢都具有相等連線條件且不使用索引的查詢使用雜湊連線

如:SELECT * FROM t1 JOIN t2 ON t1.c1=t2.c1;
EXPLAIN 查詢執行計劃後可以到Extra中看到 Extra: Using where; Using join buffer (hash join)

A hash join is usually faster than and is intended to be used in such cases instead of the block nested loop algorithm employed in previous versions of MySQL. Beginning with MySQL 8.0.20, support for block nested loop is removed, and the server employs a hash join wherever a block nested loop would have been used previously.

雜湊連線通常比以前MySQL版本中使用的塊巢狀迴圈演算法更快,也更適合使用雜湊連線。從MySQL 8.0.20開始,刪除了對塊巢狀迴圈的支援,伺服器在以前使用塊巢狀迴圈的地方使用雜湊連線。

By default, MySQL 8.0.18 and later employs hash joins whenever possible. It is possible to control whether hash joins are employed using one of the BNL and NO_BNL optimizer hints.
(MySQL 8.0.18 supported hash_join=on or hash_join=off as part of the setting for the optimizer_switch server system variable as well as the optimizer hints HASH_JOIN or NO_HASH_JOIN. In MySQL 8.0.19 and later, these no longer have any effect.)
Memory usage by hash joins can be controlled using the join_buffer_size system variable; a hash join cannot use more memory than this amount. When the memory required for a hash join exceeds the amount available, MySQL handles this by using files on disk. If this happens, you should be aware that the join may not succeed if a hash join cannot fit into memory and it creates more files than set for open_files_limit. To avoid such problems, make either of the following changes:
Increase join_buffer_size so that the hash join does not spill over to disk.
Increase open_files_limit.
Beginning with MySQL 8.0.18, join buffers for hash joins are allocated incrementally; thus, you can set join_buffer_size higher without small queries allocating very large amounts of RAM, but outer joins allocate the entire buffer. In MySQL 8.0.20 and later, hash joins are used for outer joins as well, so this is no longer an issue.

預設情況下,MySQL 8.0.18及以後版本儘可能使用雜湊連線。可以使用一個BNL和NO_BNL優化器提示來控制是否使用雜湊連線。
(MySQL 8.0.18支援hash_join=on或hash_join=off作為optimizer_switch伺服器系統變數以及優化器提示hash_join或NO_HASH_JOIN的設定的一部分。在MySQL 8.0.19及以後版本中,這些不再有任何效果。)
雜湊連線的記憶體使用可以使用join_buffer_size系統變數來控制;雜湊連線不能使用超過此數量的記憶體。當雜湊連線所需的記憶體超過可用記憶體時,MySQL通過使用磁碟上的檔案來處理。如果發生這種情況,您應該知道,如果雜湊連線無法裝入記憶體,並且建立的檔案超過open_files_limit設定的檔案,則連線可能不會成功。為避免此類問題,請作出以下任何一項更改:
增加join_buffer_size,這樣雜湊連線就不會溢位到磁碟。
增加open_files_limit。
從MySQL 8.0.18開始,雜湊連線的連線緩衝區是遞增分配的;因此,您可以將join_buffer_size設定得更高,而無需使用分配大量RAM的小查詢,但是外部連線會分配整個緩衝區。在MySQL 8.0.20及以後版本中,雜湊連線也用於外連線,因此這不再是一個問題。
分析 t1 join t2 :
在這裡插入圖片描述
select * from t1 join t2 on t1.id = t2.id
hash join 在進行一次查詢的時候,會先選擇一個表中的join欄位進行hash 運算——hash(t1.id),將該欄位全量資料進行hash後,會存放到hash table中,但是如果表資料太多,記憶體是放不下全量資料的,所以mysql對這種情況做了額外的處理,會將資料寫入檔案塊中。記憶體中資料匹配完成之後,會將檔案塊中的資料載入到記憶體,然後重複以上的過程。

相關文章