Spark SQL如何選擇join策略

大資料學習與分享發表於2021-01-29

前言

眾所周知,Catalyst Optimizer是Spark SQL的核心,它主要負責將SQL語句轉換成最終的物理執行計劃,在一定程度上決定了SQL執行的效能。

Catalyst在由Optimized Logical Plan生成Physical Plan的過程中,會根據:

abstract class SparkStrategies extends QueryPlanner[SparkPlan]

 

中的JoinSelection通過一些規則按照順序進行模式匹配,從而確定join的最終執行策略,並且策略的選擇會按照執行效率由高到低的優先順序排列。

在瞭解join策略選擇之前,首先看幾個先決條件:

1. build table的選擇

Hash Join的第一步就是根據兩表之中較小的那一個構建雜湊表,這個小表就叫做build table,大表則稱為probe table,因為需要拿小表形成的雜湊表來"探測"它。原始碼如下:

/* 左表作為build table的條件,join型別需滿足:
   1. InnerLike:實現目前包括inner join和cross join
   2. RightOuter:right outer join
*/      
private def canBuildLeft(joinType: JoinType): Boolean = joinType match {
  case _: InnerLike | RightOuter => true
  case _ => false
}

/* 右表作為build table的條件,join型別需滿足(第1種是在業務開發中寫的SQL主要適配的):
   1. InnerLike、LeftOuter(left outer join)、LeftSemi(left semi join)、LeftAnti(left anti join)
   2. ExistenceJoin:only used in the end of optimizer and physical plans, we will not generate SQL for this join type
*/
private def canBuildRight(joinType: JoinType): Boolean = joinType match {
  case _: InnerLike | LeftOuter | LeftSemi | LeftAnti | _: ExistenceJoin => true
  case _ => false
}

 


2. 滿足什麼條件的表才能被廣播

如果一個表的大小小於或等於引數spark.sql.autoBroadcastJoinThreshold(預設10M)配置的值,那麼就可以廣播該表。原始碼如下:

private def canBroadcastBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
  : Boolean = {
  val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
  val buildRight = canBuildRight(joinType) && canBroadcast(right)
  buildLeft || buildRight
}

private def canBroadcast(plan: LogicalPlan): Boolean = {
  plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold
}

private def broadcastSideBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
  : BuildSide = {
  val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
  val buildRight = canBuildRight(joinType) && canBroadcast(right)
  
  // 最終會呼叫broadcastSide
  broadcastSide(buildLeft, buildRight, left, right)
}

 


除了通過上述表的大小滿足一定條件之外,我們也可以通過直接在Spark SQL中顯示使用hint方式(/*+ BROADCAST(small_table) */),直接指定要廣播的表,原始碼如下:

private def canBroadcastByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
  : Boolean = {
  val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
  val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
  buildLeft || buildRight
}

private def broadcastSideByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
  : BuildSide = {
  val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
  val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
  
  // 最終會呼叫broadcastSide
  broadcastSide(buildLeft, buildRight, left, right)
}

 


無論是通過表大小進行廣播還是根據是否指定hint進行表廣播,最終都會呼叫broadcastSide,來決定應該廣播哪個表:

private def broadcastSide(
     canBuildLeft: Boolean,
     canBuildRight: Boolean,
     left: LogicalPlan,
     right: LogicalPlan): BuildSide = {

   def smallerSide =
     if (right.stats.sizeInBytes <= left.stats.sizeInBytes) BuildRight else BuildLeft

  if (canBuildRight && canBuildLeft) {
    // 如果左表和右表都能作為build table,則將根據表的統計資訊,確定physical size較小的表作為build table(即使兩個表都被指定了hint)
    smallerSide
  } else if (canBuildRight) {
     // 上述條件不滿足,優先判斷右表是否滿足build條件,滿足則廣播右表。否則,接著判斷左表是否滿足build條件
    BuildRight
  } else if (canBuildLeft) {
    BuildLeft
  } else {
    // 如果左表和右表都不能作為build table,則將根據表的統計資訊,確定physical size較小的表作為build table。目前主要用於broadcast nested loop join
    smallerSide
  }
}

 


從上述原始碼可知,即使使用者指定了廣播hint,實際執行時,不一定按照hint的表進行廣播。

3. 是否可構造本地HashMap

應用於Shuffle Hash Join中,原始碼如下:

// 邏輯計劃的單個分割槽足夠小到構建一個hash表
// 注意:要求分割槽數是固定的。如果分割槽數是動態的,還需滿足其他條件
private def canBuildLocalHashMap(plan: LogicalPlan): Boolean = {
  // 邏輯計劃的physical size小於spark.sql.autoBroadcastJoinThreshold * spark.sql.shuffle.partitions(預設200)時,即可構造本地HashMap
  plan.stats.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions
}

 

我們知道,SparkSQL目前主要實現了3種join:Broadcast Hash Join、ShuffledHashJoin、Sort Merge Join。那麼Catalyst在處理SQL語句時,是依據什麼規則進行join策略選擇的呢?

1. Broadcast Hash Join

主要根據hint和size進行判斷是否滿足條件。

// broadcast hints were specified
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
   if canBroadcastByHints(joinType, left, right) =>
   val buildSide = broadcastSideByHints(joinType, left, right)
   Seq(joins.BroadcastHashJoinExec(
     leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))

// broadcast hints were not specified, so need to infer it from size and configuration.
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
   if canBroadcastBySizes(joinType, left, right) =>
   val buildSide = broadcastSideBySizes(joinType, left, right)
   Seq(joins.BroadcastHashJoinExec(
     leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))

 


2. Shuffle Hash Join

選擇Shuffle Hash Join需要同時滿足以下條件:   

  1. spark.sql.join.preferSortMergeJoin為false,即Shuffle Hash Join優先於Sort Merge Join

  2. 右表或左表是否能夠作為build table  

  3. 是否能構建本地HashMap  

  4. 以右表為例,它的邏輯計劃大小要遠小於左表大小(預設3倍)

上述條件優先檢查右表。

case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
    if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
      && muchSmaller(right, left) ||
      !RowOrdering.isOrderable(leftKeys) =>
   Seq(joins.ShuffledHashJoinExec(
     leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right)))

case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
     if !conf.preferSortMergeJoin && canBuildLeft(joinType) && uildLocalHashMap(left)
       && muchSmaller(left, right) ||
       !RowOrdering.isOrderable(leftKeys) =>
    Seq(joins.ShuffledHashJoinExec(
      leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right)))
      
private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = {
  a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes
}

 


如果不滿足上述條件,但是如果參與join的表的key無法被排序,即無法使用Sort Merge Join,最終也會選擇Shuffle Hash Join。

!RowOrdering.isOrderable(leftKeys)

def isOrderable(exprs: Seq[Expression]): Boolean = exprs.forall(e => isOrderable(e.dataType))

 

3. Sort Merge Join

如果上面兩種join策略(Broadcast Hash Join和Shuffle Hash Join)都不符合條件,並且參與join的key是可排序的,就會選擇Sort Merge Join。

case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
   if RowOrdering.isOrderable(leftKeys) =>
   joins.SortMergeJoinExec(
     leftKeys, rightKeys, joinType, condition, planLater(left), planLater(right)) :: Nil

 

4. Without joining keys

Broadcast Hash Join、Shuffle Hash Join和Sort Merge Join都屬於經典的ExtractEquiJoinKeys(等值連線條件)。

對於非ExtractEquiJoinKeys,則會優先檢查表是否可以被廣播(hint或者size)。如果可以,則會使用BroadcastNestedLoopJoin(簡稱BNLJ),熟悉Nested Loop Join則不難理解BNLJ,主要卻別在於BNLJ加上了廣播表。

原始碼如下:

// Pick BroadcastNestedLoopJoin if one side could be broadcast
case j @ logical.Join(left, right, joinType, condition)
    if canBroadcastByHints(joinType, left, right) =>
  val buildSide = broadcastSideByHints(joinType, left, right)
  joins.BroadcastNestedLoopJoinExec(
    planLater(left), planLater(right), buildSide, joinType, condition) :: Nil

case j @ logical.Join(left, right, joinType, condition)
    if canBroadcastBySizes(joinType, left, right) =>
  val buildSide = broadcastSideBySizes(joinType, left, right)
  joins.BroadcastNestedLoopJoinExec(
    planLater(left), planLater(right), buildSide, joinType, condition) :: Nil

 

如果表不能被廣播,又細分為兩種情況: 

  1. 若join型別InnerLike(關於InnerLike上面已有介紹)對量表直接進行笛卡爾積處理若

  2. 上述情況都不滿足,最終方案是選擇兩個表中physical size較小的表進行廣播,join策略仍為BNLJ

原始碼如下:

// Pick CartesianProduct for InnerJoin
case logical.Join(left, right, _: InnerLike, condition) =>
  joins.CartesianProductExec(planLater(left), planLater(right), condition) :: Nil

case logical.Join(left, right, joinType, condition) =>
  val buildSide = broadcastSide(
    left.stats.hints.broadcast, right.stats.hints.broadcast, left, right)
  // This join could be very slow or OOM
  joins.BroadcastNestedLoopJoinExec(
    planLater(left), planLater(right), buildSide, joinType, condition) :: Nil

 


很顯然,無論SQL語句最終的join策略選擇笛卡爾積還是BNLJ,效率都很低,這一點在實際應用中,要儘量避免。

推薦文章:
SparkSQL與Hive metastore Parquet轉換
通過Spark生成HFile,並以BulkLoad方式將資料匯入到HBase
Spark SQL 小檔案問題處理


關注微信公眾號:大資料學習與分享,獲取更對技術乾貨

相關文章