前言
眾所周知,Catalyst Optimizer是Spark SQL的核心,它主要負責將SQL語句轉換成最終的物理執行計劃,在一定程度上決定了SQL執行的效能。
Catalyst在由Optimized Logical Plan生成Physical Plan的過程中,會根據:
abstract class SparkStrategies extends QueryPlanner[SparkPlan]
中的JoinSelection通過一些規則按照順序進行模式匹配,從而確定join的最終執行策略,並且策略的選擇會按照執行效率由高到低的優先順序排列。
在瞭解join策略選擇之前,首先看幾個先決條件:
1. build table的選擇
Hash Join的第一步就是根據兩表之中較小的那一個構建雜湊表,這個小表就叫做build table,大表則稱為probe table,因為需要拿小表形成的雜湊表來"探測"它。原始碼如下:
/* 左表作為build table的條件,join型別需滿足: 1. InnerLike:實現目前包括inner join和cross join 2. RightOuter:right outer join */ private def canBuildLeft(joinType: JoinType): Boolean = joinType match { case _: InnerLike | RightOuter => true case _ => false } /* 右表作為build table的條件,join型別需滿足(第1種是在業務開發中寫的SQL主要適配的): 1. InnerLike、LeftOuter(left outer join)、LeftSemi(left semi join)、LeftAnti(left anti join) 2. ExistenceJoin:only used in the end of optimizer and physical plans, we will not generate SQL for this join type */ private def canBuildRight(joinType: JoinType): Boolean = joinType match { case _: InnerLike | LeftOuter | LeftSemi | LeftAnti | _: ExistenceJoin => true case _ => false }
2. 滿足什麼條件的表才能被廣播
如果一個表的大小小於或等於引數spark.sql.autoBroadcastJoinThreshold(預設10M)配置的值,那麼就可以廣播該表。原始碼如下:
private def canBroadcastBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan) : Boolean = { val buildLeft = canBuildLeft(joinType) && canBroadcast(left) val buildRight = canBuildRight(joinType) && canBroadcast(right) buildLeft || buildRight } private def canBroadcast(plan: LogicalPlan): Boolean = { plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold } private def broadcastSideBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan) : BuildSide = { val buildLeft = canBuildLeft(joinType) && canBroadcast(left) val buildRight = canBuildRight(joinType) && canBroadcast(right) // 最終會呼叫broadcastSide broadcastSide(buildLeft, buildRight, left, right) }
除了通過上述表的大小滿足一定條件之外,我們也可以通過直接在Spark SQL中顯示使用hint方式(/*+ BROADCAST(small_table) */),直接指定要廣播的表,原始碼如下:
private def canBroadcastByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan) : Boolean = { val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast buildLeft || buildRight } private def broadcastSideByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan) : BuildSide = { val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast // 最終會呼叫broadcastSide broadcastSide(buildLeft, buildRight, left, right) }
無論是通過表大小進行廣播還是根據是否指定hint進行表廣播,最終都會呼叫broadcastSide,來決定應該廣播哪個表:
private def broadcastSide( canBuildLeft: Boolean, canBuildRight: Boolean, left: LogicalPlan, right: LogicalPlan): BuildSide = { def smallerSide = if (right.stats.sizeInBytes <= left.stats.sizeInBytes) BuildRight else BuildLeft if (canBuildRight && canBuildLeft) { // 如果左表和右表都能作為build table,則將根據表的統計資訊,確定physical size較小的表作為build table(即使兩個表都被指定了hint) smallerSide } else if (canBuildRight) { // 上述條件不滿足,優先判斷右表是否滿足build條件,滿足則廣播右表。否則,接著判斷左表是否滿足build條件 BuildRight } else if (canBuildLeft) { BuildLeft } else { // 如果左表和右表都不能作為build table,則將根據表的統計資訊,確定physical size較小的表作為build table。目前主要用於broadcast nested loop join smallerSide } }
從上述原始碼可知,即使使用者指定了廣播hint,實際執行時,不一定按照hint的表進行廣播。
3. 是否可構造本地HashMap
應用於Shuffle Hash Join中,原始碼如下:
// 邏輯計劃的單個分割槽足夠小到構建一個hash表 // 注意:要求分割槽數是固定的。如果分割槽數是動態的,還需滿足其他條件 private def canBuildLocalHashMap(plan: LogicalPlan): Boolean = { // 邏輯計劃的physical size小於spark.sql.autoBroadcastJoinThreshold * spark.sql.shuffle.partitions(預設200)時,即可構造本地HashMap plan.stats.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions }
我們知道,SparkSQL目前主要實現了3種join:Broadcast Hash Join、ShuffledHashJoin、Sort Merge Join。那麼Catalyst在處理SQL語句時,是依據什麼規則進行join策略選擇的呢?
1. Broadcast Hash Join
主要根據hint和size進行判斷是否滿足條件。
// broadcast hints were specified case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if canBroadcastByHints(joinType, left, right) => val buildSide = broadcastSideByHints(joinType, left, right) Seq(joins.BroadcastHashJoinExec( leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right))) // broadcast hints were not specified, so need to infer it from size and configuration. case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if canBroadcastBySizes(joinType, left, right) => val buildSide = broadcastSideBySizes(joinType, left, right) Seq(joins.BroadcastHashJoinExec( leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))
2. Shuffle Hash Join
選擇Shuffle Hash Join需要同時滿足以下條件:
-
spark.sql.join.preferSortMergeJoin為false,即Shuffle Hash Join優先於Sort Merge Join
-
右表或左表是否能夠作為build table
-
是否能構建本地HashMap
-
以右表為例,它的邏輯計劃大小要遠小於左表大小(預設3倍)
上述條件優先檢查右表。
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right) && muchSmaller(right, left) || !RowOrdering.isOrderable(leftKeys) => Seq(joins.ShuffledHashJoinExec( leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right))) case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if !conf.preferSortMergeJoin && canBuildLeft(joinType) && uildLocalHashMap(left) && muchSmaller(left, right) || !RowOrdering.isOrderable(leftKeys) => Seq(joins.ShuffledHashJoinExec( leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right))) private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = { a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes }
如果不滿足上述條件,但是如果參與join的表的key無法被排序,即無法使用Sort Merge Join,最終也會選擇Shuffle Hash Join。
!RowOrdering.isOrderable(leftKeys) def isOrderable(exprs: Seq[Expression]): Boolean = exprs.forall(e => isOrderable(e.dataType))
3. Sort Merge Join
如果上面兩種join策略(Broadcast Hash Join和Shuffle Hash Join)都不符合條件,並且參與join的key是可排序的,就會選擇Sort Merge Join。
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if RowOrdering.isOrderable(leftKeys) => joins.SortMergeJoinExec( leftKeys, rightKeys, joinType, condition, planLater(left), planLater(right)) :: Nil
4. Without joining keys
Broadcast Hash Join、Shuffle Hash Join和Sort Merge Join都屬於經典的ExtractEquiJoinKeys(等值連線條件)。
對於非ExtractEquiJoinKeys,則會優先檢查表是否可以被廣播(hint或者size)。如果可以,則會使用BroadcastNestedLoopJoin(簡稱BNLJ),熟悉Nested Loop Join則不難理解BNLJ,主要卻別在於BNLJ加上了廣播表。
原始碼如下:
// Pick BroadcastNestedLoopJoin if one side could be broadcast case j @ logical.Join(left, right, joinType, condition) if canBroadcastByHints(joinType, left, right) => val buildSide = broadcastSideByHints(joinType, left, right) joins.BroadcastNestedLoopJoinExec( planLater(left), planLater(right), buildSide, joinType, condition) :: Nil case j @ logical.Join(left, right, joinType, condition) if canBroadcastBySizes(joinType, left, right) => val buildSide = broadcastSideBySizes(joinType, left, right) joins.BroadcastNestedLoopJoinExec( planLater(left), planLater(right), buildSide, joinType, condition) :: Nil
如果表不能被廣播,又細分為兩種情況:
-
若join型別InnerLike(關於InnerLike上面已有介紹)對量表直接進行笛卡爾積處理若
-
上述情況都不滿足,最終方案是選擇兩個表中physical size較小的表進行廣播,join策略仍為BNLJ
原始碼如下:
// Pick CartesianProduct for InnerJoin case logical.Join(left, right, _: InnerLike, condition) => joins.CartesianProductExec(planLater(left), planLater(right), condition) :: Nil case logical.Join(left, right, joinType, condition) => val buildSide = broadcastSide( left.stats.hints.broadcast, right.stats.hints.broadcast, left, right) // This join could be very slow or OOM joins.BroadcastNestedLoopJoinExec( planLater(left), planLater(right), buildSide, joinType, condition) :: Nil
很顯然,無論SQL語句最終的join策略選擇笛卡爾積還是BNLJ,效率都很低,這一點在實際應用中,要儘量避免。
推薦文章:
SparkSQL與Hive metastore Parquet轉換
通過Spark生成HFile,並以BulkLoad方式將資料匯入到HBase
Spark SQL 小檔案問題處理
關注微信公眾號:大資料學習與分享,獲取更對技術乾貨