【前言:如果你經常使用Spark SQL進行資料的處理分析,那麼對笛卡爾積的危害性一定不陌生,比如大量佔用叢集資源導致其他任務無法正常執行,甚至導致節點當機。那麼都有哪些情況會產生笛卡爾積,以及如何事前"預測"寫的SQL會產生笛卡爾積從而避免呢?(以下不考慮業務需求確實需要笛卡爾積的場景)】
Spark SQL幾種產生笛卡爾積的典型場景
首先來看一下在Spark SQL中產生笛卡爾積的幾種典型SQL:
1. join語句中不指定on條件
select * from test_partition1 join test_partition2;
2. join語句中指定不等值連線
select * from test_partition1 t1 inner join test_partition2 t2 on t1.name <> t2.name;
3. join語句on中用or指定連線條件
select * from test_partition1 t1 join test_partition2 t2 on t1.id = t2.id or t1.name = t2.name;
4. join語句on中用||指定連線條件
select * from test_partition1 t1 join test_partition2 t2 on t1.id = t2.id || t1.name = t2.name;
除了上述舉的幾個典型例子,實際業務開發中產生笛卡爾積的原因多種多樣。
同時需要注意,在一些SQL中即使滿足了上述4種規則之一也不一定產生笛卡爾積。比如,對於join語句中指定不等值連線條件的下述SQL不會產生笛卡爾積:
--在Spark SQL內部優化過程中針對join策略的選擇,最終會通過SortMergeJoin進行處理。 select * from test_partition1 t1 join test_partition2 t2 on t1.id = t2.id and t1.name<>t2.name;
此外,對於直接在SQL中使用cross join的方式,也不一定產生笛卡爾積。比如下述SQL:
-- Spark SQL內部優化過程中選擇了SortMergeJoin方式進行處理 select * from test_partition1 t1 cross join test_partition2 t2 on t1.id = t2.id;
但是如果cross join沒有指定on條件同樣會產生笛卡爾積。那麼如何判斷一個SQL是否產生了笛卡爾積呢?
Spark SQL是否產生了笛卡爾積?
以join語句不指定on條件產生笛卡爾積的SQL為例:
-- test_partition1和test_partition2是Hive分割槽表 select * from test_partition1 join test_partition2;
通過Spark UI上SQL一欄檢視上述SQL執行圖,如下:
可以看出,因為該join語句中沒有指定on連線查詢條件,導致了CartesianProduct即笛卡爾積。
再來看一下該join語句的邏輯計劃和物理計劃:
== Parsed Logical Plan == 'GlobalLimit 1000 +- 'LocalLimit 1000 +- 'Project [*] +- 'UnresolvedRelation `t` == Analyzed Logical Plan == id: string, name: string, dt: string, id: string, name: string, dt: string GlobalLimit 1000 +- LocalLimit 1000 +- Project [id#84, name#85, dt#86, id#87, name#88, dt#89] +- SubqueryAlias `t` +- Project [id#84, name#85, dt#86, id#87, name#88, dt#89] +- Join Inner :- SubqueryAlias `default`.`test_partition1` : +- HiveTableRelation `default`.`test_partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#84, name#85], [dt#86] +- SubqueryAlias `default`.`test_partition2` +- HiveTableRelation `default`.`test_partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#87, name#88], [dt#89] == Optimized Logical Plan == GlobalLimit 1000 +- LocalLimit 1000 +- Join Inner :- HiveTableRelation `default`.`test_partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#84, name#85], [dt#86] +- HiveTableRelation `default`.`test_partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#87, name#88], [dt#89] == Physical Plan == CollectLimit 1000 +- CartesianProduct :- Scan hive default.test_partition1 [id#84, name#85, dt#86], HiveTableRelation `default`.`test_partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#84, name#85], [dt#86] +- Scan hive default.test_partition2 [id#87, name#88, dt#89], HiveTableRelation `default`.`test_partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#87, name#88], [dt#89]
通過邏輯計劃到物理計劃,以及最終的物理計劃選擇CartesianProduct,可以分析得出該SQL最終確實產生了笛卡爾積。
Spark SQL中產生笛卡爾積的處理策略
在之前的文章中《Spark SQL如何選擇join策略》已經介紹過,Spark SQL中主要有ExtractEquiJoinKeys(Broadcast Hash Join、Shuffle Hash Join、Sort Merge Join,這3種是我們比較熟知的Spark SQL join)和Without joining keys(CartesianProduct、BroadcastNestedLoopJoin)join策略。
那麼,如何判斷SQL是否產生了笛卡爾積就迎刃而解。
1. 在利用Spark SQL執行SQL任務時,通過檢視SQL的執行圖來分析是否產生了笛卡爾積。如果產生笛卡爾積,則將任務殺死,進行任務優化避免笛卡爾積。【不推薦。使用者需要到Spark UI上檢視執行圖,並且需要對Spark UI介面功能等要了解,需要一定的專業性。(注意:這裡之所以這樣說,是因為Spark SQL是計算引擎,面向的使用者角色不同,使用者不一定對Spark本身瞭解透徹,但熟悉SQL。對於做平臺的小夥伴兒,想必深有感觸)】
2. 分析Spark SQL的邏輯計劃和物理計劃,通過程式解析計劃推斷SQL最終是否選擇了笛卡爾積執行策略。如果是,及時提示風險。具體可以參考Spark SQL join策略選擇的原始碼:
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { // --- BroadcastHashJoin -------------------------------------------------------------------- // broadcast hints were specified case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if canBroadcastByHints(joinType, left, right) => val buildSide = broadcastSideByHints(joinType, left, right) Seq(joins.BroadcastHashJoinExec( leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right))) // broadcast hints were not specified, so need to infer it from size and configuration. case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if canBroadcastBySizes(joinType, left, right) => val buildSide = broadcastSideBySizes(joinType, left, right) Seq(joins.BroadcastHashJoinExec( leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right))) // --- ShuffledHashJoin --------------------------------------------------------------------- case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right) && muchSmaller(right, left) || !RowOrdering.isOrderable(leftKeys) => Seq(joins.ShuffledHashJoinExec( leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right))) case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if !conf.preferSortMergeJoin && canBuildLeft(joinType) && canBuildLocalHashMap(left) && muchSmaller(left, right) || !RowOrdering.isOrderable(leftKeys) => Seq(joins.ShuffledHashJoinExec( leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right))) // --- SortMergeJoin ------------------------------------------------------------ case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if RowOrdering.isOrderable(leftKeys) => joins.SortMergeJoinExec( leftKeys, rightKeys, joinType, condition, planLater(left), planLater(right)) :: Nil // --- Without joining keys ------------------------------------------------------------ // Pick BroadcastNestedLoopJoin if one side could be broadcast case j @ logical.Join(left, right, joinType, condition) if canBroadcastByHints(joinType, left, right) => val buildSide = broadcastSideByHints(joinType, left, right) joins.BroadcastNestedLoopJoinExec( planLater(left), planLater(right), buildSide, joinType, condition) :: Nil case j @ logical.Join(left, right, joinType, condition) if canBroadcastBySizes(joinType, left, right) => val buildSide = broadcastSideBySizes(joinType, left, right) joins.BroadcastNestedLoopJoinExec( planLater(left), planLater(right), buildSide, joinType, condition) :: Nil // Pick CartesianProduct for InnerJoin case logical.Join(left, right, _: InnerLike, condition) => joins.CartesianProductExec(planLater(left), planLater(right), condition) :: Nil case logical.Join(left, right, joinType, condition) => val buildSide = broadcastSide( left.stats.hints.broadcast, right.stats.hints.broadcast, left, right) // This join could be very slow or OOM joins.BroadcastNestedLoopJoinExec( planLater(left), planLater(right), buildSide, joinType, condition) :: Nil // --- Cases where this strategy does not apply --------------------------------------------- case _ => Nil }
此外,在業務開發中,要不斷總結歸納產生笛卡爾積的情況,形成知識文件,以便在後續業務開發中避免類似的情況出現。
除了笛卡爾積效率比較低,BroadcastNestedLoopJoin效率也相對低效,尤其是當資料量大的時候還很容易造成driver端的OOM,這種情況也是需要極力避免的。
推薦文章:
Spark SQL中Not in Subquery為何低效以及如何規避
SparkSQL與Hive metastore Parquet轉換
Spark SQL解析查詢parquet格式Hive表獲取分割槽欄位和查詢條件
關注微信公眾號:大資料學習與分享,獲取更對技術乾貨