SparkSQL中產生笛卡爾積的幾種典型場景以及處理策略

大資料學習與分享發表於2021-03-16

【前言:如果你經常使用Spark SQL進行資料的處理分析,那麼對笛卡爾積的危害性一定不陌生,比如大量佔用叢集資源導致其他任務無法正常執行,甚至導致節點當機。那麼都有哪些情況會產生笛卡爾積,以及如何事前"預測"寫的SQL會產生笛卡爾積從而避免呢?(以下不考慮業務需求確實需要笛卡爾積的場景)】


Spark SQL幾種產生笛卡爾積的典型場景

首先來看一下在Spark SQL中產生笛卡爾積的幾種典型SQL:

1. join語句中不指定on條件

select * from test_partition1 join test_partition2;

 

2. join語句中指定不等值連線

select * from test_partition1 t1 inner join test_partition2 t2 on t1.name <> t2.name;

 3. join語句on中用or指定連線條件

select * from test_partition1 t1 join test_partition2 t2 on t1.id = t2.id or t1.name = t2.name;

 

4. join語句on中用||指定連線條件

select * from test_partition1 t1 join test_partition2 t2 on t1.id = t2.id || t1.name = t2.name;

 

除了上述舉的幾個典型例子,實際業務開發中產生笛卡爾積的原因多種多樣。
同時需要注意,在一些SQL中即使滿足了上述4種規則之一也不一定產生笛卡爾積。比如,對於join語句中指定不等值連線條件的下述SQL不會產生笛卡爾積:

--在Spark SQL內部優化過程中針對join策略的選擇,最終會通過SortMergeJoin進行處理。
select * from test_partition1 t1 join test_partition2 t2 on t1.id = t2.id and t1.name<>t2.name;

 

此外,對於直接在SQL中使用cross join的方式,也不一定產生笛卡爾積。比如下述SQL:

-- Spark SQL內部優化過程中選擇了SortMergeJoin方式進行處理
select * from test_partition1 t1 cross  join test_partition2 t2 on t1.id = t2.id;

 

但是如果cross join沒有指定on條件同樣會產生笛卡爾積。那麼如何判斷一個SQL是否產生了笛卡爾積呢?

Spark SQL是否產生了笛卡爾積?

以join語句不指定on條件產生笛卡爾積的SQL為例:

-- test_partition1和test_partition2是Hive分割槽表
select * from test_partition1 join test_partition2;

 通過Spark UI上SQL一欄檢視上述SQL執行圖,如下:

 

可以看出,因為該join語句中沒有指定on連線查詢條件,導致了CartesianProduct即笛卡爾積。

再來看一下該join語句的邏輯計劃和物理計劃:

 

== Parsed Logical Plan ==
'GlobalLimit 1000
+- 'LocalLimit 1000
   +- 'Project [*]
      +- 'UnresolvedRelation `t`

== Analyzed Logical Plan ==
id: string, name: string, dt: string, id: string, name: string, dt: string
GlobalLimit 1000
+- LocalLimit 1000
   +- Project [id#84, name#85, dt#86, id#87, name#88, dt#89]
      +- SubqueryAlias `t`
         +- Project [id#84, name#85, dt#86, id#87, name#88, dt#89]
            +- Join Inner
               :- SubqueryAlias `default`.`test_partition1`
               :  +- HiveTableRelation `default`.`test_partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#84, name#85], [dt#86]
               +- SubqueryAlias `default`.`test_partition2`
                  +- HiveTableRelation `default`.`test_partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#87, name#88], [dt#89]

== Optimized Logical Plan ==
GlobalLimit 1000
+- LocalLimit 1000
   +- Join Inner
      :- HiveTableRelation `default`.`test_partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#84, name#85], [dt#86]
      +- HiveTableRelation `default`.`test_partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#87, name#88], [dt#89]

== Physical Plan ==
CollectLimit 1000
+- CartesianProduct
   :- Scan hive default.test_partition1 [id#84, name#85, dt#86], HiveTableRelation `default`.`test_partition1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#84, name#85], [dt#86]
   +- Scan hive default.test_partition2 [id#87, name#88, dt#89], HiveTableRelation `default`.`test_partition2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, [id#87, name#88], [dt#89]

 

 通過邏輯計劃到物理計劃,以及最終的物理計劃選擇CartesianProduct,可以分析得出該SQL最終確實產生了笛卡爾積。

Spark SQL中產生笛卡爾積的處理策略

在之前的文章中《Spark SQL如何選擇join策略》已經介紹過,Spark SQL中主要有ExtractEquiJoinKeys(Broadcast Hash Join、Shuffle Hash Join、Sort Merge Join,這3種是我們比較熟知的Spark SQL join)和Without joining keys(CartesianProduct、BroadcastNestedLoopJoin)join策略。

那麼,如何判斷SQL是否產生了笛卡爾積就迎刃而解。

1. 在利用Spark SQL執行SQL任務時,通過檢視SQL的執行圖來分析是否產生了笛卡爾積。如果產生笛卡爾積,則將任務殺死,進行任務優化避免笛卡爾積。【不推薦。使用者需要到Spark UI上檢視執行圖,並且需要對Spark UI介面功能等要了解,需要一定的專業性。(注意:這裡之所以這樣說,是因為Spark SQL是計算引擎,面向的使用者角色不同,使用者不一定對Spark本身瞭解透徹,但熟悉SQL。對於做平臺的小夥伴兒,想必深有感觸)】

2. 分析Spark SQL的邏輯計劃和物理計劃,通過程式解析計劃推斷SQL最終是否選擇了笛卡爾積執行策略。如果是,及時提示風險。具體可以參考Spark SQL join策略選擇的原始碼:

 

def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
// --- BroadcastHashJoin --------------------------------------------------------------------
// broadcast hints were specified
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if canBroadcastByHints(joinType, left, right) =>
        val buildSide = broadcastSideByHints(joinType, left, right)
Seq(joins.BroadcastHashJoinExec(
          leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))
// broadcast hints were not specified, so need to infer it from size and configuration.
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if canBroadcastBySizes(joinType, left, right) =>
        val buildSide = broadcastSideBySizes(joinType, left, right)
Seq(joins.BroadcastHashJoinExec(
          leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))
// --- ShuffledHashJoin ---------------------------------------------------------------------
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
           && muchSmaller(right, left) ||
           !RowOrdering.isOrderable(leftKeys) =>
Seq(joins.ShuffledHashJoinExec(
          leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right)))
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if !conf.preferSortMergeJoin && canBuildLeft(joinType) && canBuildLocalHashMap(left)
           && muchSmaller(left, right) ||
           !RowOrdering.isOrderable(leftKeys) =>
Seq(joins.ShuffledHashJoinExec(
          leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right)))
// --- SortMergeJoin ------------------------------------------------------------
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if RowOrdering.isOrderable(leftKeys) =>
        joins.SortMergeJoinExec(
          leftKeys, rightKeys, joinType, condition, planLater(left), planLater(right)) :: Nil
// --- Without joining keys ------------------------------------------------------------
// Pick BroadcastNestedLoopJoin if one side could be broadcast
case j @ logical.Join(left, right, joinType, condition)
if canBroadcastByHints(joinType, left, right) =>
        val buildSide = broadcastSideByHints(joinType, left, right)
        joins.BroadcastNestedLoopJoinExec(
          planLater(left), planLater(right), buildSide, joinType, condition) :: Nil
case j @ logical.Join(left, right, joinType, condition)
if canBroadcastBySizes(joinType, left, right) =>
        val buildSide = broadcastSideBySizes(joinType, left, right)
        joins.BroadcastNestedLoopJoinExec(
          planLater(left), planLater(right), buildSide, joinType, condition) :: Nil
// Pick CartesianProduct for InnerJoin
case logical.Join(left, right, _: InnerLike, condition) =>
        joins.CartesianProductExec(planLater(left), planLater(right), condition) :: Nil
case logical.Join(left, right, joinType, condition) =>
        val buildSide = broadcastSide(
left.stats.hints.broadcast, right.stats.hints.broadcast, left, right)
// This join could be very slow or OOM
        joins.BroadcastNestedLoopJoinExec(
          planLater(left), planLater(right), buildSide, joinType, condition) :: Nil
// --- Cases where this strategy does not apply ---------------------------------------------
case _ => Nil
    }

 

 此外,在業務開發中,要不斷總結歸納產生笛卡爾積的情況,形成知識文件,以便在後續業務開發中避免類似的情況出現。

除了笛卡爾積效率比較低,BroadcastNestedLoopJoin效率也相對低效,尤其是當資料量大的時候還很容易造成driver端的OOM,這種情況也是需要極力避免的。

推薦文章:

Spark SQL中Not in Subquery為何低效以及如何規避

Spark SQL如何選擇join策略

SparkSQL與Hive metastore Parquet轉換

Spark SQL 小檔案問題處理

Spark SQL解析查詢parquet格式Hive表獲取分割槽欄位和查詢條件

Spark SQL | 目前Spark社群最活躍的元件之一


關注微信公眾號:大資料學習與分享,獲取更對技術乾貨

 

相關文章