Spark原本預計在2.3版本實現聚合下推，雖然不知道是何原因最終沒有能夠在2.3版本最終實現，但是因為工作需要，必須要從聚合函式下手最佳化Spark SQL，遂思考之實現之。

一篇有意義的參考文章

網上有個牛人想在2.2版本實現聚合下推並提交程式碼到Spark，結果在pull request裡被拒絕了，Spark的人說他們在2.3會實現一套新的DataSource API，即DataSource API v2，所以讓他不要這麼執著於提交這個程式碼，我也是很醉。。。這裡貼出他的博文：

該大牛是基於物理計劃實現的下推，侷限性比較大。所以我參考了他的思路，從邏輯計劃和物理計劃兩個方面都做了一些最佳化。這裡只講邏輯計劃的下推。下推，必然最後是推到資料來源層，而Spark沒有實現DataSource的聚合資料來源的介面，這裡可以參考下剛剛分享博文實現的AggregatedFilteredScan介面，我也是基於這個介面的做法實現的。

下推的意義

無論是傳統的謂詞下推，還是聚合下推，意義都在於將一些操作推到資料來源層，這樣從資料來源裡返回的資料就會極大減少。磁碟讀寫和網路開銷都會降低，效能會得到提升。

難點的實現思路

聚合下推的最大難點，我認為是遇到了join，當join的on的兩列不屬於group列或者aggregate列該腫麼辦。最開始我認為這種情況可能沒有辦法下推，因為這樣勢必要在在group列中加上了原本不屬於group的某一join列，這樣會影響聚合的結果並且會多一次聚合。但是經過大神提點，其實這樣也是可以下推的，原因有二：

即使多一個聚合節點，SQL執行的結果也是對的，也就是最終結果來看其實不應該聚合結果。
一般來說，join on的兩列不可能有相同的行數，如果行數相同，那麼按照資料庫的設計規範，這兩張表就應該Union成一張表。所以多的這個聚合節點，也是會減少資料來源的資料傳輸的。

這兩點在後面的例子會有展示。

一個下推到join的簡單思路以及結果

資料來源準備

使用的Scott資料來源。

初始SQL以及邏輯計劃

SQL：

SELECT AVG(salary), deptName 
FROM emp 
JOIN dept 
ON emp.deptNo = dept.deptNo 
GROUP BY deptName;

LogicalPlan：

Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
+- Project [salary#6, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

逐步下推

Step 1，指標在最上層，aggregate節點推到其child節點project的下面，同時將project裡的salary#6替換成avg(salary)#19：

Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 2，指標在第二層，這次是下推一層後的aggregate節點，搜尋join節點下面的左右子project節點，看哪個有salary#6，往salary#6所在的子節點上方新增一個以join欄位為group by條件的聚合節點，假設新生成exprId是20：

Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
      :  +- Project [deptNo#5, salary#6]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 3，指標在第4層的aggregate，與1類似，將aggregate推到project下方，並將project中的salary#6替換成salary#20：

Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#20]
      :  +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 2，此時aggregate的子節點是join，將avg(salary)#19的exprId(19)生成新id，假設這裡變成avg(salary)#20：

Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 3，搜尋join節點下面的左右子project節點，看哪個有salary#6，往salary#6所在的子節點上方新增一個以join欄位為group by條件的聚合節點，假設新生成exprId是21：

Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#21 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
      :  +- Project [deptNo#5, salary#6]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 4，與1類似，將aggregate推到project下方，並將project中的salary#6替換成salary#21：

Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#21]
      :  +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

這時aggregate-filter-relation的組合就會呼叫到上面提到的AggregatedFilteredScan介面，呼叫到資料來源的buildscan()方法。

結果檢視

這個下推，其實等價於下面的SQL和執行計劃：

SQL：

SELECT AVG(salary), deptName 
FROM (SELECT AVG(salary), deptNo FROM emp GROUP BY deptNo) a 
JOIN dept 
ON a.deptNo = dept.deptNo 
GROUP BY deptName

LogicalPlan：

Project [avg(avgsalary)#21, deptName#9]
+- Aggregate [deptName#9], [avg(avgsalary#2) AS avg(avgsalary)#21, deptName#9]
   +- Join Inner, (deptNo#6 = deptNo#8)
      :- Project [deptNo#6, avgsalary#2]
      :  +- Aggregate [deptNo#6], [avg(cast(salary#7 as double)) AS avgsalary#2, deptNo#6]
      :     +- Filter isnotnull(deptNo#6)
      :        +- Relation[empNo#3,empName#4,mgr#5,deptNo#6,salary#7] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#8, deptName#9]
         +- Filter isnotnull(deptNo#8)
            +- Relation[deptNo#8,deptName#9,loc#10] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

在沒有聚合下推的情況下，返回的join兩側的資料來源是：

data without aggregate function: 
  List(ListBuffer(20, 800.0), ListBuffer(20, 3000.0), ListBuffer(20, 2975.0), ListBuffer(30, 1600.0), ListBuffer(30, 1250.0), ListBuffer(30, 2950.0), ListBuffer(10, 5000.0))
data without aggregate function: 
  List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))

而在下推的情況下，返回的join兩側資料來源是：

data with aggregate function: 
  ListBuffer(List(30, 5800.0, 3), List(20, 6775.0, 3), List(10, 5000.0, 1))
data without aggregate function: 
  List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))

可以看到，下推和沒下推，在一側資料來源中拿到的資料有明顯減少。這還只是資料量在不到20的情況下。在資料量大的情況下，那麼聚合下推效果會更好。

二者查詢結果都是：

+------------------+----------+|avg(salary)       |deptName  |
+------------------+----------+
|1933.3333333333333|sales     ||5000.0            |accounting|
|2258.3333333333335|research  |+------------------+----------+

作者：orisonchan
連結：

Spark聚合下推思路以及demo