Spark聚合下推思路以及demo
Spark原本預計在2.3版本實現聚合下推,雖然不知道是何原因最終沒有能夠在2.3版本最終實現,但是因為工作需要,必須要從聚合函式下手最佳化Spark SQL,遂思考之實現之。
一篇有意義的參考文章
網上有個牛人想在2.2版本實現聚合下推並提交程式碼到Spark,結果在pull request裡被拒絕了,Spark的人說他們在2.3會實現一套新的DataSource API,即DataSource API v2,所以讓他不要這麼執著於提交這個程式碼,我也是很醉。。。這裡貼出他的博文:
該大牛是基於物理計劃實現的下推,侷限性比較大。所以我參考了他的思路,從邏輯計劃和物理計劃兩個方面都做了一些最佳化。這裡只講邏輯計劃的下推。下推,必然最後是推到資料來源層,而Spark沒有實現DataSource的聚合資料來源的介面,這裡可以參考下剛剛分享博文實現的AggregatedFilteredScan
介面,我也是基於這個介面的做法實現的。
下推的意義
無論是傳統的謂詞下推,還是聚合下推,意義都在於將一些操作推到資料來源層,這樣從資料來源裡返回的資料就會極大減少。磁碟讀寫和網路開銷都會降低,效能會得到提升。
難點的實現思路
聚合下推的最大難點,我認為是遇到了join,當join的on的兩列不屬於group列或者aggregate列該腫麼辦。最開始我認為這種情況可能沒有辦法下推,因為這樣勢必要在在group列中加上了原本不屬於group的某一join列,這樣會影響聚合的結果並且會多一次聚合。但是經過大神提點,其實這樣也是可以下推的,原因有二:
即使多一個聚合節點,SQL執行的結果也是對的,也就是最終結果來看其實不應該聚合結果。
一般來說,join on的兩列不可能有相同的行數,如果行數相同,那麼按照資料庫的設計規範,這兩張表就應該Union成一張表。所以多的這個聚合節點,也是會減少資料來源的資料傳輸的。
這兩點在後面的例子會有展示。
一個下推到join的簡單思路以及結果
資料來源準備
使用的Scott資料來源。
初始SQL以及邏輯計劃
SQL:
SELECT AVG(salary), deptName FROM emp JOIN dept ON emp.deptNo = dept.deptNo GROUP BY deptName;
LogicalPlan:
Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8] +- Project [salary#6, deptName#8] +- Join Inner, (deptNo#5 = deptNo#7) :- Project [deptNo#5, salary#6] : +- Filter isnotnull(deptNo#5) : +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#7, deptName#8] +- Filter isnotnull(deptNo#7) +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
逐步下推
Step 1,指標在最上層,aggregate節點推到其child節點project的下面,同時將project裡的salary#6替換成avg(salary)#19:
Project [avg(salary)#19, deptName#8] +- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8] +- Join Inner, (deptNo#5 = deptNo#7) :- Project [deptNo#5, salary#6] : +- Filter isnotnull(deptNo#5) : +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#7, deptName#8] +- Filter isnotnull(deptNo#7) +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Step 2,指標在第二層,這次是下推一層後的aggregate節點,搜尋join節點下面的左右子project節點,看哪個有salary#6,往salary#6所在的子節點上方新增一個以join欄位為group by條件的聚合節點,假設新生成exprId是20:
Project [avg(salary)#19, deptName#8] +- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8] +- Join Inner, (deptNo#5 = deptNo#7) :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5] : +- Project [deptNo#5, salary#6] : +- Filter isnotnull(deptNo#5) : +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#7, deptName#8] +- Filter isnotnull(deptNo#7) +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Step 3,指標在第4層的aggregate,與1類似,將aggregate推到project下方,並將project中的salary#6替換成salary#20:
Project [avg(salary)#19, deptName#8] +- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8] +- Join Inner, (deptNo#5 = deptNo#7) :- Project [deptNo#5, salary#20] : +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5] : +- Filter isnotnull(deptNo#5) : +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#7, deptName#8] +- Filter isnotnull(deptNo#7) +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Step 2,此時aggregate的子節點是join,將avg(salary)#19的exprId(19)生成新id,假設這裡變成avg(salary)#20:
Project [avg(salary)#20, deptName#8] +- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8] +- Join Inner, (deptNo#5 = deptNo#7) :- Project [deptNo#5, salary#6] : +- Filter isnotnull(deptNo#5) : +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#7, deptName#8] +- Filter isnotnull(deptNo#7) +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Step 3,搜尋join節點下面的左右子project節點,看哪個有salary#6,往salary#6所在的子節點上方新增一個以join欄位為group by條件的聚合節點,假設新生成exprId是21:
Project [avg(salary)#20, deptName#8] +- Aggregate [deptName#8], [avg(cast(salary#21 as double)) AS avg(salary)#20, deptName#8] +- Join Inner, (deptNo#5 = deptNo#7) :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5] : +- Project [deptNo#5, salary#6] : +- Filter isnotnull(deptNo#5) : +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#7, deptName#8] +- Filter isnotnull(deptNo#7) +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Step 4,與1類似,將aggregate推到project下方,並將project中的salary#6替換成salary#21:
Project [avg(salary)#20, deptName#8] +- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8] +- Join Inner, (deptNo#5 = deptNo#7) :- Project [deptNo#5, salary#21] : +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5] : +- Filter isnotnull(deptNo#5) : +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#7, deptName#8] +- Filter isnotnull(deptNo#7) +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
這時aggregate-filter-relation的組合就會呼叫到上面提到的AggregatedFilteredScan
介面,呼叫到資料來源的buildscan()方法。
結果檢視
這個下推,其實等價於下面的SQL和執行計劃:
SQL:
SELECT AVG(salary), deptName FROM (SELECT AVG(salary), deptNo FROM emp GROUP BY deptNo) a JOIN dept ON a.deptNo = dept.deptNo GROUP BY deptName
LogicalPlan:
Project [avg(avgsalary)#21, deptName#9] +- Aggregate [deptName#9], [avg(avgsalary#2) AS avg(avgsalary)#21, deptName#9] +- Join Inner, (deptNo#6 = deptNo#8) :- Project [deptNo#6, avgsalary#2] : +- Aggregate [deptNo#6], [avg(cast(salary#7 as double)) AS avgsalary#2, deptNo#6] : +- Filter isnotnull(deptNo#6) : +- Relation[empNo#3,empName#4,mgr#5,deptNo#6,salary#7] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true))) +- Project [deptNo#8, deptName#9] +- Filter isnotnull(deptNo#8) +- Relation[deptNo#8,deptName#9,loc#10] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
在沒有聚合下推的情況下,返回的join兩側的資料來源是:
data without aggregate function: List(ListBuffer(20, 800.0), ListBuffer(20, 3000.0), ListBuffer(20, 2975.0), ListBuffer(30, 1600.0), ListBuffer(30, 1250.0), ListBuffer(30, 2950.0), ListBuffer(10, 5000.0)) data without aggregate function: List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))
而在下推的情況下,返回的join兩側資料來源是:
data with aggregate function: ListBuffer(List(30, 5800.0, 3), List(20, 6775.0, 3), List(10, 5000.0, 1)) data without aggregate function: List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))
可以看到,下推和沒下推,在一側資料來源中拿到的資料有明顯減少。這還只是資料量在不到20的情況下。在資料量大的情況下,那麼聚合下推效果會更好。
二者查詢結果都是:
+------------------+----------+|avg(salary) |deptName | +------------------+----------+ |1933.3333333333333|sales ||5000.0 |accounting| |2258.3333333333335|research |+------------------+----------+
作者:orisonchan
連結:
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/1795/viewspace-2811958/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Spark 系列(十一)—— Spark SQL 聚合函式 AggregationsSparkSQL函式
- spark處理json資料DemoSparkJSON
- 分享幾個寫 demo 的思路
- spark中的聚合函式總結Spark函式
- DDD中聚合、聚合根的含義以及作用
- 一個spark清洗資料的demoSpark
- Spark 以及 spark streaming 核心原理及實踐Spark
- libwebsocket demo以及遇到的坑。Web
- HBase實操:HBase-Spark-Read-Demo 分享Spark
- 檢視spark程式執行狀態以及安裝sparkSpark
- spark RDD運算元(五)之鍵值對聚合操作combineByKeySpark
- 關於dva框架的簡單操作以及demo框架
- Android元件化demo實現以及遇坑分享Android元件化
- 談談 Promise 以及實現 Fetch 的思路Promise
- 效能優化:索引下推優化索引
- 線下推廣方式都有哪些
- minikube配置代理:解決方案以及解決思路
- Spark SQL中Not in Subquery為何低效以及如何規避SparkSQL
- pycharm下與spark的互動 詳細配置說明以及spark 執行簡單例子PyCharmSpark單例
- MySql索引下推知識分享MySql索引
- Android Firebase接入(序)--Firebase簡介以及Firebase官方Demo的使用Android
- KunlunDB查詢優化(三)排序下推優化排序
- 五分鐘搞懂MySQL索引下推MySql索引
- Redis擊穿、穿透、雪崩產生原因以及解決思路Redis穿透
- gRPC應用實戰:(二)gRPC環境搭建以及簡單demoRPC
- Mysql:好好的索引,為什麼要下推?MySql索引
- 頻寬最佳化新思路:RoCE網路卡聚合實現X2增長
- MySQL索引下推,原來這麼簡單!MySql索引
- Spark之spark shellSpark
- 聚合
- demo
- DDD之4聚合和聚合根
- elasticsearch 聚合之 date_histogram 聚合ElasticsearchHistogram
- Spark on Yarn 和Spark on MesosSparkYarn
- Spark系列 - (3) Spark SQLSparkSQL
- Qwt開發筆記(一):Qwt簡介、下載以及基礎demo工程模板筆記
- MySQL學習筆記:組合索引-索引下推MySql筆記索引
- es筆記七之聚合操作之桶聚合和矩陣聚合筆記矩陣