Apache Spark SQL的高階Join連線技術
讓我們有以下輸入樣本資料:
people.txt -> saved at HDFS under /tests/people.txt
1,Jerry,man,USA 2,Cathy,female,GBR 3,Teresa,female,GBR 4,Rut,female,USA 5,Roasie,female,AUS 6,Tomas,man,CZE |
peoples orders -> saved at HDFS under /tests/orders.txt
1,pizza,2500 1,beer,50 1,food,3000 2,pizza,3000 2,drink,24000 2,vine,40000 3,pizza,50 3,white-wine,4000 3,bread,20 4,pizza,4000 4,food,3000 4,vodka,30 5,cola,4000 5,bread,50000 5,sangria,300 |
讓我們訓練一下Apache Spark SQL
1.從組中取最大行。
任務:為people.txt中的每一行獲取最昂貴的訂單。結果格式:name, country, price.
case class people(id: Int, name: String, gender: String, country: String); case class order(id: Int, item: String, price: Long) val peopleDF = sc.textFile("/tests/people.txt").map( line=>new people(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))).toDF(); val orderDF = sc.textFile("/tests/orders.txt").map( line=>new order(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2).toLong) ).toDF(); peopleDF.registerTempTable("people"); orderDF.registerTempTable("order"); val resultDF = sqlContext.sql(" select p.name, p.country, gr.max_p FROM people p INNER JOIN "+ " (select o.id, max(o.price) as max_p FROM order o GROUP BY o.id) gr ON gr.id = p.id " ); resultDF.show(); |
Zeppelin output:
defined class people defined class order peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gender: string, country: string] orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint] resultDF: org.apache.spark.sql.DataFrame = [name: string, country: string, max_p: bigint] +------+-------+-----+ | name|country|max_p| +------+-------+-----+ | Jerry| USA| 3000| | Cathy| GBR|40000| |Teresa| GBR| 4000| | Rut| USA| 4000| |Roasie| AUS|50000| +------+-------+-----+ |
這很簡單,因為我們不需要任何額外的資料來獲得最大價格訂單。
2. LEFT JOIN測試
任務:從people.txt獲取所有沒有訂購的行。
case class people(id: Int, name: String, gendre: String, country: String); case class order(id: Int, item: String, price: Long) val peopleDF = sc.textFile("/tests/people.txt").map( line=>new people(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))).toDF(); val orderDF = sc.textFile("/tests/orders.txt").map( line=>new order(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2).toLong) ).toDF(); peopleDF.registerTempTable("people"); orderDF.registerTempTable("order"); val resultDF = sqlContext.sql(" select p.name, p.country from people p LEFT JOIN order o ON p.id = o.id WHERE o.id is null "); resultDF.show(); |
Zeppelin output:
defined class people defined class order peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gendre: string, country: string] orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint] resultDF: org.apache.spark.sql.DataFrame = [name: string, country: string] +-----+-------+ | name|country| +-----+-------+ |Tomas| CZE| +-----+-------+ |
經典LEFT JOIN的東西,從左邊的表得到一切,無需引用右表。
3. 從GROUP中獲取前N行
現在好玩的開始了!
任務:
在people.txt給我前兩個最昂貴的訂單每一行。
import org.apache.spark.sql.expressions.Window; import org.apache.spark.sql.functions._; case class people(id: Int, name: String, gendre: String, country: String); case class order(id: Int, item: String, price: Long) val peopleDF = sc.textFile("/tests/people.txt").map( line=>new people(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))).toDF(); val orderDF = sc.textFile("/tests/orders.txt").map( line=>new order(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2).toLong) ).toDF(); peopleDF.registerTempTable("people"); orderDF.registerTempTable("order"); val window = Window.partitionBy("id").orderBy(col("price").desc); val indexedGroupDF = orderDF.withColumn("r", row_number().over(window)).where(col("r") <= 2); indexedGroupDF.registerTempTable("grouped"); val resultDF = sqlContext.sql("select p.name, r.item, r.price from people p INNER JOIN grouped r ON p.id = r.id"); resultDF.show(); |
Zeppelin output:
import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ defined class people defined class order peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gendre: string, country: string] orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint] window: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@5d181d1e indexedGroupDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint, r: int] resultDF: org.apache.spark.sql.DataFrame = [name: string, item: string, price: bigint] +------+----------+-----+ | name| item|price| +------+----------+-----+ | Jerry| food| 3000| | Jerry| pizza| 2500| | Cathy| vine|40000| | Cathy| drink|24000| |Teresa|white-wine| 4000| |Teresa| pizza| 50| | Rut| pizza| 4000| | Rut| food| 3000| |Roasie| bread|50000| |Roasie| cola| 4000| +------+----------+-----+ |
- 視窗函式按人員ID分割槽訂單,從而導致建立組。我們按價格下降了這些組。
- 透過row_number()函式,我們索引了組中的每一行。
- 在where函式中,我們僅限制組中前兩行的資料幀。
- 最後的結果DF只是以請求的格式列印。
點選標題見原文!
相關文章
- 【技術乾貨】程式碼示例:使用 Apache Spark 連線 TDengineApacheSpark
- SQL語句中不同的連線JOIN及SQL中join的各種用法SQL
- Apache Spark Dataframe Join語法教程ApacheSpark
- 【SQL】13 SQL 別名、SQL 連線(JOIN)、SQL INNER JOIN 關鍵字、SQL LEFT JOIN 關鍵字、SQL RIGHT JOIN 關鍵字、SQL FULL OUTER JSQL
- Spark SQL如何選擇join策略SparkSQL
- mysql INNER JOIN、LEFT JOIN、RIGHT JOIN;內連線(等值連線)、左連線、右連線MySql
- sql 連線查詢例項(left join)三表連線查詢SQL
- [資料庫][SQL]圖解各種連線join資料庫SQL圖解
- LEFT JOIN 和JOIN 多表連線
- 雲原生技術學習路線圖 初階+中階+高階
- 【技術乾貨】程式碼示例:使用 Apache Flink 連線 TDengineApache
- Spark SQL中出現 CROSS JOIN 問題解決SparkSQLROS
- Apache Kyuubi 助力 CDH 解鎖 Spark SQLApacheSparkSQL
- 微課sql最佳化(15)、表的連線方法(4)-關於Hash Join(雜湊連線)SQL
- 微課sql最佳化(16)、表的連線方法(5)-關於Merge Join(排序合連線)SQL排序
- SQL SERVER——高可用技術概述SQLServer
- 七、Spring MVC高階技術SpringMVC
- 物聯網6類技術無線連線技術的分析
- 教程:Apache Spark SQL入門及實踐指南!ApacheSparkSQL
- PostgreSQL:高階 SQLSQL
- 提升編碼技能的 幾 種高階技術
- MACD技術的高階應用--MACD與波浪Mac
- 多表連線的三種方式詳解 hash join、merge join、 nested loopOOP
- 深入淺出Spark JoinSpark
- sql的left join 、right join 、inner join之間的區別SQL
- Spark SQL:4.對Spark SQL的理解SparkSQL
- Flume+Spark+Hive+Spark SQL離線分析系統SparkHiveSQL
- SQL高階查詢SQL
- Spark之HiveSupport連線(spark-shell和IDEA)SparkHiveIdea
- spark效能調優指南高階篇Spark
- Spark SQL三種join和資料傾斜的產生和解決辦法SparkSQL
- SQL的連線型別SQL型別
- T-SQL——關於Join on的的連線條件和where的篩選條件的區分SQL
- WebGL程式設計指南(8)高階技術Web程式設計
- 研發高階能力之「技術規劃」
- sql 內連線和外連線SQL
- sql:left join和join區別SQL
- Apache Flink核心技術Apache