Apache Spark SQL的高階Join連線技術
讓我們有以下輸入樣本資料:
people.txt -> saved at HDFS under /tests/people.txt
1,Jerry,man,USA 2,Cathy,female,GBR 3,Teresa,female,GBR 4,Rut,female,USA 5,Roasie,female,AUS 6,Tomas,man,CZE |
peoples orders -> saved at HDFS under /tests/orders.txt
1,pizza,2500 1,beer,50 1,food,3000 2,pizza,3000 2,drink,24000 2,vine,40000 3,pizza,50 3,white-wine,4000 3,bread,20 4,pizza,4000 4,food,3000 4,vodka,30 5,cola,4000 5,bread,50000 5,sangria,300 |
讓我們訓練一下Apache Spark SQL
1.從組中取最大行。
任務:為people.txt中的每一行獲取最昂貴的訂單。結果格式:name, country, price.
case class people(id: Int, name: String, gender: String, country: String); case class order(id: Int, item: String, price: Long) val peopleDF = sc.textFile("/tests/people.txt").map( line=>new people(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))).toDF(); val orderDF = sc.textFile("/tests/orders.txt").map( line=>new order(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2).toLong) ).toDF(); peopleDF.registerTempTable("people"); orderDF.registerTempTable("order"); val resultDF = sqlContext.sql(" select p.name, p.country, gr.max_p FROM people p INNER JOIN "+ " (select o.id, max(o.price) as max_p FROM order o GROUP BY o.id) gr ON gr.id = p.id " ); resultDF.show(); |
Zeppelin output:
defined class people defined class order peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gender: string, country: string] orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint] resultDF: org.apache.spark.sql.DataFrame = [name: string, country: string, max_p: bigint] +------+-------+-----+ | name|country|max_p| +------+-------+-----+ | Jerry| USA| 3000| | Cathy| GBR|40000| |Teresa| GBR| 4000| | Rut| USA| 4000| |Roasie| AUS|50000| +------+-------+-----+ |
這很簡單,因為我們不需要任何額外的資料來獲得最大價格訂單。
2. LEFT JOIN測試
任務:從people.txt獲取所有沒有訂購的行。
case class people(id: Int, name: String, gendre: String, country: String); case class order(id: Int, item: String, price: Long) val peopleDF = sc.textFile("/tests/people.txt").map( line=>new people(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))).toDF(); val orderDF = sc.textFile("/tests/orders.txt").map( line=>new order(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2).toLong) ).toDF(); peopleDF.registerTempTable("people"); orderDF.registerTempTable("order"); val resultDF = sqlContext.sql(" select p.name, p.country from people p LEFT JOIN order o ON p.id = o.id WHERE o.id is null "); resultDF.show(); |
Zeppelin output:
defined class people defined class order peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gendre: string, country: string] orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint] resultDF: org.apache.spark.sql.DataFrame = [name: string, country: string] +-----+-------+ | name|country| +-----+-------+ |Tomas| CZE| +-----+-------+ |
經典LEFT JOIN的東西,從左邊的表得到一切,無需引用右表。
3. 從GROUP中獲取前N行
現在好玩的開始了!
任務:
在people.txt給我前兩個最昂貴的訂單每一行。
import org.apache.spark.sql.expressions.Window; import org.apache.spark.sql.functions._; case class people(id: Int, name: String, gendre: String, country: String); case class order(id: Int, item: String, price: Long) val peopleDF = sc.textFile("/tests/people.txt").map( line=>new people(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))).toDF(); val orderDF = sc.textFile("/tests/orders.txt").map( line=>new order(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2).toLong) ).toDF(); peopleDF.registerTempTable("people"); orderDF.registerTempTable("order"); val window = Window.partitionBy("id").orderBy(col("price").desc); val indexedGroupDF = orderDF.withColumn("r", row_number().over(window)).where(col("r") <= 2); indexedGroupDF.registerTempTable("grouped"); val resultDF = sqlContext.sql("select p.name, r.item, r.price from people p INNER JOIN grouped r ON p.id = r.id"); resultDF.show(); |
Zeppelin output:
import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ defined class people defined class order peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gendre: string, country: string] orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint] window: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@5d181d1e indexedGroupDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint, r: int] resultDF: org.apache.spark.sql.DataFrame = [name: string, item: string, price: bigint] +------+----------+-----+ | name| item|price| +------+----------+-----+ | Jerry| food| 3000| | Jerry| pizza| 2500| | Cathy| vine|40000| | Cathy| drink|24000| |Teresa|white-wine| 4000| |Teresa| pizza| 50| | Rut| pizza| 4000| | Rut| food| 3000| |Roasie| bread|50000| |Roasie| cola| 4000| +------+----------+-----+ |
- 視窗函式按人員ID分割槽訂單,從而導致建立組。我們按價格下降了這些組。
- 透過row_number()函式,我們索引了組中的每一行。
- 在where函式中,我們僅限制組中前兩行的資料幀。
- 最後的結果DF只是以請求的格式列印。
點選標題見原文!
相關文章
- 【技術乾貨】程式碼示例:使用 Apache Spark 連線 TDengineApacheSpark
- oracle sql 高階連線OracleSQL
- Apache Spark Dataframe Join語法教程ApacheSpark
- SQL語句中不同的連線JOIN及SQL中join的各種用法SQL
- 【轉】【技術部落格】Spark效能優化指南——高階篇Spark優化
- mysql INNER JOIN、LEFT JOIN、RIGHT JOIN;內連線(等值連線)、左連線、右連線MySql
- LINQ系列:LINQ to SQL Join連線SQL
- 雲原生技術學習路線圖 初階+中階+高階
- Spark SQL如何選擇join策略SparkSQL
- Oracle(+)連線與Join連線Oracle
- 【SQL】13 SQL 別名、SQL 連線(JOIN)、SQL INNER JOIN 關鍵字、SQL LEFT JOIN 關鍵字、SQL RIGHT JOIN 關鍵字、SQL FULL OUTER JSQL
- sql 連線查詢例項(left join)三表連線查詢SQL
- 【SQL 學習】表連線--natural join 的一個bugSQL
- LEFT JOIN 和JOIN 多表連線
- [資料庫][SQL]圖解各種連線join資料庫SQL圖解
- Apache Spark技術實戰之3 -- Spark Cassandra Connector的安裝和使用ApacheSpark
- 【技術乾貨】程式碼示例:使用 Apache Flink 連線 TDengineApache
- Oracle 的 hash join連線方式Oracle
- Android高階模糊技術Android
- UNIX find 命令的高階技術 (轉)
- Apache Spark技術實戰之7 -- CassandraRDD高併發資料讀取實現剖析ApacheSpark
- Apache Spark技術實戰之5 -- SparkR的安裝及使用ApacheSpark
- 連線查詢簡析 join 、 left join 、 right join
- 物聯網6類技術無線連線技術的分析
- SQL SERVER——高可用技術概述SQLServer
- 外連線(outer join)示例
- 微課sql最佳化(15)、表的連線方法(4)-關於Hash Join(雜湊連線)SQL
- 聊聊CBO的連線排列(Join Permutation)
- 七、Spring MVC高階技術SpringMVC
- Spark SQL中出現 CROSS JOIN 問題解決SparkSQLROS
- Apache Kyuubi 助力 CDH 解鎖 Spark SQLApacheSparkSQL
- MySql的join(連線)查詢 (三表 left join 寫法)MySql
- 微課sql最佳化(16)、表的連線方法(5)-關於Merge Join(排序合連線)SQL排序
- Apache Spark技術實戰之6 -- spark-submit常見問題及其解決ApacheSparkMIT
- Apache Spark技術實戰之4 -- 利用Spark將json檔案匯入CassandraApacheSparkJSON
- Mylearningpath(網路安全高階應用)TCP/IP高階技術TCP
- SQL 經典回顧:JOIN 表連線操作不完全指南SQL
- Hibernate連線查詢join