Apache Spark SQL的高階Join連線技術

banq發表於2019-01-08

讓我們有以下輸入樣本資料:
people.txt -> saved at HDFS under /tests/people.txt

1,Jerry,man,USA
2,Cathy,female,GBR
3,Teresa,female,GBR
4,Rut,female,USA
5,Roasie,female,AUS
6,Tomas,man,CZE


 peoples orders -> saved at HDFS under /tests/orders.txt

1,pizza,2500
1,beer,50
1,food,3000
2,pizza,3000
2,drink,24000
2,vine,40000
3,pizza,50
3,white-wine,4000
3,bread,20
4,pizza,4000
4,food,3000
4,vodka,30
5,cola,4000
5,bread,50000
5,sangria,300


讓我們訓練一下Apache Spark SQL

1.從組中取最大行。
任務:為people.txt中的每一行獲取最昂貴的訂單。結果格式:name, country, price.

case class people(id: Int, name: String, gender: String, country: String);

case class order(id: Int, item: String, price: Long)

val peopleDF = sc.textFile("/tests/people.txt").map(
    line=>new people(line.split(",")(0).toInt, 
                     line.split(",")(1),
                     line.split(",")(2),
                     line.split(",")(3))).toDF();

val orderDF = sc.textFile("/tests/orders.txt").map(
        line=>new order(line.split(",")(0).toInt,
                    line.split(",")(1),
                    line.split(",")(2).toLong)
            ).toDF();

peopleDF.registerTempTable("people");
orderDF.registerTempTable("order");

val resultDF = sqlContext.sql(" select p.name, p.country, gr.max_p FROM people p INNER JOIN "+
                          " (select o.id, max(o.price) as max_p FROM order o GROUP BY o.id) gr ON gr.id = p.id "    
);

resultDF.show();


Zeppelin output:

defined class people
defined class order
peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gender: string, country: string]
orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint]
resultDF: org.apache.spark.sql.DataFrame = [name: string, country: string, max_p: bigint]
+------+-------+-----+
|  name|country|max_p|
+------+-------+-----+
| Jerry|    USA| 3000|
| Cathy|    GBR|40000|
|Teresa|    GBR| 4000|
|   Rut|    USA| 4000|
|Roasie|    AUS|50000|
+------+-------+-----+


這很簡單,因為我們不需要任何額外的資料來獲得最大價格訂單。

2. LEFT JOIN測試
任務:從people.txt獲取所有沒有訂購的行。

case class people(id: Int, name: String, gendre: String, country: String);

case class order(id: Int, item: String, price: Long)

val peopleDF = sc.textFile("/tests/people.txt").map(
    line=>new people(line.split(",")(0).toInt, 
                     line.split(",")(1),
                     line.split(",")(2),
                     line.split(",")(3))).toDF();

val orderDF = sc.textFile("/tests/orders.txt").map(
    line=>new order(line.split(",")(0).toInt,
                    line.split(",")(1),
                    line.split(",")(2).toLong)
            ).toDF();

peopleDF.registerTempTable("people");
orderDF.registerTempTable("order");

val resultDF = sqlContext.sql(" select p.name, p.country from people p LEFT JOIN order o ON p.id = o.id WHERE o.id is null ");

resultDF.show();


Zeppelin output:

defined class people
defined class order
peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gendre: string, country: string]
orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint]
resultDF: org.apache.spark.sql.DataFrame = [name: string, country: string]
+-----+-------+
| name|country|
+-----+-------+
|Tomas|    CZE|
+-----+-------+

經典LEFT JOIN的東西,從左邊的表得到一切,無需引用右表。

3. 從GROUP中獲取前N行

現在好玩的開始了!
任務:
在people.txt給我前兩個最昂貴的訂單每一行。

import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.functions._;

case class people(id: Int, name: String, gendre: String, country: String);

case class order(id: Int, item: String, price: Long)

val peopleDF = sc.textFile("/tests/people.txt").map(
    line=>new people(line.split(",")(0).toInt, 
                     line.split(",")(1),
                     line.split(",")(2),
                     line.split(",")(3))).toDF();

val orderDF = sc.textFile("/tests/orders.txt").map(
    line=>new order(line.split(",")(0).toInt,
                    line.split(",")(1),
                    line.split(",")(2).toLong)
            ).toDF();

peopleDF.registerTempTable("people");
orderDF.registerTempTable("order");

val window = Window.partitionBy("id").orderBy(col("price").desc);

val indexedGroupDF = orderDF.withColumn("r", row_number().over(window)).where(col("r") <= 2);

indexedGroupDF.registerTempTable("grouped");

val resultDF = sqlContext.sql("select p.name, r.item, r.price from people p INNER JOIN grouped r ON p.id = r.id");

resultDF.show();


Zeppelin output:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
defined class people
defined class order
peopleDF: org.apache.spark.sql.DataFrame = [id: int, name: string, gendre: string, country: string]
orderDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint]
window: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@5d181d1e
indexedGroupDF: org.apache.spark.sql.DataFrame = [id: int, item: string, price: bigint, r: int]
resultDF: org.apache.spark.sql.DataFrame = [name: string, item: string, price: bigint]
+------+----------+-----+
|  name|      item|price|
+------+----------+-----+
| Jerry|      food| 3000|
| Jerry|     pizza| 2500|
| Cathy|      vine|40000|
| Cathy|     drink|24000|
|Teresa|white-wine| 4000|
|Teresa|     pizza|   50|
|   Rut|     pizza| 4000|
|   Rut|      food| 3000|
|Roasie|     bread|50000|
|Roasie|      cola| 4000|
+------+----------+-----+

  • 視窗函式按人員ID分割槽訂單,從而導致建立組。我們按價格下降了這些組。
  • 透過row_number()函式,我們索引了組中的每一行。
  • 在where函式中,我們僅限制組中前兩行的資料幀。
  • 最後的結果DF只是以請求的格式列印。

點選標題見原文!
 

相關文章