Apache Spark Dataframe Join語法教程

banq發表於2019-01-08

始終考慮將RDD轉換為Dataframe資料幀，執行請求的操作並儲存結果......這是節省時間最多的方法。那麼讓我們深入研究Dataframe資料幀測試

資料workers：

1,Jerry,man,USA
2,Cathy,female,GBR
3,Teresa,female,GBR
4,Rut,female,USA
5,Roasie,female,AUS
6,Garry,man,GBR
7,Adam,man,GER
8,John,man,GBR
9,Jerremy,man,AUS
10,Angela,female,AUS
11,Ivanka,female,USA
12,Melania,female,USA

每個工人workers都有工作場所workplace ：

1,developer
2,qa
3,product_manager
4,qa
5,qa
6,developer
7,team_leader
8,developer
9,team_leader
10,qa
11,developer
讓我們準備

dataframes：

import org.apache.spark.sql.functions._;

case class worker(id: Int, name: String, sex: String, country: String);

case class workplace (id: Int, department: String);

val workers = sc.textFile("/tests/workers.txt");

val workplaces = sc.textFile("/tests/workplaces.txt");

val workerRDD = workers.map(line=>new worker(line.split(",")(0).toInt,
                                  line.split(",")(1),
                                  line.split(",")(2),
                                  line.split(",")(3)));
val workerDF = workerRDD.toDF();

val workplacesRDD = workplaces.map(line=>new workplace(line.split(",")(0).toInt,
                                                       line.split(",")(1)));

val workplacesDF = workplacesRDD.toDF();

任務1.內聯接表示式
第一種可能的解決方案是使用join 表示式的join 函式：

val allDF = workerDF.join（workplacesDF，workerDF（“id”）=== workplacesDF（“id”））; 
allDF.show（）;

輸出將是：

+---+-------+------+-------+---+---------------+
| id|   name|   sex|country| id|     department|
+---+-------+------+-------+---+---------------+
|  1|  Jerry|   man|    USA|  1|      developer|
|  2|  Cathy|female|    GBR|  2|             qa|
|  3| Teresa|female|    GBR|  3|product_manager|
|  4|    Rut|female|    USA|  4|             qa|
|  5| Roasie|female|    AUS|  5|             qa|
|  6|  Garry|   man|    GBR|  6|      developer|
|  7|   Adam|   man|    GER|  7|    team_leader|
|  8|   John|   man|    GBR|  8|      developer|
|  9|Jerremy|   man|    AUS|  9|    team_leader|
| 10| Angela|female|    AUS| 10|             qa|
| 11| Ivanka|female|    USA| 11|      developer|
+---+-------+------+-------+---+---------------+

那麼輸出是完全正確的...... 但是有兩次列出的join 列 ...... 讓我們看看如何解決這個問題....

任務2.僅列出一次join 列的內部聯接
以前的輸出是正確的，但如何刪除重複連線列？只是...

val resultDF = workerDF.join(workplacesDF, Seq("id"));
resultDF.show();

輸出將是：

+---+-------+------+-------+---------------+
| id|   name|   sex|country|     department|
+---+-------+------+-------+---------------+
|  1|  Jerry|   man|    USA|      developer|
|  2|  Cathy|female|    GBR|             qa|
|  3| Teresa|female|    GBR|product_manager|
|  4|    Rut|female|    USA|             qa|
|  5| Roasie|female|    AUS|             qa|
|  6|  Garry|   man|    GBR|      developer|
|  7|   Adam|   man|    GER|    team_leader|
|  8|   John|   man|    GBR|      developer|
|  9|Jerremy|   man|    AUS|    team_leader|
| 10| Angela|female|    AUS|             qa|
| 11| Ivanka|female|    USA|      developer|
+---+-------+------+-------+---------------+

任務3.使用select projection的內部連線
另一種解決方案是使用連線表示式並使用select projection確定輸出列...

val resultDF = workerDF.join(workplacesDF, workerDF("id")===workplacesDF("id"), "inner").
    select(workerDF("id"), workerDF("name"), workerDF("country"), workplacesDF("department"));

resultDF.show();

輸出應該是：

+---+-------+-------+---------------+
| id|   name|country|     department|
+---+-------+-------+---------------+
|  1|  Jerry|    USA|      developer|
|  2|  Cathy|    GBR|             qa|
|  3| Teresa|    GBR|product_manager|
|  4|    Rut|    USA|             qa|
|  5| Roasie|    AUS|             qa|
|  6|  Garry|    GBR|      developer|
|  7|   Adam|    GER|    team_leader|
|  8|   John|    GBR|      developer|
|  9|Jerremy|    AUS|    team_leader|
| 10| Angela|    AUS|             qa|
| 11| Ivanka|    USA|      developer|
+---+-------+-------+---------------+

任務4.使用SQL查詢的內連線
如果使用函式不合適，可以使用SQL查詢來獲取資料：

workerDF.registerTempTable("worker");
workplacesDF.registerTempTable("workplaces");

val resultDF = sqlContext.sql("select w.sex, w.id, w.name, w.country, wp.department from worker w LEFT JOIN workplaces wp ON w.id = wp.id");

輸出：

+------+---+-------+-------+---------------+
|   sex| id|   name|country|     department|
+------+---+-------+-------+---------------+
|   man|  1|  Jerry|    USA|      developer|
|female|  2|  Cathy|    GBR|             qa|
|female|  3| Teresa|    GBR|product_manager|
|female|  4|    Rut|    USA|             qa|
|female|  5| Roasie|    AUS|             qa|
|   man|  6|  Garry|    GBR|      developer|
|   man|  7|   Adam|    GER|    team_leader|
|   man|  8|   John|    GBR|      developer|
|   man|  9|Jerremy|    AUS|    team_leader|
|female| 10| Angela|    AUS|             qa|
|female| 11| Ivanka|    USA|      developer|
|female| 12|Melania|    USA|           null|
+------+---+-------+-------+---------------+

正如你所看到的，我們真的執行了LEFT JOIN，因為Melania沒有部門。

任務5.內部聯接和過濾功能（Scala表示式）
讓我們聯絡工人和他們的部門，並列出只有“QA”的人......

val resultDF = workerDF.join（workplacesDF，Seq（“id”））。filter（workplacesDF（“department”）===“qa”）; 
resultDF.show（）;

輸出：

+---+------+------+-------+----------+
| id|  name|   sex|country|department|
+---+------+------+-------+----------+
|  2| Cathy|female|    GBR|        qa|
|  4|   Rut|female|    USA|        qa|
|  5|Roasie|female|    AUS|        qa|
| 10|Angela|female|    AUS|        qa|
+---+------+------+-------+----------+

任務6.內部聯接和過濾功能（SQL表示式）
同樣，如果您對Scala語法的過濾函式不滿意，可以使用類似SQL的語法：

val resultDF = workerDF.join(workplacesDF, Seq("id")).filter("department = 'qa'");
resultDF.show();

不知何故，我更喜歡scala語法，但輸出再次相同：

+---+------+------+-------+----------+
| id|  name|   sex|country|department|
+---+------+------+-------+----------+
|  2| Cathy|female|    GBR|        qa|
|  4|   Rut|female|    USA|        qa|
|  5|Roasie|female|    AUS|        qa|
| 10|Angela|female|    AUS|        qa|
+---+------+------+-------+----------+

任務7.顯示沒有部門的工人

為了獲得沒有部門的工人，我們需要執行LEFT JOIN並僅過濾具有NULL部門的行：

val resultDF = workerDF.join(workplacesDF, Seq("id"), "left").filter(workplacesDF("department").isNull);
resultDF.show();

輸出：

+---+-------+------+-------+----------+
| id|   name|   sex|country|department|
+---+-------+------+-------+----------+
| 12|Melania|female|    USA|      null|
+---+-------+------+-------+----------+

任務8.如果worker是“qa”，新增名為“isQA”的列為“1”
要解決這個任務，我們將使用帶有when子句的withColumn函式：

val resultDF = workerDF.join(workplacesDF, Seq("id")).withColumn("isQA", 
when(workplacesDF("department")==="qa", lit("1")).otherwise(lit("0")));

resultDF.show();

輸出：

+---+-------+------+-------+---------------+----+
| id|   name|   sex|country|     department|isQA|
+---+-------+------+-------+---------------+----+
|  1|  Jerry|   man|    USA|      developer|   0|
|  2|  Cathy|female|    GBR|             qa|   1|
|  3| Teresa|female|    GBR|product_manager|   0|
|  4|    Rut|female|    USA|             qa|   1|
|  5| Roasie|female|    AUS|             qa|   1|
|  6|  Garry|   man|    GBR|      developer|   0|
|  7|   Adam|   man|    GER|    team_leader|   0|
|  8|   John|   man|    GBR|      developer|   0|
|  9|Jerremy|   man|    AUS|    team_leader|   0|
| 10| Angela|female|    AUS|             qa|   1|
| 11| Ivanka|female|    USA|      developer|   0|
+---+-------+------+-------+---------------+----+

任務9.哪個部門的工作人員最多（scala語法）
這是經典的簡單SQL任務，Spark解決方案看起來很相似。我們需要按“部門”對行進行分組，然後對該組執行計數，然後進行排序。

val resultDF = workerDF.join(workplacesDF, Seq("id")).
groupBy("department").agg(count("*").as("dep_workers")).orderBy(desc("dep_workers"));

resultDF.limit(1).show();

現在輸出應該是：

+----------+-----------+
|department|dep_workers|
+----------+-----------+
| developer|          4|
+----------+-----------+

任務10.哪個部門擁有最多的工作人員（SQL語法）
如果你是SQL熟練的開發人員，那麼就沒有必要評論這個解決方案......

val resultDF = sqlContext.sql("select wp.department, count(*) as dep_workers from 
worker w INNER JOIN workplaces wp ON w.id = wp.id GROUP BY wp.department");

resultDF.limit(1).show();

輸出：

+----------+-----------+
|department|dep_workers|
+----------+-----------+
| developer|       4|
+----------+-----------+

MySQL Join語法
2018-07-11
MySql
Apache Spark SQL的高階Join連線技術
2019-01-08
ApacheSparkSQL
Spark建立空的DataFrame
2021-09-09
Spark
Spark流教程：使用 Apache Spark 的Twitter情緒分析
2021-12-29
SparkApache
Spark API 全集(1):Spark SQL Dataset & DataFrame API
2018-12-09
SparkAPISQL
數倉工具—Hive語法之map join、reduce join、smb join(8)
2021-01-03
Hive
Spark DataFrame的groupBy vs groupByKey
2018-11-04
Spark
SparkSQL /DataFrame /Spark RDD誰快？
2020-08-15
SparkSQL
mysql update join,insert select 語法
2018-03-16
MySql
Spark SQL學習——DataFrame和DataSet
2019-04-04
SparkSQL
深入淺出Spark Join
2020-11-28
Spark
教程：Apache Spark SQL入門及實踐指南！
2018-09-12
ApacheSparkSQL
15、Spark Sql（一），生成DataFrame的方式
2018-03-04
SparkSQL
tomaztk/Spark-for-data-engineers：面向資料工程師的Apache Spark學習教程
2021-12-22
Spark工程師Apache
Spark SQL，如何將 DataFrame 轉為 json 格式
2018-12-06
SparkSQLJSON
Spark SQL中的RDD與DataFrame轉換
2019-08-12
SparkSQL
scala基礎語法-----Spark基礎
2020-09-29
Spark
Spark SQL如何選擇join策略
2021-01-29
SparkSQL
PySpark DataFrame教程與演示
2024-03-08
Spark
less 語法教程
2018-03-06
Apache Spark Day3
2020-10-30
ApacheSpark
MarkDown語法教程(轉)
2024-08-10
Inner join 寫法
2019-07-25
Spark SQL三種join和資料傾斜的產生和解決辦法
2022-02-28
SparkSQL
Apache Spark有哪些侷限性
2020-06-11
ApacheSpark
Spark SQL中出現 CROSS JOIN 問題解決
2019-10-13
SparkSQLROS
大資料開發-Spark Join原理詳解
2021-02-09
大資料Spark
Apache Spark和Hive有用的功能
2019-01-08
ApacheSparkHive
使用Apache Spark和Apache Hudi構建分析資料湖
2020-06-15
ApacheSpark
實用的 Markdown 語法教程
2019-03-16
aardio教程二) 進階語法
2024-03-17
MySQL-基礎語法教程
2024-05-07
MySql
Lua語法基礎教程（上篇）
2024-10-28
從 Spark 的 DataFrame 中取出具體某一行詳解
2021-08-26
Spark
Apache Spark 2一些使用案例
2019-01-08
ApacheSpark
快取Apache Spark RDD - 效能調優
2019-01-08
快取ApacheSpark
帶有Apache Spark的Lambda架構
2018-05-29
ApacheSpark架構
【譯】Using .NET for Apache Spark to Analyze Log Data
2020-05-11
ApacheSpark

Apache Spark Dataframe Join語法教程

相關文章