Apache Spark Dataframe Join語法教程
始終考慮將RDD轉換為Dataframe資料幀,執行請求的操作並儲存結果......這是節省時間最多的方法。那麼讓我們深入研究Dataframe資料幀測試
資料workers:
1,Jerry,man,USA 2,Cathy,female,GBR 3,Teresa,female,GBR 4,Rut,female,USA 5,Roasie,female,AUS 6,Garry,man,GBR 7,Adam,man,GER 8,John,man,GBR 9,Jerremy,man,AUS 10,Angela,female,AUS 11,Ivanka,female,USA 12,Melania,female,USA |
每個工人workers都有工作場所workplace :
1,developer 2,qa 3,product_manager 4,qa 5,qa 6,developer 7,team_leader 8,developer 9,team_leader 10,qa 11,developer 讓我們準備 |
import org.apache.spark.sql.functions._; case class worker(id: Int, name: String, sex: String, country: String); case class workplace (id: Int, department: String); val workers = sc.textFile("/tests/workers.txt"); val workplaces = sc.textFile("/tests/workplaces.txt"); val workerRDD = workers.map(line=>new worker(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))); val workerDF = workerRDD.toDF(); val workplacesRDD = workplaces.map(line=>new workplace(line.split(",")(0).toInt, line.split(",")(1))); val workplacesDF = workplacesRDD.toDF(); |
任務1.內聯接表示式
第一種可能的解決方案是使用join 表示式的join 函式:
val allDF = workerDF.join(workplacesDF,workerDF(“id”)=== workplacesDF(“id”)); allDF.show(); |
輸出將是:
+---+-------+------+-------+---+---------------+ | id| name| sex|country| id| department| +---+-------+------+-------+---+---------------+ | 1| Jerry| man| USA| 1| developer| | 2| Cathy|female| GBR| 2| qa| | 3| Teresa|female| GBR| 3|product_manager| | 4| Rut|female| USA| 4| qa| | 5| Roasie|female| AUS| 5| qa| | 6| Garry| man| GBR| 6| developer| | 7| Adam| man| GER| 7| team_leader| | 8| John| man| GBR| 8| developer| | 9|Jerremy| man| AUS| 9| team_leader| | 10| Angela|female| AUS| 10| qa| | 11| Ivanka|female| USA| 11| developer| +---+-------+------+-------+---+---------------+ |
那麼輸出是完全正確的...... 但是有兩次列出的join 列 ...... 讓我們看看如何解決這個問題....
任務2.僅列出一次join 列的內部聯接
以前的輸出是正確的,但如何刪除重複連線列?只是...
val resultDF = workerDF.join(workplacesDF, Seq("id")); resultDF.show(); |
輸出將是:
+---+-------+------+-------+---------------+ | id| name| sex|country| department| +---+-------+------+-------+---------------+ | 1| Jerry| man| USA| developer| | 2| Cathy|female| GBR| qa| | 3| Teresa|female| GBR|product_manager| | 4| Rut|female| USA| qa| | 5| Roasie|female| AUS| qa| | 6| Garry| man| GBR| developer| | 7| Adam| man| GER| team_leader| | 8| John| man| GBR| developer| | 9|Jerremy| man| AUS| team_leader| | 10| Angela|female| AUS| qa| | 11| Ivanka|female| USA| developer| +---+-------+------+-------+---------------+ |
任務3.使用select projection的內部連線
另一種解決方案是使用連線表示式 並使用select projection確定輸出列...
val resultDF = workerDF.join(workplacesDF, workerDF("id")===workplacesDF("id"), "inner"). select(workerDF("id"), workerDF("name"), workerDF("country"), workplacesDF("department")); resultDF.show(); |
輸出應該是:
+---+-------+-------+---------------+ | id| name|country| department| +---+-------+-------+---------------+ | 1| Jerry| USA| developer| | 2| Cathy| GBR| qa| | 3| Teresa| GBR|product_manager| | 4| Rut| USA| qa| | 5| Roasie| AUS| qa| | 6| Garry| GBR| developer| | 7| Adam| GER| team_leader| | 8| John| GBR| developer| | 9|Jerremy| AUS| team_leader| | 10| Angela| AUS| qa| | 11| Ivanka| USA| developer| +---+-------+-------+---------------+ |
任務4.使用SQL查詢的內連線
如果使用函式不合適,可以使用SQL查詢來獲取資料:
workerDF.registerTempTable("worker"); workplacesDF.registerTempTable("workplaces"); val resultDF = sqlContext.sql("select w.sex, w.id, w.name, w.country, wp.department from worker w LEFT JOIN workplaces wp ON w.id = wp.id"); |
輸出:
+------+---+-------+-------+---------------+ | sex| id| name|country| department| +------+---+-------+-------+---------------+ | man| 1| Jerry| USA| developer| |female| 2| Cathy| GBR| qa| |female| 3| Teresa| GBR|product_manager| |female| 4| Rut| USA| qa| |female| 5| Roasie| AUS| qa| | man| 6| Garry| GBR| developer| | man| 7| Adam| GER| team_leader| | man| 8| John| GBR| developer| | man| 9|Jerremy| AUS| team_leader| |female| 10| Angela| AUS| qa| |female| 11| Ivanka| USA| developer| |female| 12|Melania| USA| null| +------+---+-------+-------+---------------+ |
正如你所看到的,我們真的執行了LEFT JOIN,因為Melania沒有部門。
任務5.內部聯接和過濾功能(Scala表示式)
讓我們聯絡工人和他們的部門,並列出只有“QA”的人......
val resultDF = workerDF.join(workplacesDF,Seq(“id”))。filter(workplacesDF(“department”)===“qa”); resultDF.show(); |
輸出:
+---+------+------+-------+----------+ | id| name| sex|country|department| +---+------+------+-------+----------+ | 2| Cathy|female| GBR| qa| | 4| Rut|female| USA| qa| | 5|Roasie|female| AUS| qa| | 10|Angela|female| AUS| qa| +---+------+------+-------+----------+ |
任務6.內部聯接和過濾功能(SQL表示式)
同樣,如果您對Scala語法的過濾函式不滿意,可以使用類似SQL的語法:
val resultDF = workerDF.join(workplacesDF, Seq("id")).filter("department = 'qa'"); resultDF.show(); |
不知何故,我更喜歡scala語法,但輸出再次相同:
+---+------+------+-------+----------+ | id| name| sex|country|department| +---+------+------+-------+----------+ | 2| Cathy|female| GBR| qa| | 4| Rut|female| USA| qa| | 5|Roasie|female| AUS| qa| | 10|Angela|female| AUS| qa| +---+------+------+-------+----------+ |
任務7.顯示沒有部門的工人
為了獲得沒有部門的工人,我們需要執行LEFT JOIN並僅過濾具有NULL部門的行 :
val resultDF = workerDF.join(workplacesDF, Seq("id"), "left").filter(workplacesDF("department").isNull); resultDF.show(); |
輸出:
+---+-------+------+-------+----------+ | id| name| sex|country|department| +---+-------+------+-------+----------+ | 12|Melania|female| USA| null| +---+-------+------+-------+----------+ |
任務8.如果worker是“qa”,新增名為“isQA”的列為“1”
要解決這個任務,我們將使用帶有when子句的withColumn函式:
val resultDF = workerDF.join(workplacesDF, Seq("id")).withColumn("isQA", when(workplacesDF("department")==="qa", lit("1")).otherwise(lit("0"))); resultDF.show(); |
輸出:
+---+-------+------+-------+---------------+----+ | id| name| sex|country| department|isQA| +---+-------+------+-------+---------------+----+ | 1| Jerry| man| USA| developer| 0| | 2| Cathy|female| GBR| qa| 1| | 3| Teresa|female| GBR|product_manager| 0| | 4| Rut|female| USA| qa| 1| | 5| Roasie|female| AUS| qa| 1| | 6| Garry| man| GBR| developer| 0| | 7| Adam| man| GER| team_leader| 0| | 8| John| man| GBR| developer| 0| | 9|Jerremy| man| AUS| team_leader| 0| | 10| Angela|female| AUS| qa| 1| | 11| Ivanka|female| USA| developer| 0| +---+-------+------+-------+---------------+----+ |
任務9.哪個部門的工作人員最多(scala語法)
這是經典的簡單SQL任務,Spark解決方案看起來很相似。我們需要按“部門”對行進行分組,然後對該組執行計數,然後進行排序。
val resultDF = workerDF.join(workplacesDF, Seq("id")). groupBy("department").agg(count("*").as("dep_workers")).orderBy(desc("dep_workers")); resultDF.limit(1).show(); |
現在輸出應該是:
+----------+-----------+ |department|dep_workers| +----------+-----------+ | developer| 4| +----------+-----------+ |
任務10.哪個部門擁有最多的工作人員(SQL語法)
如果你是SQL熟練的開發人員,那麼就沒有必要評論這個解決方案......
val resultDF = sqlContext.sql("select wp.department, count(*) as dep_workers from worker w INNER JOIN workplaces wp ON w.id = wp.id GROUP BY wp.department"); resultDF.limit(1).show(); |
輸出:
+----------+-----------+ |department|dep_workers| +----------+-----------+ | developer| 4| +----------+-----------+ |
相關文章
- MySQL Join語法MySql
- Apache Spark SQL的高階Join連線技術ApacheSparkSQL
- Spark建立空的DataFrameSpark
- Spark流教程 :使用 Apache Spark 的Twitter情緒分析SparkApache
- Spark API 全集(1):Spark SQL Dataset & DataFrame APISparkAPISQL
- 數倉工具—Hive語法之map join、reduce join、smb join(8)Hive
- Spark DataFrame的groupBy vs groupByKeySpark
- SparkSQL /DataFrame /Spark RDD誰快?SparkSQL
- mysql update join,insert select 語法MySql
- Spark SQL學習——DataFrame和DataSetSparkSQL
- 深入淺出Spark JoinSpark
- 教程:Apache Spark SQL入門及實踐指南!ApacheSparkSQL
- 15、Spark Sql(一),生成DataFrame的方式SparkSQL
- tomaztk/Spark-for-data-engineers:面向資料工程師的Apache Spark學習教程Spark工程師Apache
- Spark SQL,如何將 DataFrame 轉為 json 格式SparkSQLJSON
- Spark SQL中的RDD與DataFrame轉換SparkSQL
- scala基礎語法-----Spark基礎Spark
- Spark SQL如何選擇join策略SparkSQL
- PySpark DataFrame教程與演示Spark
- less 語法教程
- Apache Spark Day3ApacheSpark
- MarkDown語法教程(轉)
- Inner join 寫法
- Spark SQL三種join和資料傾斜的產生和解決辦法SparkSQL
- Apache Spark有哪些侷限性ApacheSpark
- Spark SQL中出現 CROSS JOIN 問題解決SparkSQLROS
- 大資料開發-Spark Join原理詳解大資料Spark
- Apache Spark和Hive有用的功能ApacheSparkHive
- 使用Apache Spark和Apache Hudi構建分析資料湖ApacheSpark
- 實用的 Markdown 語法教程
- aardio教程二) 進階語法
- MySQL-基礎語法教程MySql
- Lua語法基礎教程(上篇)
- 從 Spark 的 DataFrame 中取出具體某一行詳解Spark
- Apache Spark 2一些使用案例ApacheSpark
- 快取Apache Spark RDD - 效能調優快取ApacheSpark
- 帶有Apache Spark的Lambda架構ApacheSpark架構
- 【譯】Using .NET for Apache Spark to Analyze Log DataApacheSpark