Spark2 Dataset行列操作和執行計劃

智慧先行者發表於2016-11-25

　　Dataset是一個強型別的特定領域的物件，這種物件可以函式式或者關係操作並行地轉換。每個Dataset也有一個被稱為一個DataFrame的型別化檢視，這種DataFrame是Row型別的Dataset，即Dataset[Row]
　　Dataset是“懶惰”的，只在執行行動操作時觸發計算。本質上，資料集表示一個邏輯計劃，該計劃描述了產生資料所需的計算。當執行行動操作時，Spark的查詢優化程式優化邏輯計劃，並生成一個高效的並行和分散式物理計劃。

示例資料欄位解釋

// affairs：一年來婚外情的頻率

// gender：性別

// age：年齡

// yearsmarried：婚齡

// children：是否有小孩

// religiousness：宗教信仰程度（5分制，1分表示反對，5分表示非常信仰）

// education：學歷

// occupation：職業（逆向編號的戈登7種分類）

// rating：對婚姻的自我評分（5分制，1表示非常不幸福，5表示非常幸福）

1.匯入常用的包

import scala.math._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
import org.apache.spark.sql.DataFrameReader
import org.apache.spark.sql.functions._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.DataFrameStatFunctions

2.建立SparkSession，並匯入示例資料

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

val dataList: List[(Double, String, Double, Double, String, Double, Double, Double, Double)] = List(
      (0, "male", 37, 10, "no", 3, 18, 7, 4),
      (0, "female", 27, 4, "no", 4, 14, 6, 4),
      (0, "female", 32, 15, "yes", 1, 12, 1, 4),
      (0, "male", 57, 15, "yes", 5, 18, 6, 5),
      (0, "male", 22, 0.75, "no", 2, 17, 6, 3),
      (0, "female", 32, 1.5, "no", 2, 17, 5, 5),
      (0, "female", 22, 0.75, "no", 2, 12, 1, 3),
      (0, "male", 57, 15, "yes", 2, 14, 4, 4),
      (0, "female", 32, 15, "yes", 4, 16, 1, 2),
      (0, "male", 22, 1.5, "no", 4, 14, 4, 5))

val data = dataList.toDF("affairs", "gender", "age", "yearsmarried", "children", "religiousness", "education", "occupation", "rating")

data.printSchema()
root
 |-- affairs: double (nullable = false)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = false)
 |-- yearsmarried: double (nullable = false)
 |-- children: string (nullable = true)
 |-- religiousness: double (nullable = false)
 |-- education: double (nullable = false)
 |-- occupation: double (nullable = false)
 |-- rating: double (nullable = false)

3.操作指定的列和行

// 在Spark-shell中展示，前n條記錄
data.show(7)
+-------+------+----+------------+--------+-------------+---------+----------+------+
|affairs|gender| age|yearsmarried|children|religiousness|education|occupation|rating|
+-------+------+----+------------+--------+-------------+---------+----------+------+
|    0.0|  male|37.0|        10.0|      no|          3.0|     18.0|       7.0|   4.0|
|    0.0|female|27.0|         4.0|      no|          4.0|     14.0|       6.0|   4.0|
|    0.0|female|32.0|        15.0|     yes|          1.0|     12.0|       1.0|   4.0|
|    0.0|  male|57.0|        15.0|     yes|          5.0|     18.0|       6.0|   5.0|
|    0.0|  male|22.0|        0.75|      no|          2.0|     17.0|       6.0|   3.0|
|    0.0|female|32.0|         1.5|      no|          2.0|     17.0|       5.0|   5.0|
|    0.0|female|22.0|        0.75|      no|          2.0|     12.0|       1.0|   3.0|
+-------+------+----+------------+--------+-------------+---------+----------+------+
only showing top 7 rows


// 取前n條記錄
val data3=data.limit(5)


// 過濾
data.filter("age>50 and gender=='male' ").show
+-------+------+----+------------+--------+-------------+---------+----------+------+
|affairs|gender| age|yearsmarried|children|religiousness|education|occupation|rating|
+-------+------+----+------------+--------+-------------+---------+----------+------+
|    0.0|  male|57.0|        15.0|     yes|          5.0|     18.0|       6.0|   5.0|
|    0.0|  male|57.0|        15.0|     yes|          2.0|     14.0|       4.0|   4.0|
+-------+------+----+------------+--------+-------------+---------+----------+------+

// 資料框的所有列

val columnArray=data.columns
columnArray: Array[String] = Array(affairs, gender, age, yearsmarried, children, religiousness, education, occupation, rating)

// 查詢某些列的資料
data.select("gender", "age", "yearsmarried", "children").show(3)
+------+----+------------+--------+
|gender| age|yearsmarried|children|
+------+----+------------+--------+
|  male|37.0|        10.0|      no|
|female|27.0|         4.0|      no|
|female|32.0|        15.0|     yes|
+------+----+------------+--------+
only showing top 3 rows


val colArray=Array("gender", "age", "yearsmarried", "children")
colArray: Array[String] = Array(gender, age, yearsmarried, children)

data.selectExpr(colArray:_*).show(3)
+------+----+------------+--------+
|gender| age|yearsmarried|children|
+------+----+------------+--------+
|  male|37.0|        10.0|      no|
|female|27.0|         4.0|      no|
|female|32.0|        15.0|     yes|
+------+----+------------+--------+
only showing top 3 rows


// 操作指定的列,並排序
// data.selectExpr("gender", "age+1","cast(age as bigint)").orderBy($"gender".desc, $"age".asc).show
data.selectExpr("gender", "age+1 as age1","cast(age as bigint) as age2").sort($"gender".desc, $"age".asc).show
+------+----+----+
|gender|age1|age2|
+------+----+----+
|  male|23.0|  22|
|  male|23.0|  22|
|  male|38.0|  37|
|  male|58.0|  57|
|  male|58.0|  57|
|female|23.0|  22|
|female|28.0|  27|
|female|33.0|  32|
|female|33.0|  32|
|female|33.0|  32|
+------+----+----+

4.檢視SparkSQL邏輯和物理執行計劃

val data4=data.selectExpr("gender", "age+1 as age1","cast(age as bigint) as age2").sort($"gender".desc, $"age".asc)
data4: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [gender: string, age1: double ... 1 more field]

// 檢視物理執行計劃
data4.explain()
== Physical Plan ==
*Project [gender#20, age1#135, age2#136L]
+- *Sort [gender#20 DESC, age#21 ASC], true, 0
   +- Exchange rangepartitioning(gender#20 DESC, age#21 ASC, 200)
      +- LocalTableScan [gender#20, age1#135, age2#136L, age#21]

// 檢視邏輯和物理執行計劃	  
data4.explain(extended=true)
== Parsed Logical Plan ==
'Sort ['gender DESC, 'age ASC], true
+- Project [gender#20, (age#21 + cast(1 as double)) AS age1#135, cast(age#21 as bigint) AS age2#136L]
   +- Project [_1#9 AS affairs#19, _2#10 AS gender#20, _3#11 AS age#21, _4#12 AS yearsmarried#22, _5#13 AS children#23, _6#14 AS religiousness#24, _7#15 AS education#25, _8#16 AS occupation#2
6, _9#17 AS rating#27]      +- LocalRelation [_1#9, _2#10, _3#11, _4#12, _5#13, _6#14, _7#15, _8#16, _9#17]

== Analyzed Logical Plan ==
gender: string, age1: double, age2: bigint
Project [gender#20, age1#135, age2#136L]
+- Sort [gender#20 DESC, age#21 ASC], true
   +- Project [gender#20, (age#21 + cast(1 as double)) AS age1#135, cast(age#21 as bigint) AS age2#136L, age#21]
      +- Project [_1#9 AS affairs#19, _2#10 AS gender#20, _3#11 AS age#21, _4#12 AS yearsmarried#22, _5#13 AS children#23, _6#14 AS religiousness#24, _7#15 AS education#25, _8#16 AS occupatio
n#26, _9#17 AS rating#27]         +- LocalRelation [_1#9, _2#10, _3#11, _4#12, _5#13, _6#14, _7#15, _8#16, _9#17]

== Optimized Logical Plan ==
Project [gender#20, age1#135, age2#136L]
+- Sort [gender#20 DESC, age#21 ASC], true
   +- LocalRelation [gender#20, age1#135, age2#136L, age#21]

== Physical Plan ==
*Project [gender#20, age1#135, age2#136L]
+- *Sort [gender#20 DESC, age#21 ASC], true, 0
   +- Exchange rangepartitioning(gender#20 DESC, age#21 ASC, 200)
      +- LocalTableScan [gender#20, age1#135, age2#136L, age#21]

Spark2 DataSet 建立新行之flatMap
2016-11-28
Spark
Spark2 Dataset聚合操作
2016-11-25
Spark
Spark2 Dataset多維度統計cube與rollup
2016-11-25
Spark
Spark2 Dataset去重、差集、交集
2016-11-25
Spark
Spark2 Dataset之檢視與SQL
2016-11-25
SparkSQL
執行計劃-1：獲取執行計劃
2019-10-07
執行計劃
2014-04-03
Spark2 Dataset持久化儲存級別StorageLevel
2016-11-25
Spark持久化
分析執行計劃優化SQLORACLE的執行計劃(轉)
2019-04-04
優化SQLOracle
oracle sqlprofile 固定執行計劃，並遷移執行計劃
2017-08-09
OracleSQL
【sql調優之執行計劃】獲取執行計劃
2010-08-16
SQL
SYBASE執行計劃
2019-06-28
MySQL 執行計劃
2020-12-24
MySql
MySQL執行計劃
2018-05-29
MySql
sql 執行計劃
2008-08-20
SQL
ORACLE執行計劃
2011-05-02
Oracle
【執行計劃】Oracle獲取執行計劃的幾種方法
2021-08-09
Oracle
【Oracle】-【索引-HINT，執行計劃】-帶HINT的索引執行計劃
2013-07-26
Oracle索引
分析執行計劃最佳化SQLORACLE的執行計劃(轉)
2009-03-04
SQLOracle
Spark2 Dataset之collect_set與collect_list
2016-11-25
Spark
【PG執行計劃】Postgresql資料庫執行計劃統計資訊簡述
2019-01-10
SQL資料庫
oracle 固定執行計劃
2019-05-01
Oracle
Oracle sql執行計劃
2020-12-14
OracleSQL
explain執行計劃分析
2020-11-14
AI
mysql執行計劃explain
2020-10-23
MySqlAI
mysql explain 執行計劃
2023-12-04
MySqlAI
MySQL執行計劃解析
2018-06-19
MySql
分割槽執行計劃
2016-09-19
執行計劃繫結
2017-08-29
SQL Server執行計劃
2016-01-11
SQLServer
執行計劃詳解
2015-05-07
啟動執行計劃
2008-05-14
SQL的執行計劃
2008-06-25
SQL
FAST DUAL執行計劃？
2006-04-06
AST
執行計劃的理解.
2016-05-24
min ? max ? 執行計劃？
2007-04-28
檢視執行計劃
2012-02-03
SQL執行計劃分析
2009-04-27
SQL

Spark2 Dataset行列操作和執行計劃

相關文章