Spark API 全集(1):Spark SQL Dataset & DataFrame API

Liam8發表於2018-12-09

原文網址 : https://juejin.im/post/5c0d3d55e51d4529a90311cc

簡介

org.apache.spark.sql.Dataset是Spark SQL中核心的類，定義如下：

class Dataset[T] extends Serializable
複製程式碼

DataFrame是Dataset[Row]的別名。

本文基於spark2.3.0.

下面是類方法簡介。

類方法

Actions

collect(): Array[T]
返回一個陣列，包含Dataset所有行的資料。
注意：所有資料會被載入進driver程式的記憶體。

collectAsList(): List[T]
同上，但是返回Java list。

count(): Long
資料行數

describe(cols: String*): DataFrame
計算指定列的統計指標，包括count, mean, stddev, min, and max.

head(): T
返回第一行

head(n: Int): Array[T]
返回前N行

first(): T
返回第一行，是head()的別名。

foreach(f: (T) ⇒ Unit): Unit
所有元素上應用f函式

foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit
所有元素分割槽上應用f函式

reduce(func: (T, T) ⇒ T): T
根據對映函式func，對RDD中的元素進行二元計算，返回計算結果。
注意：提供的函式應滿足交換律及結合律，否則計算結果將是非確定的。

show(numRows: Int, truncate: Int, vertical: Boolean): Unit
表格形式列印出資料。numRows：顯示的行數，truncate：裁剪字串型別值到指定長度，vertical：垂直列印。

show(numRows: Int, truncate: Int): Unit
show(numRows: Int, truncate: Boolean): Unit
show(truncate: Boolean): Unit
numRows=20 truncate=20

show(numRows: Int): Unit
truncate=20

show(): Unit
numRows=20 truncate=20

summary(statistics: String*): DataFrame
計算資料集statistics指定的指標，可指定 count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
如未指定則會計算全部。

take(n: Int): Array[T]
獲取前n行

takeAsList(n: Int): List[T]
獲取前n行儲存為list

toLocalIterator(): Iterator[T]
返回一個所有行的迭代器
The iterator will consume as much memory as the largest partition in this Dataset.


複製程式碼

基本函式（Basic Dataset functions)

as[U](implicit arg0: Encoder[U]): Dataset[U]
將資料對映成指定型別U，返回新的Dataset

persist(newLevel: StorageLevel): Dataset.this.type
快取資料，可設定快取級別。

persist(): Dataset.this.type
同cache方法

cache(): Dataset.this.type
快取資料,MEMORY_AND_DISK模式。
注意：RDD的cache函式預設是MEMORY_ONLY。

checkpoint(eager: Boolean): Dataset[T]
返回一個checkpointed的Dataset，Dataset的邏輯執行計劃將被截斷。

checkpoint(): Dataset[T]
同上，eager=true.

columns: Array[String]
陣列形式返回所有列名。

dtypes: Array[(String, String)]
陣列形式返回所有列名及型別。

createGlobalTempView(viewName: String): Unit
建立全域性臨時檢視(view)，生命週期與Spark應用一致。
可以跨session訪問。e.g. SELECT * FROM global_temp.view1.

createOrReplaceGlobalTempView(viewName: String): Unit
同上，已存在則替換。

createTempView(viewName: String): Unit
建立本地臨時檢視(view)，僅當前SparkSession可訪問。
注意：不跟任何庫繫結，不能用db1.view1這樣的形式訪問。

createOrReplaceTempView(viewName: String): Unit
同上，已存在則替換。

explain(): Unit
列印物理執行計劃
另有：queryExecution變數，完整執行計劃。

explain(extended: Boolean): Unit
列印物理+邏輯執行計劃

hint(name: String, parameters: Any*): Dataset[T]
當前dataset指定hint。//todo
e.g. df1.join(df2.hint("broadcast"))

inputFiles: Array[String]
返回組成Dataset的輸入檔案（Returns a best-effort snapshot of the files that compose this Dataset）

isLocal: Boolean
collect和take是否可以本地執行，不需要executor.

localCheckpoint(eager: Boolean): Dataset[T]
執行本地Checkpoint，返回新dataset。

localCheckpoint(): Dataset[T]
eager=true

printSchema(): Unit
列印schema結構

rdd: RDD[T]
dataset內部的RDD

schema: StructType
schema

storageLevel: StorageLevel
當前儲存等級，沒有被persist則是StorageLevel.NONE

toDF(): DataFrame
toDF(colNames: String*): DataFrame
轉為DataFrame，也可以將RDD轉為DataFrame。

unpersist(): Dataset.this.type
unpersist(blocking: Boolean): Dataset.this.type
刪除快取，blocking表示是否等所有blocks刪除後才返回,刪除期間阻塞。

write: DataFrameWriter[T]
DataFrameWriter，非流式資料寫介面。

writeStream: DataStreamWriter[T]
DataStreamWriter，流式資料寫介面。

複製程式碼

流式函式（streaming）

isStreaming: Boolean
是否流式資料

withWatermark(eventTime: String, delayThreshold: String): Dataset[T]
Defines an event time watermark for this Dataset.
//TODO


複製程式碼

強型別轉換（Typed transformations）

alias(alias: Symbol): Dataset[T]
alias(alias: String): Dataset[T]
as(alias: Symbol): Dataset[T]
as(alias: String): Dataset[T]
給Dataset一個別名

coalesce(numPartitions: Int): Dataset[T]
分割槽合併(只能減少分割槽)

distinct(): Dataset[T]
dropDuplicates的別名

dropDuplicates(col1: String, cols: String*): Dataset[T]
dropDuplicates(colNames: Array[String]): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(): Dataset[T]
根據指定欄位，對資料去重。

except(other: Dataset[T]): Dataset[T]
去除other中也有的行。同EXCEPT DISTINCT in SQL。
//TODO

filter(func: (T) ⇒ Boolean): Dataset[T]
filter(conditionExpr: String): Dataset[T]
filter(condition: Column): Dataset[T]
根據條件過濾行
e.g.
peopleDs.filter("age > 15")
peopleDs.filter($"age" > 15)

flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]
第一步和map一樣，最後將所有的輸出合併。

groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
現根據func函式生成key，然後按key分組。

intersect(other: Dataset[T]): Dataset[T]
求兩個dataset的交集，等同於INTERSECT in SQL.

joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
inner equi-join兩個dataset

joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
joinType可選：inner, cross, outer, full, full_outer, left, left_outer, right, right_outer

limit(n: Int): Dataset[T]
返回前n行，與head的區別是，head是一個action，會馬上返回結果陣列。

map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
在每一個元素應用func函式，返回包含結果集的dataset。

mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]
在每一個分割槽應用func函式，返回包含結果集的dataset。

orderBy(sortExprs: Column*): Dataset[T]
orderBy(sortCol: String, sortCols: String*): Dataset[T]
sort的別名

sort(sortExprs: Column*): Dataset[T]
sort(sortCol: String, sortCols: String*): Dataset[T]
按指定列排序，預設asc。
e.g. ds.sort($"col1", $"col2".desc)

sortWithinPartitions(sortExprs: Column*): Dataset[T]
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]
分割槽內排序，同"SORT BY" in SQL (Hive QL).

randomSplit(weights: Array[Double]): Array[Dataset[T]]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
按權重隨機分割資料

repartition(partitionExprs: Column*): Dataset[T]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
repartition(numPartitions: Int): Dataset[T]
按指定表示式，分割槽數，重新分割槽（hash），同"DISTRIBUTE BY" in SQL。
預設分割槽數為spark.sql.shuffle.partitions

repartitionByRange(partitionExprs: Column*): Dataset[T]
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]
按指定表示式，分割槽數，重新分割槽，採用Range partition方式，按鍵範圍分割槽。
分割槽預設排序方式為ascending nulls first，分割槽內資料未排序。

sample(withReplacement: Boolean, fraction: Double): Dataset[T]
sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]
sample(fraction: Double): Dataset[T]
sample(fraction: Double, seed: Long): Dataset[T]
隨機取樣本資料
withReplacement：Sample with replacement or not.
fraction：Fraction of rows to generate, range [0.0, 1.0].
seed：Seed for sampling.

select[U1](c1: TypedColumn[T, U1]): Dataset[U1]
根據列/表示式獲取列資料

transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]
應用t函式轉換Dataset。

union(other: Dataset[T]): Dataset[T]
等於UNION ALL in SQL。
注意是按列位置合併：
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show

// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// |   1|   2|   3|
// |   4|   5|   6|
// +----+----+----+

unionByName(other: Dataset[T]): Dataset[T]
同union方法，但是按列名合併：
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.unionByName(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// |   1|   2|   3|
// |   6|   4|   5|
// +----+----+----+

where(conditionExpr: String): Dataset[T]
where(condition: Column): Dataset[T]
filter的別名



複製程式碼

弱型別轉換（Untyped transformations）

返回型別為DataFrame而不是Dataset。

agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
在整個dataset進行聚合。
ds.agg(...) 是 ds.groupBy().agg(...) 的簡寫。
e.g.
ds.agg(max($"age"), avg($"salary"))
ds.agg(Map("age" -> "max", "salary" -> "avg"))
ds.agg("age" -> "max", "salary" -> "avg")


apply(colName: String): Column
col(colName: String): Column
colRegex(colName: String): Column
返回指定列。

crossJoin(right: Dataset[_]): DataFrame
cross join。

cube(col1: String, cols: String*): RelationalGroupedDataset
cube(cols: Column*): RelationalGroupedDataset
使用指定列建立多維cube。
//TODO

drop(col: Column): DataFrame
drop(colNames: String*): DataFrame
drop(colName: String): DataFrame
剪掉指定欄位。

groupBy(col1: String, cols: String*): RelationalGroupedDataset
groupBy(cols: Column*): RelationalGroupedDataset
按指定列分組

join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_]): DataFrame
與另一個DataFrame join。
joinExprs：$"df1Key" === $"df2Key"
usingColumn：Seq("user_id", "user_name")
joinType：Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.


na: DataFrameNaFunctions
見DataFrameNaFunctions

stat: DataFrameStatFunctions
見DataFrameStatFunctions

rollup(col1: String, cols: String*): RelationalGroupedDataset
rollup(cols: Column*): RelationalGroupedDataset
使用指定列進行rollup聚合。//TODO

select(col: String, cols: String*): DataFrame
select(cols: Column*): DataFrame
selectExpr(exprs: String*): DataFrame
選取指定列、SQL表示式。

withColumn(colName: String, col: Column): DataFrame
新增或替換一列。

withColumnRenamed(existingName: String, newName: String): DataFrame
將指定列更名。

複製程式碼

未分組（Ungrouped）


queryExecution: QueryExecution
執行計劃

sparkSession: SparkSession
建立該dataset的SparkSession

sqlContext: SQLContext
dataset的SQLContext

toJSON: Dataset[String]
每行資料轉成JSON字串。

toString(): String
Any的toString

複製程式碼

參考

Spark API

Spark SQL學習——DataFrame和DataSet
2019-04-04
SparkSQL
Spark 系列（九）—— Spark SQL 之 Structured API
2019-08-13
SparkSQLStructAPI
Spark RDD API
2021-09-09
SparkAPI
Spark REST API & metrics
2018-05-31
SparkRESTAPI
15、Spark Sql（一），生成DataFrame的方式
2018-03-04
SparkSQL
Spark3學習【基於Java】3. Spark-Sql常用API
2021-12-03
SparkJavaSQLAPI
Spark SQL，如何將 DataFrame 轉為 json 格式
2018-12-06
SparkSQLJSON
Spark SQL中的RDD與DataFrame轉換
2019-08-12
SparkSQL
Spark建立空的DataFrame
2021-09-09
Spark
【Spark Java API】Action(3)—foreach、f
2021-09-09
SparkJavaAPI
【Spark Java API】Action(4)—sortBy、ta
2021-09-09
SparkJavaAPI
Spark系列 - (3) Spark SQL
2023-02-23
SparkSQL
Spark DataFrame的groupBy vs groupByKey
2018-11-04
Spark
SparkSQL /DataFrame /Spark RDD誰快？
2020-08-15
SparkSQL
tensorflow dataset API
2020-12-18
API
Spark SQL:4.對Spark SQL的理解
2018-12-08
SparkSQL
Apache Spark Dataframe Join語法教程
2019-01-08
ApacheSpark
spark學習筆記--Spark SQL
2018-07-13
Spark筆記SQL
Spark SQL 教程：通過示例瞭解 Spark SQL
2021-12-29
SparkSQL
Spark SQL 教程：透過示例瞭解 Spark SQL
2021-12-29
SparkSQL
Spark SQL | Spark，從入門到精通
2019-01-21
SparkSQL
spark2.2.0 配置spark sql 操作hive
2018-09-23
SparkSQLHive
Spark 系列（十一）—— Spark SQL 聚合函式 Aggregations
2019-08-14
SparkSQL函式
Spark Streaming + Spark SQL 實現配置化ET
2021-09-09
SparkSQL
Flume+Spark+Hive+Spark SQL離線分析系統
2018-09-18
SparkHiveSQL
Spark 從零到開發（五）初識Spark SQL
2021-09-09
SparkSQL
spark sql 實踐（續）
2018-07-08
SparkSQL
Hive on Spark和Spark sql on Hive，你能分的清楚麼
2022-01-04
HiveSparkSQL
Hive on Spark 和 Spark sql on Hive，你能分的清楚麼
2022-09-26
HiveSparkSQL
Spark SQL | 目前Spark社群最活躍的元件之一
2020-11-24
SparkSQL元件
1.Spark學習(Python版本)：Spark安裝
2018-07-24
SparkPython
Spark SQL 開窗函式
2020-03-23
SparkSQL函式
Cris 的 Spark SQL 筆記
2018-12-30
SparkSQL筆記
Spark SQL,正則,regexp_replace
2018-06-03
SparkSQL
Spark SQL的官網解釋
2019-08-09
SparkSQL
Spark SQL如何選擇join策略
2021-01-29
SparkSQL
Spark之spark shell
2018-09-13
Spark
Spark on Yarn 和Spark on Mesos
2018-11-20
SparkYarn

Spark API 全集(1):Spark SQL Dataset & DataFrame API

簡介

類方法

Actions

基本函式（Basic Dataset functions)

流式函式（streaming）

強型別轉換（Typed transformations）

弱型別轉換（Untyped transformations）

未分組（Ungrouped）

參考

相關文章