Spark SQL | Spark，從入門到精通

美圖資料技術團隊發表於2019-01-21

原文網址 : http://www.jiqizhixin.com/articles/2019-01-22-11

歡迎閱讀美圖資料技術團隊的「Spark，從入門到精通」系列文章，本系列文章將由淺入深為大家介紹 Spark，從框架入門到底層架構的實現，相信總有一種姿勢適合你。

/ 發家史 /

熟悉 Spark SQL 的都知道，Spark SQL 是從 Shark 發展而來。Shark 為了實現 Hive 相容，在 HQL 方面重用了 Hive 中 HQL 的解析、邏輯執行計劃翻譯、執行計劃優化等邏輯，可以近似認為僅將物理執行計劃從 MR 作業替換成了 Spark 作業（輔以記憶體列式儲存等各種和 Hive 關係不大的優化）；同時還依賴 Hive Metastore 和 Hive SerDe（用於相容現有的各種 Hive 儲存格式）。

Spark SQL 在 Hive 相容層面僅依賴 HQL parser、Hive Metastore 和 Hive SerDe。也就是說，從 HQL 被解析成抽象語法樹（AST）起，就全部由 Spark SQL 接管了。執行計劃生成和優化都由 Catalyst 負責。藉助 Scala 的模式匹配等函式式語言特性，利用 Catalyst 開發執行計劃優化策略比 Hive 要簡潔得多。

Spark SQL | Spark，從入門到精通

Spark SQL

Spark SQL 提供了多種介面：

純 Sql 文字；
dataset/dataframe api。

當然，相應的，也會有各種客戶端：

sql 文字，可以用 thriftserver/spark-sql；
編碼，Dataframe/dataset/sql。

/ Dataframe/Dataset API 簡介 /

Dataframe/Dataset 也是分散式資料集，但與 RDD 不同的是其帶有 schema 資訊，類似一張表。

可以用下面一張圖詳細對比 Dataset/dataframe 和 RDD 的區別：

Spark SQL | Spark，從入門到精通

Dataset 是在 spark1.6 引入的，目的是提供像 RDD 一樣的強型別、使用強大的 lambda 函式，同時使用 Spark SQL 的優化執行引擎。到 spark2.0 以後，DataFrame 變成型別為 Row 的 Dataset，即為：

type DataFrame = Dataset[Row]

Spark SQL | Spark，從入門到精通

所以，很多移植 spark1.6 及之前的程式碼到 spark2+的都會報錯誤，找不到 dataframe 類。

基本操作

val df = spark.read.json(“file:///opt/meitu/bigdata/src/main/data/people.json”)
df.show()
import spark.implicits._
df.printSchema()
df.select("name").show()
df.select($"name", $"age" + 1).show()
df.filter($"age" > 21).show()
df.groupBy("age").count().show()
spark.stop()

分割槽分桶排序

分桶排序儲存hive表
df.write.bucketBy(42,“name”).sortBy(“age”).saveAsTable(“people_bucketed”)
分割槽以parquet輸出到指定目錄
df.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")
分割槽分桶儲存到hive表
df.write .partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("users_partitioned_bucketed")

cube rullup pivot

cube
sales.cube("city", "year”).agg(sum("amount")as "amount”) .show()
rull up
sales.rollup("city", "year”).agg(sum("amount")as "amount”).show()
pivot 只能跟在groupby之後
sales.groupBy("year").pivot("city",Seq("Warsaw","Boston","Toronto")).agg(sum("amount")as "amount”).show()

/ SQL 程式設計 /

Spark SQL 允許使用者提交 SQL 文字，支援以下三種手段編寫 SQL 文字：

1. spark 程式碼

2. spark-sql的shell

3. thriftserver

支援 Spark SQL 自身的語法，同時也相容 HSQL。

1. 編碼

要先宣告構建 SQLContext 或者 SparkSession，這個是 SparkSQL 的編碼入口。早起的版本使用的是 SQLContext 或者 HiveContext，spark2 以後，建議使用的是 SparkSession。

SQLContext

new SQLContext(SparkContext)

HiveContext

new HiveContext(spark.sparkContext)

SparkSession

不使用 hive 後設資料：

val spark = SparkSession.builder()
 .config(sparkConf) .getOrCreate()

使用 hive 後設資料：

val spark = SparkSession.builder()
 .config(sparkConf) .enableHiveSupport().getOrCreate()

使用

val df =spark.read.json("examples/src/main/resources/people.json") 
df.createOrReplaceTempView("people") 
spark.sql("SELECT * FROM people").show()

2. spark-sql 指令碼

spark-sql 啟動的時候類似於 spark-submit 可以設定部署模式資源等，可以使用

bin/spark-sql –help 檢視配置引數。

需要將 hive-site.xml 放到 ${SPARK_HOME}/conf/ 目錄下，然後就可以測試

show tables;

select count(*) from student;

3. thriftserver

thriftserver jdbc/odbc 的實現類似於 hive1.2.1 的 hiveserver2，可以使用 spark 的 beeline 命令來測試 jdbc server。

安裝部署

/1 開啟 hive 的 metastore

bin/hive --service metastore

/2 將配置檔案複製到spark/conf/目錄下

/3 thriftserver

sbin/start-thriftserver.sh --masteryarn --deploy-mode client

對於 yarn 只支援 client 模式。

/4 啟動 bin/beeline

/5 連線到 thriftserver

!connect jdbc:hive2://localhost:10001

/ 使用者自定義函式 /

1. UDF

定義一個 udf 很簡單，例如我們自定義一個求字串長度的 udf：

val len = udf{(str:String) => str.length}
spark.udf.register("len",len)
val ds =spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json")
ds.createOrReplaceTempView("employees")
ds.show()
spark.sql("select len(name) from employees").show()

2. UserDefinedAggregateFunction

定義一個 UDAF

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._

object MyAverageUDAF extends UserDefinedAggregateFunction {
 //Data types of input arguments of this aggregate function
 definputSchema:StructType = StructType(StructField("inputColumn", LongType) :: Nil)
 //Data types of values in the aggregation buffer
 defbufferSchema:StructType = {
   StructType(StructField("sum", LongType):: StructField("count", LongType) :: Nil)
 }
 //The data type of the returned value
 defdataType:DataType = DoubleType
 //Whether this function always returns the same output on the identical input
 defdeterministic: Boolean = true
 //Initializes the given aggregation buffer. The buffer itself is a `Row` that inaddition to
 // standard methods like retrieving avalue at an index (e.g., get(), getBoolean()), provides
 // the opportunity to update itsvalues. Note that arrays and maps inside the buffer are still
 // immutable.
 definitialize(buffer:MutableAggregationBuffer): Unit = {
   buffer(0) = 0L
   buffer(1) = 0L
 }
 //Updates the given aggregation buffer `buffer` with new input data from `input`
 defupdate(buffer:MutableAggregationBuffer, input: Row): Unit ={
   if(!input.isNullAt(0)) {
     buffer(0) = buffer.getLong(0)+ input.getLong(0)
     buffer(1) = buffer.getLong(1)+ 1
   }
 }
 // Mergestwo aggregation buffers and stores the updated buffer values back to `buffer1`
 defmerge(buffer1:MutableAggregationBuffer, buffer2: Row): Unit ={
   buffer1(0) = buffer1.getLong(0)+ buffer2.getLong(0)
   buffer1(1) = buffer1.getLong(1)+ buffer2.getLong(1)
 }
 //Calculates the final result
 defevaluate(buffer:Row): Double =buffer.getLong(0).toDouble /buffer.getLong(1)
}

使用 UDAF

val ds = spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json")
ds.createOrReplaceTempView("employees")
ds.show()
spark.udf.register("myAverage", MyAverageUDAF)
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()

3. Aggregator

定義一個 Aggregator

import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)

object MyAverageAggregator extends Aggregator[Employee, Average, Double] {

 // A zero value for this aggregation. Should satisfy the property that any b + zero = b
 def zero: Average = Average(0L, 0L)
 // Combine two values to produce a new value. For performance, the function may modify `buffer`
 // and return it instead of constructing a new object
 def reduce(buffer: Average, employee: Employee): Average = {
   buffer.sum += employee.salary
   buffer.count += 1
   buffer
 }
 // Merge two intermediate values
 def merge(b1: Average, b2: Average): Average = {
   b1.sum += b2.sum
   b1.count += b2.count
   b1
 }
 // Transform the output of the reduction
 def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
 // Specifies the Encoder for the intermediate value type
 def bufferEncoder: Encoder[Average] = Encoders.product
 // Specifies the Encoder for the final output value type
 def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

使用

spark.udf.register("myAverage2", MyAverageAggregator)
import spark.implicits._
val ds = spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json").as[Employee]
ds.show()
val averageSalary = MyAverageAggregator.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()

/ 資料來源 /

1. 通用的 laod/save 函式
可支援多種資料格式：json, parquet, jdbc, orc, libsvm, csv, text

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

預設的是 parquet，可以通過 spark.sql.sources.default，修改預設配置。

2. Parquet 檔案

val parquetFileDF =spark.read.parquet("people.parquet") 
peopleDF.write.parquet("people.parquet")

3. ORC 檔案

val ds = spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json")
ds.write.mode("append").orc("/opt/outputorc/")
spark.read.orc("/opt/outputorc/*").show(1)

4. JSON

ds.write.mode("overwrite").json("/opt/outputjson/")
spark.read.json("/opt/outputjson/*").show()

5. Hive 表

spark 1.6 及以前的版本使用 hive 表需要 hivecontext。Spark2 開始只需要建立 sparksession 增加 enableHiveSupport()即可。

val spark = SparkSession
.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()

spark.sql("select count(*) from student").show()

6. JDBC

寫入 mysql

wcdf.repartition(1).write.mode("append").option("user", "root")
 .option("password", "mdh2018@#").jdbc("jdbc:mysql://localhost:3306/test","alluxio",new Properties())

從 mysql 裡讀

val fromMysql = spark.read.option("user", "root")
 .option("password", "mdh2018@#").jdbc("jdbc:mysql://localhost:3306/test","alluxio",new Properties())

7. 自定義資料來源

自定義 source 比較簡單，首先我們要看看 source 載入的方式。指定的目錄下，定義一個 DefaultSource 類，在類裡面實現自定義 source，就可以實現我們的目標。

import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport}

class DefaultSource  extends DataSourceV2 with ReadSupport {

 def createReader(options: DataSourceOptions) = new SimpleDataSourceReader()
}

import org.apache.spark.sql.Row
import org.apache.spark.sql.sources.v2.reader.{DataReaderFactory, DataSourceReader}
import org.apache.spark.sql.types.{StringType, StructField, StructType}

class SimpleDataSourceReader extends DataSourceReader {

 def readSchema() = StructType(Array(StructField("value", StringType)))

 def createDataReaderFactories = {
   val factoryList = new java.util.ArrayList[DataReaderFactory[Row]]
   factoryList.add(new SimpleDataSourceReaderFactory())
   factoryList
 }
}

import org.apache.spark.sql.Row
import org.apache.spark.sql.sources.v2.reader.{DataReader, DataReaderFactory}

class SimpleDataSourceReaderFactory extends
 DataReaderFactory[Row] with DataReader[Row] {
 def createDataReader = new SimpleDataSourceReaderFactory()
 val values = Array("1", "2", "3", "4", "5")

 var index = 0

 def next = index < values.length

 def get = {
   val row = Row(values(index))
   index = index + 1
   row
 }

 def close() = Unit
}

使用

val simpleDf = spark.read
 .format("bigdata.spark.SparkSQL.DataSources")
 .load()

simpleDf.show()

/ 優化器及執行計劃 /

1. 流程簡介

Spark SQL | Spark，從入門到精通

總體執行流程如下：從提供的輸入 API（SQL，Dataset， dataframe）開始，依次經過 unresolved 邏輯計劃，解析的邏輯計劃，優化的邏輯計劃，物理計劃，然後根據 cost based 優化，選取一條物理計劃進行執行。

簡單化成四個部分：

/1 analysis

Spark 2.0 以後語法樹生成使用的是 antlr4，之前是 scalaparse。

/2 logical optimization

常量合併，謂詞下推，列裁剪，boolean 表示式簡化，和其它的規則。

/3 physical planning

eg:SortExec 。

/4 Codegen

codegen 技術是用 scala 的字串插值特性生成原始碼，然後使用 Janino 編譯成 java位元組碼，Eg： SortExec。

2. 自定義優化器

/1 實現

繼承 Rule[LogicalPlan]

object MultiplyOptimizationRule extends Rule[LogicalPlan] {

   def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {

     case Multiply(left,right) if right.isInstanceOf[Literal] &&

       right.asInstanceOf[Literal].value.asInstanceOf[Double] == 1.0 =>

       println("=========> optimization of one applied")

       left

   }

 }
 
   spark.experimental.extraOptimizations = Seq(MultiplyOptimizationRule)

   val multipliedDFWithOptimization = df.selectExpr("amountPaid * 1")

   println("after optimization")

/2 註冊

spark.experimental.extraOptimizations= Seq(MultiplyOptimizationRule)

/3 使用

selectExpr("amountPaid* 1")

Spark SQL | Spark，從入門到精通

3. 自定義執行計劃

/1 物理計劃

繼承 SparkLan 實現 doExecute 方法。

/2 邏輯計劃

繼承 SparkStrategy 實現 apply。

case class FastOperator(output: Seq[Attribute],child:SparkPlan) extends SparkPlan {

 override def children: Seq[SparkPlan] = Nil

 override protected def doExecute(): RDD[InternalRow] = {
   val row = org.apache.spark.sql.Row("hi",12L)
   val unsafeRow = toUnsafeRow(row, Array(org.apache.spark.sql.types.StringType,org.apache.spark.sql.types.LongType))
   sparkContext.parallelize(Seq(unsafeRow),1)
 }

 def toUnsafeRow(row: org.apache.spark.sql.Row, schema: Array[org.apache.spark.sql.types.DataType]): org.apache.spark.sql.catalyst.expressions.UnsafeRow = {
   val converter = unsafeRowConverter(schema)
   converter(row)
 }

 def unsafeRowConverter(schema: Array[org.apache.spark.sql.types.DataType]): org.apache.spark.sql.Row => org.apache.spark.sql.catalyst.expressions.UnsafeRow = {
   val converter = org.apache.spark.sql.catalyst.expressions.UnsafeProjection.create(schema)
   (row: org.apache.spark.sql.Row) => {
     converter(org.apache.spark.sql.catalyst.CatalystTypeConverters.convertToCatalyst(row).asInstanceOf[org.apache.spark.sql.catalyst.InternalRow])
   }
 }
}
case object NeverPlanned extends LeafNode {
 override def output: Seq[Attribute] = Nil
}

object TestStrategy extends Strategy {
 def apply(plan: LogicalPlan): Seq[SparkPlan] =
   plan match {
     case Project(pblist, child) =>
       println("mt fastOperator ------------>")
       FastOperator(pblist.map(_.toAttribute),planLater(child)) :: Nil
     case Union(children) =>
       println("mt union ========>")
       UnionExec(children.map(planLater)) :: Nil
     case LocalRelation(output, data, _) =>
       LocalTableScanExec(output, data):: Nil
     case _ => Nil
 }
}

/3 註冊到 Spark 執行策略

spark.experimental.extraStrategies =Seq(countStrategy)

/4 使用

spark.sql("select count(*) fromtest")

Spark SQL | Spark，從入門到精通

Hello Spark! | Spark，從入門到精通
2018-09-18
Spark
Spark從入門到放棄——初始Spark（一）
2020-12-09
Spark
Spark從入門到放棄---RDD
2020-08-17
Spark
Spark 從零到開發（五）初識Spark SQL
2021-09-09
SparkSQL
「Spark從精通到重新入門(一)」Spark 中不可不知的動態優化
2021-12-01
Spark優化
Spark從入門到放棄——Spark2.4.7安裝和啟動（二）
2020-12-14
Spark
「Spark從精通到重新入門(二)」Spark中不可不知的動態資源分配
2021-12-15
Spark
10.spark sql之快速入門
2021-09-09
SparkSQL
Promise從入門到精通
2019-01-14
Promise
LESS從入門到精通
2019-03-17
Git 從入門到精通
2019-03-08
Git
SAP從入門到精通
2018-06-29
Python從入門到精通
2024-03-09
Python
Thymeleaf從入門到精通
2020-07-24
Eclipse從入門到精通
2019-06-11
Eclipse
vim從入門到精通
2022-05-24
Shell從入門到精通
2021-01-28
Spark Streaming入門
2018-05-16
Spark
Spark入門篇
2020-11-04
Spark
Spark 快速入門
2019-04-24
Spark
教程：Apache Spark SQL入門及實踐指南！
2018-09-12
ApacheSparkSQL
Spark入門（四）--Spark的map、flatMap、mapToPair
2019-02-28
SparkAPTAI
Spark入門（五）--Spark的reduce和reduceByKey
2019-03-01
Spark
Spark系列 - (3) Spark SQL
2023-02-23
SparkSQL
Kaizen如何從入門到精通?
2023-05-16
AI
Linux從入門到精通（二）
2024-03-14
Linux
ElasticSearch 7.8.1 從入門到精通
2020-08-10
Elasticsearch
RabbitMQ 從入門到精通（一）
2019-06-06
MQ
ActiveMQ從入門到精通（一）
2018-12-28
MQ
ActiveMQ從入門到精通（二）
2018-12-28
MQ
Celery框架從入門到精通
2023-03-09
框架
MyBatis從入門到精通(一)：MyBatis入門
2019-06-28
MyBatis
01_spark入門
2024-07-11
Spark
Spark SQL:4.對Spark SQL的理解
2018-12-08
SparkSQL
尚矽谷 springboot 從入門到精通
2019-02-15
Spring Boot
Flink從入門到精通系列文章
2019-03-10
WIFI滲透從入門到精通
2020-08-19
WiFi
Docker從入門到精通（五）——Dockerfile
2021-12-16
Docker

Spark SQL | Spark，從入門到精通

相關文章