Spark SQL | 目前Spark社群最活躍的元件之一

大資料學習與分享發表於2020-11-24

原文網址 : https://www.cnblogs.com/bigdatalearnshare/p/14014931.html

Spark SQL是一個用來處理結構化資料的Spark元件，前身是shark，但是shark過多的依賴於hive如採用hive的語法解析器、查詢優化器等，制約了Spark各個元件之間的相互整合，因此Spark SQL應運而生。

Spark SQL在汲取了shark諸多優勢如記憶體列儲存、相容hive等基礎上，做了重新的構造，因此也擺脫了對hive的依賴，但同時相容hive。除了採取記憶體列儲存優化效能，還引入了位元組碼生成技術、CBO和RBO對查詢等進行動態評估獲取最優邏輯計劃、物理計劃執行等。基於這些優化，使得Spark SQL相對於原有的SQL on Hadoop技術在效能方面得到有效提升。

同時，Spark SQL支援多種資料來源，如JDBC、HDFS、HBase。它的內部元件，如SQL的語法解析器、分析器等支援重定義進行擴充套件，能更好的滿足不同的業務場景。與Spark Core無縫整合，提供了DataSet/DataFrame的可程式設計抽象資料模型，並且可被視為一個分散式的SQL查詢引擎。

DataSet/DataFrame

DataSet/DataFrame都是Spark SQL提供的分散式資料集，相對於RDD而言，除了記錄資料以外，還記錄表的schema資訊。

DataSet是自Spark1.6開始提供的一個分散式資料集，具有RDD的特性比如強型別、可以使用強大的lambda表示式，並且使用Spark SQL的優化執行引擎。DataSet API支援Scala和Java語言，不支援Python。但是鑑於Python的動態特性，它仍然能夠受益於DataSet API（如，你可以通過一個列名從Row裡獲取這個欄位 row.columnName），類似的還有R語言。

DataFrame是DataSet以命名列方式組織的分散式資料集，類似於RDBMS中的表，或者R和Python中的 data frame。DataFrame API支援Scala、Java、Python、R。在Scala API中，DataFrame變成型別為Row的Dataset：

type DataFrame = Dataset[Row]。

DataFrame在編譯期不進行資料中欄位的型別檢查，在執行期進行檢查。但DataSet則與之相反，因為它是強型別的。此外，二者都是使用catalyst進行sql的解析和優化。為了方便，以下統一使用DataSet統稱。

DataSet建立

DataSet通常通過載入外部資料或通過RDD轉化建立。

1. 載入外部資料

以載入json和mysql為例：

val ds = sparkSession.read.json("/路徑/people.json")

val ds = sparkSession.read.format("jdbc")
.options(Map("url" -> "jdbc:mysql://ip:port/db",
"driver" -> "com.mysql.jdbc.Driver",
"dbtable" -> "tableName", "user" -> "root", "root" -> "123")).load()

2. RDD轉換為DataSet通過RDD轉化建立DataSet，關鍵在於為RDD指定schema，通常有兩種方式（虛擬碼）：

1.定義一個case class，利用反射機制來推斷

1) 從HDFS中載入檔案為普通RDD
val lineRDD = sparkContext.textFile("hdfs://ip:port/person.txt").map(_.split(" "))

2) 定義case class（相當於表的schema）
case class Person(id:Int, name:String, age:Int)

3) 將RDD和case class關聯
val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))

4) 將RDD轉換成DataFrame
val ds= personRDD.toDF

2.手動定義一個schema StructType，直接指定在RDD上

val schemaString ="name age"

val schema =  StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

val rowRdd = peopleRdd.map(p=>Row(p(0),p(1)))

val ds = sparkSession.createDataFrame(rowRdd,schema)

操作DataSet的兩種風格語法

DSL語法

1. 查詢DataSet部分列中的內容

personDS.select(col("name"))

personDS.select(col("name"), col("age"))

2. 查詢所有的name和age和salary，並將salary加1000

personDS.select(col("name"), col("age"), col("salary") + 1000)

personDS.select(personDS("name"), personDS("age"), personDS("salary") + 1000)

3. 過濾age大於18的personDS.filter(col("age") > 18)

4. 按年齡進行分組並統計相同年齡的人數

personDS.groupBy("age").count()

注意：直接使用col方法需要import org.apache.spark.sql.functions._

SQL語法

如果想使用SQL風格的語法，需要將DataSet註冊成表personDS.registerTempTable("person")

//查詢年齡最大的前兩名

val result = sparkSession.sql("select * from person order by age desc limit 2")/

/儲存結果為json檔案。注意：如果不指定儲存格式，則預設儲存為parquet
result.write.format("json").save("hdfs://ip:port/res2")

Spark SQL的幾種使用方式

1. sparksql-shell互動式查詢

就是利用Spark提供的shell命令列執行SQL

2. 程式設計

首先要獲取Spark SQL程式設計"入口"：SparkSession（當然在早期版本中大家可能更熟悉的是SQLContext，如果是操作hive則為HiveContext）。這裡以讀取parquet為例：

val spark = SparkSession.builder()
.appName("example").master("local[*]").getOrCreate();

val df = sparkSession.read.format("parquet").load("/路徑/parquet檔案")

然後就可以針對df進行業務處理了。

3. Thriftserverbeeline客戶端連線操作

啟動spark-sql的thrift服務，sbin/start-thriftserver.sh，啟動指令碼中配置好Spark叢集服務資源、地址等資訊。然後通過beeline連線thrift服務進行資料處理。

hive-jdbc驅動包來訪問spark-sql的thrift服務

在專案pom檔案中引入相關驅動包，跟訪問mysql等jdbc資料來源類似。示例：

Class.forName("org.apache.hive.jdbc.HiveDriver")
val conn = DriverManager.getConnection("jdbc:hive2://ip:port", "root", "123");
try {
  val stat = conn.createStatement()
  val res = stat.executeQuery("select * from people limit 1")
  while (res.next()) {
    println(res.getString("name"))
  }
} catch {
  case e: Exception => e.printStackTrace()
} finally{
  if(conn!=null) conn.close()
}

Spark SQL 獲取Hive資料

Spark SQL讀取hive資料的關鍵在於將hive的後設資料作為服務暴露給Spark。除了通過上面thriftserver jdbc連線hive的方式，也可以通過下面這種方式：
首先，配置 $HIVE_HOME/conf/hive-site.xml，增加如下內容：

<property>

<name>hive.metastore.uris</name>

<value>thrift://ip:port</value>

</property>

然後，啟動hive metastore

最後，將hive-site.xml複製或者軟鏈到$SPARK_HOME/conf/。如果hive的後設資料儲存在mysql中，那麼需要將mysql的連線驅動jar包如mysql-connector-java-5.1.12.jar放到$SPARK_HOME/lib/下，啟動spark-sql即可操作hive中的庫和表。而此時使用hive後設資料獲取SparkSession的方式為：

val spark = SparkSession.builder()

.config(sparkConf).enableHiveSupport().getOrCreate()

UDF、UDAF、Aggregator

UDF

UDF是最基礎的使用者自定義函式，以自定義一個求字串長度的udf為例：

val udf_str_length = udf{(str:String) => str.length}
spark.udf.register("str_length",udf_str_length)
val ds =sparkSession.read.json("路徑/people.json")
ds.createOrReplaceTempView("people")
sparkSession.sql("select str_length(address) from people")

UDAF

定義UDAF，需要繼承抽象類UserDefinedAggregateFunction，它是弱型別的，下面的aggregator是強型別的。以求平均數為例：

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._

object MyAverage extends UserDefinedAggregateFunction {
  // Data types of input arguments of this aggregate function
  def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
  // Data types of values in the aggregation buffer
  def bufferSchema: StructType = {
    StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
  }
  // The data type of the returned value
  def dataType: DataType = DoubleType
  // Whether this function always returns the same output on the identical input
  def deterministic: Boolean = true
  // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
  // the opportunity to update its values. Note that arrays and maps inside the buffer are still
  // immutable.
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }
  // Updates the given aggregation buffer `buffer` with new input data from `input`
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    if (!input.isNullAt(0)) {
      buffer(0) = buffer.getLong(0) + input.getLong(0)
      buffer(1) = buffer.getLong(1) + 1
    }
  }
  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  }
  // Calculates the final result
  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}

// Register the function to access it
spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()

Aggregator

import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator

case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)

object MyAverage extends Aggregator[Employee, Average, Double] {
  // A zero value for this aggregation. Should satisfy the property that any b + zero = b
  def zero: Average = Average(0L, 0L)
  // Combine two values to produce a new value. For performance, the function may modify `buffer`
  // and return it instead of constructing a new object
  def reduce(buffer: Average, employee: Employee): Average = {
    buffer.sum += employee.salary
    buffer.count += 1
    buffer
  }
  // Merge two intermediate values
  def merge(b1: Average, b2: Average): Average = {
    b1.sum += b2.sum
    b1.count += b2.count
    b1
  }
  // Transform the output of the reduction
  def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
  // Specifies the Encoder for the intermediate value type
  def bufferEncoder: Encoder[Average] = Encoders.product
  // Specifies the Encoder for the final output value type
  def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()

Spark SQL與Hive的對比

關注微信公眾號：大資料學習與分享，獲取更對技術乾貨

Spark SQL:4.對Spark SQL的理解
2018-12-08
SparkSQL
Spark系列 - (3) Spark SQL
2023-02-23
SparkSQL
Spark文件閱讀之一：Spark Overview
2020-06-15
SparkView
spark學習筆記--Spark SQL
2018-07-13
Spark筆記SQL
Spark SQL 教程：通過示例瞭解 Spark SQL
2021-12-29
SparkSQL
Spark SQL 教程：透過示例瞭解 Spark SQL
2021-12-29
SparkSQL
Spark SQL | Spark，從入門到精通
2019-01-21
SparkSQL
spark2.2.0 配置spark sql 操作hive
2018-09-23
SparkSQLHive
Spark 系列（九）—— Spark SQL 之 Structured API
2019-08-13
SparkSQLStructAPI
Spark API 全集(1):Spark SQL Dataset & DataFrame API
2018-12-09
SparkAPISQL
Spark 系列（十一）—— Spark SQL 聚合函式 Aggregations
2019-08-14
SparkSQL函式
Spark Streaming + Spark SQL 實現配置化ET
2021-09-09
SparkSQL
Hive on Spark和Spark sql on Hive，你能分的清楚麼
2022-01-04
HiveSparkSQL
Hive on Spark 和 Spark sql on Hive，你能分的清楚麼
2022-09-26
HiveSparkSQL
Cris 的 Spark SQL 筆記
2018-12-30
SparkSQL筆記
Flume+Spark+Hive+Spark SQL離線分析系統
2018-09-18
SparkHiveSQL
Spark 從零到開發（五）初識Spark SQL
2021-09-09
SparkSQL
spark sql 實踐（續）
2018-07-08
SparkSQL
Spark SQL的官網解釋
2019-08-09
SparkSQL
Spark SQL 開窗函式
2020-03-23
SparkSQL函式
Spark SQL：通用的load和save操作
2018-09-26
SparkSQL
15、Spark Sql（一），生成DataFrame的方式
2018-03-04
SparkSQL
Laravel 國外社群活躍者
2019-04-02
Laravel
Spark SQL學習——DataFrame和DataSet
2019-04-04
SparkSQL
Spark SQL,正則,regexp_replace
2018-06-03
SparkSQL
Spark SQL如何選擇join策略
2021-01-29
SparkSQL
Spark SQL中的RDD與DataFrame轉換
2019-08-12
SparkSQL
Spark之spark shell
2018-09-13
Spark
如何保持並提升社群活躍度
2020-10-22
Spark3學習【基於Java】3. Spark-Sql常用API
2021-12-03
SparkJavaSQLAPI
Spark on Yarn 和Spark on Mesos
2018-11-20
SparkYarn
Spark 原始碼系列（九）Spark SQL 初體驗之解析過程詳解
2019-04-25
Spark原始碼SQL
Spark SQL日期相距天數,月數
2018-06-03
SparkSQL
Spark SQL / Catalyst 內部原理與 RBO
2018-09-11
SparkSQL
Spark SQL知識點與實戰
2021-11-23
SparkSQL
Apache Kyuubi 助力 CDH 解鎖 Spark SQL
2021-11-18
ApacheSparkSQL
10.spark sql之快速入門
2021-09-09
SparkSQL
spark lac 分析停留最長的兩個地方
2018-10-11
Spark

Spark SQL | 目前Spark社群最活躍的元件之一

相關文章