前世今生

Hive&Shark

隨著大資料時代的來臨，Hadoop風靡一時。為了使熟悉RDBMS但又不理解MapReduce的技術人員快速進行大資料開發，Hive應運而生。Hive是當時唯一執行在Hadoop上的SQL-on-Hadoop工具。

但是MapReduce計算過程中大量的中間磁碟落地過程消耗了大量的I/O，降低的執行效率。為了提高SQL-on-Hadoop的效率，大量的SQL-on-Hadoop工具開始產生，其中表現較為突出的是：

MapR的Drill
Cloudera的Impala
Shark

Shark是伯克利實驗室Spark生態的元件之一，它修改了Hive Driver的記憶體管理、物理計劃、執行三個模組，使之能執行在Spark引擎上，從而使得SQL查詢的速度得到10-100倍的提升。

Hive&Shark.jpg

Shark&Spark SQL

Shark對於Hive的太多依賴（如採用Hive的語法解析器、查詢最佳化器等等），制約了Spark的One Stack Rule Them All的既定方針，制約了Spark各個元件的相互整合，所以提出了SparkSQL專案。

SparkSQL拋棄原有Shark的程式碼，汲取了Shark的一些優點，如記憶體列儲存（In-Memory Columnar Storage）、Hive相容性等，重新開發了SparkSQL程式碼。由於擺脫了對Hive的依賴性，SparkSQL無論在資料相容、效能最佳化、元件擴充套件方面都得到了極大地提升。

資料相容方面

不但相容Hive，還可以從RDD、parquet檔案、JSON檔案中獲取資料，也支援獲取RDBMS資料以及cassandra等NOSQL資料。

效能最佳化方面

除了採取In-Memory Columnar Storage、byte-code generation等最佳化技術外,引進Cost Model對查詢進行動態評估、獲取最佳物理計劃等。

元件擴充套件方面

無論是SQL的語法解析器、分析器還是最佳化器都可以重新定義，進行擴充套件。

2014年Shark停止開發，團隊將所有資源放SparkSQL專案上，至此，Shark的發展畫上了句號，但也因此發展出兩條線：SparkSQL和Hive on Spark。

Shark&SparkSQL.jpg

其中SparkSQL作為Spark生態的一員繼續發展，而不再受限於Hive，只是相容Hive；而Hive on Spark是一個Hive的發展計劃，該計劃將Spark作為Hive的底層引擎之一，也就是說，Hive將不再受限於一個引擎，可以採用Map-Reduce、Tez、Spark等引擎。

簡介

Spark SQL是一個用於結構化資料處理的模組。Spark SQL賦予待處理資料一些結構化資訊，可以使用SQL語句或DataSet API介面與Spark SQL進行互動。

SQL

Spark SQL可以使用sql讀寫Hive中的資料；也可以在程式語言中使用sql，返回Dataset/DataFrame結果集。

DataSets&DataFrames

Dataset是一個分散式資料集，它結合了RDD與SparkSQL執行引擎的優點。Dataset可以透過JVM物件構造，然後使用運算元操作進行處理。Java和Scala都有Dataset API；Python和R本身支援Dataset特性。

DataFrame是一個二維結構的DataSet，相當於RDBMS中的表。DataFrame可以有多種方式構造，比如結構化資料檔案、hive表、外部資料庫、RDD等。在Scala、Java、Python及R中都有DataFrame API。

DataFrame與DataSet

DataFrame建立及操作

scala

import org.apache.spark.sql.SparkSession// 構造SparkSessionval spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()// 建立DataFrameval df = spark.read.json("examples/src/main/resources/people.json")// Displays the content of the DataFrame to stdoutdf.show()// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+// DataFrame操作// This import is needed to use the $-notationimport spark.implicits._// Print the schema in a tree formatdf.printSchema()// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show()// +-------+// |   name|// +-------+// |Michael|// |   Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select($"name", $"age" + 1).show()// +-------+---------+// |   name|(age + 1)|// +-------+---------+// |Michael|     null|// |   Andy|       31|// | Justin|       20|// +-------+---------+// Select people older than 21df.filter($"age" > 21).show()// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show()// +----+-----+// | age|count|// +----+-----+// |  19|    1|// |null|    1|// |  30|    1|// +----+-----+

java

import org.apache.spark.sql.SparkSession;//構造SparkSessionSparkSession spark = SparkSession
  .builder()
  .appName("Java Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate();  
//建立DataFrameimport org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;

Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");// Displays the content of the DataFrame to stdoutdf.show();// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+//DataFrame操作// col("...") is preferable to df.col("...")import static org.apache.spark.sql.functions.col;// Print the schema in a tree formatdf.printSchema();// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show();// +-------+// |   name|// +-------+// |Michael|// |   Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select(col("name"), col("age").plus(1)).show();// +-------+---------+// |   name|(age + 1)|// +-------+---------+// |Michael|     null|// |   Andy|       31|// | Justin|       20|// +-------+---------+// Select people older than 21df.filter(col("age").gt(21)).show();// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show();// +----+-----+// | age|count|// +----+-----+// |  19|    1|// |null|    1|// |  30|    1|// +----+-----+

python

from pyspark.sql import SparkSession# 構造SparkSessionspark = SparkSession 
    .builder 
    .appName("Python Spark SQL basic example") 
    .config("spark.some.config.option", "some-value") 
    .getOrCreate()    
# 建立DataFrame# spark is an existing SparkSessiondf = spark.read.json("examples/src/main/resources/people.json")# Displays the content of the DataFrame to stdoutdf.show()# +----+-------+# | age|   name|# +----+-------+# |null|Michael|# |  30|   Andy|# |  19| Justin|# +----+-------+# DataFrame操作# spark, df are from the previous example# Print the schema in a tree formatdf.printSchema()# root# |-- age: long (nullable = true)# |-- name: string (nullable = true)# Select only the "name" columndf.select("name").show()# +-------+# |   name|# +-------+# |Michael|# |   Andy|# | Justin|# +-------+# Select everybody, but increment the age by 1df.select(df['name'], df['age'] + 1).show()# +-------+---------+# |   name|(age + 1)|# +-------+---------+# |Michael|     null|# |   Andy|       31|# | Justin|       20|# +-------+---------+# Select people older than 21df.filter(df['age'] > 21).show()# +---+----+# |age|name|# +---+----+# | 30|Andy|# +---+----+# Count people by agedf.groupBy("age").count().show()# +----+-----+# | age|count|# +----+-----+# |  19|    1|# |null|    1|# |  30|    1|# +----+-----+

DataSet建立及操作

Datasets和RDD類似，但使用專門的Encoder編碼器來序列化需要經過網路傳輸的資料物件，而不用RDD使用的Java序列化或Kryo庫。Encoder編碼器是動態生成的程式碼，允許直接執行各種運算元操作，而不用反序列化。

scala

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,// you can use custom classes that implement the Product interfacecase class Person(name: String, age: Long)// Encoders are created for case classesval caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()// +----+---+// |name|age|// +----+---+// |Andy| 32|// +----+---+// Encoders for most common types are automatically provided by importing spark.implicits._val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by nameval path = "examples/src/main/resources/people.json"val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

java

import java.util.Arrays;import java.util.Collections;import java.io.Serializable;import org.apache.spark.api.java.function.MapFunction;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.Encoder;import org.apache.spark.sql.Encoders;public static class Person implements Serializable {  private String name;  private int age;  public String getName() {    return name;
  }  public void setName(String name) {    this.name = name;
  }  public int getAge() {    return age;
  }  public void setAge(int age) {    this.age = age;
  }
}// Create an instance of a Bean classPerson person = new Person();
person.setName("Andy");
person.setAge(32);// Encoders are created for Java beansEncoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> javaBeanDS = spark.createDataset(
  Collections.singletonList(person),
  personEncoder
);
javaBeanDS.show();// +---+----+// |age|name|// +---+----+// | 32|Andy|// +---+----+// Encoders for most common types are provided in class EncodersEncoder<Integer> integerEncoder = Encoders.INT();
Dataset<Integer> primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder);
Dataset<Integer> transformedDS = primitiveDS.map(
    (MapFunction<Integer, Integer>) value -> value + 1,
    integerEncoder);
transformedDS.collect(); // Returns [2, 3, 4]// DataFrames can be converted to a Dataset by providing a class. Mapping based on nameString path = "examples/src/main/resources/people.json";
Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder);
peopleDS.show();// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

SQL操作

scala

// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people")//df.createGlobalTempView("people")val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

java

import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people");//df.createGlobalTempView("people")Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");
sqlDF.show();// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

python

# Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people")# df.createGlobalTempView("people")sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()# +----+-------+# | age|   name|# +----+-------+# |null|Michael|# |  30|   Andy|# |  19| Justin|# +----+-------+

作者：java大資料程式設計
連結：

10.spark sql之快速入門