10.spark sql之快速入門
前世今生
Hive&Shark
隨著大資料時代的來臨,Hadoop風靡一時。為了使熟悉RDBMS但又不理解MapReduce的技術人員快速進行大資料開發,Hive應運而生。Hive是當時唯一執行在Hadoop上的SQL-on-Hadoop工具。
但是MapReduce計算過程中大量的中間磁碟落地過程消耗了大量的I/O,降低的執行效率。為了提高SQL-on-Hadoop的效率,大量的SQL-on-Hadoop工具開始產生,其中表現較為突出的是:
MapR的Drill
Cloudera的Impala
Shark
Shark是伯克利實驗室Spark生態的元件之一,它修改了Hive Driver的記憶體管理、物理計劃、執行三個模組,使之能執行在Spark引擎上,從而使得SQL查詢的速度得到10-100倍的提升。
Hive&Shark.jpg
Shark&Spark SQL
Shark對於Hive的太多依賴(如採用Hive的語法解析器、查詢最佳化器等等),制約了Spark的One Stack Rule Them All的既定方針,制約了Spark各個元件的相互整合,所以提出了SparkSQL專案。
SparkSQL拋棄原有Shark的程式碼,汲取了Shark的一些優點,如記憶體列儲存(In-Memory Columnar Storage)、Hive相容性等,重新開發了SparkSQL程式碼。由於擺脫了對Hive的依賴性,SparkSQL無論在資料相容、效能最佳化、元件擴充套件方面都得到了極大地提升。
資料相容方面
不但相容Hive,還可以從RDD、parquet檔案、JSON檔案中獲取資料,也支援獲取RDBMS資料以及cassandra等NOSQL資料。
效能最佳化方面
除了採取In-Memory Columnar Storage、byte-code generation等最佳化技術外,引進Cost Model對查詢進行動態評估、獲取最佳物理計劃等。
元件擴充套件方面
無論是SQL的語法解析器、分析器還是最佳化器都可以重新定義,進行擴充套件。
2014年Shark停止開發,團隊將所有資源放SparkSQL專案上,至此,Shark的發展畫上了句號,但也因此發展出兩條線:SparkSQL和Hive on Spark。
Shark&SparkSQL.jpg
其中SparkSQL作為Spark生態的一員繼續發展,而不再受限於Hive,只是相容Hive;而Hive on Spark是一個Hive的發展計劃,該計劃將Spark作為Hive的底層引擎之一,也就是說,Hive將不再受限於一個引擎,可以採用Map-Reduce、Tez、Spark等引擎。
簡介
Spark SQL是一個用於結構化資料處理的模組。Spark SQL賦予待處理資料一些結構化資訊,可以使用SQL語句或DataSet API介面與Spark SQL進行互動。
SQL
Spark SQL可以使用sql讀寫Hive中的資料;也可以在程式語言中使用sql,返回Dataset/DataFrame結果集。
DataSets&DataFrames
Dataset是一個分散式資料集,它結合了RDD與SparkSQL執行引擎的優點。Dataset可以透過JVM物件構造,然後使用運算元操作進行處理。Java和Scala都有Dataset API;Python和R本身支援Dataset特性。
DataFrame是一個二維結構的DataSet,相當於RDBMS中的表。DataFrame可以有多種方式構造,比如結構化資料檔案、hive表、外部資料庫、RDD等。在Scala、Java、Python及R中都有DataFrame API。
DataFrame與DataSet
DataFrame建立及操作
scala
import org.apache.spark.sql.SparkSession// 構造SparkSessionval spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate()// 建立DataFrameval df = spark.read.json("examples/src/main/resources/people.json")// Displays the content of the DataFrame to stdoutdf.show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+// DataFrame操作// This import is needed to use the $-notationimport spark.implicits._// Print the schema in a tree formatdf.printSchema()// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show()// +-------+// | name|// +-------+// |Michael|// | Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select($"name", $"age" + 1).show()// +-------+---------+// | name|(age + 1)|// +-------+---------+// |Michael| null|// | Andy| 31|// | Justin| 20|// +-------+---------+// Select people older than 21df.filter($"age" > 21).show()// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show()// +----+-----+// | age|count|// +----+-----+// | 19| 1|// |null| 1|// | 30| 1|// +----+-----+
java
import org.apache.spark.sql.SparkSession;//構造SparkSessionSparkSession spark = SparkSession .builder() .appName("Java Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate(); //建立DataFrameimport org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row; Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");// Displays the content of the DataFrame to stdoutdf.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+//DataFrame操作// col("...") is preferable to df.col("...")import static org.apache.spark.sql.functions.col;// Print the schema in a tree formatdf.printSchema();// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show();// +-------+// | name|// +-------+// |Michael|// | Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select(col("name"), col("age").plus(1)).show();// +-------+---------+// | name|(age + 1)|// +-------+---------+// |Michael| null|// | Andy| 31|// | Justin| 20|// +-------+---------+// Select people older than 21df.filter(col("age").gt(21)).show();// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show();// +----+-----+// | age|count|// +----+-----+// | 19| 1|// |null| 1|// | 30| 1|// +----+-----+
python
from pyspark.sql import SparkSession# 構造SparkSessionspark = SparkSession .builder .appName("Python Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() # 建立DataFrame# spark is an existing SparkSessiondf = spark.read.json("examples/src/main/resources/people.json")# Displays the content of the DataFrame to stdoutdf.show()# +----+-------+# | age| name|# +----+-------+# |null|Michael|# | 30| Andy|# | 19| Justin|# +----+-------+# DataFrame操作# spark, df are from the previous example# Print the schema in a tree formatdf.printSchema()# root# |-- age: long (nullable = true)# |-- name: string (nullable = true)# Select only the "name" columndf.select("name").show()# +-------+# | name|# +-------+# |Michael|# | Andy|# | Justin|# +-------+# Select everybody, but increment the age by 1df.select(df['name'], df['age'] + 1).show()# +-------+---------+# | name|(age + 1)|# +-------+---------+# |Michael| null|# | Andy| 31|# | Justin| 20|# +-------+---------+# Select people older than 21df.filter(df['age'] > 21).show()# +---+----+# |age|name|# +---+----+# | 30|Andy|# +---+----+# Count people by agedf.groupBy("age").count().show()# +----+-----+# | age|count|# +----+-----+# | 19| 1|# |null| 1|# | 30| 1|# +----+-----+
DataSet建立及操作
Datasets和RDD類似,但使用專門的Encoder編碼器來序列化需要經過網路傳輸的資料物件,而不用RDD使用的Java序列化或Kryo庫。Encoder編碼器是動態生成的程式碼,允許直接執行各種運算元操作,而不用反序列化。
scala
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,// you can use custom classes that implement the Product interfacecase class Person(name: String, age: Long)// Encoders are created for case classesval caseClassDS = Seq(Person("Andy", 32)).toDS() caseClassDS.show()// +----+---+// |name|age|// +----+---+// |Andy| 32|// +----+---+// Encoders for most common types are automatically provided by importing spark.implicits._val primitiveDS = Seq(1, 2, 3).toDS() primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by nameval path = "examples/src/main/resources/people.json"val peopleDS = spark.read.json(path).as[Person] peopleDS.show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+
java
import java.util.Arrays;import java.util.Collections;import java.io.Serializable;import org.apache.spark.api.java.function.MapFunction;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.Encoder;import org.apache.spark.sql.Encoders;public static class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } }// Create an instance of a Bean classPerson person = new Person(); person.setName("Andy"); person.setAge(32);// Encoders are created for Java beansEncoder<Person> personEncoder = Encoders.bean(Person.class); Dataset<Person> javaBeanDS = spark.createDataset( Collections.singletonList(person), personEncoder ); javaBeanDS.show();// +---+----+// |age|name|// +---+----+// | 32|Andy|// +---+----+// Encoders for most common types are provided in class EncodersEncoder<Integer> integerEncoder = Encoders.INT(); Dataset<Integer> primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder); Dataset<Integer> transformedDS = primitiveDS.map( (MapFunction<Integer, Integer>) value -> value + 1, integerEncoder); transformedDS.collect(); // Returns [2, 3, 4]// DataFrames can be converted to a Dataset by providing a class. Mapping based on nameString path = "examples/src/main/resources/people.json"; Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder); peopleDS.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+
SQL操作
scala
// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people")//df.createGlobalTempView("people")val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show()// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+
java
import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people");//df.createGlobalTempView("people")Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"); sqlDF.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+
python
# Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people")# df.createGlobalTempView("people")sqlDF = spark.sql("SELECT * FROM people") sqlDF.show()# +----+-------+# | age| name|# +----+-------+# |null|Michael|# | 30| Andy|# | 19| Justin|# +----+-------+
作者:java大資料程式設計
連結:
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/758/viewspace-2815215/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- SQL快速入門 ( MySQL快速入門, MySQL參考, MySQL快速回顧 )MySql
- flink快速入門(部署+flink-sql)SQL
- Flask二之快速入門Flask
- FreeMarker 之快速入門Demo
- MyBatis 框架之快速入門程式MyBatis框架
- 資料分析師之SQL入門SQL
- [大資料之Spark]——快速入門大資料Spark
- SQL 入門SQL
- Web開發初探之JavaScript 快速入門WebJavaScript
- C#快速入門教程(27)—— SQL Server資料庫C#SQLServer資料庫
- 快速排序快速入門排序
- Python快速入門之基礎知識(一)Python
- JavaScript快速入門JavaScript
- vim快速入門
- Webpack快速入門Web
- Lumen快速入門
- TypeScript 快速入門TypeScript
- phpunit 快速入門PHP
- React快速入門React
- WebSocket 快速入門Web
- Pipenv 快速入門
- MQTT 快速入門MQQT
- Zookeeper快速入門
- DvaJS快速入門JS
- SnakeYaml快速入門YAML
- RabbitMQ快速入門MQ
- 快速入門reactReact
- pipenv快速入門
- Promise快速入門Promise
- PHP快速入門PHP
- GitHub 快速入門Github
- mongodb快速入門MongoDB
- mysqlsla快速入門MySql
- Express快速入門Express
- Python快速入門Python
- NuxtJS快速入門UXJS
- MySQL 快速入門MySql
- jackson快速入門