Spark SQL使用簡介（3）--載入和儲存資料

瀛999發表於2018-08-03

原文網址 : https://blog.csdn.net/gan785160627/article/details/81369708

載入和儲存資料

val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

可以在載入和儲存資料的時候選擇資料來源，對於內建資料來源可以用他們的簡短名，如json, parquet, jdbc, orc, libsvm, csv, text。如下程式碼載入了一個json資料，並儲存為parquet資料。利用這種方式能進行資料格式轉化。

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

用如下方式載入csv檔案，option用來進行一些設定，如 .option("sep", ";") 表示以分號為分隔符。

val peopleDFCsv = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("examples/src/main/resources/people.csv")

也可以直接對檔案進行查詢

val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

save modes

有如下四種save mode：

`SaveMode.ErrorIfExists`(default)	當資料來源已經存在資料，丟擲一個異常
SaveMode.Append	當資料來源已經存在資料，將新的資料附加到原有資料上
SaveMode.Overwrite	當資料來源已經存在資料，新資料覆蓋掉原有資料
SaveMode.Ignore	當資料來源已經存在資料，忽略新資料

Saving to Persistent Tables

DataFrame可以儲存為Hive的持久表，使用的命令是 saveAsTable。當儲存為持久表後，DataFrame裡的資料將物化，持久表將一直存在，當重啟Spark後，依然可以使用這個持久表。

儲存資料時，還可以進行裝桶（bucketBy）、排序（sortBy）、分割槽（partitionBy），接下來是幾個例子：

peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")

usersDF
  .write
  .partitionBy("favorite_color")
  .bucketBy(42, "name")
  .saveAsTable("users_partitioned_bucketed")

Parquet Files

Parquet是列式儲存格式，可以用如下方法讀寫Parquet格式資料：

// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")

Hive metastore Parquet table conversion

當Hive的Parquet表中讀寫資料時，Spark SQL將使用自己Parquet替代Hive SerDe。

Hive和Parquet有兩點不同：

1.Hive是型別不敏感的，但是Parquet不是、

2.Hive允許所有列是可空的（nullable），Parquet中nullability 是有意義的。

JSON

Spark SQL能推斷JSON資料集的scheme，並且載入它為一個DataSet[Row]。

// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
//將JSON檔案轉化為DataSet[Row]
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
//列印scheme
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
//通過Dataset[String]轉化為Dataset[Row]
val otherPeopleDataset = spark.createDataset(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

Hive Tables

Spark SQL支援讀寫Hive中的資料，可以通過/conf/目錄下的hive-site.xml、core-site.xml、hdfs-site.xml對hive進行配置。

import java.io.File

import org.apache.spark.sql.{Row, SaveMode, SparkSession}

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

//建立表src
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
//載入資料
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// |    500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DataFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
  case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// |               value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// |  2| val_2|  2| val_2|
// |  4| val_4|  4| val_4|
// |  5| val_5|  5| val_5|
// ...

// Create a Hive managed Parquet table, with HQL syntax instead of the Spark SQL native syntax
// `USING hive`
sql("CREATE TABLE hive_records(key int, value string) STORED AS PARQUET")
// Save DataFrame to the Hive managed table
val df = spark.table("src")
df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records")
// After insertion, the Hive managed table has data now
sql("SELECT * FROM hive_records").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Prepare a Parquet data directory
val dataDir = "/tmp/parquet_data"
spark.range(10).write.parquet(dataDir)
// Create a Hive external Parquet table
sql(s"CREATE EXTERNAL TABLE hive_ints(key int) STORED AS PARQUET LOCATION '$dataDir'")
// The Hive external table should already have data
sql("SELECT * FROM hive_ints").show()
// +---+
// |key|
// +---+
// |  0|
// |  1|
// |  2|
// ...

// Turn on flag for Hive Dynamic Partitioning
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
// Create a Hive partitioned table using DataFrame API
df.write.partitionBy("key").format("hive").saveAsTable("hive_part_tbl")
// Partitioned column `key` will be moved to the end of the schema.
sql("SELECT * FROM hive_part_tbl").show()
// +-------+---+
// |  value|key|
// +-------+---+
// |val_238|238|
// | val_86| 86|
// |val_311|311|
// ...

spark.stop()

可以使用類似以下語句定義如何從檔案系統中讀寫資料，以下語句指定檔案格式為Parquet。

CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')

有以下選項可以選擇：

屬性名含義

fileFormat 檔案格式，包括inputFormat, outputFormat、serde，現在支援'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'。

inputFormat, outputFormat 指定輸入、輸出格式，必須成對出現，如果指定了fileFormat就不能指定這兩個了。

serde 指定序列化類的名字，如果指定了fileFormat就不能指定這個選項。

fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim

這個選項只在textfile中被使用，定義如何將檔案劃分為行。

JDBC To Other Databases

Spark SQL可以使用JDBC從其他資料庫讀資料，你需要在啟動Spark shell時使用如下命令指定JDBC driver。

bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

為了載入其他資料庫的資料，使用者需要指定一些JDBC連線屬性，如使用者名稱、密碼等。

// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
//載入資料，設定URL、表名、使用者名稱、密碼等
val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .load()

//使用jdbc命令
val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF2 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// Specifying the custom data types of the read schema
//設定scheme資料型別
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Saving data to a JDBC source
//將資料寫到其他資料庫
jdbcDF.write
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .save()

//使用jdbc命令寫到其他資料庫
jdbcDF2.write
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Specifying create table column data types on write
//寫的時候指定scheme的型別
jdbcDF.write
  .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

使用Spark載入資料到SQL Server列儲存表
2021-03-03
SparkSQLServer
MySQL資料庫儲存引擎簡介
2019-03-12
MySql資料庫儲存引擎
使用儲存過程(PL/SQL)向資料庫中儲存BLOB物件
2019-07-08
儲存過程SQL資料庫物件
spacy儲存和載入模型
2021-01-01
模型
如何使用Amazon S3儲存和檢索資料
2024-10-25
S3
spark sql與mysql 資料載入與匯出資料到mysql
2018-11-08
SparkMySql
儲存引擎簡介
2020-11-05
儲存引擎
Azure 儲存簡介
2021-01-15
MySQL入門系列：儲存程式（三）之儲存過程簡介
2019-03-21
MySql儲存過程
MySQL入門系列：儲存程式（二）之儲存函式簡介
2019-03-20
MySql儲存函式
MySQL原理簡介—5.儲存模型和資料讀寫機制
2024-11-24
MySql模型
【儲存資料恢復】某品牌EqualLogic系列儲存介紹和資料恢復方法
2022-09-30
資料恢復
7-03. 實現資料儲存和載入的邏輯
2024-04-18
Spark SQL使用簡介（2）--UDF（使用者自定義函式）
2018-08-02
SparkSQL函式
InnoDB儲存引擎簡介
2020-08-07
儲存引擎
tensorflow模型持久化儲存和載入
2018-04-23
模型持久化
Storage API簡介和儲存限制與逐出策略
2020-10-21
API
資料儲存-領存高速海量資料記錄儲存模組產品介紹
2021-04-08
Python 黑帽程式設計 4.2 Sniffer 之資料本地儲存和載入
2019-03-02
Python程式設計
Spark系列 - (3) Spark SQL
2023-02-23
SparkSQL
SQL Server 2008 R2並行資料倉儲簡介SZ
2022-03-21
SQLServer並行
spark學習筆記--資料讀取與儲存
2018-07-09
Spark筆記
Spark(16) -- 資料讀取與儲存的主要方式
2020-10-05
Spark
Pytorch | Tutorial-07 儲存和載入模型
2024-03-20
PyTorch模型
JavaScript使用localStorage儲存資料
2020-04-07
JavaScript
入門Kubernetes-資料儲存
2021-08-01
重新學習Mysql資料庫3：Mysql儲存引擎與資料儲存原理
2019-11-16
MySql資料庫儲存引擎
Android的3種資料儲存技術（一）File儲存
2020-11-12
Android
《Hadoop+Spark大資料分析實戰》簡介
2022-07-27
HadoopSpark大資料
使用SQL SERVER儲存過程實現歷史資料遷移
2021-09-19
SQLServer儲存過程
MySQL儲存引擎簡介及MyISAM和InnoDB的區別
2021-09-09
MySql儲存引擎
好程式設計師大資料培訓分享MySQL資料庫儲存引擎簡介
2020-05-27
程式設計師大資料MySql資料庫儲存引擎
orbeon form 的架構簡介 - 如何訪問使用者通過 form 儲存的資料
2021-12-19
ORBORM架構
MySQL儲存引擎入門介紹
2020-09-28
MySql儲存引擎
Redis資料儲存和讀寫
2024-06-02
Redis
資料儲存--檔案儲存
2024-05-26
spark 的簡介
2022-07-25
Spark
Spark in action on Kubernetes - 儲存篇
2019-04-09
Spark