Spark SQL使用簡介(3)--載入和儲存資料

瀛999發表於2018-08-03

載入和儲存資料

val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

可以在載入和儲存資料的時候選擇資料來源,對於內建資料來源可以用他們的簡短名,如jsonparquetjdbcorclibsvmcsvtext。如下程式碼載入了一個json資料,並儲存為parquet資料。利用這種方式能進行資料格式轉化。

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

用如下方式載入csv檔案,option用來進行一些設定,如 .option("sep", ";") 表示以分號為分隔符。

val peopleDFCsv = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("examples/src/main/resources/people.csv")

也可以直接對檔案進行查詢

val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

save modes

有如下四種save mode:

SaveMode.ErrorIfExists(default) 當資料來源已經存在資料,丟擲一個異常
SaveMode.Append 當資料來源已經存在資料,將新的資料附加到原有資料上
SaveMode.Overwrite 當資料來源已經存在資料,新資料覆蓋掉原有資料
SaveMode.Ignore 當資料來源已經存在資料,忽略新資料

Saving to Persistent Tables

DataFrame可以儲存為Hive的持久表,使用的命令是 saveAsTable。當儲存為持久表後,DataFrame裡的資料將物化,持久表將一直存在,當重啟Spark後,依然可以使用這個持久表。

儲存資料時,還可以進行裝桶(bucketBy)、排序(sortBy)、分割槽(partitionBy),接下來是幾個例子:

peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")
usersDF
  .write
  .partitionBy("favorite_color")
  .bucketBy(42, "name")
  .saveAsTable("users_partitioned_bucketed")

Parquet Files

Parquet是列式儲存格式,可以用如下方法讀寫Parquet格式資料:

// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")

Hive metastore Parquet table conversion

當Hive的Parquet表中讀寫資料時,Spark SQL將使用自己Parquet替代Hive SerDe。

Hive和Parquet有兩點不同:

1.Hive是型別不敏感的,但是Parquet不是、

2.Hive允許所有列是可空的(nullable),Parquet中nullability 是有意義的。

JSON

Spark SQL能推斷JSON資料集的scheme,並且載入它為一個DataSet[Row]。

// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
//將JSON檔案轉化為DataSet[Row]
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
//列印scheme
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
//通過Dataset[String]轉化為Dataset[Row]
val otherPeopleDataset = spark.createDataset(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

Hive Tables

Spark SQL支援讀寫Hive中的資料,可以通過/conf/目錄下的hive-site.xml、core-site.xml、hdfs-site.xml對hive進行配置。

import java.io.File

import org.apache.spark.sql.{Row, SaveMode, SparkSession}

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

//建立表src
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
//載入資料
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// |    500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DataFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
  case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// |               value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// |  2| val_2|  2| val_2|
// |  4| val_4|  4| val_4|
// |  5| val_5|  5| val_5|
// ...

// Create a Hive managed Parquet table, with HQL syntax instead of the Spark SQL native syntax
// `USING hive`
sql("CREATE TABLE hive_records(key int, value string) STORED AS PARQUET")
// Save DataFrame to the Hive managed table
val df = spark.table("src")
df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records")
// After insertion, the Hive managed table has data now
sql("SELECT * FROM hive_records").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Prepare a Parquet data directory
val dataDir = "/tmp/parquet_data"
spark.range(10).write.parquet(dataDir)
// Create a Hive external Parquet table
sql(s"CREATE EXTERNAL TABLE hive_ints(key int) STORED AS PARQUET LOCATION '$dataDir'")
// The Hive external table should already have data
sql("SELECT * FROM hive_ints").show()
// +---+
// |key|
// +---+
// |  0|
// |  1|
// |  2|
// ...

// Turn on flag for Hive Dynamic Partitioning
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
// Create a Hive partitioned table using DataFrame API
df.write.partitionBy("key").format("hive").saveAsTable("hive_part_tbl")
// Partitioned column `key` will be moved to the end of the schema.
sql("SELECT * FROM hive_part_tbl").show()
// +-------+---+
// |  value|key|
// +-------+---+
// |val_238|238|
// | val_86| 86|
// |val_311|311|
// ...

spark.stop()

可以使用類似以下語句定義如何從檔案系統中讀寫資料,以下語句指定檔案格式為Parquet。

CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')

有以下選項可以選擇:

屬性名 含義
fileFormat 檔案格式,包括inputFormat, outputFormat、serde,現在支援'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'。
inputFormat, outputFormat 指定輸入、輸出格式,必須成對出現,如果指定了fileFormat就不能指定這兩個了。
serde 指定序列化類的名字,如果指定了fileFormat就不能指定這個選項。
fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim
這個選項只在textfile中被使用,定義如何將檔案劃分為行。

JDBC To Other Databases

Spark SQL可以使用JDBC從其他資料庫讀資料,你需要在啟動Spark shell時使用如下命令指定JDBC driver。

bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

為了載入其他資料庫的資料,使用者需要指定一些JDBC連線屬性,如使用者名稱、密碼等。 

// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
//載入資料,設定URL、表名、使用者名稱、密碼等
val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .load()

//使用jdbc命令
val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF2 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// Specifying the custom data types of the read schema
//設定scheme資料型別
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Saving data to a JDBC source
//將資料寫到其他資料庫
jdbcDF.write
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .save()

//使用jdbc命令寫到其他資料庫
jdbcDF2.write
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Specifying create table column data types on write
//寫的時候指定scheme的型別
jdbcDF.write
  .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

 

相關文章