Spark修煉之道(進階篇)——Spark入門到精通:第八節 Spark SQL與DataFrame(一)

五柳-先生發表於2015-11-14

本節主要內賓

  1. Spark SQL簡介
  2. DataFrame

1. Spark SQL簡介

Spark SQL是Spark的五大核心模組之一,用於在Spark平臺之上處理結構化資料,利用Spark SQL可以構建大資料平臺上的資料倉儲,它具有如下特點: 
(1)能夠無縫地將SQL語句整合到Spark應用程式當中 
這裡寫圖片描述 
(2)統一的資料訪問方式 
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources. 
這裡寫圖片描述 
(3) 相容Hive 
這裡寫圖片描述 
(4) 可採用JDBC or ODBC連線 
這裡寫圖片描述 
具體見:http://spark.apache.org/sql/

關於Spark SQL的執行原理可參見:http://blog.csdn.net/book_mmicky/article/details/39956809,文章寫得非常好 ,這裡不再贅述,在此向作者致敬

2. DataFrame

(1)DataFrame簡介

本文部分內容譯自https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

DataFrames在Spark-1.3.0中引入,主要解決使用Spark RDD API使用的門檻,使熟悉R語言等的資料分析師能夠快速上手Spark下的資料分析工作,極大地擴大了Spark使用者的數量,由於DataFrames脫胎自SchemaRDD,因此它天然適用於分散式大資料場景。相信在不久的將來,Spark將是大資料分析的終極歸宿。

在Spark中,DataFrame是一種以RDD為基礎的分散式資料集,與傳統RDBMS的表結構類似。與一般的RDD不同的是,DataFrame帶有schema元資訊,即DataFrame所表示的表資料集的每一列都帶有名稱和型別,它對於資料的內部結構具有很強的描述能力。因此Spark SQL可以對藏於DataFrame背後的資料來源以及作用於DataFrame之上的變換進行了針對性的優化,最終達到大幅提升執行時效率。

DataFrames具有如下特點:

(1)Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster(支援單機KB級到叢集PB級的資料處理) 
(2)Support for a wide array of data formats and storage systems(支援多種資料格式和儲存系統,如圖所示) 
這裡寫圖片描述 
(3)State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer(通過Spark SQL Catalyst優化器可以進行高效的程式碼生成和優化) 
(4)Seamless integration with all big data tooling and infrastructure via Spark(能夠無縫整合所有的大資料處理工具) 
(5)APIs for Python, Java, Scala, and R (in development via SparkR)(提供Python, Java, Scala, R語言API)

(2)DataFrame 實戰

本節部分內容來自:http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

將people.json上傳到HDFS上,放置在/data目錄下,people.json檔案內容如下:

<code class="hljs ruby has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">root<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">@sparkslave01</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:~</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># hdfs dfs -cat /data/people.json</span>
{<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Michael"</span></span>}
{<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Andy"</span></span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"age"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">30</span>}
{<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Justin"</span></span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"age"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">19</span>}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>

由於json檔案中已經包括了列名稱的資訊,因此它可以直接建立DataFrame

<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val df = sqlContext.read.json("/data/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//顯示DataFrame完整資訊</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.show()
+----+-------+</span>
<span class="hljs-header" style="box-sizing: border-box;">| age|   name|
+----+-------+</span>
|null|Michael|
|  30|   Andy|
<span class="hljs-header" style="box-sizing: border-box;">|  19| Justin|
+----+-------+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>

這裡寫圖片描述

<code class="hljs cs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//檢視DataFrame後設資料資訊</span>
scala> df.printSchema()
root
 |-- age: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">long</span> (nullable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>)
 |-- name: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">string</span> (nullable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//返回DataFrame某列所有資料</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.select("name").show()
+-------+</span>
<span class="hljs-header" style="box-sizing: border-box;">|   name|
+-------+</span>
|Michael|
|   Andy|
<span class="hljs-header" style="box-sizing: border-box;">| Justin|
+-------+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//DataFrame資料過濾</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.filter(df("age") > 21).show()
+---+----+</span>
<span class="hljs-header" style="box-sizing: border-box;">|age|name|
+---+----+</span>
<span class="hljs-header" style="box-sizing: border-box;">| 30|Andy|
+---+----+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//按年齡分組</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.groupBy("age").count().show()
+----+-----+</span>
<span class="hljs-header" style="box-sizing: border-box;">| age|count|
+----+-----+</span>
|null|    1|
|  19|    1|
<span class="hljs-header" style="box-sizing: border-box;">|  30|    1|
+----+-----+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//註冊成表
scala> df<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.registerTempTable</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"people"</span>)
//執行SparkSQL
scala> val teenagers = sqlContext<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"SELECT name, age FROM people WHERE age >= 13 AND age <= 19"</span>)
<span class="hljs-label" style="box-sizing: border-box;">teenagers:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.DataFrame</span> = [name: string, age: bigint]
//結果格式化輸出
scala> teenagers<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>(t => <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Name: "</span> + t(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>()<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.foreach</span>(println)
<span class="hljs-label" style="box-sizing: border-box;">Name:</span> Justin</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

這裡寫圖片描述

轉載: http://blog.csdn.net/lovehuangjiaju/article/details/48661847

相關文章