Spark修煉之道(進階篇)——Spark入門到精通:第八節 Spark SQL與DataFrame(一)
本節主要內賓
- Spark SQL簡介
- DataFrame
1. Spark SQL簡介
Spark SQL是Spark的五大核心模組之一,用於在Spark平臺之上處理結構化資料,利用Spark SQL可以構建大資料平臺上的資料倉儲,它具有如下特點:
(1)能夠無縫地將SQL語句整合到Spark應用程式當中
(2)統一的資料訪問方式
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
(3) 相容Hive
(4) 可採用JDBC or ODBC連線
具體見:http://spark.apache.org/sql/
關於Spark SQL的執行原理可參見:http://blog.csdn.net/book_mmicky/article/details/39956809,文章寫得非常好 ,這裡不再贅述,在此向作者致敬
2. DataFrame
(1)DataFrame簡介
DataFrames在Spark-1.3.0中引入,主要解決使用Spark RDD API使用的門檻,使熟悉R語言等的資料分析師能夠快速上手Spark下的資料分析工作,極大地擴大了Spark使用者的數量,由於DataFrames脫胎自SchemaRDD,因此它天然適用於分散式大資料場景。相信在不久的將來,Spark將是大資料分析的終極歸宿。
在Spark中,DataFrame是一種以RDD為基礎的分散式資料集,與傳統RDBMS的表結構類似。與一般的RDD不同的是,DataFrame帶有schema元資訊,即DataFrame所表示的表資料集的每一列都帶有名稱和型別,它對於資料的內部結構具有很強的描述能力。因此Spark SQL可以對藏於DataFrame背後的資料來源以及作用於DataFrame之上的變換進行了針對性的優化,最終達到大幅提升執行時效率。
DataFrames具有如下特點:
(1)Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster(支援單機KB級到叢集PB級的資料處理)
(2)Support for a wide array of data formats and storage systems(支援多種資料格式和儲存系統,如圖所示)
(3)State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer(通過Spark SQL Catalyst優化器可以進行高效的程式碼生成和優化)
(4)Seamless integration with all big data tooling and infrastructure via Spark(能夠無縫整合所有的大資料處理工具)
(5)APIs for Python, Java, Scala, and R (in development via SparkR)(提供Python, Java, Scala, R語言API)
(2)DataFrame 實戰
本節部分內容來自:http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
將people.json上傳到HDFS上,放置在/data目錄下,people.json檔案內容如下:
<code class="hljs ruby has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">root<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">@sparkslave01</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:~</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># hdfs dfs -cat /data/people.json</span> {<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Michael"</span></span>} {<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Andy"</span></span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"age"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">30</span>} {<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Justin"</span></span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"age"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">19</span>}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>
由於json檔案中已經包括了列名稱的資訊,因此它可以直接建立DataFrame
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val df = sqlContext.read.json("/data/people.json") df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//顯示DataFrame完整資訊</span> <span class="hljs-header" style="box-sizing: border-box;">scala> df.show() +----+-------+</span> <span class="hljs-header" style="box-sizing: border-box;">| age| name| +----+-------+</span> |null|Michael| | 30| Andy| <span class="hljs-header" style="box-sizing: border-box;">| 19| Justin| +----+-------+</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>
<code class="hljs cs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//檢視DataFrame後設資料資訊</span> scala> df.printSchema() root |-- age: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">long</span> (nullable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>) |-- name: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">string</span> (nullable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//返回DataFrame某列所有資料</span> <span class="hljs-header" style="box-sizing: border-box;">scala> df.select("name").show() +-------+</span> <span class="hljs-header" style="box-sizing: border-box;">| name| +-------+</span> |Michael| | Andy| <span class="hljs-header" style="box-sizing: border-box;">| Justin| +-------+</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//DataFrame資料過濾</span> <span class="hljs-header" style="box-sizing: border-box;">scala> df.filter(df("age") > 21).show() +---+----+</span> <span class="hljs-header" style="box-sizing: border-box;">|age|name| +---+----+</span> <span class="hljs-header" style="box-sizing: border-box;">| 30|Andy| +---+----+</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//按年齡分組</span> <span class="hljs-header" style="box-sizing: border-box;">scala> df.groupBy("age").count().show() +----+-----+</span> <span class="hljs-header" style="box-sizing: border-box;">| age|count| +----+-----+</span> |null| 1| | 19| 1| <span class="hljs-header" style="box-sizing: border-box;">| 30| 1| +----+-----+</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//註冊成表 scala> df<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.registerTempTable</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"people"</span>) //執行SparkSQL scala> val teenagers = sqlContext<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"SELECT name, age FROM people WHERE age >= 13 AND age <= 19"</span>) <span class="hljs-label" style="box-sizing: border-box;">teenagers:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.DataFrame</span> = [name: string, age: bigint] //結果格式化輸出 scala> teenagers<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>(t => <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Name: "</span> + t(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>()<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.foreach</span>(println) <span class="hljs-label" style="box-sizing: border-box;">Name:</span> Justin</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>
轉載: http://blog.csdn.net/lovehuangjiaju/article/details/48661847
相關文章
- Spark修煉之道(進階篇)——Spark入門到精通:第一節 Spark 1.5.0叢集搭建Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第七節 Spark執行原理Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第四節 Spark程式設計模型(一)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第五節 Spark程式設計模型(二)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第六節 Spark程式設計模型(三)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第二節 Hadoop、Spark生成圈簡介SparkHadoop
- Spark修煉之道(進階篇)——Spark入門到精通:第三節 Spark Intellij IDEA開發環境搭建SparkIntelliJIdea開發環境
- Spark SQL | Spark,從入門到精通SparkSQL
- Spark修煉之道(高階篇)——Spark原始碼閱讀:第一節 Spark應用程式提交流程Spark原始碼
- Hello Spark! | Spark,從入門到精通Spark
- Spark API 全集(1):Spark SQL Dataset & DataFrame APISparkAPISQL
- Spark從入門到放棄——初始Spark(一)Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第八節:網路管理SparkLinux大資料
- Spark入門篇Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第九節:Shell程式設計入門(一)SparkLinux大資料程式設計
- 15、Spark Sql(一),生成DataFrame的方式SparkSQL
- Spark下載與入門(Spark自學二)Spark
- Spark SQL中的RDD與DataFrame轉換SparkSQL
- Spark SQL學習——DataFrame和DataSetSparkSQL
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十節:Shell程式設計入門(二)SparkLinux大資料程式設計
- Spark SQL 程式設計API入門系列之Spark SQL的作用與使用方式SparkSQL程式設計API
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十二節:Shell程式設計入門(四)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十三節:Shell程式設計入門(五)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十四節:Shell程式設計入門(六)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第七節:程式管理SparkLinux大資料
- 【Spark篇】---Spark初始Spark
- 「Spark從精通到重新入門(一)」Spark 中不可不知的動態優化Spark優化
- Spark建立空的DataFrameSpark
- Spark系列 - (3) Spark SQLSparkSQL
- Spark 快速入門Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第五節:vi、vim編輯器(一)SparkLinux大資料
- 10.spark sql之快速入門SparkSQL
- Spark從入門到放棄---RDDSpark
- Spark 從零到開發(五)初識Spark SQLSparkSQL
- Spark入門(五)--Spark的reduce和reduceByKeySpark
- Spark入門(四)--Spark的map、flatMap、mapToPairSparkAPTAI
- SparkSQL /DataFrame /Spark RDD誰快?SparkSQL
- spark dataframe 型別轉換Spark型別