每週一書《Spark與Hadoop大資料分析》分享!
Spark與Hadoop大資料分析比較系統地講解了利用Hadoop和Spark及其生態系統裡的一系列工具進行大資料分析的方法,既涵蓋ApacheSpark和Hadoop的基礎知識,又深入探討所有Spark元件——SparkCore、SparkSQL、DataFrame、DataSet、普通流、結構化流、MLlib、Graphx,以及Hadoop的核心元件(HDFS、MapReduce和Yarn)等,並配套詳細的實現示例,是快速掌握大資料分析基礎架構及其實施方法的詳實參考。
全書共10章,第1章從巨集觀的角度講解大資料分析的概念,並介紹在Hadoop和Spark平臺上使用的工具和技術,以及一些*常見的用例;第2章介紹Hadoop和Spark平臺的基礎知識;第3章深入探討並學習Spark;第4章主要介紹DataSourcesAPI、DataFrameAPI和新的DatasetAPI;第5章講解如何用SparkStreaming進行實時分析;第6章介紹Spark和Hadoop配套的筆記本和資料流;第7章講解Spark和Hadoop上的機器學習技術;第8章介紹如何構建推薦系統;第9章介紹如何使用GraphX進行圖分析;第10章介紹如何使用SparkR。
目錄:
第1章 從巨集觀視角看大資料分析··········1
1.1 大資料分析以及Hadoop和Spark
在其中承擔的角色····························3
1.1.1 典型大攻據分析專案的
生名週期.....................4
1.1.2 Hadoop中Spark承擔的角色·············6
1.2 大資料札學以及Hadoop和
Spark在其中承扣的角色…………6
1.2.1 從資料分析到資料科學的
根本性轉變···························6
1.2.2 典型資料科學專案的生命週期··········8
1.2.3 Hadoop和Spark承擔的角色·················9
1.3 工具和技術··························9
1.4 實際環境中的用例·············11
1.5 小結········································12
第2章 Apache Hadoop和ApacheSpark 入門····13
2.1 Apache Hadoop概述..…………13
2.1.1 Hadoop分散式檔案系統····14
2.1.2 HDFS的特性·······························15
2.1.3 MapReduce··························16
2.1.4 MapReduce的特性······················17
2.1.5 MapReduce v 1與
MapRcduce v2 對比······················17
2.1.6 YARN··································18
2.1.7 Hadoop上的儲存選擇······················20
2.2 Apache Spark概述···························24
2.2.1 Spark的發展歷史······················24
2.2.2 Apache Spark是什麼······················25
2.2.3 Apache Spark不是什麼·······26
2.2.4 MapReduce的問題······················27
2.2.5 Spark的架構························28
2.3 為何把Hadoop和Spark結合使用·······31
2.3.1 Hadoop的持性······················31
2.3.2 Spark的特性·······························31
2.4 安裝Hadoop和Spark叢集···············33
2.5 小結··················································36
第3章 深入剖析Apache Spark ··········37
3.1 啟動Spark守護程式·······························37
3.1.1 使用CDH ····························38
3.1.2 使用HDP 、MapR和Spark預製軟體包··············38
3.2 學習Spark的核心概念························39
3.2.1 使用Spark的方法.··························39
3.2.2 彈性分散式資料集······················41
3.2.3 Spark環境································13
3.2.4 變換和動作..........................44
3.2.5 ROD中的並行度·························46
3.2.6 延遲評估·······························49
3.2.7 譜系圖··································50
3.2.8 序列化·································51
3.2.9 在Spark 中利用Hadoop檔案格式····52
3.2.10 資料的本地性··················53
3.2.11 共享變數........................... 54
3.2.12 鍵值對RDD ······················55
3.3 Spark 程式的生命週期………………55
3.3.1 流水線............................... 57
3.3.2 Spark執行的摘要....………58
3.4 Spark應用程式······························59
3.4.1 Spark Shell和Spark應用程式·········59
3.4.2 建立Spark環境…….............59
3.4.3 SparkConf·························59
3.4.4 SparkSubmit ························60
3.4.5 Spark 配置項的優先順序····61
3.4.6 重要的應用程式配置··········61
3. 5 持久化與快取··························62
3.5.1 儲存級別............................. 62
3.5.2 應該選擇哪個儲存級別·····63
3.6 Spark 資源管理器: Standalone 、
YARN和Mesos·······························63
3.6.1 本地和叢集模式··················63
3.6.2 叢集資源管理器························64
3.7 小結·················································67
第4章 利用Spark SQL 、DataFrame
和Dataset 進行大資料分析····················69
4.1 Spark SQL的發展史····························70
4.2 Spark SQL的架構·······················71
4.3 介紹Spark SQL的四個元件················72
4.4 DataFrame和Dataset的演變············74
4.4.1 ROD 有什麼問題····························74
4.4.2 ROD 變換與Dataset和
DataFramc 變換....................75
4.5 為什麼要使用Dataset和Dataframe·····75
4.5.1 優化·····································76
4.5.2 速度·····································76
4.5.3 自動模式發現························77
4.5.4 多資料來源,多種程式語言··················77
4.5.5 ROD和其包API之間的互操作性.......77
4.5.6 僅選擇和讀取為要的資料···········78
4.6 何時使用ROD 、Dataset
和DataFrame·············78
4.7 利用DataFraine進行分析.......……78
4.7.1 建立SparkSession …………...79
4.7.2 建立DataFrame·····························79
4.7.3 把DataFrame轉換為RDD·············82
4.7.4 常用的Dataset DataFrame操作······83
4.7.5 快取資料··································84
4.7.6 效能優化·····························84
4.8 利用DatasetAPl進行分析················85
4.8.1 建立Dataset·····························85
4.8.2 把Dataframe轉換為Dataset····86
4.8.3 利用資料字典訪問後設資料···············87
4.9 Data Sources API ............................87
4.9.1 讀和寫函式································88
4.9.2 內建資料庫····································88
4.9.3 外部資料來源··························93
4.10 把Spark SQL作為分散式SQL引擎····97
4.10.1 把Spark SQL的Thrift伺服器
用於JDBC/ODBC訪問............97
4.10.2 使用beeline客戶端查詢資料·········98
4.10.3 使用spark-sqI CLI從Hive查詢資料....99
4.10.4 與BI工具整合··························100
4.11 Hive on Spark ...........................…100
4.12 小結..............................................100
第5章 利用Spark Streaming和Structured Streaming 進行
實時分析···102
5.1 實時處理概述··························103
5.1.1 Spark Streaming 的優缺點...104
5.1.2 Spark Strcruning的發展史····104
5.2 Spark Streaming的架構···············104
5.2.1 Spark Streaming應用程式流··········106
5.2.2 無狀態和有狀態的準處理·················107
5.3 Spark Streaming的變換和動作········109
5.3.1 union·································· 109
5.3.2 join···························109
5.3.3 transform操作··························109
5.3.4 updateStateByKey·····················109
5.3.5 mapWithState ····················110
5.3.6 視窗操作······ ·····················110
5.3.7 輸出操作........................... 1 11
5.4 輸人資料來源和輸出儲存·············111
5.4.1 基本資料來源·······112
5.4.2 高階資料來源····················112
5.4.3 自定義資料來源.···················112
5.4.4 接收器的可靠性························ 112
5.4.5 輸出儲存··························113
5.5 使用Katlca和HBase的SparkStreaming···113
5.5.1 基於接收器的方法·······················114
5.5.2 直接方法(無接收器······················116
5.5.3 與HBase整合···························117
5.6 Spark Streaming的高階概念·········118
5.6.1 使用DataF rame······················118
5.6.2 MLlib操作·······················119
5.6.3 快取/持久化·······················119
5.6.4 Spark Streaming中的容錯機制······119
5.6.5 Spark Streaming應用程式的
效能調優············121
5.7 監控應用程式·······························122
5.8 結構化流概述································123
5.8.1 結構化流應用程式的工作流··········123
5.8.2 流式Dataset和流式DataFrame·····125
5.8.3 流式Dataset和流式
DataFrame的操作·················126
5.9 小結········································129
第6章 利用Spark 和Hadoop的
筆記本與資料流····················130
6.1 基下網路的筆記本概述·····················130
6.2 Jupyter概述..·························· 131
6.2.1 安裝Jupyter···················132
6.2.2 用Jupyter進行分析···················134
6.3 Apache Zeppelin 概述····················· 135
6.3.1 Jupyter和Zeppelin對比····136
6.3.2 安裝ApacheZeppelin···················137
6.3.3 使用Zeppelin進行兮析····139
6.4 Livy REST作業伺服器和Hue筆記本····140
6.4.1 安裝設定Livy伺服器和Hue········141
6.4.2 使用Livy伺服器····················1 42
6.4.3 Livy和Hue筆記本搭配使用·········145
6.4.4 Livy和Zeppelin搭配使用·············148
6.5 用於資料流的ApacheNiFi概述········148
6.5.1 安裝ApacheNiFi··················148
6.5.2 把N iF1用幹資料流和分析·····149
6.6 小結·····························152
第7章 利用Spark 和Hadoop 進行機器學習...153
7.1 機器學習概述........….................... 153
7.2 在Spark和Hadoop上進行機器學習.....154
7.3 機器學習演算法··················155
7.3.1 有監督學習........…............. 156
7.3.2 無監督學習···················156
7.3.3 推薦系統…................…..... 157
7.3.4 特徵提取和變換……...…157
7.3.5 優化...................................158
7.3.6 Spark MLlib的資料型別…158
7.4 機器學習演算法示例·················160
7.5 構建機器學習流水線·················163
7.5.1 流水線工作流的一個示例···········163
7.5.2 構建一個ML流水線··················164
7.5.3 儲存和載入模型··················166
7.6 利用H2O和Spark進行機器學習·····167
7.6.1 為什麼使用SparklingWatcr······167
7.6.2 YARN上的一個應用程式流.........167
7 .6.3 Sparkling Water入門........168
7.7 Hivemall概述……..…………..169
7.8 Hivemall for Spark概述.. ……........170
7.9 小結······························170
第8章 利用Spark和Mahout構建推薦系統...171
8.1 構建推薦系統..............…171
8.1.1 基幹內容的過濾························172
8.1.2 協同過濾······························ 172
8.2 推薦系統的侷限性··························· 173
8.3 用MLlib實現推薦系統·······················173
8.3.1 準備環境·······················174
8.3.2 建立RDD······················175
8.3.3 利用DataFrame探索資料·······176
8.3.4 建立訓練和測試資料集················178
8.3.5 建立一個模型···················178
8.3.6 做出預測··························179
8.3.7 利用測試資料對模型進行評估·······179
8.3.8 檢查誤型的準確度……......180
8.3.9 顯式和隱式反饋····················181
8.4 Mahout和Spark的整合·····················181
8.4.1 安裝Mahout····················181
8.4.2 探索Mahout shell ·····················182
8.4.3 利可Mahout和搜尋工具
構建一個通用的推薦系統········186
8.5 小結····················189
第9章 利用GraphX進行圖分析···190
9.1 圖處理概述···································190
9.1.1 圖是什麼···························191
9.1.2 圖資料庫和圖處理系統····191
9.1.3 GraphX概述·······················192
9.1.4 圖演算法···································192
9.2 GraphX入門·······················193
9.2.1 GraphX的基本操作·······················193
9.2.2 圖的變換·············198
9.2.3 GraphX演算法·······················202
9.3 利用GraphX分析航班資料···········205
9.4 GraphFrames概述························209
9.4.1 模式發現··························· 211
9.4.2 載入和儲存Graphframes···212
9.5 小結...............................................212
第10章 利用SparkR進行互動式分析······213
10.1 R語言和Spark.R概述·······················213
10.1.1 R語言是什麼.··························214
10.1.2 SparkR慨述.....................214
10.1.3 SparkR架構..................... 216
10.2 SparkR入門·······················216
10.2.1 安裝和配置R·························216
10.2.2 使用SparkR shell··········218
10.2.3 使甲Spark.R指令碼·······················222
10.3 在 SparkR裡使用Dataframe······223
10.4 在RStudio裡使用SparkR···········228
10.5 利用SparkR進行機器學習·······230
10.5.1 利用樸素貝葉斯模型······230
10.5.2 利用K均值模型·······················232
10.6 在Zeppelin裡使用SparkR·······233
10.7 小結·······················234
果想得到下載地址,請訪問中科院計算所培訓中心官網http://www.tcict.cn/新增官網上的微信客服號索取!
相關文章
- 每週一書《Python資料科學手冊》分享!Python資料科學
- 【每週一本書】之《Microsoft Power BI 資料視覺化與資料分析》ROS視覺化
- 《Hadoop+Spark大資料分析實戰》簡介HadoopSpark大資料
- [譯] Python 與大資料:Airflow、 Jupyter Notebook 與 Hadoop 3、Spark、PrestoPython大資料AIHadoopSparkREST
- 大資料框架對比 - Hadoop、Spark、Storm、Samza、Spark、Flink大資料框架HadoopSparkORM
- 每週一書:400頁《Python資料處理》分享!資料處理領域真正需要的一本書!Python
- 每週一書《系統架構設計師》分享!架構
- 每週一書:162頁《笨辦法學 Python》分享!Python
- 每週一書《大資料時代的演算法:機器學習、人工智慧及其典型例項 》分享!大資料演算法機器學習人工智慧
- 每週一書:290頁《機器學習導論》分享!機器學習
- 好程式設計師分享大資料入門教程:Hadoop和spark的效能比較程式設計師大資料HadoopSpark
- 好程式設計師大資料培訓分享常見的Hadoop和Spark專案程式設計師大資料HadoopSpark
- 每週一書《Python機器學習實踐指南 附隨書程式碼》分享!Python機器學習
- 資料演算法 Hadoop/Spark大資料處理---第十六章演算法HadoopSpark大資料
- 《Hadoop大資料分析技術》簡介Hadoop大資料
- Hadoop與Spark關係HadoopSpark
- 每週一練 之 資料結構與演算法(Stack)資料結構演算法
- 每週一練 之 資料結構與演算法(Tree)資料結構演算法
- 每週一練 之 資料結構與演算法(Set)資料結構演算法
- 每週一練 之 資料結構與演算法(Queue)資料結構演算法
- 每週一練 之 資料結構與演算法(LinkedList)資料結構演算法
- 資料演算法 Hadoop/Spark大資料處理---第十二章演算法HadoopSpark大資料
- #每週讀書挑戰#《黑客與畫家》黑客
- 每週分享五個 PyCharm 使用技巧(一)PyCharm
- 阮一峰每週分享第 2 期
- 大資料Storm相比於Spark、Hadoop有哪些優勢(摘錄)大資料ORMSparkHadoop
- 大資料入門課程:Hadoop和spark的效能比較大資料HadoopSpark
- 分享Hadoop處理大資料工具及優勢Hadoop大資料
- 大資料hadoop資料大資料Hadoop
- 大資料 Hadoop介紹、配置與使用大資料Hadoop
- 大資料開發:剖析Hadoop和Spark的Shuffle過程差異大資料HadoopSpark
- 每週分享第 45 期
- 每週分享第 28 期
- 每週分享第 27 期
- 每週分享第 49 期
- 每週分享第 30 期
- 每週分享第 29 期
- 每週分享第 39 期