每週一書《Spark與Hadoop大資料分析》分享!

shenmanli發表於2019-03-15

Spark與Hadoop大資料分析比較系統地講解了利用Hadoop和Spark及其生態系統裡的一系列工具進行大資料分析的方法,既涵蓋ApacheSpark和Hadoop的基礎知識,又深入探討所有Spark元件——SparkCore、SparkSQL、DataFrame、DataSet、普通流、結構化流、MLlib、Graphx,以及Hadoop的核心元件(HDFS、MapReduce和Yarn)等,並配套詳細的實現示例,是快速掌握大資料分析基礎架構及其實施方法的詳實參考。

全書共10章,第1章從巨集觀的角度講解大資料分析的概念,並介紹在Hadoop和Spark平臺上使用的工具和技術,以及一些*常見的用例;第2章介紹Hadoop和Spark平臺的基礎知識;第3章深入探討並學習Spark;第4章主要介紹DataSourcesAPI、DataFrameAPI和新的DatasetAPI;第5章講解如何用SparkStreaming進行實時分析;第6章介紹Spark和Hadoop配套的筆記本和資料流;第7章講解Spark和Hadoop上的機器學習技術;第8章介紹如何構建推薦系統;第9章介紹如何使用GraphX進行圖分析;第10章介紹如何使用SparkR。

目錄:

第1章 從巨集觀視角看大資料分析··········1

1.1 大資料分析以及Hadoop和Spark

在其中承擔的角色····························3

1.1.1 典型大攻據分析專案的

生名週期.....................4

1.1.2 Hadoop中Spark承擔的角色·············6

1.2 大資料札學以及Hadoop和

Spark在其中承扣的角色…………6

1.2.1 從資料分析到資料科學的

根本性轉變···························6

1.2.2 典型資料科學專案的生命週期··········8

1.2.3 Hadoop和Spark承擔的角色·················9

1.3 工具和技術··························9

1.4 實際環境中的用例·············11

1.5 小結········································12

第2章 Apache Hadoop和ApacheSpark 入門····13

2.1 Apache Hadoop概述..…………13

2.1.1 Hadoop分散式檔案系統····14

2.1.2 HDFS的特性·······························15

2.1.3 MapReduce··························16

2.1.4 MapReduce的特性······················17

2.1.5 MapReduce v 1與

MapRcduce v2 對比······················17

2.1.6 YARN··································18

2.1.7 Hadoop上的儲存選擇······················20

2.2 Apache Spark概述···························24

2.2.1 Spark的發展歷史······················24

2.2.2 Apache Spark是什麼······················25

2.2.3 Apache Spark不是什麼·······26

2.2.4 MapReduce的問題······················27

2.2.5 Spark的架構························28

2.3 為何把Hadoop和Spark結合使用·······31

2.3.1 Hadoop的持性······················31

2.3.2 Spark的特性·······························31

2.4 安裝Hadoop和Spark叢集···············33

2.5 小結··················································36

第3章 深入剖析Apache Spark ··········37

3.1 啟動Spark守護程式·······························37

3.1.1 使用CDH ····························38

3.1.2 使用HDP 、MapR和Spark預製軟體包··············38

3.2 學習Spark的核心概念························39

3.2.1 使用Spark的方法.··························39

3.2.2 彈性分散式資料集······················41

3.2.3 Spark環境································13

3.2.4 變換和動作..........................44

3.2.5 ROD中的並行度·························46

3.2.6 延遲評估·······························49

3.2.7 譜系圖··································50

3.2.8 序列化·································51

3.2.9 在Spark 中利用Hadoop檔案格式····52

3.2.10 資料的本地性··················53

3.2.11 共享變數........................... 54

3.2.12 鍵值對RDD ······················55

3.3 Spark 程式的生命週期………………55

3.3.1 流水線............................... 57

3.3.2 Spark執行的摘要....………58

3.4 Spark應用程式······························59

3.4.1 Spark Shell和Spark應用程式·········59

3.4.2 建立Spark環境…….............59

3.4.3 SparkConf·························59

3.4.4 SparkSubmit ························60

3.4.5 Spark 配置項的優先順序····61

3.4.6 重要的應用程式配置··········61

3. 5 持久化與快取··························62

3.5.1 儲存級別............................. 62

3.5.2 應該選擇哪個儲存級別·····63

3.6 Spark 資源管理器: Standalone 、

YARN和Mesos·······························63

3.6.1 本地和叢集模式··················63

3.6.2 叢集資源管理器························64

3.7 小結·················································67

第4章 利用Spark SQL 、DataFrame

和Dataset 進行大資料分析····················69

4.1 Spark SQL的發展史····························70

4.2 Spark SQL的架構·······················71

4.3 介紹Spark SQL的四個元件················72

4.4 DataFrame和Dataset的演變············74

4.4.1 ROD 有什麼問題····························74

4.4.2 ROD 變換與Dataset和

DataFramc 變換....................75

4.5 為什麼要使用Dataset和Dataframe·····75

4.5.1 優化·····································76

4.5.2 速度·····································76

4.5.3 自動模式發現························77

4.5.4 多資料來源,多種程式語言··················77

4.5.5 ROD和其包API之間的互操作性.......77

4.5.6 僅選擇和讀取為要的資料···········78

4.6 何時使用ROD 、Dataset

和DataFrame·············78

4.7 利用DataFraine進行分析.......……78

4.7.1 建立SparkSession …………...79

4.7.2 建立DataFrame·····························79

4.7.3 把DataFrame轉換為RDD·············82

4.7.4 常用的Dataset DataFrame操作······83

4.7.5 快取資料··································84

4.7.6 效能優化·····························84

4.8 利用DatasetAPl進行分析················85

4.8.1 建立Dataset·····························85

4.8.2 把Dataframe轉換為Dataset····86

4.8.3 利用資料字典訪問後設資料···············87

4.9 Data Sources API ............................87

4.9.1 讀和寫函式································88

4.9.2 內建資料庫····································88

4.9.3 外部資料來源··························93

4.10 把Spark SQL作為分散式SQL引擎····97

4.10.1 把Spark SQL的Thrift伺服器

用於JDBC/ODBC訪問............97

4.10.2 使用beeline客戶端查詢資料·········98

4.10.3 使用spark-sqI CLI從Hive查詢資料....99

4.10.4 與BI工具整合··························100

4.11 Hive on Spark ...........................…100

4.12 小結..............................................100

第5章 利用Spark Streaming和Structured Streaming 進行

實時分析···102

5.1 實時處理概述··························103

5.1.1 Spark Streaming 的優缺點...104

5.1.2 Spark Strcruning的發展史····104

5.2 Spark Streaming的架構···············104

5.2.1 Spark Streaming應用程式流··········106

5.2.2 無狀態和有狀態的準處理·················107

5.3 Spark Streaming的變換和動作········109

5.3.1 union·································· 109

5.3.2 join···························109

5.3.3 transform操作··························109

5.3.4 updateStateByKey·····················109

5.3.5 mapWithState ····················110

5.3.6 視窗操作······ ·····················110

5.3.7 輸出操作........................... 1 11

5.4 輸人資料來源和輸出儲存·············111

5.4.1 基本資料來源·······112

5.4.2 高階資料來源····················112

5.4.3 自定義資料來源.···················112

5.4.4 接收器的可靠性························ 112

5.4.5 輸出儲存··························113

5.5 使用Katlca和HBase的SparkStreaming···113

5.5.1 基於接收器的方法·······················114

5.5.2 直接方法(無接收器······················116

5.5.3 與HBase整合···························117

5.6 Spark Streaming的高階概念·········118

5.6.1 使用DataF rame······················118

5.6.2 MLlib操作·······················119

5.6.3 快取/持久化·······················119

5.6.4 Spark Streaming中的容錯機制······119

5.6.5 Spark Streaming應用程式的

效能調優············121

5.7 監控應用程式·······························122

5.8 結構化流概述································123

5.8.1 結構化流應用程式的工作流··········123

5.8.2 流式Dataset和流式DataFrame·····125

5.8.3 流式Dataset和流式

DataFrame的操作·················126

5.9 小結········································129

第6章 利用Spark 和Hadoop的

筆記本與資料流····················130

6.1 基下網路的筆記本概述·····················130

6.2 Jupyter概述..·························· 131

6.2.1 安裝Jupyter···················132

6.2.2 用Jupyter進行分析···················134

6.3 Apache Zeppelin 概述····················· 135

6.3.1 Jupyter和Zeppelin對比····136

6.3.2 安裝ApacheZeppelin···················137

6.3.3 使用Zeppelin進行兮析····139

6.4 Livy REST作業伺服器和Hue筆記本····140

6.4.1 安裝設定Livy伺服器和Hue········141

6.4.2 使用Livy伺服器····················1 42

6.4.3 Livy和Hue筆記本搭配使用·········145

6.4.4 Livy和Zeppelin搭配使用·············148

6.5 用於資料流的ApacheNiFi概述········148

6.5.1 安裝ApacheNiFi··················148

6.5.2 把N iF1用幹資料流和分析·····149

6.6 小結·····························152

第7章 利用Spark 和Hadoop 進行機器學習...153

7.1 機器學習概述........….................... 153

7.2 在Spark和Hadoop上進行機器學習.....154

7.3 機器學習演算法··················155

7.3.1 有監督學習........…............. 156

7.3.2 無監督學習···················156

7.3.3 推薦系統…................…..... 157

7.3.4 特徵提取和變換……...…157

7.3.5 優化...................................158

7.3.6 Spark MLlib的資料型別…158

7.4 機器學習演算法示例·················160

7.5 構建機器學習流水線·················163

7.5.1 流水線工作流的一個示例···········163

7.5.2 構建一個ML流水線··················164

7.5.3 儲存和載入模型··················166

7.6 利用H2O和Spark進行機器學習·····167

7.6.1 為什麼使用SparklingWatcr······167

7.6.2 YARN上的一個應用程式流.........167

7 .6.3 Sparkling Water入門........168

7.7 Hivemall概述……..…………..169

7.8 Hivemall for Spark概述.. ……........170

7.9 小結······························170

第8章 利用Spark和Mahout構建推薦系統...171

8.1 構建推薦系統..............…171

8.1.1 基幹內容的過濾························172

8.1.2 協同過濾······························ 172

8.2 推薦系統的侷限性··························· 173

8.3 用MLlib實現推薦系統·······················173

8.3.1 準備環境·······················174

8.3.2 建立RDD······················175

8.3.3 利用DataFrame探索資料·······176

8.3.4 建立訓練和測試資料集················178

8.3.5 建立一個模型···················178

8.3.6 做出預測··························179

8.3.7 利用測試資料對模型進行評估·······179

8.3.8 檢查誤型的準確度……......180

8.3.9 顯式和隱式反饋····················181

8.4 Mahout和Spark的整合·····················181

8.4.1 安裝Mahout····················181

8.4.2 探索Mahout shell ·····················182

8.4.3 利可Mahout和搜尋工具

構建一個通用的推薦系統········186

8.5 小結····················189

第9章 利用GraphX進行圖分析···190

9.1 圖處理概述···································190

9.1.1 圖是什麼···························191

9.1.2 圖資料庫和圖處理系統····191

9.1.3 GraphX概述·······················192

9.1.4 圖演算法···································192

9.2 GraphX入門·······················193

9.2.1 GraphX的基本操作·······················193

9.2.2 圖的變換·············198

9.2.3 GraphX演算法·······················202

9.3 利用GraphX分析航班資料···········205

9.4 GraphFrames概述························209

9.4.1 模式發現··························· 211

9.4.2 載入和儲存Graphframes···212

9.5 小結...............................................212

第10章 利用SparkR進行互動式分析······213

10.1 R語言和Spark.R概述·······················213

10.1.1 R語言是什麼.··························214

10.1.2 SparkR慨述.....................214

10.1.3 SparkR架構..................... 216

10.2 SparkR入門·······················216

10.2.1 安裝和配置R·························216

10.2.2 使用SparkR shell··········218

10.2.3 使甲Spark.R指令碼·······················222

10.3 在 SparkR裡使用Dataframe······223

10.4 在RStudio裡使用SparkR···········228

10.5 利用SparkR進行機器學習·······230

10.5.1 利用樸素貝葉斯模型······230

10.5.2 利用K均值模型·······················232

10.6 在Zeppelin裡使用SparkR·······233

10.7 小結·······················234

果想得到下載地址,請訪問中科院計算所培訓中心官網http://www.tcict.cn/新增官網上的微信客服號索取!

相關文章