Spark之spark shell

平生若夢為歡幾何發表於2018-09-13

原文網址 : https://flycode.co/archives/247913

前言：要學習spark程式開發，建議先學習spark-shell互動式學習，加深對spark程式開發的理解。spark-shell提供了一種學習API的簡單方式，以及一個能夠進行互動式分析資料的強大工具，可以使用scala編寫（scala執行與Java虛擬機器可以使用現有的Java庫）或使用Python編寫。

1.啟動spark-shell

spark-shell的本質是在後臺呼叫了spark-submit指令碼來啟動應用程式的，在spark-shell中已經建立了一個名為sc的SparkContext物件，在4個CPU核執行spark-shell命令如下：

spark-shell --master local[4]

如果指定Jar包路徑，則命令如下：

spark-shell --master local[4] --jars xxx.jar,yyy,jar

–master用來設定context將要連線並使用的資源主節點，master的值是standalone模式中spark的叢集地址、yarn或mesos叢集的URL，或是一個local地址

–jars可以新增需要用到的jar包，通過逗號分隔來新增多個包。

2.載入text檔案

spark建立sc後，可以載入本地檔案建立RDD，這裡測試是載入spark自帶的本地檔案README.md，返回一個MapPartitionsRDD檔案。

scala> val textFile = sc.textFile(“file:///opt/cloud/spark-2.1.1-bin-hadoop2.7/README.md”);
textFile: org.apache.spark.rdd.RDD[String] = file:///opt/cloud/spark-2.1.1-bin-hadoop2.7/README.md MapPartitionsRDD[9] at textFile at <console>:24

載入HDFS檔案和本地檔案都是使用textFile，區別是新增字首(hdfs://和file://)進行標識，從本地讀取檔案直接返回MapPartitionsRDD,而從HDFS讀取的檔案是先轉成HadoopRDD，然後隱試轉換成MapPartitionsRDD。想了解MapPartitions可以看這篇MapPartition和Map的區別。

3.簡單RDD操作

對於RDD可以執行Transformation返回新的RDD，也可以執行Action得到返回結果。ｆｉｒｓｔ命令返回檔案第一行，ｃｏｕｎｔ命令返回檔案所有行數。

scala> textFile.first();
res6: String = # Apache Spark

scala> textFile.count();
res7: Long = 104

　接下來進行transformation操作，使用ｆｉｌｔｅｒ命令從README.md檔案中抽取出一個子集，返回一個新的FilteredRDD。

scala> val textFilter = textFile.filter(line=>line.contains("Spark"));
textFilter: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[16] at filter at <console>:26

　連結多個Transformation和Action，計算包括”Spark”字串的行數。

scala> textFile.filter(line=>line.contains("Spark")).count();
res10: Long = 20

４.RDD應用的簡單操作

　(1)計算文字中單詞最多的一行的單詞數

scala> textFile.map(line =>line.split(" ").size).reduce((a,b) => if (a > b) a else b);
res11: Int = 22

　先將每一行的單詞使用空格進行拆分，並統計每一行的單詞數，建立一個基於單詞數的新RDD，然後對該RDD進行Reduce操作返回最大值。

　(2)統計單詞

　詞頻統計WordCount是大資料處理最流行的入門程式之一，Spark可以很容易實現WordCount操作。

//這個過程返回的是一個(string,int)型別的鍵值對ShuffledRDD(y執行reduceByKey的時候需要進行Shuffle操作，返回的是一個Shuffle形式的RDD），最後用Collect聚合統計結果
scala> val wordCount = textFile.flatMap(line =>line.split(" ")).map(x => (x,1)).reduceByKey((a,b) => a+b);
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[23] at reduceByKey at <console>:26
scala> wordCount.collect
[Stage 7:>                                                          (0 + 0)
[Stage 7:>                                                          (0 + 2)                                                                           
res12: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (general,3), (have,1), (pre-built,1), (YARN,,1), ([http://spark.apache.org/developer-tools.html](the,1), (changed,1), (locally,2), (sc.parallelize(1,1), (only,1), (locally.,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (info,1), (["Specifying,1), ("yarn",1), ([params]`.,1), ([project,1), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (["Parallel,1), (ar...

//這裡使用了佔位符_，使表示式更為簡潔，是Scala語音的特色，每個_代表一個引數。
scala> val wordCount2 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_);
wordCount2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[26] at reduceByKey at <console>:26
scala> wordCount2.collect
res14: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (general,3), (have,1), (pre-built,1), (YARN,,1), ([http://spark.apache.org/developer-tools.html](the,1), (changed,1), (locally,2), (sc.parallelize(1,1), (only,1), (locally.,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (info,1), (["Specifying,1), ("yarn",1), ([params]`.,1), ([project,1), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (["Parallel,1), (ar...

//Spark預設不進行排序，如有需要排序輸出，排序的時候將key和value互換，使用sortByKey方法指定升序（true）和降序（false）
scala> val wordCount3 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1));
wordCount3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:26

scala> wordCount3.collect
res15: Array[(String, Int)] = Array(("",71), (the,24), (to,17), (Spark,16), (for,12), (##,9), (and,9), (a,8), (can,7), (run,7), (on,7), (is,6), (in,6), (using,5), (of,5), (build,4), (Please,4), (with,4), (also,4), (if,4), (including,4), (an,4), (You,4), (you,4), (general,3), (documentation,3), (example,3), (how,3), (one,3), (For,3), (use,3), (or,3), (see,3), (Hadoop,3), (Python,2), (locally,2), (This,2), (Hive,2), (SparkPi,2), (refer,2), (Interactive,2), (Scala,2), (detailed,2), (return,2), (Shell,2), (class,2), (Python,,2), (set,2), (building,2), (SQL,2), (guidance,2), (cluster,2), (shell:,2), (supports,2), (particular,2), (following,2), (which,2), (should,2), (To,2), (be,2), (do,2), (./bin/run-example,2), (It,2), (1000:,2), (tests,2), (examples,2), (at,2), (`examples`,2), (that,2), (H...

5.RDD快取使用RDD的cache()方法

Spark之HiveSupport連線（spark-shell和IDEA）
2018-09-17
SparkHiveIdea
Spark學習進度-Spark環境搭建&Spark shell
2021-01-04
Spark
Spark Streaming監聽HDFS檔案（Spark-shell）
2024-11-04
Spark
Spark 系列（九）—— Spark SQL 之 Structured API
2019-08-13
SparkSQLStructAPI
Spark（十三） Spark效能調優之RDD持久化
2019-01-15
Spark持久化
spark-streaming之 socketTextStream
2018-10-17
Spark
《深入理解Spark》之Spark的整體執行流程
2018-08-15
Spark
spark 原始碼分析之十八 -- Spark儲存體系剖析
2019-07-23
Spark原始碼
spark 原始碼分析之十五 -- Spark記憶體管理剖析
2019-07-17
Spark原始碼記憶體
Spark on Yarn 和Spark on Mesos
2018-11-20
SparkYarn
Spark系列 - (3) Spark SQL
2023-02-23
SparkSQL
spark 原始碼分析之十六 -- Spark記憶體儲存剖析
2019-07-18
Spark原始碼記憶體
【Spark】Spark容錯機制
2021-09-09
Spark
spark
2022-07-05
Spark
spark學習筆記--Spark SQL
2018-07-13
Spark筆記SQL
spark學習筆記-- Spark Streaming
2018-08-03
Spark筆記
Spark 系列（十四）—— Spark Streaming 基本操作
2019-08-14
Spark
Spark 系列（十五）—— Spark Streaming 整合 Flume
2019-08-15
Spark
【Spark篇】---Spark故障解決（troubleshooting）
2018-03-04
Spark
Spark記錄（一）：Spark全景概述
2021-11-06
Spark
Spark SQL | Spark，從入門到精通
2019-01-21
SparkSQL
spark2.2.0 配置spark sql 操作hive
2018-09-23
SparkSQLHive
Hello Spark! | Spark，從入門到精通
2018-09-18
Spark
Spark文件閱讀之一：Spark Overview
2020-06-15
SparkView
Spark學習筆記（三）-Spark Streaming
2020-06-24
Spark筆記
Spark —— Spark OOM Error問題排查定位
2020-11-22
SparkOOMError
Spark 原始碼系列（九）Spark SQL 初體驗之解析過程詳解
2019-04-25
Spark原始碼SQL
spark with hive
2018-11-21
SparkHive
spark MapPartitionsRDD
2020-11-20
SparkAPP
Spark 加入
2022-09-23
Spark
Spark Stage
2021-09-09
Spark
Spark & Zeppelin
2021-09-09
Spark
Spark入門（四）--Spark的map、flatMap、mapToPair
2019-02-28
SparkAPTAI
Spark in action on Kubernetes - Spark Operator的原理解析
2019-04-03
Spark
Spark API 全集(1):Spark SQL Dataset & DataFrame API
2018-12-09
SparkAPISQL
Spark SQL:4.對Spark SQL的理解
2018-12-08
SparkSQL
Spark入門（五）--Spark的reduce和reduceByKey
2019-03-01
Spark
Spark 系列（十一）—— Spark SQL 聚合函式 Aggregations
2019-08-14
SparkSQL函式