Spark 簡單例項（基本操作）

Happy王子樂發表於2018-04-19

原文網址 : https://blog.csdn.net/u012888052/article/details/80007402

目錄[-]

1、準備檔案 2、載入檔案 3、顯示一行 4、函式運用  （1）map （2）collecct （3）filter （4）flatMap （5）union （6） join （7）lookup （8）groupByKey （9）sortByKey

1、準備檔案

?

1

`wget http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data`

2、載入檔案

?

1

`scala> val inFile = sc.textFile("/home/scipio/spam.data")`

輸出

?

1

2

3

`14/06/28` `12:15:34` `INFO MemoryStore: ensureFreeSpace(32880) called with curMem=65736, maxMem=311387750`

`14/06/28` `12:15:34` `INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size` `32.1` `KB, free` `296.9` `MB)`

`inFile: org.apache.spark.rdd.RDD[String] = MappedRDD[7] at textFile at <console>:12`

3、顯示一行

?

1

`scala> inFile.first()`

輸出

?

1

2

3

4

5

6

7

8

9

10

`14/06/28` `12:15:39` `INFO FileInputFormat: Total input paths to process :` `1`

`14/06/28` `12:15:39` `INFO SparkContext: Starting job: first at <console>:15`

`14/06/28` `12:15:39` `INFO DAGScheduler: Got job` `0` `(first at <console>:15) with` `1` `output partitions (allowLocal=true)`

`14/06/28` `12:15:39` `INFO DAGScheduler: Final stage: Stage` `0(first at <console>:15)`

`14/06/28` `12:15:39` `INFO DAGScheduler: Parents of` `final` `stage: List()`

`14/06/28` `12:15:39` `INFO DAGScheduler: Missing parents: List()`

`14/06/28` `12:15:39` `INFO DAGScheduler: Computing the requested partition locally`

`14/06/28` `12:15:39` `INFO HadoopRDD: Input split: file:/home/scipio/spam.data:0+349170`

`14/06/28` `12:15:39` `INFO SparkContext: Job finished: first at <console>:15, took` `0.532360118` `s`

`res2: String =` `0` `0.64` `0.64` `0` `0.32` `0` `0` `0` `0` `0` `0` `0.64` `0` `0` `0` `0.32` `0` `1.29` `1.93` `0` `0.96` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0` `0.778` `0` `0` `3.756` `61` `278` `1`

該命令表明：Spark載入檔案是按行載入，每行為一個字串，這樣一個RDD[String]字串陣列就可以將整個檔案存到記憶體中。

4、函式運用

（1）map

?

1

2

3

4

5

6

7

8

9

10

11

12

13

`scala> val nums = inFile.map(x=>x.split(' ').map(_.toDouble))`

`nums: org.apache.spark.rdd.RDD[Array[Double]] = MappedRDD[8] at map at <console>:14`

`scala> nums.first()`

`14/06/28` `12:19:07` `INFO SparkContext: Starting job: first at <console>:17`

`14/06/28` `12:19:07` `INFO DAGScheduler: Got job` `1` `(first at <console>:17) with` `1` `output partitions (allowLocal=true)`

`14/06/28` `12:19:07` `INFO DAGScheduler: Final stage: Stage` `1(first at <console>:17)`

`14/06/28` `12:19:07` `INFO DAGScheduler: Parents of` `final` `stage: List()`

`14/06/28` `12:19:07` `INFO DAGScheduler: Missing parents: List()`

`14/06/28` `12:19:07` `INFO DAGScheduler: Computing the requested partition locally`

`14/06/28` `12:19:07` `INFO HadoopRDD: Input split: file:/home/scipio/spam.data:0+349170`

`14/06/28` `12:19:07` `INFO SparkContext: Job finished: first at <console>:17, took` `0.011412903` `s`

`res3: Array[Double] = Array(0.0,` `0.64,` `0.64,` `0.0,` `0.32,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.64,` `0.0,` `0.0,` `0.0,` `0.32,` `0.0,` `1.29,` `1.93,` `0.0,` `0.96,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.0,` `0.778,` `0.0,` `0.0,` `3.756,` `61.0,` `278.0,` `1.0)`

這裡的命令列：將每行的字串轉換為相應的一個double陣列，這樣全部的資料將可以用一個二維的陣列 RDD[Array[Double]]來表示了

（2）collecct

?

1

2

3

4

5

6

7

8

9

`scala> val rdd = sc.parallelize(List(1,2,3,4,5))`

`rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:12`

`scala> val mapRdd = rdd.map(2_)`

`mapRdd: org.apache.spark.rdd.RDD[Int] = MappedRDD[10] at map at <console>:14`

`scala> mapRdd.collect`

`14/06/28` `12:24:45` `INFO SparkContext: Job finished: collect at <console>:17, took` `1.789249751` `s`

`res4: Array[Int] = Array(2,` `4,` `6,` `8,` `10)`

（3）filter

?

1

2

3

4

5

6

`scala> val filterRdd = sc.parallelize(List(1,2,3,4,5)).map(_2).filter(_>5)`

`filterRdd: org.apache.spark.rdd.RDD[Int] = FilteredRDD[13] at filter at <console>:12`

`scala> filterRdd.collect`

`14/06/28` `12:27:45` `INFO SparkContext: Job finished: collect at <console>:15, took` `0.056086178` `s`

`res5: Array[Int] = Array(6,` `8,` `10)`

（4）flatMap

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

`scala> val rdd = sc.textFile("/home/scipio/README.md")`

`14/06/28` `12:31:55` `INFO MemoryStore: ensureFreeSpace(32880) called with curMem=98616, maxMem=311387750`

`14/06/28` `12:31:55` `INFO MemoryStore: Block broadcast_3 stored as values to memory (estimated size` `32.1` `KB, free` `296.8` `MB)`

`rdd: org.apache.spark.rdd.RDD[String] = MappedRDD[15] at textFile at <console>:12`

`scala> rdd.count`

`14/06/28` `12:32:50` `INFO SparkContext: Job finished: count at <console>:15, took` `0.341167662` `s`

`res6: Long =` `127`

`scala> rdd.cache`

`res7: rdd.type = MappedRDD[15] at textFile at <console>:12`

`scala> rdd.count`

`14/06/28` `12:33:00` `INFO SparkContext: Job finished: count at <console>:15, took` `0.32015745` `s`

`res8: Long =` `127`

`scala> val wordCount = rdd.flatMap(_.split(' ')).map(x=>(x,1)).reduceByKey(_+_)`

`wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[20] at reduceByKey at <console>:14`

`scala> wordCount.collect`

`res9: Array[(String, Int)] = Array((means,1), (under,2), (this,4), (Because,1), (Python,2), (agree,1), (cluster.,1), (its,1), (YARN,,3), (have,2), (pre-built,1), (MRv1,,1), (locally.,1), (locally,2), (changed,1), (several,1), (only,1), (sc.parallelize(1,1), (This,2), (basic,1), (first,1), (requests,1), (documentation,1), (Configuration,1), (MapReduce,2), (without,1), (setting,1), ("yarn-client",1`), ([params]`.,`1), (any,2), (application,1), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (version,3), (file,1), (documentation,,1), (test,1), (MASTER,1), (entry,1), (example,3), (are,2), (systems.,1), (params,1), (scala>,1), (<artifactId>hadoop-client</artifactId>,1), (refer,1), (configure,1), (Interactive,2), (artifact,1), (can,7), (file's,1), (build,3), (when,2), (2.0.X,,1), (Apac...`

`scala> wordCount.saveAsTextFile("/home/scipio/wordCountResult.txt")`

（5）union

?

1

2

3

4

5

6

7

8

9

10

11

12

`scala> val rdd = sc.parallelize(List(('a',1),('a',2)))`

`rdd: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:12`

`scala> val rdd2 = sc.parallelize(List(('b',1),('b',2)))`

`rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:12`

`scala> rdd union rdd2`

`res3: org.apache.spark.rdd.RDD[(Char, Int)] = UnionRDD[12] at union at <console>:17`

`scala> res3.collect`

`res4: Array[(Char, Int)] = Array((a,1), (a,2), (b,1), (b,2))`

（6） join

?

1

2

3

4

5

6

7

8

9

10

11

12

`scala> val rdd1 = sc.parallelize(List(('a',1),('a',2),('b',3),('b',4)))`

`rdd1: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:12`

`scala> val rdd2 = sc.parallelize(List(('a',5),('a',6),('b',7),('b',8)))`

`rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:12`

`scala> rdd1 join rdd2`

`res1: org.apache.spark.rdd.RDD[(Char, (Int, Int))] = FlatMappedValuesRDD[14] at join at <console>:17`

`res1.collect`

`res2: Array[(Char, (Int, Int))] = Array((b,(3,7)), (b,(3,8)), (b,(4,7)), (b,(4,8)), (a,(1,5)), (a,(1,6)), (a,(2,5)), (a,(2,6)))`

（7）lookup

?

1

2

3

`val rdd1 = sc.parallelize(List(('a',1),('a',2),('b',3),('b',4)))`

`rdd1.lookup('a')`

`res3: Seq[Int] = WrappedArray(1,` `2)`

（8）groupByKey

?

1

2

3

4

5

`val wc = sc.textFile("/home/scipio/README.md").flatMap(_.split(' ')).map((_,1)).groupByKey`

`wc.collect`

`14/06/28` `12:56:14` `INFO SparkContext: Job finished: collect at <console>:15, took` `2.933392093` `s`

`res0: Array[(String, Iterable[Int])] = Array((means,ArrayBuffer(1)), (under,ArrayBuffer(1,` `1)), (this,ArrayBuffer(1,` `1,` `1,` `1)), (Because,ArrayBuffer(1)), (Python,ArrayBuffer(1,` `1)), (agree,ArrayBuffer(1)), (cluster.,ArrayBuffer(1)), (its,ArrayBuffer(1)), (YARN,,ArrayBuffer(1,` `1,` `1)), (have,ArrayBuffer(1,` `1)), (pre-built,ArrayBuffer(1)), (MRv1,,ArrayBuffer(1)), (locally.,ArrayBuffer(1)), (locally,ArrayBuffer(1,` `1)), (changed,ArrayBuffer(1)), (sc.parallelize(1,ArrayBuffer(1)), (only,ArrayBuffer(1)), (several,ArrayBuffer(1)), (This,ArrayBuffer(1,` `1)), (basic,ArrayBuffer(1)), (first,ArrayBuffer(1)), (documentation,ArrayBuffer(1)), (Configuration,ArrayBuffer(1)), (MapReduce,ArrayBuffer(1,` `1)), (requests,ArrayBuffer(1)), (without,ArrayBuffer(1)), ("yarn-client",ArrayBuffer(1`)), ([params]`.,Ar...

（9）sortByKey

?

1

2

3

4

`val rdd = sc.textFile("/home/scipio/README.md")`

`val wordcount = rdd.flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_)`

`val wcsort = wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))`

`wcsort.saveAsTextFile("/home/scipio/sort.txt")`

升序的話，sortByKey(true)

相關文章

ElasticSearch客戶端簡單操作例項
2021-09-08
Elasticsearch客戶端
Spark 系列（十四）—— Spark Streaming 基本操作
2019-08-14
Spark
MySQL基本簡單操作01
2018-07-15
MySql
opengl簡單入門例項
2020-05-14
EventBus詳解及簡單例項
2021-09-09
單例
Java的Socket通訊簡單例項
2021-10-27
Java單例
Spark Job-Stage-Task例項理解
2020-09-21
Spark
超級簡單入門vuex 小例項
2018-12-07
Vue
淡入淡出效果簡單程式碼例項
2018-07-03
XML節點自動生成簡單例項
2020-10-10
XML單例
C#out引數的簡單例項
2024-12-01
C#單例
C++學習隨筆——簡單的單例設計模式例項
2024-08-26
C++單例設計模式
Spark效能調優——9項基本原則
2021-11-17
Spark
phpqrcode生成動態二維碼簡單例項
2019-04-18
PHP單例
Python簡單函式迴圈綜合例項
2024-04-05
Python函式
dom操作程式碼例項
2018-05-25
Android 簡單瀏覽器例項-webview控制元件
2018-07-09
Android瀏覽器WebView控制元件
[邊學邊練]用簡單例項學習React
2018-08-27
單例React
html實現簡單ListViews效果的例項程式碼
2020-05-20
HTMLView
JavaScript中的DOM和Timer(簡單易用的基本操作)
2024-10-13
JavaScript
spark簡單介紹（一）
2021-09-09
Spark
SpringBoot + ES基本專案搭建例項
2022-03-29
Spring Boot
logstash簡介及基本操作
2022-06-29
Spring Cloud超簡單十分鐘入門例項
2019-04-02
SpringCloud
keras轉tensorflow lite【方法二】直接轉：簡單模型例項
2019-01-08
Keras模型
PHP+jQuery開發簡單的翻牌抽獎例項
2019-12-11
PHPjQuery
python3將變數輸入的簡單例項
2020-11-28
Python變數單例
Spark程式設計環境搭建及WordCount例項
2018-09-12
Spark程式設計
單例模式，真不簡單
2021-11-25
單例模式
PHP 完整表單例項
2020-12-29
PHP單例
Redis單例項安裝
2024-12-09
Redis單例
基本的 HTML 標籤 - 四個例項
2024-08-05
HTML
例項操作mysql varchar型別求和
2021-06-15
MySql型別
Java 正規表示式例項操作
2021-05-25
Java
RabbitMq知識整理以及在java語言下的簡單例項
2018-08-09
MQJava單例
SHA-256加密簡單例項（客戶端、服務端）
2018-06-26
加密單例客戶端服務端
Python中Scrapy框架元素選擇器XPath的簡單例項
2018-03-09
Python框架單例
C#中WebService的建立、部署和呼叫的簡單例項
2020-08-28
C#Web單例