Spark修煉之道(進階篇)——Spark入門到精通:第四節 Spark程式設計模型(一)

五柳-先生發表於2015-11-14

本節主要內容

  1. Spark重要概念
  2. 彈性分散式資料集(RDD)基礎

1. Spark重要概念

本節部分內容源自官方文件:http://spark.apache.org/docs/latest/cluster-overview.html

(1)Spark執行模式

目前最為常用的Spark執行模式有: 
- local:本地執行緒方式執行,主要用於開發除錯Spark應用程式 
- Standalone:利用Spark自帶的資源管理與排程器執行Spark叢集,採用Master/Slave結構,為解決單點故障,可以採用ZooKeeper實現高可靠(High Availability,HA) 
- Apache Mesos :執行在著名的Mesos資源管理框架基礎之上,該叢集執行模式將資源管理交給Mesos,Spark只負責進行任務排程和計算 
- Hadoop YARN : 叢集執行在Yarn資源管理器上,資源管理交給Yarn,Spark只負責進行任務排程和計算 
Spark執行模式中Hadoop YARN的叢集執行方式最為常用,本課程中的第一節便是採用Hadoop YARN的方式進行Spark叢集搭建。如此Spark便與Hadoop生態圈完美搭配,組成強大的叢集,可謂無所不能。

(2)Spark元件(Components)

一個完整的Spark應用程式,如前一節當中SparkWordCount程式,在提交叢集執行時,它涉及到如下圖所示的元件: 
這裡寫圖片描述

各Spark應用程式以相互獨立的程式集合執行於叢集之上,由SparkContext物件進行協調,SparkContext物件可以視為Spark應用程式的入口,被稱為driver program,SparkContext可以與不同種類的叢集資源管理器(Cluster Manager),例如Hadoop Yarn、Mesos等 進行通訊,從而分配到程式執行所需的資源,獲取到叢集執行所需的資源後,SparkContext將得到叢集中其它工作節點(Worker Node) 上對應的Executors (不同的Spark應用程式有不同的Executor,它們之間也是獨立的程式,Executor為應用程式提供分散式計算及資料儲存功能),之後SparkContext將應用程式程式碼分發到各Executors,最後將任務(Task)分配給executors執行。

Term(術語) Meaning(解釋)
Application(Spark應用程式) 執行於Spark上的使用者程式,由叢集上的一個driver program(包含SparkContext物件)和多個executor執行緒組成
Application jar(Spark應用程式JAR包) Jar包中包含了使用者Spark應用程式,如果Jar包要提交到叢集中執行,不需要將其它的Spark依賴包打包進行,在執行時
Driver program 包含main方法的程式,負責建立SparkContext物件
Cluster manager 叢集資源管理器,例如Mesos,Hadoop Yarn
Deploy mode 部署模式,用於區別driver program的執行方式:叢集模式(cluter mode),driver在叢集內部啟動;客戶端模式(client mode),driver程式從叢集外部啟動
Worker node 工作節點, 叢集中可以執行Spark應用程式的節點
Executor Worker node上的程式,該程式用於執行具體的Spark應用程式任務,負責任務間的資料維護(資料在記憶體中或磁碟上)。不同的Spark應用程式有不同的Executor
Task 執行於Executor中的任務單元,Spark應用程式最終被劃分為經過優化後的多個任務的集合(在下一節中將詳細闡述)
Job 由多個任務構建的平行計算任務,具體為Spark中的action操作,如collect,save等)
Stage 每個job將被拆分為更小的task集合,這些任務集合被稱為stage,各stage相互獨立(類似於MapReduce中的map stage和reduce stage),由於它由多個task集合構成,因此也稱為TaskSet

2. 彈性分散式資料集(RDD)基礎

彈性分散式資料集(RDD,Resilient Distributed Datasets),由Berkeley實驗室於2011年提出,原始論文名字:Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing 原始論文非常值得一讀,是研究RDD的一手資料,本節內容大部分將基於該論文。

(1)RDD設計目標

RDD用於支援在平行計算時能夠高效地利用中間結果,支援更簡單的程式設計模型,同時也具有像MapReduce等平行計算框架的高容錯性、能夠高效地進行排程及可擴充套件性。RDD的容錯通過記錄RDD轉換操作的lineage關係來進行,lineage記錄了RDD的家族關係,當出現錯誤的時候,直接通過lineage進行恢復。RDD最合資料探勘, 機器學習及圖計算,因此這些應用涉及到大家的迭代計算,基於記憶體能夠極大地提升其在分散式環境下的執行效率;RDD不適用於諸如分散式爬蟲等需要頻繁更新共享狀態的任務。

下面給出的是在spark-shell中如何檢視RDD的Lineage

<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//textFile讀取hdfs根目錄下的README.md檔案,然後篩選出所有包括Spark的行
scala> val rdd2=sc.textFile(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/README.md"</span>).<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span> => <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span>.<span class="hljs-operator" style="box-sizing: border-box;">contains</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Spark"</span>))
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">
//toDebugString方法會列印出RDD的家族關係</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">
//可以看到textFile方法會生成兩個RDD,分別是HadoopRDD</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">
//MapPartitionsRDD,而filter同時也會生成新的MapPartitionsRDD</span>
scala> rdd2.toDebugString
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15</span>/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">09</span>/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">20</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">01</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">35</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">27</span> INFO mapred.FileInputFormat: Total input paths <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">to</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">process</span> : <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
res0: String = 
(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> []
 |  MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> textFile <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> []
 |  /README.md HadoopRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> textFile <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> []</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul>

(2)RDD抽象

RDD在Spark中是一個只讀的(val型別)、經過分割槽的記錄集合。RDD在Spark中只有兩種建立方式:(1)從儲存系統中建立;(2)從其它RDD中建立。從儲存中建立有多種方式,可以是本地檔案系統,也可以是分散式檔案系統,還可以是記憶體中的資料。 
下面的程式碼演示的是從HDFS中建立RDD

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.textFile</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/README.md"</span>)
<span class="hljs-label" style="box-sizing: border-box;">res1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[String] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>] at textFile at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">22</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

下面的程式碼演示的是從記憶體中建立RDD

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//記憶體中定義了一個陣列
<span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span><span class="hljs-container" style="box-sizing: border-box;">(1, 2, 3, 4, 5)</span></span>
<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span><span class="hljs-container" style="box-sizing: border-box;">(1, 2, 3, 4, 5)</span></span>
//通過parallelize方法建立<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>
<span class="hljs-title" style="box-sizing: border-box;">scala</span>> val distData = sc.parallelize(<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>)</span>
<span class="hljs-title" style="box-sizing: border-box;">distData</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>

下面的程式碼演示的是從其它RDD建立新的RDD

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//filter函式將distData RDD轉換成新的RDD
scala> val distDataFiletered=distData<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.filter</span>(e=>e><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)
<span class="hljs-label" style="box-sizing: border-box;">distDataFiletered:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>] at filter at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">25</span>
//觸發action操作(後面我們會講),檢視過濾後的內容
//注意collect只適合資料量較少時使用
scala> distDataFiltered<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res3:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

(3)RDD程式設計模型

在前面的例子中,我們已經接觸過到如何利用RDD進行程式設計,前面我們提到的

<code class="hljs fsharp has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//filter函式將distData RDD轉換成新的RDD</span>
scala> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> distDataFiletered=distData.filter(e=>e><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//觸發action操作(後面我們會講),檢視過濾後的內容</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//注意collect只適合資料量較少時使用</span>
scala> distDataFiltered.collect</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

這段程式碼它已經給我們解釋了RDD程式設計模型的核心思想:“filter函式將distData RDD轉換成新的RDD”,“觸發action操作”。也就是說RDD的操作包括Transformations(轉換)、Actions兩種。

transformations操作會將一個RDD轉換成一個新的RDD,需要特別注意的是所有的transformation都是lazy的,如果對scala中的lazy瞭解的人都知道,transformation之後它不會立馬執行,而只是會記住對相應資料集的transformation,而到真正被使用的時候才會執行,例如distData.filter(e=>e>2) transformation後,它不會立即執行,而是等到distDataFiltered.collect方法執行時才被執行,如下圖所示 
這裡寫圖片描述 
從上圖可以看到,在distDataFiltered.collect方法執行之後,才會觸發最終的transformation執行。

從transformation的介紹中我們知道,action是解決程式最終執行的誘因,action操作會返回程式執行結果如collect操作或將執行結果儲存,例如SparkWordCount中的saveAsTextFile方法。

Spark 1.5.0支援的transformation包括:

(1)map 
map函式方法引數:

<code class="hljs markdown has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[<span class="hljs-link_label" style="box-sizing: border-box;">U: ClassTag</span>](<span class="hljs-link_url" style="box-sizing: border-box;">f: T => U</span>): RDD[U]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>

//使用示例

<code class="hljs php has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc.parallelize(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)).map(x=><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>*x).collect
rdd1: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Array</span>[Int] = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

(2)filter 
方法引數:

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">filter</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(f: T => Boolean)</span>:</span> RDD[T]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>

使用示例

<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc.parallelize(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)).<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span>(x=>x><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).collect
rdd1: <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>[<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>] = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>

(3)flatMap 
方法引數:

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">and</span> then flattening the results.
   */
  <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">flatMap</span>[<span class="hljs-title" style="box-sizing: border-box;">U</span>:</span> ClassTag](f: T => TraversableOnce[U]): RDD[U] </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

使用示例:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala>  val data =Array(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>),Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>))
<span class="hljs-label" style="box-sizing: border-box;">data:</span> Array[Array[Int]] = Array(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>))

scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(data)
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Array[Int]] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span>

scala> val rdd2=rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.flatMap</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>=><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">y</span>=><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">y</span>))
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>] at flatMap at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">25</span>

scala> rdd2<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res0:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>)


</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul>

(4)mapPartitions(func) 
本mapPartitions例子來源於:https://www.zybuluo.com/jewes/note/35032 
mapPartitions是map的一個變種。map的輸入函式是應用於RDD中每個元素,而mapPartitions的輸入函式是應用於每個分割槽,也就是把每個分割槽中的內容作為整體來處理的。它的函式定義為:

def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

f即為輸入函式,它處理每個分割槽裡面的內容。每個分割槽中的內容將以Iterator[T]傳遞給輸入函式f,f的輸出結果是Iterator[U]。最終的RDD由所有分割槽經過輸入函式處理後的結果合併起來的。

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val a = sc.parallelize(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)
scala> <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">myfunc</span>[<span class="hljs-title" style="box-sizing: border-box;">T</span>]<span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(iter: Iterator[T])</span> :</span> Iterator[(T, T)] = {
    var res = List[(T, T)]() 
    var pre = iter.next 
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (iter.hasNext) {
        val cur = iter.next; 
        res .::= (pre, cur)
        pre = cur;
    } 
    res.iterator
}
scala> a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>

上述例子中的函式myfunc是把分割槽中一個元素和它的下一個元素組成一個Tuple。因為分割槽中最後一個元素沒有下一個元素了,所以(3,4)和(6,7)不在結果中。 
mapPartitions還有些變種,比如mapPartitionsWithContext,它能把處理過程中的一些狀態資訊傳遞給使用者指定的輸入函式。還有mapPartitionsWithIndex,它能把分割槽的index傳遞給使用者指定的輸入函式。

(5)mapPartitionsWithIndex

mapPartitionsWithIndex函式是mapPartitions函式的一個變種,它的函式引數如下:

def mapPartitionsWithIndex[U: ClassTag]( 
f: (Int, Iterator[T]) => Iterator[U], 
preservesPartitioning: Boolean = false): RDD[U]

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">```
scala> val a = sc.parallelize(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)
//函式帶分割槽索引,返回的集合第一個元素為分割槽索引
scala> <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">myfunc</span>[<span class="hljs-title" style="box-sizing: border-box;">T</span>]<span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(index:T,iter: Iterator[T])</span> :</span> Iterator[(T,T,T)] = {
    var res = List[(T,T, T)]() 
    var pre = iter.next 
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (iter.hasNext) {
        val cur = iter.next
        res .::= (index,pre, cur) 
        pre = cur
    } 
    res.iterator
}
scala> a.mapPartitionsWithIndex(myfunc).collect
res11: Array[(Int, Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>))
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li></ul>

這裡寫圖片描述

(6)sample 
方法引數:

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"> /**
   * Return a sampled subset of this RDD.
   *
   * <span class="hljs-decorator" style="color: rgb(0, 102, 102); box-sizing: border-box;">@param withReplacement can elements be sampled multiple times (replaced when sampled out)</span>
   * <span class="hljs-decorator" style="color: rgb(0, 102, 102); box-sizing: border-box;">@param fraction expected size of the sample as a fraction of this RDD's size</span>
   *  without replacement: probability that each element <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">is</span> chosen; fraction must be [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]
   *  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">with</span> replacement: expected number of times each element <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">is</span> chosen; fraction must be >= <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>
   * <span class="hljs-decorator" style="color: rgb(0, 102, 102); box-sizing: border-box;">@param seed seed for the random number generator</span>
   */
  <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">sample</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong)</span>:</span> RDD[T] </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>

使用示例:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val a = sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)
<span class="hljs-label" style="box-sizing: border-box;">a:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val smapledA=a<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sample</span>(true,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.5</span>)
<span class="hljs-label" style="box-sizing: border-box;">smapledA:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = PartitionwiseSampledRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span>] at sample at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span>
scala> smapledA<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res12:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)

scala> val smapledA2=a<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sample</span>(false,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.5</span>)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">smapledA2:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>

這裡寫圖片描述

轉載:http://blog.csdn.net/lovehuangjiaju/article/details/48580863

相關文章