Spark修煉之道(進階篇)——Spark入門到精通:第四節 Spark程式設計模型(一)
本節主要內容
- Spark重要概念
- 彈性分散式資料集(RDD)基礎
1. Spark重要概念
本節部分內容源自官方文件:http://spark.apache.org/docs/latest/cluster-overview.html
(1)Spark執行模式
目前最為常用的Spark執行模式有:
- local:本地執行緒方式執行,主要用於開發除錯Spark應用程式
- Standalone:利用Spark自帶的資源管理與排程器執行Spark叢集,採用Master/Slave結構,為解決單點故障,可以採用ZooKeeper實現高可靠(High Availability,HA)
- Apache Mesos :執行在著名的Mesos資源管理框架基礎之上,該叢集執行模式將資源管理交給Mesos,Spark只負責進行任務排程和計算
- Hadoop YARN : 叢集執行在Yarn資源管理器上,資源管理交給Yarn,Spark只負責進行任務排程和計算
Spark執行模式中Hadoop YARN的叢集執行方式最為常用,本課程中的第一節便是採用Hadoop YARN的方式進行Spark叢集搭建。如此Spark便與Hadoop生態圈完美搭配,組成強大的叢集,可謂無所不能。
(2)Spark元件(Components)
一個完整的Spark應用程式,如前一節當中SparkWordCount程式,在提交叢集執行時,它涉及到如下圖所示的元件:
各Spark應用程式以相互獨立的程式集合執行於叢集之上,由SparkContext物件進行協調,SparkContext物件可以視為Spark應用程式的入口,被稱為driver program,SparkContext可以與不同種類的叢集資源管理器(Cluster Manager),例如Hadoop Yarn、Mesos等 進行通訊,從而分配到程式執行所需的資源,獲取到叢集執行所需的資源後,SparkContext將得到叢集中其它工作節點(Worker Node) 上對應的Executors (不同的Spark應用程式有不同的Executor,它們之間也是獨立的程式,Executor為應用程式提供分散式計算及資料儲存功能),之後SparkContext將應用程式程式碼分發到各Executors,最後將任務(Task)分配給executors執行。
Term(術語) | Meaning(解釋) |
---|---|
Application(Spark應用程式) | 執行於Spark上的使用者程式,由叢集上的一個driver program(包含SparkContext物件)和多個executor執行緒組成 |
Application jar(Spark應用程式JAR包) | Jar包中包含了使用者Spark應用程式,如果Jar包要提交到叢集中執行,不需要將其它的Spark依賴包打包進行,在執行時 |
Driver program | 包含main方法的程式,負責建立SparkContext物件 |
Cluster manager | 叢集資源管理器,例如Mesos,Hadoop Yarn |
Deploy mode | 部署模式,用於區別driver program的執行方式:叢集模式(cluter mode),driver在叢集內部啟動;客戶端模式(client mode),driver程式從叢集外部啟動 |
Worker node | 工作節點, 叢集中可以執行Spark應用程式的節點 |
Executor | Worker node上的程式,該程式用於執行具體的Spark應用程式任務,負責任務間的資料維護(資料在記憶體中或磁碟上)。不同的Spark應用程式有不同的Executor |
Task | 執行於Executor中的任務單元,Spark應用程式最終被劃分為經過優化後的多個任務的集合(在下一節中將詳細闡述) |
Job | 由多個任務構建的平行計算任務,具體為Spark中的action操作,如collect,save等) |
Stage | 每個job將被拆分為更小的task集合,這些任務集合被稱為stage,各stage相互獨立(類似於MapReduce中的map stage和reduce stage),由於它由多個task集合構成,因此也稱為TaskSet |
2. 彈性分散式資料集(RDD)基礎
彈性分散式資料集(RDD,Resilient Distributed Datasets),由Berkeley實驗室於2011年提出,原始論文名字:Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing 原始論文非常值得一讀,是研究RDD的一手資料,本節內容大部分將基於該論文。
(1)RDD設計目標
RDD用於支援在平行計算時能夠高效地利用中間結果,支援更簡單的程式設計模型,同時也具有像MapReduce等平行計算框架的高容錯性、能夠高效地進行排程及可擴充套件性。RDD的容錯通過記錄RDD轉換操作的lineage關係來進行,lineage記錄了RDD的家族關係,當出現錯誤的時候,直接通過lineage進行恢復。RDD最合資料探勘, 機器學習及圖計算,因此這些應用涉及到大家的迭代計算,基於記憶體能夠極大地提升其在分散式環境下的執行效率;RDD不適用於諸如分散式爬蟲等需要頻繁更新共享狀態的任務。
下面給出的是在spark-shell中如何檢視RDD的Lineage
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//textFile讀取hdfs根目錄下的README.md檔案,然後篩選出所有包括Spark的行 scala> val rdd2=sc.textFile(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/README.md"</span>).<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span> => <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span>.<span class="hljs-operator" style="box-sizing: border-box;">contains</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Spark"</span>)) rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"> //toDebugString方法會列印出RDD的家族關係</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"> //可以看到textFile方法會生成兩個RDD,分別是HadoopRDD</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"> //MapPartitionsRDD,而filter同時也會生成新的MapPartitionsRDD</span> scala> rdd2.toDebugString <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15</span>/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">09</span>/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">20</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">01</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">35</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">27</span> INFO mapred.FileInputFormat: Total input paths <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">to</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">process</span> : <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> res0: String = (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> [] | MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> textFile <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> [] | /README.md HadoopRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> textFile <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">at</span> <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> []</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul>
(2)RDD抽象
RDD在Spark中是一個只讀的(val型別)、經過分割槽的記錄集合。RDD在Spark中只有兩種建立方式:(1)從儲存系統中建立;(2)從其它RDD中建立。從儲存中建立有多種方式,可以是本地檔案系統,也可以是分散式檔案系統,還可以是記憶體中的資料。
下面的程式碼演示的是從HDFS中建立RDD
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.textFile</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/README.md"</span>) <span class="hljs-label" style="box-sizing: border-box;">res1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[String] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>] at textFile at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">22</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
下面的程式碼演示的是從記憶體中建立RDD
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//記憶體中定義了一個陣列 <span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span><span class="hljs-container" style="box-sizing: border-box;">(1, 2, 3, 4, 5)</span></span> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span><span class="hljs-container" style="box-sizing: border-box;">(1, 2, 3, 4, 5)</span></span> //通過parallelize方法建立<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> val distData = sc.parallelize(<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>)</span> <span class="hljs-title" style="box-sizing: border-box;">distData</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>
下面的程式碼演示的是從其它RDD建立新的RDD
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//filter函式將distData RDD轉換成新的RDD scala> val distDataFiletered=distData<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.filter</span>(e=>e><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) <span class="hljs-label" style="box-sizing: border-box;">distDataFiletered:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>] at filter at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">25</span> //觸發action操作(後面我們會講),檢視過濾後的內容 //注意collect只適合資料量較少時使用 scala> distDataFiltered<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res3:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>
(3)RDD程式設計模型
在前面的例子中,我們已經接觸過到如何利用RDD進行程式設計,前面我們提到的
<code class="hljs fsharp has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//filter函式將distData RDD轉換成新的RDD</span> scala> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> distDataFiletered=distData.filter(e=>e><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//觸發action操作(後面我們會講),檢視過濾後的內容</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//注意collect只適合資料量較少時使用</span> scala> distDataFiltered.collect</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
這段程式碼它已經給我們解釋了RDD程式設計模型的核心思想:“filter函式將distData RDD轉換成新的RDD”,“觸發action操作”。也就是說RDD的操作包括Transformations(轉換)、Actions兩種。
transformations操作會將一個RDD轉換成一個新的RDD,需要特別注意的是所有的transformation都是lazy的,如果對scala中的lazy瞭解的人都知道,transformation之後它不會立馬執行,而只是會記住對相應資料集的transformation,而到真正被使用的時候才會執行,例如distData.filter(e=>e>2) transformation後,它不會立即執行,而是等到distDataFiltered.collect方法執行時才被執行,如下圖所示
從上圖可以看到,在distDataFiltered.collect方法執行之後,才會觸發最終的transformation執行。
從transformation的介紹中我們知道,action是解決程式最終執行的誘因,action操作會返回程式執行結果如collect操作或將執行結果儲存,例如SparkWordCount中的saveAsTextFile方法。
Spark 1.5.0支援的transformation包括:
(1)map
map函式方法引數:
<code class="hljs markdown has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/** * Return a new RDD by applying a function to all elements of this RDD. */ def map[<span class="hljs-link_label" style="box-sizing: border-box;">U: ClassTag</span>](<span class="hljs-link_url" style="box-sizing: border-box;">f: T => U</span>): RDD[U]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>
//使用示例
<code class="hljs php has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc.parallelize(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)).map(x=><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>*x).collect rdd1: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Array</span>[Int] = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
(2)filter
方法引數:
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/** * Return a new RDD containing only the elements that satisfy a predicate. */ <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">filter</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(f: T => Boolean)</span>:</span> RDD[T]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>
使用示例
<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc.parallelize(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)).<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">filter</span>(x=>x><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).collect rdd1: <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>[<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>] = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
(3)flatMap
方法引數:
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/** * Return a new RDD by first applying a function to all elements of this * RDD, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">and</span> then flattening the results. */ <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">flatMap</span>[<span class="hljs-title" style="box-sizing: border-box;">U</span>:</span> ClassTag](f: T => TraversableOnce[U]): RDD[U] </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
使用示例:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val data =Array(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>),Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>)) <span class="hljs-label" style="box-sizing: border-box;">data:</span> Array[Array[Int]] = Array(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>)) scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(data) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Array[Int]] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span> scala> val rdd2=rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.flatMap</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>=><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">y</span>=><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">y</span>)) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = MapPartitionsRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>] at flatMap at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">25</span> scala> rdd2<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res0:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul>
(4)mapPartitions(func)
本mapPartitions例子來源於:https://www.zybuluo.com/jewes/note/35032
mapPartitions是map的一個變種。map的輸入函式是應用於RDD中每個元素,而mapPartitions的輸入函式是應用於每個分割槽,也就是把每個分割槽中的內容作為整體來處理的。它的函式定義為:
def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]
f即為輸入函式,它處理每個分割槽裡面的內容。每個分割槽中的內容將以Iterator[T]傳遞給輸入函式f,f的輸出結果是Iterator[U]。最終的RDD由所有分割槽經過輸入函式處理後的結果合併起來的。
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val a = sc.parallelize(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>) scala> <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">myfunc</span>[<span class="hljs-title" style="box-sizing: border-box;">T</span>]<span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(iter: Iterator[T])</span> :</span> Iterator[(T, T)] = { var res = List[(T, T)]() var pre = iter.next <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (iter.hasNext) { val cur = iter.next; res .::= (pre, cur) pre = cur; } res.iterator } scala> a.mapPartitions(myfunc).collect res0: Array[(Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>
上述例子中的函式myfunc是把分割槽中一個元素和它的下一個元素組成一個Tuple。因為分割槽中最後一個元素沒有下一個元素了,所以(3,4)和(6,7)不在結果中。
mapPartitions還有些變種,比如mapPartitionsWithContext,它能把處理過程中的一些狀態資訊傳遞給使用者指定的輸入函式。還有mapPartitionsWithIndex,它能把分割槽的index傳遞給使用者指定的輸入函式。
(5)mapPartitionsWithIndex
mapPartitionsWithIndex函式是mapPartitions函式的一個變種,它的函式引數如下:
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">``` scala> val a = sc.parallelize(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>) //函式帶分割槽索引,返回的集合第一個元素為分割槽索引 scala> <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">myfunc</span>[<span class="hljs-title" style="box-sizing: border-box;">T</span>]<span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(index:T,iter: Iterator[T])</span> :</span> Iterator[(T,T,T)] = { var res = List[(T,T, T)]() var pre = iter.next <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">while</span> (iter.hasNext) { val cur = iter.next res .::= (index,pre, cur) pre = cur } res.iterator } scala> a.mapPartitionsWithIndex(myfunc).collect res11: Array[(Int, Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li></ul>
(6)sample
方法引數:
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"> /** * Return a sampled subset of this RDD. * * <span class="hljs-decorator" style="color: rgb(0, 102, 102); box-sizing: border-box;">@param withReplacement can elements be sampled multiple times (replaced when sampled out)</span> * <span class="hljs-decorator" style="color: rgb(0, 102, 102); box-sizing: border-box;">@param fraction expected size of the sample as a fraction of this RDD's size</span> * without replacement: probability that each element <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">is</span> chosen; fraction must be [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] * <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">with</span> replacement: expected number of times each element <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">is</span> chosen; fraction must be >= <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> * <span class="hljs-decorator" style="color: rgb(0, 102, 102); box-sizing: border-box;">@param seed seed for the random number generator</span> */ <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">sample</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">( withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong)</span>:</span> RDD[T] </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>
使用示例:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val a = sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>) <span class="hljs-label" style="box-sizing: border-box;">a:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val smapledA=a<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sample</span>(true,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.5</span>) <span class="hljs-label" style="box-sizing: border-box;">smapledA:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = PartitionwiseSampledRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span>] at sample at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span> scala> smapledA<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res12:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>) scala> val smapledA2=a<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sample</span>(false,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.5</span>)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">smapledA2:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
轉載:http://blog.csdn.net/lovehuangjiaju/article/details/48580863
相關文章
- Spark修煉之道(進階篇)——Spark入門到精通:第五節 Spark程式設計模型(二)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第六節 Spark程式設計模型(三)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第一節 Spark 1.5.0叢集搭建Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第八節 Spark SQL與DataFrame(一)SparkSQL
- Spark修煉之道(進階篇)——Spark入門到精通:第七節 Spark執行原理Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第二節 Hadoop、Spark生成圈簡介SparkHadoop
- Spark修煉之道(進階篇)——Spark入門到精通:第三節 Spark Intellij IDEA開發環境搭建SparkIntelliJIdea開發環境
- Spark修煉之道(高階篇)——Spark原始碼閱讀:第一節 Spark應用程式提交流程Spark原始碼
- Hello Spark! | Spark,從入門到精通Spark
- Spark SQL | Spark,從入門到精通SparkSQL
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第九節:Shell程式設計入門(一)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十節:Shell程式設計入門(二)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十二節:Shell程式設計入門(四)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十三節:Shell程式設計入門(五)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十四節:Shell程式設計入門(六)SparkLinux大資料程式設計
- spark學習之-----spark程式設計模型Spark程式設計模型
- Spark從入門到放棄——初始Spark(一)Spark
- Spark入門篇Spark
- Spark 程式設計模型(上)Spark程式設計模型
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第七節:程式管理SparkLinux大資料
- spark學習筆記--進階程式設計Spark筆記程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第四節:Linux檔案系統(二)SparkLinux大資料
- 【Spark篇】---Spark初始Spark
- Spark SQL 程式設計API入門系列之Spark SQL的作用與使用方式SparkSQL程式設計API
- spark架構設計&程式設計模型01Spark架構程式設計模型
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第八節:網路管理SparkLinux大資料
- 「Spark從精通到重新入門(一)」Spark 中不可不知的動態優化Spark優化
- Spark入門(三)--Spark經典的單詞統計Spark
- Spark下載與入門(Spark自學二)Spark
- Spark 快速入門Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第五節:vi、vim編輯器(一)SparkLinux大資料
- Spark從入門到放棄---RDDSpark
- Spark入門(五)--Spark的reduce和reduceByKeySpark
- Spark入門(四)--Spark的map、flatMap、mapToPairSparkAPTAI
- Spark機器學習1·程式設計入門(scala/java/python)Spark機器學習程式設計JavaPython
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第三節:使用者和組SparkLinux大資料
- spark入門筆記Spark筆記
- Spark Streaming入門Spark