Spark修煉之道(進階篇)——Spark入門到精通:第五節 Spark程式設計模型(二)
本文主要內容
- RDD 常用Transformation函式
1. RDD 常用Transformation函式
(1)union
union將兩個RDD資料集元素合併,類似兩個集合的並集
union函式引數:
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"> /** * Return the union of this RDD <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">and</span> another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). */ <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">union</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(other: RDD[T])</span>:</span> RDD[T] </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
RDD與另外一個RDD進行Union操作之後,兩個資料集中的存在的重複元素
程式碼如下:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">16</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> //存在重複元素 scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res13:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
(2)intersection
方法返回兩個RDD資料集的交集
函式引數:
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* Note that this method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T]
使用示例:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.intersection</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res14:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
(3)distinct
distinct函式將去除重複元素
distinct函式引數:
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(): RDD[T]
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.distinct</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res0:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
(4)groupByKey([numTasks])
輸入資料為(K, V) 對, 返回的是 (K, Iterable) ,numTasks指定task數量,該引數是可選的,下面給出的是無引數的groupByKey方法
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])]
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>((_,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.groupByKey</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res2:</span> Array[(Int, Iterable[Int])] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
(5)reduceByKey(func, [numTasks])
reduceByKey函式輸入資料為(K, V)對,返回的資料集結果也是(K,V)對,只不過V為經過聚合操作後的值
/**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions.
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
使用示例:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>((_,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.reduceByKey</span>(_+_)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res4:</span> Array[(Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
(6)sortByKey([ascending], [numTasks])
對輸入的資料集按key排序
sortByKey方法定義
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/** * Sort the RDD <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">by</span> key, so that each partition contains a sorted range <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> the elements. Calling * `<span class="javascript" style="box-sizing: border-box;">collect</span>` <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">or</span> `<span class="javascript" style="box-sizing: border-box;">save</span>` <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">on</span> the resulting RDD will <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">or</span> output an ordered list <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> records * (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> the `<span class="javascript" style="box-sizing: border-box;">save</span>` <span class="hljs-reserved" style="box-sizing: border-box;">case</span>, they will be written to multiple `<span class="javascript" style="box-sizing: border-box;">part-X</span>` files <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> the filesystem, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> * order <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> the keys). */ <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(0, 136, 0);">TODO</span>: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">this</span> currently doesn<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'t work on P other than Tuple2! def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length) : RDD[(K, V)]</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
使用示例:
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> var <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(7,9)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>))</span> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[20] at parallelize at <console>:21</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.sortByKey<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">true</span>)</span>.collect</span> <span class="hljs-title" style="box-sizing: border-box;">res10</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>)) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>
(7)join(otherDataset, [numTasks])
對於資料集型別為 (K, V) 及 (K, W)的RDD,join操作後返回型別為 (K, (V, W)),join函式有三種:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
def leftOuterJoin[W](
other: RDD[(K, W)],
partitioner: Partitioner): RDD[(K, (V, Option[W]))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Option[V], W))]
使用示例:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)) | ) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">24</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">32</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.join</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res13:</span> Array[(Int, (Int, Int))] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1.leftOuterJoin(rdd2).collect res15: <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>[(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>, (<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Option</span>[<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>]))] = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1.rightOuterJoin(rdd2).collect res16: <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>[(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>, (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Option</span>[<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>], <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>))] = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
(8)cogroup(otherDataset, [numTasks])
如果輸入的RDD型別為(K, V) 和(K, W),則返回的RDD型別為 (K, (Iterable, Iterable)) . 該操作與 groupWith等同
方法定義:
/**
* For each key k in this
or other
,
return a resulting RDD that contains a tuple with the
* list of values for that key in this
as well as other
.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))]
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)) | ) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">24</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">32</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.cogroup</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res17:</span> Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)))) scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.groupWith</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res18:</span> Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)))) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>
(9)cartesian(otherDataset)
求兩個RDD資料集間的笛卡爾積
函式定義:
/**
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
* elements (a, b) where a is in this
and b is in other
.
*/
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">52</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>)) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">53</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.cartesian</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span> <span class="hljs-label" style="box-sizing: border-box;">res21:</span> Array[(Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>)) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
(10)coalesce(numPartitions)
將RDD的分割槽數減至指定的numPartitions分割槽數
函式定義:
/**
* Return a new RDD that is reduced into numPartitions
partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions.
*
* However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* Note: With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
: RDD[T]
示例程式碼:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">100</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">55</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span> scala> val rdd2=rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.coalesce</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) <span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = CoalescedRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">56</span>] at coalesce at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>
repartition(numPartitions),功能與coalesce函式相同,實質上它呼叫的就是coalesce函式,只不是shuffle = true,意味著可能會導致大量的網路開銷。
方法定義:
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using coalesce
,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
轉載: http://blog.csdn.net/lovehuangjiaju/article/details/48603009
相關文章
- Spark修煉之道(進階篇)——Spark入門到精通:第四節 Spark程式設計模型(一)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第六節 Spark程式設計模型(三)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第七節 Spark執行原理Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第二節 Hadoop、Spark生成圈簡介SparkHadoop
- Spark修煉之道(進階篇)——Spark入門到精通:第一節 Spark 1.5.0叢集搭建Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第八節 Spark SQL與DataFrame(一)SparkSQL
- Spark修煉之道(進階篇)——Spark入門到精通:第三節 Spark Intellij IDEA開發環境搭建SparkIntelliJIdea開發環境
- Spark修煉之道(高階篇)——Spark原始碼閱讀:第一節 Spark應用程式提交流程Spark原始碼
- Hello Spark! | Spark,從入門到精通Spark
- Spark SQL | Spark,從入門到精通SparkSQL
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十節:Shell程式設計入門(二)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第九節:Shell程式設計入門(一)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十二節:Shell程式設計入門(四)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十三節:Shell程式設計入門(五)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十四節:Shell程式設計入門(六)SparkLinux大資料程式設計
- spark學習之-----spark程式設計模型Spark程式設計模型
- Spark入門篇Spark
- Spark下載與入門(Spark自學二)Spark
- Spark 程式設計模型(上)Spark程式設計模型
- Spark從入門到放棄——初始Spark(一)Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第七節:程式管理SparkLinux大資料
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第五節:vi、vim編輯器(一)SparkLinux大資料
- spark學習筆記--進階程式設計Spark筆記程式設計
- Spark從入門到放棄——Spark2.4.7安裝和啟動(二)Spark
- 【Spark篇】---Spark初始Spark
- Spark SQL 程式設計API入門系列之Spark SQL的作用與使用方式SparkSQL程式設計API
- 【Spark篇】---Spark中transformations運算元二SparkORM
- 「Spark從精通到重新入門(二)」Spark中不可不知的動態資源分配Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第六節:vi、vim編輯器(二)SparkLinux大資料
- spark架構設計&程式設計模型01Spark架構程式設計模型
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第八節:網路管理SparkLinux大資料
- Spark入門(三)--Spark經典的單詞統計Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第四節:Linux檔案系統(二)SparkLinux大資料
- Spark 快速入門Spark
- Spark從入門到放棄---RDDSpark
- Spark入門(五)--Spark的reduce和reduceByKeySpark
- Spark入門(四)--Spark的map、flatMap、mapToPairSparkAPTAI
- Spark機器學習1·程式設計入門(scala/java/python)Spark機器學習程式設計JavaPython