Spark修煉之道(進階篇)——Spark入門到精通:第六節 Spark程式設計模型(三)
本節主要內容
- RDD transformation(續)
- RDD actions
1. RDD transformation(續)
(1)repartitionAndSortWithinPartitions(partitioner)
repartitionAndSortWithinPartitions函式是repartition函式的變種,與repartition函式不同的是,repartitionAndSortWithinPartitions在給定的partitioner內部進行排序,效能比repartition要高。
函式定義:
/**
* Repartition the RDD according to the given partitioner and, within each resulting partition,
* sort records by their keys.
*
* This is more efficient than calling repartition
and then sorting within each partition
* because it can push the sorting down into the shuffle machinery.
*/
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]
使用示例:
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(5,4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>),3)</span> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[3] at parallelize at <console>:21</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.repartitionAndSortWithinPartitions<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">new</span> <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">HashPartitioner(3)</span>)</span>.collect</span> <span class="hljs-title" style="box-sizing: border-box;">res3</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
(2)aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
aggregateByKey函式對PairRDD中相同Key的值進行聚合操作,在聚合過程中同樣使用了一箇中立的初始值。其函式定義如下:
/**
* Aggregate the values of each key, using given combine functions and a neutral “zero value”.
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)]
示例程式碼:
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-import" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.SparkContext._</span> <span class="hljs-import" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.{SparkConf, SparkContext}</span> <span class="hljs-title" style="box-sizing: border-box;">object</span> <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">SparkWordCount</span>{ def main(args: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">String</span>]) { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (args.length == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) { <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">System</span>.err.println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: SparkWordCount <inputfile> <outputfile>"</span>) <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">System</span>.exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) } val conf = new <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">SparkConf</span>().setAppName(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"SparkWordCount"</span>).setMaster(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"local"</span>) val sc = new <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">SparkContext</span>(conf) val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>))</span> def seqOp(a:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, b:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>) : <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> ={ println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"seq: "</span> + a + <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t "</span> + b) math.max(a,b) } def combineOp(a:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, b:<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>) : <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> ={ println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"comb: "</span> + a + <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\t "</span> + b) a + b } val localIterator=<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.aggregateByKey<span class="hljs-container" style="box-sizing: border-box;">(1)</span><span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">seqOp</span>, <span class="hljs-title" style="box-sizing: border-box;">combineOp</span>)</span>.toLocalIterator</span> for(i<-localIterator) println(i) sc.stop() } }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li></ul>
執行結果:
seq: 1 3
seq: 3 2
seq: 3 4
seq: 1 3
seq: 3 4
(1,4)
(2,4)
從輸出結果來看,seqOp函式起作用了,但comineOp函式並沒有起作用,在Spark 1.5、1.4及1.3三個版本中測試,結果都是一樣的。這篇文章http://www.iteblog.com/archives/1261給出了aggregateByKey的使用,其Spark版本是1.1,其返回結果符合預期。個人覺得是版本原因造成的,具體後面有時間再來分析。
RDD中還有其它非常有用的transformation操作,參見API文件:http://spark.apache.org/docs/latest/api/scala/index.html
2. RDD actions
本小節將介紹常用的action操作,前面使用的collect方法便是一種action,它返回RDD中所有的資料元素,方法定義如下:
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T]
(1) reduce(func)
reduce取樣累加或關聯操作減少RDD中元素的數量,其方法定義如下:
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
使用示例:
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>=sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(1 <span class="hljs-title" style="box-sizing: border-box;">to</span> 9)</span></span> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[6] at parallelize at <console>:22</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.reduce<span class="hljs-container" style="box-sizing: border-box;">((<span class="hljs-title" style="box-sizing: border-box;">x</span>,<span class="hljs-title" style="box-sizing: border-box;">y</span>)</span>=>x+y)</span> <span class="hljs-title" style="box-sizing: border-box;">res12</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">45</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.reduce<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">_</span>+<span class="hljs-title" style="box-sizing: border-box;">_</span>)</span></span> <span class="hljs-title" style="box-sizing: border-box;">res13</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span> = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">45</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>
(2)count()
/**
* Return the number of elements in the RDD.
*/
def count(): Long
使用示例:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val data=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>) <span class="hljs-label" style="box-sizing: border-box;">data:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">22</span> scala> data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.count</span> <span class="hljs-label" style="box-sizing: border-box;">res14:</span> Long = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
(3)first()
/**
* Return the first element in this RDD.
*/
def first()
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val data=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>) <span class="hljs-label" style="box-sizing: border-box;">data:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">22</span> scala> data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.first</span> <span class="hljs-label" style="box-sizing: border-box;">res15:</span> Int = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
(4)take(n)
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.
*
* @note due to complications in the internal implementation, this method will raise
* an exception if called on an RDD of Nothing
or Null
.
*/
def take(num: Int): Array[T]
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>=sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(1 <span class="hljs-title" style="box-sizing: border-box;">to</span> 9)</span></span> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[6] at parallelize at <console>:22</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.take<span class="hljs-container" style="box-sizing: border-box;">(2)</span></span> <span class="hljs-title" style="box-sizing: border-box;">res16</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
(5) takeSample(withReplacement, num, [seed])
對RDD中的資料進行取樣
/**
* Return a fixed-size sampled subset of this RDD in an array
*
* @param withReplacement whether sampling is done with replacement
* @param num size of the returned sample
* @param seed seed for the random number generator
* @return sample of specified size in an array
*/
// TODO: rewrite this without return statements so we can wrap it in a scope
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>=sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(1 <span class="hljs-title" style="box-sizing: border-box;">to</span> 9)</span></span> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[6] at parallelize at <console>:22</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.takeSample<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">false</span>,5)</span></span> <span class="hljs-title" style="box-sizing: border-box;">res17</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.takeSample<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">true</span>,5)</span></span> <span class="hljs-title" style="box-sizing: border-box;">res18</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
(6) takeOrdered(n, [ordering])
/**
* Returns the first k (smallest) elements from this RDD as defined by the specified
* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
* For example:
* {{{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
* // returns Array(2)
*
* sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
* // returns Array(2, 3)
* }}}
*
* @param num k, the number of elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
(6) saveAsTextFile(path)
將RDD儲存到檔案,本地模式時儲存在本地檔案,叢集模式指如果在Hadoop基礎上則儲存在HDFS上
/**
* Save this RDD as a text file, using string representations of elements.
*/
def saveAsTextFile(path: String): Unit
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.saveAsTextFile</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/data.txt"</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
(7) countByKey()
將RDD中的資料按Key計數
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* Note that this method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver’s memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long]
使用示例:
<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> val <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(5,4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>),3)</span> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[10] at parallelize at <console>:22</span> <span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.countByKey<span class="hljs-container" style="box-sizing: border-box;">()</span></span> <span class="hljs-title" style="box-sizing: border-box;">res22</span>: scala.collection.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Map</span>[<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>,<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Long</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Map</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>
(8)foreach(func)
foreach方法遍歷RDD中所有的元素
// Actions (launch a job to return a value to the user program)
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.SparkContext</span>._ import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span>.{SparkConf, SparkContext} object ForEachDemo{ def main(args: Array[String]) { if (args<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.length</span> == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) { System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.err</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.println</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: SparkWordCount <inputfile> <outputfile>"</span>) System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.exit</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) } val conf = new SparkConf()<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.setAppName</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"SparkWordCount"</span>)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.setMaster</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"local"</span>) val sc = new SparkContext(conf) val data = sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(List((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>))) data<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.foreach</span>(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>=>println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"key="</span>+<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>._1+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">",value="</span>+<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">x</span>._2)) sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.stop</span>() } }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li></ul>
Sparkh中還存在其它非常有用的action操作,如foldByKey、sampleByKey等,參見API文件:http://spark.apache.org/docs/latest/api/scala/index.html
轉載: http://blog.csdn.net/lovehuangjiaju/article/details/48622757
相關文章
- Spark修煉之道(進階篇)——Spark入門到精通:第四節 Spark程式設計模型(一)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第五節 Spark程式設計模型(二)Spark程式設計模型
- Spark修煉之道(進階篇)——Spark入門到精通:第七節 Spark執行原理Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第一節 Spark 1.5.0叢集搭建Spark
- Spark修煉之道(進階篇)——Spark入門到精通:第八節 Spark SQL與DataFrame(一)SparkSQL
- Spark修煉之道(進階篇)——Spark入門到精通:第二節 Hadoop、Spark生成圈簡介SparkHadoop
- Spark修煉之道(進階篇)——Spark入門到精通:第三節 Spark Intellij IDEA開發環境搭建SparkIntelliJIdea開發環境
- Spark修煉之道(高階篇)——Spark原始碼閱讀:第一節 Spark應用程式提交流程Spark原始碼
- Hello Spark! | Spark,從入門到精通Spark
- Spark SQL | Spark,從入門到精通SparkSQL
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第九節:Shell程式設計入門(一)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十節:Shell程式設計入門(二)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十二節:Shell程式設計入門(四)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十三節:Shell程式設計入門(五)SparkLinux大資料程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第十四節:Shell程式設計入門(六)SparkLinux大資料程式設計
- spark學習之-----spark程式設計模型Spark程式設計模型
- Spark入門篇Spark
- Spark入門(三)--Spark經典的單詞統計Spark
- Spark 程式設計模型(上)Spark程式設計模型
- Spark從入門到放棄——初始Spark(一)Spark
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第七節:程式管理SparkLinux大資料
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第六節:vi、vim編輯器(二)SparkLinux大資料
- spark學習筆記--進階程式設計Spark筆記程式設計
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第三節:使用者和組SparkLinux大資料
- 【Spark篇】---Spark初始Spark
- Spark SQL 程式設計API入門系列之Spark SQL的作用與使用方式SparkSQL程式設計API
- spark架構設計&程式設計模型01Spark架構程式設計模型
- Spark修煉之道(基礎篇)——Linux大資料開發基礎:第八節:網路管理SparkLinux大資料
- Spark下載與入門(Spark自學二)Spark
- RDD程式設計 上(Spark自學三)程式設計Spark
- Spark 快速入門Spark
- Spark從入門到放棄---RDDSpark
- Spark入門(五)--Spark的reduce和reduceByKeySpark
- Spark入門(四)--Spark的map、flatMap、mapToPairSparkAPTAI
- Spark機器學習1·程式設計入門(scala/java/python)Spark機器學習程式設計JavaPython
- spark入門筆記Spark筆記
- Spark Streaming入門Spark
- Spark入門學習Spark