Spark開發-transformations操作

Xlucas發表於2017-09-22

核心
transformations操作
map(func)
返回一個新的RDD，這個函式的主要功能是對所有元素進行引數上的操作
對每一條輸入進行指定的操作，然後為每一條輸入返回一個物件
例如 val rdd1=sc.parallelize(Array(1,2,3,4)).map(x=>2*x).collect
這個是對資料 1,2,3,4進行map操作，裡面的函式是2*x就是每個元素都乘以2返回
返回結果是 rdd1: Array[Int] = Array(2, 4, 6, 8)

filter(func)
返回一個新的RDD，這個函式的主要功能是對元素進行過濾獲取符合條件的元素
例如 val rdd1=sc.parallelize(Array(1,2,3,4)).filter(x=>x>1).collect
這個是對陣列1,2,3,4進行filter操作，將符合大於1的元素返回
rdd1: Array[Int] = Array(2, 3, 4)

flatMap(func)
返回一個新的RDD，這個引數是函式，類似map的操作
和map不一樣的地方是最後將所有物件合併為一個物件

案例：

scala> val data =Array(Array(1, 2, 3, 4, 5),Array(4,5,6))
data: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(4, 5, 6))
scala> val rdd1=sc.parallelize(data)
rdd1: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[4] at parallelize at <console>:29
scala> val rdd2=rdd1.flatMap(x=>x.map(y=>y))
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at flatMap at <console>:31
scala> rdd2.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6)

mapPartitions(func)
與map方法類似，map是對rdd中的每一個元素進行操作，而mapPartitions(foreachPartition)則是對rdd中的每個分割槽的迭代器進行操作。
如果在map過程中需要頻繁建立額外的物件(例如將rdd中的資料通過jdbc寫入資料庫,map需要為每個元素建立一個連結而mapPartition為每個partition建立一個連結),則mapPartitions效率比map高的多。

val a = sc.parallelize(1 to 9, 3)
  def doubleFunc(iter: Iterator[Int]) : Iterator[(Int,Int)] = {
    var res = List[(Int,Int)]()
    while (iter.hasNext)
    {
      val cur = iter.next;
      res .::= (cur,cur*2)
    }
    res.iterator
  }
val result = a.mapPartitions(doubleFunc)
println(result.collect().mkString)

結果：(3,6)(2,4)(1,2)(6,12)(5,10)(4,8)(9,18)(8,16)(7,14)

mapPartitionsWithIndex(func)
函式作用同mapPartitions，不過提供了兩個引數，第一個引數為分割槽的索引

scala> val a = sc.parallelize(1 to 9, 3)
scala> def myfunc[T](index:T,iter: Iterator[T]) : Iterator[(T,T,T)] = {
    var res = List[(T,T, T)]() 
    var pre = iter.next 
    while (iter.hasNext) {
        val cur = iter.next
        res .::= (index,pre, cur) 
        pre = cur
    } 
    res.iterator
}
scala> a.mapPartitionsWithIndex(myfunc).collect
res11: Array[(Int, Int, Int)] = Array((0,2,3), (0,1,2), (1,5,6), (1,4,5), (2,8,9), (2,7,8))

sample(withReplacement, fraction, seed)
Sample是對rdd中的資料集進行取樣,並生成一個新的RDD,這個新的RDD只有原來RDD的部分資料,這個保留的資料集大小由fraction來進行控制
程式碼中的引數說明:
withReplacement=>,這個值如果是true時,採用PoissonSampler取樣器(Poisson分佈),否則使用BernoulliSampler的取樣器.
Fraction=>,一個大於0,小於或等於1的小數值,用於控制要讀取的資料所佔整個資料集的概率.
Seed=>,這個值如果沒有傳入,預設值是一個0~Long.maxvalue之間的整數.

val a = sc.parallelize(1 to 9, 3)
val b = a.sample(true,0.5,4)
b.collect()
res7: Array[Int] = Array(2, 3, 4, 4, 6, 8)

val c = a.sample(false,0.5,4)
c.collect()
Array[Int] = Array(2, 3, 5, 6, 8)

union(otherDataset)
將2個RDD合併起來。返回一個新的RDD

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9)))
scala> val rdd2=sc.parallelize(List(('c',6),('c',7),('d',8),('e',10)))
scala> val rdd3 = rdd1 union rdd2
scala> rdd3.collect()
res0: Array[(Char, Int)] = Array((a,2), (b,4), (c,6), (d,9), (c,6), (c,7), (d,8), (e,10))

intersection(otherDataset)
該函式返回兩個RDD的交集，並且去重

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9)))
scala> val rdd2=sc.parallelize(List(('c',6),('c',7),('d',8),('e',10)))
scala> val rdd3 = rdd1 intersection rdd2
scala> rdd3.collect()
Array[(Char, Int)] = Array((c,6))

distinct([numTasks]))
該函式將RDD去重

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9),('d',9),('d',9),('d',9)))
scala> val rdd2=rdd1.distinct()
scala> rdd2.collect()
res3: Array[(Char, Int)] = Array((a,2), (c,6), (d,9), (b,4))

groupByKey([numTasks])
輸入資料為(K, V) 對, 返回的是 (K, Iterable) ，numTasks指定task數量，該引數是可選的

scala> val rdd1=sc.parallelize(1 to 5)
scala> val rdd2=sc.parallelize(4 to 9)
scala> rdd1.union(rdd2).map(word=>(word,1)).groupByKey().collect()
Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1)), (2,CompactBuffer(1)), (3,CompactBuffer(1)),
(4,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)), (6,CompactBuffer(1)), (7,CompactBuffer(1)), (8,CompactBuffer(1)), (9,CompactBuffer(1)))

reduceByKey(func, [numTasks])
reduceByKey函式輸入資料為(K, V)對，返回的資料集結果也是（K,V）對，只不過V為經過聚合操作後的值

scala> val rdd1=sc.parallelize(1 to 5)
scala> val rdd2=sc.parallelize(4 to 9)
scala> rdd1.union(rdd2).map(word=>(word,1)).reduceByKey(_+_).collect()
Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,2), (5,2), (6,1), (7,1), (8,1), (9,1))

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
aggregateByKey函式對PairRDD中相同Key的值進行聚合操作，在聚合過程中同樣使用了一箇中立的初始值

sortByKey([ascending], [numTasks])
對輸入的資料集按key排序

scala> var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(3,4),(7,9),(2,4)))
scala> data.sortByKey(true).collect()
Array[(Int, Int)] = Array((1,2), (1,4), (1,3), (2,3), (2,4), (3,4), (7,9))

join(otherDataset, [numTasks])
將2個RDD根據key關聯起來

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9)))
scala> val rdd2=sc.parallelize(List(('c',6),('c',7),('d',8),('e',10)))
scala> val rdd3 = rdd1 join rdd2
scala> rdd3.collect()
Array[(Char, (Int, Int))] = Array((c,(6,6)), (c,(6,7)), (d,(9,8)))

cogroup(otherDataset, [numTasks])
如果輸入的RDD型別為(K, V) 和(K, W)，則返回的RDD型別為 (K,

(Iterable, Iterable)) . 該操作與 groupWith 等同
scala> val rdd1=sc.parallelize(Array((1,2),(1,3)))
scala> val rdd2=sc.parallelize(Array((1,3)))
scala> rdd1.cogroup(rdd2).collect
Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))
scala> rdd1.groupWith(rdd2).collect
res10: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))

cartesian(otherDataset)
求兩個RDD資料集間的笛卡爾積

scala> val rdd1=sc.parallelize(Array(1,2,3,4))
scala> val rdd2=sc.parallelize(Array(5,6))
scala> rdd1.cartesian(rdd2).collect
res12: Array[(Int, Int)] = Array((1,5), (1,6), (2,5), (2,6), (3,5), (3,6), (4,5), (4,6))

coalesce(numPartitions)
將RDD的分割槽數減至指定的numPartitions分割槽數，預設shuffle = false不進行shuffle的操作

scala> val rdd1=sc.parallelize(1 to 100,3)
scala> val rdd2=rdd1.coalesce(2)
scala> rdd1.collect()
17/09/22 08:52:40 INFO spark.SparkContext: Starting job: collect at <console>:30
17/09/22 08:52:40 INFO scheduler.DAGScheduler: Got job 14 (collect at <console>:30) with 3 output partitions
scala> rdd2.collect()
17/09/22 08:52:09 INFO spark.SparkContext: Starting job: collect at <console>:32
17/09/22 08:52:09 INFO scheduler.DAGScheduler: Got job 13 (collect at <console>:32) with 2 output partitions

repartition(numPartitions)
repartition(numPartitions)，功能與coalesce函式相同，實質上它呼叫的就是coalesce函式，只不是shuffle = true，意味著可能會導致大量的網路開銷

repartitionAndSortWithinPartitions(partitioner)
repartitionAndSortWithinPartitions函式是repartition函式的變種，與repartition函式不同的是，
repartitionAndSortWithinPartitions在給定的partitioner內部進行排序，效能比repartition要高

【Spark篇】---Spark中transformations運算元二
2018-02-05
SparkORM
Spark開發-控制操作
2017-10-09
Spark
Spark開發-Action操作
2017-09-25
Spark
【Spark篇】---Spark中Transformations轉換運算元
2018-02-01
SparkORM
Spark常用Transformations運算元(一)
2018-11-05
SparkORM
Spark常用Transformations運算元(二)
2018-01-12
SparkORM
Spark操作開窗函式
2019-09-02
Spark函式
Spark開發-Spark核心細說
2017-09-21
Spark
Spark開發-spark環境搭建
2017-09-10
Spark
Spark開發-SparkSql的開發
2017-09-28
SparkSQL
Spark開發-spark執行原理和RDD
2017-09-13
Spark
Spark開發-Local模式
2017-10-24
Spark模式
Spark開發-Standalone模式
2017-10-24
Spark模式
Spark面試題（七）——Spark程式開發調優
2021-11-18
Spark面試題
Spark開發-Spark執行模式及原理一
2017-10-23
Spark模式
Spark開發-Yarn cluster模式
2017-11-05
SparkYarn模式
Spark開發-Shuffle優化
2017-11-15
Spark優化
Spark 系列（十四）—— Spark Streaming 基本操作
2019-08-14
Spark
spark2.2.0 配置spark sql 操作hive
2018-09-23
SparkSQLHive
Spark 從零到開發（五）初識Spark SQL
2021-09-09
SparkSQL
Spark開發-廣播變數
2017-10-01
Spark變數
Spark Basic RDD 操作示例
2017-06-01
Spark
IDEA開發Spark應用並提交本地Spark 2.1.0 stand
2021-09-09
IdeaSpark
Spark 效能調優--開發階段
2021-09-09
Spark
Spark開發-RDD介面程式設計
2017-10-03
Spark程式設計
Spark開發-WordCount詳細講解
2017-09-15
Spark
Spark開發-HA環境的搭建
2017-09-20
Spark
如何用 Spark 快速開發應用？
2015-07-14
Spark
StreamAnalytix Visual Spark Studio （二）！Spark開發史上最強大的神器，只需拖拽控制元件即可完成Spark開發，造福國內的Spark開發者!
2018-02-28
Spark控制元件
Spark操作Hive分割槽表
2018-12-07
SparkHive
Spark2 Dataset聚合操作
2016-11-25
Spark
spark 基礎開發 Tips總結
2018-11-12
Spark
在Intellij中開發Spark--demo
2017-06-20
IntelliJSpark
Spark開發-RDD分割槽重新劃分
2017-10-09
Spark
Spark開發-WordCount流程詳細講解
2017-09-18
Spark
大資料開發-Spark-初識Spark-Graph && 快速入門
2021-02-08
大資料Spark
Spark Streaming中的Window操作
2020-12-28
Spark
Spark 簡單例項（基本操作）
2018-04-19
Spark單例

Spark開發-transformations操作

相關文章