spark常用RDD介紹及Demo

停不下的腳步發表於2015-06-02

Transformation：

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.

val list=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4)))
val result = list.map(x => (x._1,x._2+1))
for(each <- result){
  print(each)
}

console:(a,2)(a,3)(b,4)(b,5)

val list=sc.parallelize(List(1,2,3,4,5))
val result = list.map(_+1)
for(each <- result){

  print(each)

}

console:23456

filter(func): Return a new dataset formed by selecting those elements of the source on which func returns true.

val list=sc.parallelize(List(1,2,3,4,5,6))
val result = list.filter(_%2==0)
for(each <- result){

  print(each)

}

console:246

flatMap(func): Similar to map,but each input item can be mapped to 0 or more output items(so func should return a Seq rather than a single item).

eg:

val list = sc.parallelize(List("abc","def"))
val result = list.flatMap(_.toList)
for(each <- result){

  print(each)

}

console:abcdef

union(otherDataset): Return a new dataset that contain the union of the elements in the source dataset and the argument.

val list=sc.parallelize(List(1,2,3))
val list1=sc.parallelize(List(4,5,6))
val result = list.union(list1)
for(each <- result){

  print(each)

}

console:123456

join(otherDataset,[numTasks]): When called on datasets of type(K,V) and (K,W),returns a dataset of (K,(V,W)) pairs with all pairs of elements for each key.Outer joins are supported leftOutJoin,rightOuterJoin,and fullOuterJoin.

val list1=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4),('c',4)))
val list2=sc.parallelize(List(('a',5),('a',6),('b',7),('b',8)))
for(each <- list1.join(list2)){

  print(each+" ")

}

console:(a,(1,5)) (a,(1,6)) (a,(2,5)) (a,(2,6)) (b,(3,7)) (b,(3,8)) (b,(4,7)) (b,(4,8))

intersection(otherDataset): Return a new RDD that contains the intersection of elements in the source dataset and the argument.

val list1=sc.parallelize(List(('a',1),('a',5),('b',3),('b',4),('c',4)))
val list2=sc.parallelize(List(('a',5),('a',6),('b',4),('b',8)))
for(each <- list1.intersection(list2)){

  print(each+" ")

}

console:(b,4) (a,5)

distinct([numTasks]): Return a new dataset that contains the distinct elements of the source dataset.

val list1=sc.parallelize(List(('a',1),('a',1),('b',3),('b',4),('c',4)))
for(each <- list1.distinct()){

  print(each + " ")

}

console:(a,1) (b,4) (b,3) (c,4)

groupByKey([numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,Iterable<V>) pairs.

Note:If you are grouping in order to perform an aggregation(such as a sum or average) over each key,using reduceByKey or combineByKey will yield much better performance.

Note:By default,the level of parallelism in the output depends on the number of partitions of the parent RDD.You can pass an optional numTasks argument to set a different number of tasks.

val list1=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4),('c',4)))
for(each <- list1.groupByKey()){

  print(each + " ")

}

console:(a,CompactBuffer(1, 2)) (b,CompactBuffer(3, 4)) (c,CompactBuffer(4))

reduceByKey(func,[numTasks]): When called on a dataset of (K,V) pairs,returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func,which must be of type(V,V)=>V. Like in groupByKey,the number of reduce tasks is configurable through an optional second argument.

val list1=sc.parallelize(List(('a',1),('a',2),('b',3),('b',4),('c',4)))
for(each <- list1.reduceByKey(_+_)){

  print(each + " ")

}

console:(a,3) (b,7) (c,4)

sortByKey([ascending],[numTasks]): When called on a dataset of (K,V) pairs where K implements Ordered,return a dataset of (K,V) pairs sorted by keys in ascending or descending order,as specified in the boolean ascending argument.

val list1=sc.parallelize(List(('a',1),('e',2),('b',3),('d',4),('c',4)))
for(each <- list1.sortByKey(false)){

  print(each + " ")

}

console:(e,2) (d,4) (c,4) (b,3) (a,1)

Action:

reduce(func): Aggregate the elements of the dataset using a function func(which takes two arguments and return one).The function should be commutative and associative so that it can be computed correctly in parallel.

eg:

val rdd=sc.parallelize(List(1,2,3,4))
print(rdd.reduce((_+_)))

console:10

collect(): Return all the elements of the dataset as an array at the driver program.This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

val rdd=sc.parallelize(List(1,2,3,4))
val result = rdd.filter(_%2==0).collect()
for(each <- result){

  print(each + " ")

}

console:2 4

count(): Return the number of elements in the dataset.

val rdd = sc.parallelize(List(1,2,3,4))
print(rdd.count())

console:4

first(): Return the first element of the dataset(similar to take(1))

val rdd = sc.parallelize(List(1,2,3,4))
print(rdd.first())

console:1

take(n): Return an array with the first n elements of the dataset.Note that this is currently not executed in parallel.instead,the driver program computers all the elements.

Spark RDD API
2021-09-09
SparkAPI
spark-RDD
2020-10-25
Spark
Spark RDD使用詳解--RDD原理
2018-01-16
Spark
Spark介紹
2015-05-15
Spark
Spark - [03] RDD概述
2024-05-12
Spark
Spark概念介紹
2016-04-29
Spark
Spark 的核心概念 RDD
2019-04-20
Spark
Spark Basic RDD 操作示例
2017-06-01
Spark
Redis介紹及常用命令
2012-03-01
Redis
Spark開發-spark執行原理和RDD
2017-09-13
Spark
cmd控制檯常用指令及powershell介紹
2017-04-07
SparkSQL /DataFrame /Spark RDD誰快？
2020-08-15
SparkSQL
Spark RDD 特徵及其依賴
2018-09-23
Spark特徵
spark學習筆記--RDD
2018-07-05
Spark筆記
spark簡單介紹（一）
2021-09-09
Spark
Spark RDD的預設分割槽數：（spark 2.1.0）
2021-09-09
Spark
Spark RDD在Spark中的地位和作用如何？
2021-05-12
Spark
Spark（十三） Spark效能調優之RDD持久化
2019-01-15
Spark持久化
Spark學習（二）——RDD基礎
2019-03-31
Spark
【大資料】Spark RDD基礎
2019-01-03
大資料Spark
spark RDD，reduceByKey vs groupByKey
2018-10-28
Spark
Spark RDD中Runtime流程解析
2020-09-04
Spark
Spark從入門到放棄---RDD
2020-08-17
Spark
快取Apache Spark RDD - 效能調優
2019-01-08
快取ApacheSpark
大白話講解Spark中的RDD
2020-11-15
Spark
RDD程式設計上（Spark自學三）
2017-10-18
程式設計Spark
RDD程式設計下（Spark自學四）
2017-10-20
程式設計Spark
Spark開發-RDD介面程式設計
2017-10-03
Spark程式設計
Calcite 使用原生的RDD 處理Spark
2018-06-28
Spark
Spark運算元：統計RDD分割槽中的元素及數量
2021-09-09
Spark
JavaScript常用物件介紹
2020-10-14
JavaScript物件
Oracle 常用HINT介紹
2011-01-28
Oracle
Spark Streaming基礎概念介紹
2016-01-29
Spark
效能測試的流程及常用工具介紹
2019-09-12
自定義View：Paint的常用屬性介紹及使用
2019-03-26
ViewAI
Spark SQL中的RDD與DataFrame轉換
2019-08-12
SparkSQL
大資料學習—Spark核心概念RDD
2021-09-28
大資料Spark
Spark----RDD運算元分類 DAG
2020-12-23
Spark

spark常用RDD介紹及Demo

相關文章