Spark常用Transformations運算元(二)
介紹以下Transformations運算元:
aggregateByKey
join
cogroup
cartesian
pipe
repartitionAndSortWithinPartitions
glom
randomSplit
zip
zipWithIndex
zipWithUniqueId
(1) aggregateByKey,原理還沒有搞清楚,只演示結果
object aggregateByKeyTest {
def seq(a:Int, b:Int) : Int ={
math.max(a, b)
}
def comb(a:Int, b:Int) : Int ={
a + b
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("aggregateByKeyTest").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.parallelize(List((1,3),(1,2),(1,4),(2,3)))
data.aggregateByKey(1)(seq, comb).foreach(println)
/*
(1,4)
(2,3)
*/
val data2 = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(2,4),(3,1),(3,2),(3,3),(4,1),(4,2),(4,3),(4,4)), 2)
data2.aggregateByKey(1)(seq, comb).foreach(println)
/*
(4,4)
(2,4)
(1,4)
(3,4)
*/
}
}
(2) join
object JoinTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MapTest").setMaster("local")
val sc = new SparkContext(conf)
val nameList = List(
(1,"Jed"),
(2,"Tom"),
(3,"Bob"),
(4,"Tony")
)
val salaryArr = Array(
(1,8000),
(2,6000),
(3,5000)
)
val nameRDD = sc.parallelize(nameList,2)
val salaryRDD = sc.parallelize(salaryArr,3)
// inner join
val joinRDD = nameRDD.join(salaryRDD)
joinRDD.foreach( x => {
val id = x._1
val name = x._2._1
val salary = x._2._2
println(id + "\t" + name + "\t" + salary)
})
/*
1 Jed 8000
2 Tom 6000
3 Bob 5000
*/
// left join
val leftOuterJoinRDD = nameRDD.leftOuterJoin(salaryRDD)
leftOuterJoinRDD.foreach( x => {
val id = x._1
val name = x._2._1
val salary = x._2._2
println(id + "\t" + name + "\t" + salary)
})
/*
1 Jed Some(8000)
2 Tom Some(6000)
3 Bob Some(5000)
4 Tony None
*/
// right join
val rightOuterJoinRDD = nameRDD.rightOuterJoin(salaryRDD)
rightOuterJoinRDD.foreach( x => {
val id = x._1
val name = x._2._1
val salary = x._2._2
println(id + "\t" + name + "\t" + salary)
})
/*
1 Some(Jed) 8000
2 Some(Tom) 6000
3 Some(Bob) 5000
*/
// full join
val fullOuterJoinRDD = nameRDD.fullOuterJoin(salaryRDD)
fullOuterJoinRDD.foreach( x => {
val id = x._1
val name = x._2._1
val salary = x._2._2
println(id + "\t" + name + "\t" + salary)
})
/*
1 Some(Jed) Some(8000)
2 Some(Tom) Some(6000)
3 Some(Bob) Some(5000)
4 Some(Tony) None
*/
}
}
(3) cogroup:將多個RDD中同一個Key對應的Value組合到一起
val data1 = sc.parallelize(List((1, "Good"), (2, "Morning")))
val data2 = sc.parallelize(List((1, "How"), (2, "Are"), (3, "You")))
val data3 = sc.parallelize(List((1, "I"), (2, "Love"), (3, "U")))
val result = data1.cogroup(data2, data3)
result.foreach(println)
val data1 = sc.parallelize(List((1, "Good"), (2, "Morning")))
val data2 = sc.parallelize(List((1, "How"), (2, "Are"), (3, "You")))
val data3 = sc.parallelize(List((1, "I"), (2, "Love"), (3, "U")))
val result = data1.cogroup(data2, data3)
result.foreach(println)
/*
(1,(CompactBuffer(Good),CompactBuffer(How),CompactBuffer(I)))
(2,(CompactBuffer(Morning),CompactBuffer(Are),CompactBuffer(Love)))
(3,(CompactBuffer(),CompactBuffer(You),CompactBuffer(U)))
*/
(4) cartesian:求笛卡爾積
val rdd1 = sc.makeRDD(Array(1,2,3))
val rdd2 = sc.makeRDD(Array(4,5,6))
rdd1.cartesian(rdd2).foreach(println)
/*
(1,4)
(1,5)
(1,6)
(2,4)
(2,5)
(2,6)
(3,4)
(3,5)
(3,6)
*/
(5) pipe:呼叫Shell命令
(6) repartitionAndSortWithinPartitions:重新分割槽並按照新分割槽排序
val arr = Array((1,"Tom"),(18,"Tony"),(23,"Ted"),
(3,"Harry"),(56,"Bob"),(45,"Jack"),
(22,"Jed"),(2,"Kobe"),(4,"Kate"),
(23,"Mary"),(32,"Tracy"),(6,"Allen"),
(7,"Caleb"),(19,"Alexande"),(9,"Nathan"))
val rdd = sc.makeRDD(arr,2)
rdd.foreachPartition(x => {
println("=============")
while(x.hasNext) {
println(x.next())
}
})
/*
=============
(1,Tom)
(18,Tony)
(23,Ted)
(3,Harry)
(56,Bob)
(45,Jack)
(22,Jed)
=============
(2,Kobe)
(4,Kate)
(23,Mary)
(32,Tracy)
(6,Allen)
(7,Caleb)
(19,Alexande)
(9,Nathan)
*/
// 改變為4個分割槽
rdd.repartitionAndSortWithinPartitions(new HashPartitioner(4))
.foreachPartition(x => {
println("=============")
while(x.hasNext) {
println(x.next())
}
})
/*
=============
(4,Kate)
(32,Tracy)
(56,Bob)
=============
(1,Tom)
(9,Nathan)
(45,Jack)
=============
(2,Kobe)
(6,Allen)
(18,Tony)
(22,Jed)
=============
(3,Harry)
(7,Caleb)
(19,Alexande)
(23,Ted)
(23,Mary)
*/
(7) glom:把分割槽中的元素封裝到陣列中
val rdd = sc.parallelize(1 to 10,2)
val glomRDD = rdd.glom()
glomRDD.foreach(x => {
println("============")
x.foreach(println)
})
println("glomRDD中的元素個數為:" + glomRDD.count())
/*
============
1
2
3
4
5
============
6
7
8
9
10
glomRDD中的元素個數為:2
*/
(8) randomSplit:拆分RDD
val rdd = sc.parallelize(1 to 10)
// 把原來的RDD按照1:2:3:4的比例拆分為4個RDD
rdd.randomSplit(Array(0.1,0.2,0.3,0.4)).foreach(x => {println(x.count)})
理論結果:
1
2
3
4
在資料量不大的情況下,實際結果不一定準確
(9) zip、zipWithIndex、zipWithUniqueId
package com.aura.transformations
import org.apache.spark.{SparkConf, SparkContext}
/**
* Author: Jed
* Description:
* Date: Create in 2018/1/11
*/
object ZipTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MapTest").setMaster("local")
val sc = new SparkContext(conf)
val arr = Array(1,2,3,4,5)
val arr2 = Array("Tom","Jed","Tony","Terry","Kate")
val rdd1 = sc.makeRDD(arr)
val rdd2 = sc.makeRDD(arr2)
rdd1.zip(rdd2).foreach(println)
/*
(1,Tom)
(2,Jed)
(3,Tony)
(4,Terry)
(5,Kate)
*/
rdd2.zipWithIndex().foreach(println)
/*
(Tom,0)
(Jed,1)
(Tony,2)
(Terry,3)
(Kate,4)
*/
rdd1.zipWithUniqueId().foreach(println)
/*
(1,0)
(2,1)
(3,2)
(4,3)
(5,4)
*/
}
}
原理:
相關文章
- Spark常用Transformations運算元(一)SparkORM
- 【Spark篇】---Spark中transformations運算元二SparkORM
- 【Spark篇】---Spark中Transformations轉換運算元SparkORM
- spark一些常用運算元Spark
- spark-運算元-分割槽運算元Spark
- 【Spark篇】---Spark中控制運算元Spark
- 【Spark篇】---Spark中Action運算元Spark
- Spark運算元篇 --Spark運算元之aggregateByKey詳解Spark
- Spark運算元篇 --Spark運算元之combineByKey詳解Spark
- 【Spark篇】---SparkStreaming中運算元中OutPutOperator類運算元Spark
- Spark開發-transformations操作SparkORM
- Spark----RDD運算元分類 DAGSpark
- Spark RDD運算元(八)mapPartitions, mapPartitionsWithIndexSparkAPPIndex
- 【OpenCV】影像變換(二)邊緣檢測:梯度運算元、Sobel運算元和Laplace運算元OpenCV梯度
- 二維幾何常用運算
- spark的基本運算元使用和原始碼解析Spark原始碼
- spark RDD textFile運算元 分割槽數量詳解Spark
- 圖解Spark排序運算元sortBy的核心原始碼圖解Spark排序原始碼
- 【Spark篇】---SparkStreaming運算元操作transform和updateStateByKeySparkORM
- Spark運算元:RDD基本轉換操作map、flatMapSpark
- [Halcon] 機器視覺中常用運算元視覺
- 使用運算元控制公式運算公式
- Spark運算元:統計RDD分割槽中的元素及數量Spark
- 【SEOI2024 A】二元運算器
- Python 影像處理 OpenCV (12): Roberts 運算元、 Prewitt 運算元、 Sobel 運算元和 Laplacian 運算元邊緣檢測技術PythonOpenCV
- spark RDD運算元(五)之鍵值對聚合操作combineByKeySpark
- Spark運算元:RDD行動Action操作學習–countByKey、foreach、sortBySpark
- 運算元據庫
- 運算元據庫表
- SIFT運算元總結
- java反射——運算元組Java反射
- yii運算元據庫
- Mysqli運算元據庫MySql
- MySQL DML運算元據MySql
- onnx 運算元定義
- DDL:運算元據庫
- python運算元據Python
- jmeter運算元據庫JMeter