spark RDD運算元(五)之鍵值對聚合操作combineByKey
combineByKey
聚合函式一般在集中式資料比較方便,如果涉及到分散式的資料集,該如何去實現呢;這裡介紹一下combineByKey,這個是各種聚集操作的鼻祖
簡要介紹
def combineByKey[C] (createCombiner: (V) => C,
mergeValue: (C,V) => C,
mergeCombiners: (C,C) => C): RD
cerateCombiner:combineByKey()會遍歷分割槽中的所有元素,因此每個元素的鍵要麼還沒有遇到過,要麼就和之前的某個元素的鍵相同。如果這是一個新的元素,combineByKey()會使用一個叫做createCombiner()的函式來創造那個鍵對應的累加器的初始值
mergeValue:如果這是一個在處理當前分割槽之前已經遇到的鍵,他會使用mergeValue()方法將該鍵的累加器對應的當前值與這個新的值進行合併
mergeCombiners:由於每個分割槽都是獨立處理的,因此對於同一個鍵可以有多個累加器。如果有兩個或者更多的分割槽都有對應同一個鍵的累加器,就需要使用使用者提供的mergeCombiners()方法將各個分割槽的結果進行合併。
計算學生平均成績例子:
建立一個學生成績說明的類
scala版本:
case class ScoreDetail(studentName: String, subject: Dtring, score: Float)
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("combinebykey")
val sc = SparkContext.getOrCreate(conf)
val scores = List(
ScoreDetail("Aythna","Music",96),
ScoreDetail("Aythna","Math",92),
ScoreDetail("Tomoyo","English",97),
ScoreDetail("Tomoyo","Science",90),
ScoreDetail("LiSA","Sports",94),
ScoreDetail("LiSA","Music",99),
ScoreDetail("Ninja","Music",92),
ScoreDetail("Ninja","Sports",96),
ScoreDetail("Sino","Music",95),
ScoreDetail("Sino","Math",98),
ScoreDetail("Nagisa","Music",96),
ScoreDetail("Nagisa","Math",97)
)
//將集合轉換成二元組,可以理解為轉換成一個map,利用了for和yield的組合
val scoresWithKey = for {i <- scores} yield (i.studentName,i)
//建立RDD,並指定三個分割槽
val scoresWithKeyRDD = sc.parallelize(scoresWithKey).partitionBy(new HashPartitioner(3)).cache()
//輸出列印各個分割槽的長度和一些資料
scoresWithKeyRDD.foreachPartition(partition => println(partition.length))
scoresWithKeyRDD.foreachPartition(partContext => partContext.foreach(
x => println(x._2,"姓名:" + x._2.studentName,"學科:" + x._2.subject,"成績:" + x._2.score)
))
//聚合求平均值然後列印
val avgScoreRDD = scoresWithKeyRDD.combineByKey(
(x: ScoreDetail) => (x.score, 1),
(acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1),
(acc1: (Float, Int), acc2: (Float, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
).map({ case (key, value) => (key, value._1 / value._2) })
avgScoreRDD.collect().foreach(println)
Java版本:
SparkConf sparkconf = new SparkConf().setMaster("local[*]").setAppName("JavaWordCount");
JavaSparkContext sc = new JavaSparkContext(sparkconf);
ArrayList<ScoreDetail> scoreDetails = new ArrayList<>();
scoreDetails.add(new ScoreDetail("Aythna", "Music", 97));
scoreDetails.add(new ScoreDetail("Aythna", "Math", 94));
scoreDetails.add(new ScoreDetail("Nagisa", "Science", 99));
scoreDetails.add(new ScoreDetail("Nagisa", "English", 95));
scoreDetails.add(new ScoreDetail("Tomoyo", "Artist", 92));
scoreDetails.add(new ScoreDetail("Tomoyo", "Sports", 96));
scoreDetails.add(new ScoreDetail("LiSA", "Music", 99));
scoreDetails.add(new ScoreDetail("LiSA", "Sports", 98));
JavaRDD<ScoreDetail> scoreDetailsRDD = sc.parallelize(scoreDetails);
JavaPairRDD<String, ScoreDetail> pairRDD = scoreDetailsRDD.mapToPair(new PairFunction<ScoreDetail, String, ScoreDetail>() {
@Override
public Tuple2<String, ScoreDetail> call(ScoreDetail scoreDetail) throws Exception {
return new Tuple2<>(scoreDetail.studentName, scoreDetail);
}
});
//new Function<ScoreDetail,Float,Integer>()
Function<ScoreDetail, Tuple2<Float, Integer>> createCombine = new Function<ScoreDetail, Tuple2<Float, Integer>>() {
@Override
public Tuple2<Float, Integer> call(ScoreDetail scoreDetail) throws Exception {
return new Tuple2<>(scoreDetail.score, 1);
}
};
//Function2傳入兩個值,返回一個值
Function2<Tuple2<Float, Integer>, ScoreDetail, Tuple2<Float, Integer>> mergeValue = new Function2<Tuple2<Float, Integer>, ScoreDetail, Tuple2<Float, Integer>>() {
@Override
public Tuple2<Float, Integer> call(Tuple2<Float, Integer> tp, ScoreDetail scoreDetail) throws Exception {
return new Tuple2<>(tp._1 + scoreDetail.score, tp._2 + 1);
}
};
Function2<Tuple2<Float, Integer>, Tuple2<Float, Integer>, Tuple2<Float, Integer>> mergeCombiners = new Function2<Tuple2<Float, Integer>, Tuple2<Float, Integer>, Tuple2<Float, Integer>>() {
@Override
public Tuple2<Float, Integer> call(Tuple2<Float, Integer> tp1, Tuple2<Float, Integer> tp2) throws Exception {
return new Tuple2<>(tp1._1 + tp2._1, tp1._2 + tp2._2);
}
};
JavaPairRDD<String, Tuple2<Float, Integer>> combineByRDD = pairRDD.combineByKey(createCombine, mergeValue, mergeCombiners);
//列印平均數
Map<String, Tuple2<Float, Integer>> stringTuple2Map = combineByRDD.collectAsMap();
for (String et : stringTuple2Map.keySet()) {
System.out.println(et + " " + stringTuple2Map.get(et)._1 / stringTuple2Map.get(et)._2);
}
相關文章
- Spark RDD運算元(八)mapPartitions, mapPartitionsWithIndexSparkAPPIndex
- Spark----RDD運算元分類 DAGSpark
- spark學習筆記--RDD鍵對操作Spark筆記
- RDD運算元
- spark RDD textFile運算元 分割槽數量詳解Spark
- Spark運算元:統計RDD分割槽中的元素及數量Spark
- spark-運算元-分割槽運算元Spark
- Spark效能優化:對RDD持久化或CheckPoint操作Spark優化持久化
- 【Spark篇】---SparkStreaming中運算元中OutPutOperator類運算元Spark
- Spark常用Transformations運算元(一)SparkORM
- RDD轉換操作運算元 --- zip(k-v)、join(k)、cogroup(k)、lookup(k)
- Spark效能調優-RDD運算元調優篇(深度好文,面試常問,建議收藏)Spark面試
- spark一些常用運算元Spark
- Spark(十三) Spark效能調優之RDD持久化Spark持久化
- Flink -- Operator操作運算元
- spark-RDDSpark
- Spark RDD APISparkAPI
- 大資料分散式計算系統 Spark 入門核心之 RDD大資料分散式Spark
- Spark - [03] RDD概述Spark
- SHELL之數值運算
- spark的基本運算元使用和原始碼解析Spark原始碼
- spark RDD,reduceByKey vs groupByKeySpark
- Spark 的核心概念 RDDSpark
- 圖解Spark排序運算元sortBy的核心原始碼圖解Spark排序原始碼
- Spark RDD 特徵及其依賴Spark特徵
- spark學習筆記--RDDSpark筆記
- Spark RDD中Runtime流程解析Spark
- SparkSQL /DataFrame /Spark RDD誰快?SparkSQL
- Spark RDD的預設分割槽數:(spark 2.1.0)Spark
- Spark RDD在Spark中的地位和作用如何?Spark
- es筆記六之聚合操作之指標聚合筆記指標
- es筆記七之聚合操作之桶聚合和矩陣聚合筆記矩陣
- Spark學習(二)——RDD基礎Spark
- 【大資料】Spark RDD基礎大資料Spark
- 運算元
- MongoDB學習之聚合操作MongoDB
- Flink SQL之Over 聚合操作SQL
- 使用運算元控制公式運算公式