spark RDD運算元(五)之鍵值對聚合操作combineByKey

jalrs發表於2020-11-11

combineByKey
聚合函式一般在集中式資料比較方便,如果涉及到分散式的資料集,該如何去實現呢;這裡介紹一下combineByKey,這個是各種聚集操作的鼻祖
簡要介紹

def combineByKey[C] (createCombiner: (V) => C,
mergeValue: (C,V) => C,
mergeCombiners: (C,C) => C): RD

cerateCombiner:combineByKey()會遍歷分割槽中的所有元素,因此每個元素的鍵要麼還沒有遇到過,要麼就和之前的某個元素的鍵相同。如果這是一個新的元素,combineByKey()會使用一個叫做createCombiner()的函式來創造那個鍵對應的累加器的初始值
mergeValue:如果這是一個在處理當前分割槽之前已經遇到的鍵,他會使用mergeValue()方法將該鍵的累加器對應的當前值與這個新的值進行合併
mergeCombiners:由於每個分割槽都是獨立處理的,因此對於同一個鍵可以有多個累加器。如果有兩個或者更多的分割槽都有對應同一個鍵的累加器,就需要使用使用者提供的mergeCombiners()方法將各個分割槽的結果進行合併。
計算學生平均成績例子:
建立一個學生成績說明的類
scala版本:

case class ScoreDetail(studentName: String, subject: Dtring, score: Float)
def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("combinebykey")
    val sc = SparkContext.getOrCreate(conf)
    val scores = List(
      ScoreDetail("Aythna","Music",96),
      ScoreDetail("Aythna","Math",92),
      ScoreDetail("Tomoyo","English",97),
      ScoreDetail("Tomoyo","Science",90),
      ScoreDetail("LiSA","Sports",94),
      ScoreDetail("LiSA","Music",99),
      ScoreDetail("Ninja","Music",92),
      ScoreDetail("Ninja","Sports",96),
      ScoreDetail("Sino","Music",95),
      ScoreDetail("Sino","Math",98),
      ScoreDetail("Nagisa","Music",96),
      ScoreDetail("Nagisa","Math",97)
    )
    //將集合轉換成二元組,可以理解為轉換成一個map,利用了for和yield的組合
    val scoresWithKey = for {i <- scores} yield (i.studentName,i)
    //建立RDD,並指定三個分割槽
    val scoresWithKeyRDD = sc.parallelize(scoresWithKey).partitionBy(new HashPartitioner(3)).cache()

    //輸出列印各個分割槽的長度和一些資料
    scoresWithKeyRDD.foreachPartition(partition => println(partition.length))
    scoresWithKeyRDD.foreachPartition(partContext => partContext.foreach(
      x => println(x._2,"姓名:" + x._2.studentName,"學科:" + x._2.subject,"成績:" + x._2.score)
    ))

    //聚合求平均值然後列印
    val avgScoreRDD = scoresWithKeyRDD.combineByKey(
      (x: ScoreDetail) => (x.score, 1),
      (acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1),
      (acc1: (Float, Int), acc2: (Float, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
    ).map({ case (key, value) => (key, value._1 / value._2) })
    avgScoreRDD.collect().foreach(println)

在這裡插入圖片描述

Java版本:

SparkConf sparkconf = new SparkConf().setMaster("local[*]").setAppName("JavaWordCount");
        JavaSparkContext sc = new JavaSparkContext(sparkconf);
        ArrayList<ScoreDetail> scoreDetails = new ArrayList<>();
        scoreDetails.add(new ScoreDetail("Aythna", "Music", 97));
        scoreDetails.add(new ScoreDetail("Aythna", "Math", 94));
        scoreDetails.add(new ScoreDetail("Nagisa", "Science", 99));
        scoreDetails.add(new ScoreDetail("Nagisa", "English", 95));
        scoreDetails.add(new ScoreDetail("Tomoyo", "Artist", 92));
        scoreDetails.add(new ScoreDetail("Tomoyo", "Sports", 96));
        scoreDetails.add(new ScoreDetail("LiSA", "Music", 99));
        scoreDetails.add(new ScoreDetail("LiSA", "Sports", 98));

        JavaRDD<ScoreDetail> scoreDetailsRDD = sc.parallelize(scoreDetails);
        JavaPairRDD<String, ScoreDetail> pairRDD = scoreDetailsRDD.mapToPair(new PairFunction<ScoreDetail, String, ScoreDetail>() {
            @Override
            public Tuple2<String, ScoreDetail> call(ScoreDetail scoreDetail) throws Exception {
                return new Tuple2<>(scoreDetail.studentName, scoreDetail);
            }
        });
        //new Function<ScoreDetail,Float,Integer>()
        Function<ScoreDetail, Tuple2<Float, Integer>> createCombine = new Function<ScoreDetail, Tuple2<Float, Integer>>() {
            @Override
            public Tuple2<Float, Integer> call(ScoreDetail scoreDetail) throws Exception {
                return new Tuple2<>(scoreDetail.score, 1);
            }
        };

        //Function2傳入兩個值,返回一個值
        Function2<Tuple2<Float, Integer>, ScoreDetail, Tuple2<Float, Integer>> mergeValue = new Function2<Tuple2<Float, Integer>, ScoreDetail, Tuple2<Float, Integer>>() {
            @Override
            public Tuple2<Float, Integer> call(Tuple2<Float, Integer> tp, ScoreDetail scoreDetail) throws Exception {
                return new Tuple2<>(tp._1 + scoreDetail.score, tp._2 + 1);
            }
        };
        Function2<Tuple2<Float, Integer>, Tuple2<Float, Integer>, Tuple2<Float, Integer>> mergeCombiners = new Function2<Tuple2<Float, Integer>, Tuple2<Float, Integer>, Tuple2<Float, Integer>>() {
            @Override
            public Tuple2<Float, Integer> call(Tuple2<Float, Integer> tp1, Tuple2<Float, Integer> tp2) throws Exception {
                return new Tuple2<>(tp1._1 + tp2._1, tp1._2 + tp2._2);
            }
        };
        JavaPairRDD<String, Tuple2<Float, Integer>> combineByRDD = pairRDD.combineByKey(createCombine, mergeValue, mergeCombiners);

        //列印平均數
        Map<String, Tuple2<Float, Integer>> stringTuple2Map = combineByRDD.collectAsMap();
        for (String et : stringTuple2Map.keySet()) {
            System.out.println(et + " " + stringTuple2Map.get(et)._1 / stringTuple2Map.get(et)._2);
        }

在這裡插入圖片描述

相關文章