spark RDD，reduceByKey vs groupByKey

zzzzMing發表於2018-10-28

原文網址 : https://flycode.co/archives/247823

Spark中有兩個類似的api，分別是reduceByKey和groupByKey。這兩個的功能類似，但底層實現卻有些不同，那麼為什麼要這樣設計呢？我們來從原始碼的角度分析一下。

先看兩者的呼叫順序（都是使用預設的Partitioner，即defaultPartitioner）

所用spark版本：spark2.1.0

先看reduceByKey

Step1

  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

Setp2

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

Setp3

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

姑且不去看方法裡面的細節，我們會只要知道最後呼叫的是combineByKeyWithClassTag這個方法。這個方法有兩個引數我們來重點看一下，

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)

首先是partitioner引數，這個即是RDD的分割槽設定。除了預設的defaultPartitioner，Spark還提供了RangePartitioner和HashPartitioner外，此外使用者也可以自定義partitioner。通過原始碼可以發現如果是HashPartitioner的話，那麼是會丟擲一個錯誤的。

然後是mapSideCombine引數，這個引數正是reduceByKey和groupByKey最大不同的地方，它決定是是否會先在節點上進行一次Combine操作，下面會有更具體的例子來介紹。

然後是groupByKey

Step1

  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

Step2

  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

Setp3

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

結合上面reduceByKey的呼叫鏈，可以發現最終其實都是呼叫combineByKeyWithClassTag這個方法的，但呼叫的引數不同。
reduceByKey的呼叫

combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

groupByKey的呼叫

combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)

正是兩者不同的呼叫方式導致了兩個方法的差別，我們分別來看

reduceByKey的泛型引數直接是[V]，而groupByKey的泛型引數是[CompactBuffer[V]]。這直接導致了reduceByKey和groupByKey的返回值不同，前者是RDD[(K, V)]，而後者是RDD[(K, Iterable[V])]
然後就是mapSideCombine=false了，這個mapSideCombine引數的預設是true的。這個值有什麼用呢，上面也說了，這個引數的作用是控制要不要在map端進行初步合併（Combine）。可以看看下面具體的例子。

spark RDD，reduceByKey vs groupByKey

從功能上來說，可以發現ReduceByKey其實就是會在每個節點先進行一次合併的操作，而groupByKey沒有。

這麼來看ReduceByKey的效能會比groupByKey好很多，因為有些工作在節點已經處理了。那麼groupByKey為什麼存在，它的應用場景是什麼呢？我也不清楚，如果觀看這篇文章的讀者知道的話不妨在評論裡說出來吧。非常感謝！

Spark DataFrame的groupBy vs groupByKey
2018-11-04
Spark
spark reduceByKey原始碼解析
2020-12-06
Spark原始碼
Spark入門（五）--Spark的reduce和reduceByKey
2019-03-01
Spark
spark-RDD
2020-10-25
Spark
Spark RDD API
2021-09-09
SparkAPI
Spark - [03] RDD概述
2024-05-12
Spark
Spark 的核心概念 RDD
2019-04-20
Spark
Spark RDD 特徵及其依賴
2018-09-23
Spark特徵
spark學習筆記--RDD
2018-07-05
Spark筆記
Spark RDD中Runtime流程解析
2020-09-04
Spark
SparkSQL /DataFrame /Spark RDD誰快？
2020-08-15
SparkSQL
Spark效能優化：提高並行度、使用reduceByKey
2018-09-14
Spark優化並行
Spark（十三） Spark效能調優之RDD持久化
2019-01-15
Spark持久化
Spark RDD的預設分割槽數：（spark 2.1.0）
2021-09-09
Spark
Spark RDD在Spark中的地位和作用如何？
2021-05-12
Spark
Spark學習（二）——RDD基礎
2019-03-31
Spark
【大資料】Spark RDD基礎
2019-01-03
大資料Spark
Spark RDD詳解 | RDD特性、lineage、快取、checkpoint、依賴關係
2020-10-23
Spark快取
快取Apache Spark RDD - 效能調優
2019-01-08
快取ApacheSpark
Calcite 使用原生的RDD 處理Spark
2018-06-28
Spark
Spark從入門到放棄---RDD
2020-08-17
Spark
Spark RDD運算元（八）mapPartitions， mapPartitionsWithIndex
2020-11-16
SparkAPPIndex
大白話講解Spark中的RDD
2020-11-15
Spark
spark學習筆記--RDD鍵對操作
2018-07-06
Spark筆記
Spark SQL中的RDD與DataFrame轉換
2019-08-12
SparkSQL
大資料學習—Spark核心概念RDD
2021-09-28
大資料Spark
Spark----RDD運算元分類 DAG
2020-12-23
Spark
Spark Streaming VS Flink
2019-03-04
Spark
Spark效能優化：對RDD持久化或CheckPoint操作
2018-09-14
Spark優化持久化
通過WordCount解析Spark RDD內部原始碼機制
2020-09-02
Spark原始碼
深入原始碼理解Spark RDD的資料分割槽原理
2020-08-20
原始碼Spark
spark RDD textFile運算元分割槽數量詳解
2020-11-24
Spark
一文帶你過完Spark RDD的基礎概念
2020-02-09
Spark
spark RDD運算元（五）之鍵值對聚合操作combineByKey
2020-11-11
Spark
大資料分散式計算系統 Spark 入門核心之 RDD
2022-03-23
大資料分散式Spark
Spark運算元：統計RDD分割槽中的元素及數量
2021-09-09
Spark
Apache Spark：資料框，資料集和RDD之間的區別 - Baeldung
2020-10-21
ApacheSpark
spark RDD的學習，filter函式的學習，split函式的學習
2018-08-01
SparkFilter函式

spark RDD，reduceByKey vs groupByKey

先看reduceByKey

然後是groupByKey

相關文章