Apache Spark原始碼走讀之24 -- Sort-based Shuffle的設計與實現

徽滬一郎發表於2014-09-19

歡迎轉載，轉載請註明出處。

概要

Spark 1.1中對spark core的一個重大改進就是引入了sort-based shuffle處理機制，本文就該處理機制的實現進行初步的分析。

Sort-based Shuffle之初體驗

通過一個小的實驗來直觀的感受一下sort-based shuffle演算法會產生哪些中間檔案，具體實驗步驟如下所述。

步驟1：修改conf/spark-default.conf, 加入如下內容

spark.shuffle.manager SORT

步驟2: 執行spark-shell

SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-shell

步驟3: 執行wordcount

sc.textFile("README.md").flatMap(l => l.split(" ")).map(w=>(w,1)).reduceByKey(_ + _).collect

步驟4: 檢視生成的中間檔案

find /tmp/spark-local* -type f

檔案查詢結果如下所示

/tmp/spark-local-20140919091822-aa66/0f/shuffle_0_1_0.index
/tmp/spark-local-20140919091822-aa66/30/shuffle_0_0_0.index
/tmp/spark-local-20140919091822-aa66/0c/shuffle_0_0_0.data
/tmp/spark-local-20140919091822-aa66/15/shuffle_0_1_0.data

可以看到生成了兩人種字尾的檔案，分別為data和index型別，這兩者的用途在後續分析中會詳細講述。

如果我們做一下對比實驗，將shuffle模式改為Hash，再來觀看生成的檔案，就會找到區別。將原先配置檔案中的SORT改為HASH，重新啟動spark-shell，執行相同的wordcount之後，在tmp目錄下找到的檔案列表如下。

/tmp/spark-local-20140919092949-14cc/10/shuffle_0_1_3
/tmp/spark-local-20140919092949-14cc/0f/shuffle_0_1_2
/tmp/spark-local-20140919092949-14cc/0f/shuffle_0_0_3
/tmp/spark-local-20140919092949-14cc/0c/shuffle_0_0_0
/tmp/spark-local-20140919092949-14cc/0d/shuffle_0_1_0
/tmp/spark-local-20140919092949-14cc/0d/shuffle_0_0_1
/tmp/spark-local-20140919092949-14cc/0e/shuffle_0_1_1
/tmp/spark-local-20140919092949-14cc/0e/shuffle_0_0_2

兩者生成的檔案數量差異非常大，具體數值計算如下

在HASH模式下，每一次shuffle會生成M*R的數量的檔案，如上述wordcount例子中，整個job有一次shuffle過程，由於輸入檔案預設分片為2,故M個數為2,而spark.default.parallelism配置的值為4，故R為4,所以總共生成1*2*4=8個檔案。shuffle_0_1_2解讀為shuffle+shuffle_id+map_id+reduce_id，故0_1_2表示由第0次shuffle中的第1個maptask生成的檔案，該檔案內容會被第2個reduce task消費
在SORT模式下，一個Map Task只生成一個檔案，而不管生成的檔案要被多少的Reduce消費，故檔案個數是M的數量，由於wordcount中的預設分片為2,故只生成兩個data檔案

多次shuffle

剛才的示例中只有一次shuffle過程，我們可以通過小小的改動來達到兩次shuffle，程式碼如下

sc.textFile("README.md").flatMap(l => l.split(" ")).map(w => (w,1)).reduceByKey(_ + _).map(p=>(p._2,p._1)).groupByKey.collect

上述程式碼將reduceByKey的結果通過map進行反轉，即將原來的(w, count)轉換為(count,w)，然後根據出現次數進行歸類。 groupByKey會再次導致資料shuffle過程。

在HASH模式下產生的檔案如下所示

/tmp/spark-local-20140919094531-1cb6/12/shuffle_0_3_3
/tmp/spark-local-20140919094531-1cb6/0c/shuffle_0_0_0
/tmp/spark-local-20140919094531-1cb6/11/shuffle_0_2_3
/tmp/spark-local-20140919094531-1cb6/11/shuffle_0_3_2
/tmp/spark-local-20140919094531-1cb6/11/shuffle_1_1_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_2_2
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_1_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_3_1
/tmp/spark-local-20140919094531-1cb6/10/shuffle_1_0_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_1_1_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_0_3
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_3_0
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_2_1
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_1_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_1_0_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_1_1_1
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_0_0_1
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_0_1_0
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_1_0_0
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_2_0
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_1_1
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_0_2
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_1_0_1
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_1_1_0

引入一次新的shuffle，產生了大量的中間檔案

如果是使用SORT，效果如何呢？只會增加M個檔案，由於在新的shuffle過程中，map task數目為4,所以總共的檔案是2+4=6。

/tmp/spark-local-20140919094731-034a/29/shuffle_0_3_0.data
/tmp/spark-local-20140919094731-034a/30/shuffle_0_0_0.index
/tmp/spark-local-20140919094731-034a/15/shuffle_0_1_0.data
/tmp/spark-local-20140919094731-034a/36/shuffle_0_2_0.data
/tmp/spark-local-20140919094731-034a/0c/shuffle_0_0_0.data
/tmp/spark-local-20140919094731-034a/32/shuffle_0_2_0.index
/tmp/spark-local-20140919094731-034a/32/shuffle_1_1_0.index
/tmp/spark-local-20140919094731-034a/0f/shuffle_0_1_0.index
/tmp/spark-local-20140919094731-034a/0f/shuffle_1_0_0.index
/tmp/spark-local-20140919094731-034a/0a/shuffle_1_1_0.data
/tmp/spark-local-20140919094731-034a/2b/shuffle_1_0_0.data
/tmp/spark-local-20140919094731-034a/0d/shuffle_0_3_0.index

值得指出的是shuffle_0和shuffle_1的執行次序問題，數字越大越先執行，由於spark job提交的時候是從後往前倒推的，故0是最後將執行，而前面的先執行。

Sort-based Shuffle的設計思想

sort-based shuffle的總體指導思想是一個map task最終只生成一個shuffle檔案，那麼後續的reduce task是如何從這一個shuffle檔案中得到自己的partition呢，這個時候就需要引入一個新的檔案型別即index檔案。

其具體實現步驟如下:

Map Task在讀取自己輸入的partition之後，將計算結果寫入到ExternalSorter
ExternalSorter會使用一個map來儲存新的計算結果，新的計算結果根據partiton分類，如果是有combine操作，則需要將新的值與原有的值進行合併
如果ExternalSorter中的map佔用的記憶體已經超越了使用的閥值，則將map中的內容spill到磁碟中，每一次spill產生一個不同的檔案
當輸入Partition中的所有資料都已經處理完畢之後，這時有可能一部分計算結果在記憶體中，另一部分計算結果在spill的一到多個檔案之中，這時通過merge操作將記憶體和spill檔案中的內容合併整到一個檔案裡
最後將每一個partition的在data檔案中的起始位置和結束位置寫入到index檔案

相應的原始檔

SortShuffleManager.scala
SortShuffleWriter.scala
ExternalSorter.scala
IndexShuffleBlockManager.scala

幾個重要的函式

SortShuffleWriter.write

  override def write(records: Iterator[_ >: Product2[K, V]]): Unit = {
    if (dep.mapSideCombine) {
      if (!dep.aggregator.isDefined) {
        throw new IllegalStateException("Aggregator is empty for map-side combine")
      }
      sorter = new ExternalSorter[K, V, C](
        dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
      sorter.insertAll(records)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      sorter = new ExternalSorter[K, V, V](
        None, Some(dep.partitioner), None, dep.serializer)
      sorter.insertAll(records)
    }

    val outputFile = shuffleBlockManager.getDataFile(dep.shuffleId, mapId)
    val blockId = shuffleBlockManager.consolidateId(dep.shuffleId, mapId)
    val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
    shuffleBlockManager.writeIndexFile(dep.shuffleId, mapId, partitionLengths)

    mapStatus = new MapStatus(blockManager.blockManagerId,
      partitionLengths.map(MapOutputTracker.compressSize))
  }

ExternalSorter.insertAll

def insertAll(records: Iterator[_  {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        elementsRead += 1
        kv = records.next()
        map.changeValue((getPartition(kv._1), kv._1), update)
        maybeSpill(usingMap = true)
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        elementsRead += 1
        val kv = records.next()
        buffer.insert((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
        maybeSpill(usingMap = false)
      }
    }
  }

writePartitionedFile將記憶體中的資料和spill檔案中內容一起合併到一個檔案當中

def writePartitionedFile(
      blockId: BlockId,
      context: TaskContext,
      outputFile: File): Array[Long] = {

    // Track location of each range in the output file
    val lengths = new Array[Long](numPartitions)

    if (bypassMergeSort && partitionWriters != null) {
      // We decided to write separate files for each partition, so just concatenate them. To keep
      // this simple we spill out the current in-memory collection so that everything is in files.
      spillToPartitionFiles(if (aggregator.isDefined) map else buffer)
      partitionWriters.foreach(_.commitAndClose())
      var out: FileOutputStream = null
      var in: FileInputStream = null
      try {
        out = new FileOutputStream(outputFile)
        for (i <- 0 until numPartitions) {
          in = new FileInputStream(partitionWriters(i).fileSegment().file)
          val size = org.apache.spark.util.Utils.copyStream(in, out, false)
          in.close()
          in = null
          lengths(i) = size
        }
      } finally {
        if (out != null) {
          out.close()
        }
        if (in != null) {
          in.close()
        }
      }
    } else {
      // Either we're not bypassing merge-sort or we have only in-memory data; get an iterator by
      // partition and just write everything directly.
      for ((id, elements) <- this.partitionedIterator) {
        if (elements.hasNext) {
          val writer = blockManager.getDiskWriter(
            blockId, outputFile, ser, fileBufferSize, context.taskMetrics.shuffleWriteMetrics.get)
          for (elem

而資料讀取過程中則需要使用IndexShuffleBlockManager來獲取Partiton的具體位置

  override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
    // The block is actually going to be a range of a single map output file for this map, so
    // find out the consolidated file, then the offset within that from our index
    val indexFile = getIndexFile(blockId.shuffleId, blockId.mapId)

    val in = new DataInputStream(new FileInputStream(indexFile))
    try {
      in.skip(blockId.reduceId * 8)
      val offset = in.readLong()
      val nextOffset = in.readLong()
      new FileSegmentManagedBuffer(
        getDataFile(blockId.shuffleId, blockId.mapId),
        offset,
        nextOffset - offset)
    } finally {
      in.close()
    }
  }

引數資料

sort-based shuffle的核心：org.apache.spark.util.collection.ExternalSorter
2015-09-17
ApacheSpark
Spark原始碼解析之Shuffle Writer
2017-12-24
Spark原始碼
Spark Shuffle實現
2015-03-06
Spark
Spark 原始碼系列（六）Shuffle 的過程解析
2019-04-25
Spark原始碼
Spark Shuffle機制詳細原始碼解析
2020-11-12
Spark原始碼
Apache Spark原始碼剖析
2015-03-23
ApacheSpark原始碼
讀Flink原始碼談設計：流批一體的實現與現狀
2022-03-18
原始碼
Canal 原始碼走讀
2019-02-26
原始碼
[Apache Doris] Apache Doris 後設資料設計及DDL操作原始碼閱讀
2021-11-18
Apache原始碼
spark 原始碼分析之十四 -- broadcast 是如何實現的？
2019-07-16
Spark原始碼AST
原始碼閱讀之Java棧的實現
2018-11-06
原始碼Java
underscore 原始碼解讀之 bind 方法的實現
2016-09-06
原始碼
Spark 原始碼系列（七）Spark on yarn 具體實現
2019-04-25
Spark原始碼Yarn
Apache Spark技術實戰之7 -- CassandraRDD高併發資料讀取實現剖析
2014-11-17
ApacheSpark
原始碼閱讀之ArrayList實現細節
2018-11-06
原始碼
Spark shuffle調優
2018-12-17
Spark
spark Shuffle相關
2015-09-30
Spark
Apache Dubbo 原始碼搭建與解讀（八）—— Dubbo 註冊中心之ZooKeeper
2020-10-15
Apache原始碼
Spark的Shuffle總結分析
2020-02-15
Spark
【Spark篇】---Spark中Shuffle檔案的定址
2018-03-07
Spark
自動升級系統的設計與實現（原始碼）
2014-08-21
原始碼
簡要MR與Spark在Shuffle區別
2021-01-18
Spark
React原始碼之元件的實現與首次渲染
2020-07-02
React原始碼元件
原始碼閱讀之LinkedList實現細節
2018-10-18
原始碼
Spark面試題（八）——Spark的Shuffle配置調優
2021-11-19
Spark面試題
《Android原始碼設計模式解析與實戰》讀書筆記
2017-12-17
Android原始碼設計模式筆記
Spark學習——排序Shuffle
2019-04-03
Spark排序
Spark原始碼分析之MemoryManager
2017-11-11
Spark原始碼
Spark原始碼分析之BlockStore
2017-11-11
Spark原始碼BloC
TiFlash 原始碼閱讀（四）TiFlash DDL 模組設計及實現分析
2022-07-06
原始碼
原始碼解讀預告｜TiFlash DeltaTree 引擎設計及實現解析
2022-05-20
原始碼
Google Volley框架原始碼走讀
2015-06-11
Go框架原始碼
Zepto 原始碼分析 3 - qsa 實現與工具函式設計
2018-11-19
原始碼函式
【React原始碼解讀】- 元件的實現
2018-10-31
React原始碼元件
Lua設計與實現--讀書筆記
2020-09-28
筆記
《Redis設計與實現》讀書筆記
2014-08-11
Redis筆記
Axios 原始碼解讀 —— 原始碼實現篇
2022-01-22
iOS原始碼
如何設計實現一個React UI元件庫——Ant Design原始碼閱讀與淺析
2018-04-29
ReactUI元件原始碼

Apache Spark原始碼走讀之24 -- Sort-based Shuffle的設計與實現

概要

Sort-based Shuffle之初體驗

多次shuffle

Sort-based Shuffle的設計思想

引數資料

相關文章