Apache Spark原始碼走讀之24 -- Sort-based Shuffle的設計與實現

徽滬一郎發表於2014-09-19

歡迎轉載,轉載請註明出處。

概要

Spark 1.1中對spark core的一個重大改進就是引入了sort-based shuffle處理機制,本文就該處理機制的實現進行初步的分析。

Sort-based Shuffle之初體驗

通過一個小的實驗來直觀的感受一下sort-based shuffle演算法會產生哪些中間檔案,具體實驗步驟如下所述。

步驟1: 修改conf/spark-default.conf, 加入如下內容

spark.shuffle.manager SORT

步驟2: 執行spark-shell

SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-shell

 步驟3: 執行wordcount

sc.textFile("README.md").flatMap(l => l.split(" ")).map(w=>(w,1)).reduceByKey(_ + _).collect

 步驟4: 檢視生成的中間檔案

find /tmp/spark-local* -type f

檔案查詢結果如下所示

/tmp/spark-local-20140919091822-aa66/0f/shuffle_0_1_0.index
/tmp/spark-local-20140919091822-aa66/30/shuffle_0_0_0.index
/tmp/spark-local-20140919091822-aa66/0c/shuffle_0_0_0.data
/tmp/spark-local-20140919091822-aa66/15/shuffle_0_1_0.data

可以看到生成了兩人種字尾的檔案,分別為data和index型別,這兩者的用途在後續分析中會詳細講述。

如果我們做一下對比實驗,將shuffle模式改為Hash,再來觀看生成的檔案,就會找到區別。將原先配置檔案中的SORT改為HASH,重新啟動spark-shell,執行相同的wordcount之後,在tmp目錄下找到的檔案列表如下。

/tmp/spark-local-20140919092949-14cc/10/shuffle_0_1_3
/tmp/spark-local-20140919092949-14cc/0f/shuffle_0_1_2
/tmp/spark-local-20140919092949-14cc/0f/shuffle_0_0_3
/tmp/spark-local-20140919092949-14cc/0c/shuffle_0_0_0
/tmp/spark-local-20140919092949-14cc/0d/shuffle_0_1_0
/tmp/spark-local-20140919092949-14cc/0d/shuffle_0_0_1
/tmp/spark-local-20140919092949-14cc/0e/shuffle_0_1_1
/tmp/spark-local-20140919092949-14cc/0e/shuffle_0_0_2

兩者生成的檔案數量差異非常大,具體數值計算如下

  1. 在HASH模式下,每一次shuffle會生成M*R的數量的檔案,如上述wordcount例子中,整個job有一次shuffle過程,由於輸入檔案預設分片為2,故M個數為2,而spark.default.parallelism配置的值為4,故R為4,所以總共生成1*2*4=8個檔案。shuffle_0_1_2解讀為shuffle+shuffle_id+map_id+reduce_id,故0_1_2表示由第0次shuffle中的第1個maptask生成的檔案,該檔案內容會被第2個reduce task消費
  2. 在SORT模式下,一個Map Task只生成一個檔案,而不管生成的檔案要被多少的Reduce消費,故檔案個數是M的數量,由於wordcount中的預設分片為2,故只生成兩個data檔案

多次shuffle

剛才的示例中只有一次shuffle過程,我們可以通過小小的改動來達到兩次shuffle,程式碼如下

sc.textFile("README.md").flatMap(l => l.split(" ")).map(w => (w,1)).reduceByKey(_ + _).map(p=>(p._2,p._1)).groupByKey.collect

上述程式碼將reduceByKey的結果通過map進行反轉,即將原來的(w, count)轉換為(count,w),然後根據出現次數進行歸類。 groupByKey會再次導致資料shuffle過程。

在HASH模式下產生的檔案如下所示

/tmp/spark-local-20140919094531-1cb6/12/shuffle_0_3_3
/tmp/spark-local-20140919094531-1cb6/0c/shuffle_0_0_0
/tmp/spark-local-20140919094531-1cb6/11/shuffle_0_2_3
/tmp/spark-local-20140919094531-1cb6/11/shuffle_0_3_2
/tmp/spark-local-20140919094531-1cb6/11/shuffle_1_1_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_2_2
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_1_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_3_1
/tmp/spark-local-20140919094531-1cb6/10/shuffle_1_0_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_1_1_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_0_3
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_3_0
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_2_1
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_1_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_1_0_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_1_1_1
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_0_0_1
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_0_1_0
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_1_0_0
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_2_0
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_1_1
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_0_2
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_1_0_1
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_1_1_0

引入一次新的shuffle,產生了大量的中間檔案

如果是使用SORT,效果如何呢?只會增加M個檔案,由於在新的shuffle過程中,map task數目為4,所以總共的檔案是2+4=6。

/tmp/spark-local-20140919094731-034a/29/shuffle_0_3_0.data
/tmp/spark-local-20140919094731-034a/30/shuffle_0_0_0.index
/tmp/spark-local-20140919094731-034a/15/shuffle_0_1_0.data
/tmp/spark-local-20140919094731-034a/36/shuffle_0_2_0.data
/tmp/spark-local-20140919094731-034a/0c/shuffle_0_0_0.data
/tmp/spark-local-20140919094731-034a/32/shuffle_0_2_0.index
/tmp/spark-local-20140919094731-034a/32/shuffle_1_1_0.index
/tmp/spark-local-20140919094731-034a/0f/shuffle_0_1_0.index
/tmp/spark-local-20140919094731-034a/0f/shuffle_1_0_0.index
/tmp/spark-local-20140919094731-034a/0a/shuffle_1_1_0.data
/tmp/spark-local-20140919094731-034a/2b/shuffle_1_0_0.data
/tmp/spark-local-20140919094731-034a/0d/shuffle_0_3_0.index

值得指出的是shuffle_0和shuffle_1的執行次序問題,數字越大越先執行,由於spark job提交的時候是從後往前倒推的,故0是最後將執行,而前面的先執行。

Sort-based Shuffle的設計思想

sort-based shuffle的總體指導思想是一個map task最終只生成一個shuffle檔案,那麼後續的reduce task是如何從這一個shuffle檔案中得到自己的partition呢,這個時候就需要引入一個新的檔案型別即index檔案。

其具體實現步驟如下:

  1. Map Task在讀取自己輸入的partition之後,將計算結果寫入到ExternalSorter
  2. ExternalSorter會使用一個map來儲存新的計算結果,新的計算結果根據partiton分類,如果是有combine操作,則需要將新的值與原有的值進行合併
  3. 如果ExternalSorter中的map佔用的記憶體已經超越了使用的閥值,則將map中的內容spill到磁碟中,每一次spill產生一個不同的檔案
  4. 當輸入Partition中的所有資料都已經處理完畢之後,這時有可能一部分計算結果在記憶體中,另一部分計算結果在spill的一到多個檔案之中,這時通過merge操作將記憶體和spill檔案中的內容合併整到一個檔案裡
  5. 最後將每一個partition的在data檔案中的起始位置和結束位置寫入到index檔案

相應的原始檔

  1. SortShuffleManager.scala
  2. SortShuffleWriter.scala
  3. ExternalSorter.scala
  4. IndexShuffleBlockManager.scala

幾個重要的函式

SortShuffleWriter.write

  override def write(records: Iterator[_ >: Product2[K, V]]): Unit = {
    if (dep.mapSideCombine) {
      if (!dep.aggregator.isDefined) {
        throw new IllegalStateException("Aggregator is empty for map-side combine")
      }
      sorter = new ExternalSorter[K, V, C](
        dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
      sorter.insertAll(records)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      sorter = new ExternalSorter[K, V, V](
        None, Some(dep.partitioner), None, dep.serializer)
      sorter.insertAll(records)
    }

    val outputFile = shuffleBlockManager.getDataFile(dep.shuffleId, mapId)
    val blockId = shuffleBlockManager.consolidateId(dep.shuffleId, mapId)
    val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
    shuffleBlockManager.writeIndexFile(dep.shuffleId, mapId, partitionLengths)

    mapStatus = new MapStatus(blockManager.blockManagerId,
      partitionLengths.map(MapOutputTracker.compressSize))
  }

ExternalSorter.insertAll

def insertAll(records: Iterator[_  {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        elementsRead += 1
        kv = records.next()
        map.changeValue((getPartition(kv._1), kv._1), update)
        maybeSpill(usingMap = true)
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        elementsRead += 1
        val kv = records.next()
        buffer.insert((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
        maybeSpill(usingMap = false)
      }
    }
  }

writePartitionedFile將記憶體中的資料和spill檔案中內容一起合併到一個檔案當中

def writePartitionedFile(
      blockId: BlockId,
      context: TaskContext,
      outputFile: File): Array[Long] = {

    // Track location of each range in the output file
    val lengths = new Array[Long](numPartitions)

    if (bypassMergeSort && partitionWriters != null) {
      // We decided to write separate files for each partition, so just concatenate them. To keep
      // this simple we spill out the current in-memory collection so that everything is in files.
      spillToPartitionFiles(if (aggregator.isDefined) map else buffer)
      partitionWriters.foreach(_.commitAndClose())
      var out: FileOutputStream = null
      var in: FileInputStream = null
      try {
        out = new FileOutputStream(outputFile)
        for (i <- 0 until numPartitions) {
          in = new FileInputStream(partitionWriters(i).fileSegment().file)
          val size = org.apache.spark.util.Utils.copyStream(in, out, false)
          in.close()
          in = null
          lengths(i) = size
        }
      } finally {
        if (out != null) {
          out.close()
        }
        if (in != null) {
          in.close()
        }
      }
    } else {
      // Either we're not bypassing merge-sort or we have only in-memory data; get an iterator by
      // partition and just write everything directly.
      for ((id, elements) <- this.partitionedIterator) {
        if (elements.hasNext) {
          val writer = blockManager.getDiskWriter(
            blockId, outputFile, ser, fileBufferSize, context.taskMetrics.shuffleWriteMetrics.get)
          for (elem 

而資料讀取過程中則需要使用IndexShuffleBlockManager來獲取Partiton的具體位置

  override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
    // The block is actually going to be a range of a single map output file for this map, so
    // find out the consolidated file, then the offset within that from our index
    val indexFile = getIndexFile(blockId.shuffleId, blockId.mapId)

    val in = new DataInputStream(new FileInputStream(indexFile))
    try {
      in.skip(blockId.reduceId * 8)
      val offset = in.readLong()
      val nextOffset = in.readLong()
      new FileSegmentManagedBuffer(
        getDataFile(blockId.shuffleId, blockId.mapId),
        offset,
        nextOffset - offset)
    } finally {
      in.close()
    }
  }

引數資料

  1. 詳細探究spark的shuffle 實現
  2. spark-2045 sort-based shuffle implementation

相關文章