Spark原始碼解析之Shuffle Writer

weixin_34337265發表於2017-12-24

摘要：Shuffle是MapReduce程式設計模型中最耗時的一個步驟，而Spark將Shuffle過程分解成了Shuffle Write和Shuffle Read兩個過程，本文我們將詳細解讀Spark的Shuffle Write實現。

ShuffleWriter

Spark Shuffle Write的介面是org.apache.spark.shuffle.ShuffleWriter

我們來看下介面定義：

private[spark] abstract class ShuffleWriter[K, V] {![螢幕快照 2017-12-17 下午2.48.59.png](http://upload-images.jianshu.io/upload_images/1381055-7248e894ca3ea2b4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

  /** Write a sequence of records to this task's output */
  @throws[IOException]
  def write(records: Iterator[Product2[K, V]]): Unit

  /** Close this writer, passing along whether the map completed */
  def stop(success: Boolean): Option[MapStatus]
}

共有三個實現類:

Shuffle Writer的實現類

BypassMergeSortShuffleWriter

我們以第一個stage（map）的個數為m個來計算，第二個stage個數為r個來計算

BypassMergeSortShuffleWriter可以分為

1.為每個ShuffleMapTask(即map端的每個partition，每個ShuffleMapTask處理的是map端的一個partition)建立r個臨時檔案
2.迭代每個map的partition，根據getPartition(key)來分組，並寫入對應的partitionId的檔案
3.合併步驟2產生的r個檔案，並將每個partitionId對應的索引寫入index檔案

BypassMergeSortShuffleWriter流程圖

關鍵程式碼解讀

public void write(Iterator<Product2<K, V>> records) throws IOException {
  ...
  // 根據下游stage(reduce端)的partition個數建立對應個數的DiskWriter
  partitionWriters = new DiskBlockObjectWriter[numPartitions];
  partitionWriterSegments = new FileSegment[numPartitions];
  for (int i = 0; i < numPartitions; i++) {
    final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
      blockManager.diskBlockManager().createTempShuffleBlock();
    final File file = tempShuffleBlockIdPlusFile._2();
    final BlockId blockId = tempShuffleBlockIdPlusFile._1();
    partitionWriters[i] =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
  }

  // 根據`getPartition(key)`獲取kv所屬的reduce的partitionId，並將kv寫入對應的partitionId的臨時檔案
  while (records.hasNext()) {
    final Product2<K, V> record = records.next();
    final K key = record._1();
    partitionWriters[partitioner.getPartition(key)].write(key, record._2());
  }

  for (int i = 0; i < numPartitions; i++) {
    final DiskBlockObjectWriter writer = partitionWriters[i];
    partitionWriterSegments[i] = writer.commitAndGet();
    writer.close();
  }

  File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
  File tmp = Utils.tempFileWith(output);
  try {
    // 合併多個partitionId對應的臨時檔案，寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.data`檔案
    partitionLengths = writePartitionedFile(tmp);
    // 將多個partitionId對應的index寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.index`檔案
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  }
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

1.預設的Partitioner的實現類為HashPartitioner
2.預設的SerializerInstance的實現類為JavaSerializerInstance

FileSegment

一個BypassMergeSortShuffleWriter的中間臨時檔案稱之為FileSegment

class FileSegment(val file: File, val offset: Long, val length: Long)

file記錄物理檔案，length記錄檔案大小，用於合併多個FileSegment時寫index檔案。

我們再看下合併臨時檔案方法writePartitionedFile的實現：

private long[] writePartitionedFile(File outputFile) throws IOException {
  final long[] lengths = new long[numPartitions];
  ...
  final FileChannel out = FileChannel.open(outputFile.toPath(), WRITE, APPEND, CREATE);
    for (int i = 0; i < numPartitions; i++) {
      final File file = partitionWriterSegments[i].file();
      if (file.exists()) {
        final FileChannel in = FileChannel.open(file.toPath(), READ);
        try {
          long size = in.size();
          // 合併檔案的關鍵程式碼，通過NIO的transferTo提高合併檔案流的效率
          Utils.copyFileStreamNIO(in, out, 0, size);
          lengths[i] = size;
        }
      }
    }
  }
  partitionWriters = null;
  // 返回每個臨時檔案大小，用於寫Index檔案
  return lengths;
}

寫index檔案的方法writeIndexFileAndCommit:

def writeIndexFileAndCommit(
    shuffleId: Int,
    mapId: Int,
    lengths: Array[Long],
    dataTmp: File): Unit = {
  val indexFile = getIndexFile(shuffleId, mapId)
  val indexTmp = Utils.tempFileWith(indexFile)
  try {
    val out = new DataOutputStream(
      new BufferedOutputStream(Files.newOutputStream(indexTmp.toPath)))
    Utils.tryWithSafeFinally {
      var offset = 0L
      out.writeLong(offset)
      for (length <- lengths) {
        offset += length
        out.writeLong(offset)
      }
    }
  }
  ...
}

NOTE: 1.檔案合併時採用了java nio的transferTo方法提高檔案合併效率。
2.BypassMergeSortShuffleWriter的完整程式碼

BypassMergeSortShuffleWriter Example

我們通過下面一個例子來看下BypassMergeSortShuffleWriter的工作原理。

BypassMergeSortShuffleWriter Example

1.真實場景下，我們的partition上的資料往往是無序的，本例中我們模擬的資料是有序的，不要誤認為BypassMergeSortShuffleWriter會為我們的資料排序。

SortShuffleWriter

預備知識:

org.apache.spark.util.collection.AppendOnlyMap

org.apache.spark.util.collection.PartitionedPairBuffer

TimSorter

SortShuffleWriter.writer()實現

我們先看下writer的具體實現：

override def write(records: Iterator[Product2[K, V]]): Unit = {
  sorter = if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  sorter.insertAll(records)

  val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
  val tmp = Utils.tempFileWith(output)
  try {
    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
    val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
    shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
  } finally {
    if (tmp.exists() && !tmp.delete()) {
      logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
    }
  }
}

SortShuffleWriter write過程大概可以分成兩個步驟，第一步insertAll，第二步merge溢寫到磁碟的SpilledFile

ExternalSorter可以分為四個步驟來理解

根據是否需要combine操作，決定快取結構是PartitionedAppendOnlyMap還是PartitionedPairBuffer，在這兩種資料結構中，我們會先按照partitionId將資料排序，而且在每個partition中，我們會根據key排序。
當快取資料到達我們的記憶體限制，或者或者條數限制，我們將進行spill操作，並且每個SpilledFile會記錄每個parition有多少條記錄。
當我們請求一個iterator或者檔案時，會將所有的SpilledFile和在記憶體當中未進行溢寫的資料進行合併。
最後請求stop方法刪除相關臨時檔案。

ExternalSorter.insertAll的實現：

def insertAll(records: Iterator[Product2[K, V]]): Unit = {
  val shouldCombine = aggregator.isDefined
  // 根據aggregator是否定義來判斷是否需要map端合併(combine)
  if (shouldCombine) {
    // Combine values in-memory first using our AppendOnlyMap
    // 對應rdd.aggregatorByKey的 seqOp 引數
    val mergeValue = aggregator.get.mergeValue
    // 對應rdd.aggregatorByKey的zeroValue引數，利用zeroValue來建立Combiner
    val createCombiner = aggregator.get.createCombiner
    var kv: Product2[K, V] = null
    val update = (hadValue: Boolean, oldValue: C) => {
      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
    }
    while (records.hasNext) {
      addElementsRead()
      kv = records.next()
      map.changeValue((getPartition(kv._1), kv._1), update)
      maybeSpillCollection(usingMap = true)
    }
  } else {
    // Stick values into our buffer
    while (records.hasNext) {
      addElementsRead()
      val kv = records.next()
      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
      maybeSpillCollection(usingMap = false)
    }
  }
}

需要注意的一點是往map/buffer中寫入的key都是(partitionId，key)，因為我們需要對一個臨時檔案中的資料結構，先按照partitionId排序，再按照key排序。

寫磁碟的時機

寫磁碟的時機有兩個條件，滿足其中一個就進行spill操作。

1.每32個元素取樣一次，判斷當前記憶體指是否大於myMemoryThreshold，即currentMemory >= myMemoryThreshold。currentMemory需要通過預估當前map/buffer大小來獲取。
2.判斷記憶體快取結構中資料條數是否大於強制溢寫閾值，即_elementsRead > numElementsForceSpillThreshold。強制溢寫閾值可以通過在SparkConf中設定spark.shuffle.spill.batchSize來控制。

private def maybeSpillCollection(usingMap: Boolean): Unit = {
  var estimatedSize = 0L
  if (usingMap) {
    // 預估map在記憶體中的大小
    estimatedSize = map.estimateSize()
    if (maybeSpill(map, estimatedSize)) {
    // 如果記憶體中的資料spill到磁碟上了，重置map
      map = new PartitionedAppendOnlyMap[K, C]
    }
  } else {
    // 預估buffer在記憶體中的大小
    estimatedSize = buffer.estimateSize()
    if (maybeSpill(buffer, estimatedSize)) {
    // 同map操作
      buffer = new PartitionedPairBuffer[K, C]
    }
  }

  if (estimatedSize > _peakMemoryUsedBytes) {
    _peakMemoryUsedBytes = estimatedSize
  }
}

protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
  var shouldSpill = false
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
    val amountToRequest = 2 * currentMemory - myMemoryThreshold
    val granted = acquireMemory(amountToRequest)
    myMemoryThreshold += granted
    shouldSpill = currentMemory >= myMemoryThreshold
  }
  shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
  if (shouldSpill) {
    _spillCount += 1
    logSpillage(currentMemory)
    // 溢寫
    spill(collection)
    _elementsRead = 0
    _memoryBytesSpilled += currentMemory
    releaseMemory()
  }
  shouldSpill
}

溢寫磁碟的過程

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
  // 利用timsort演算法將記憶體中的資料排序
  val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
  // 將記憶體中的資料寫入磁碟
  val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
  // 加入spills陣列
  spills += spillFile
}

總結下insertAll過程就是，利用記憶體快取結構的資料結構PartitionedPairBuffer/PartitionedAppendOnlyMap，一邊往記憶體快取寫資料一邊判斷是否達到spill的條件，一次spill就是一個磁碟臨時檔案。

讀取`SpilledFile`過程

SpilledFile資料檔案是按照(partitionId，recordKey)來排序，而且我們記錄了每個partition的offset，所以我們獲取一個SpilledFile中的某個partition資料就變得很簡單了。

讀取SpilledFile的實現類是SpillReader

merge過程

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    if (aggregator.isDefined) {
      (p, mergeWithAggregation(
        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}

merge過程是比較複雜的一個過程，要涉及到當前Shuffle是否有aggregator和ordering操作。接下來我們將就這幾種情況一一分析。

no aggregator or sorter

partitionBy

case class TestIntKey(i: Int)
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", 4.toString)
val sc = new SparkContext(conf)
val testData = (1 to 100).toList
sc.parallelize(testData, 1)
  .map(x => {
    (TestIntKey(x % 3), x)
  }).partitionBy(new HashPartitioner(3)).collect()

partitionBy流程圖

no aggregator but sorter

這段程式碼其實很容易混淆，因為很容易想到sortByKey操作就是無aggregator有sorter操作，但是我們其實可以看到SortShuffleWriter在初始化ExternalSorter的時，ordring = None。具體程式碼如下：

sorter = if (dep.mapSideCombine) {
  ...
} else {
  new ExternalSorter[K, V, V](
    context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}

NOTE: sortBykey的ordering的邏輯將會被放到Shuffle Read過程中執行，這個我們後續將會介紹。

不過我們還是來簡單看下mergeSort方法的實現。我們的SpilledFile中，每個partition內的資料已經是按照recordKey排好序的，所以我們只要拿到每個SpilledFile的

private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
    : Iterator[Product2[K, C]] =
{
  // NOTE:(fchen)將該partition資料全部放入等級佇列當中，取資料時進行每個iterator頭部對比，取出最小的
  val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
  type Iter = BufferedIterator[Product2[K, C]]
  val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
    override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
  })
  heap.enqueue(bufferedIters: _*)  // Will contain only the iterators with hasNext = true
  new Iterator[Product2[K, C]] {
    override def hasNext: Boolean = !heap.isEmpty

    override def next(): Product2[K, C] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      val firstBuf = heap.dequeue()
      val firstPair = firstBuf.next()
      if (firstBuf.hasNext) {
        // 將迭代器重新放回等級佇列
        heap.enqueue(firstBuf)
      }
      firstPair
    }
  }
}

我們通過下面這個例子來看下mergeSort的整個過程：

mergeSort

從示例圖中我們可以清晰的看出，一個分散在多個SpilledFile的partition資料，經過mergeSort操作之後，就會變成按照recordKey排序的Iterator了。

aggregator, but no sorter

reduceByKey

if (!totalOrder) {
  new Iterator[Iterator[Product2[K, C]]] {
    val sorted = mergeSort(iterators, comparator).buffered

    // Buffers reused across elements to decrease memory allocation
    val keys = new ArrayBuffer[K]
    val combiners = new ArrayBuffer[C]

    override def hasNext: Boolean = sorted.hasNext

    override def next(): Iterator[Product2[K, C]] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      keys.clear()
      combiners.clear()
      val firstPair = sorted.next()
      keys += firstPair._1
      combiners += firstPair._2
      val key = firstPair._1
      while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
        val pair = sorted.next()
        var i = 0
        var foundKey = false
        while (i < keys.size && !foundKey) {
          if (keys(i) == pair._1) {
            combiners(i) = mergeCombiners(combiners(i), pair._2)
            foundKey = true
          }
          i += 1
        }
        if (!foundKey) {
          keys += pair._1
          combiners += pair._2
        }
      }

      keys.iterator.zip(combiners.iterator)
    }
  }.flatMap(i => i)
}

看到這我們可能會有所困惑，為什麼key儲存需要一個ArrayBuffer

reduceByKey Example:

val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", (4).toString)
val sc = new SparkContext(conf)
val testData = (1 to 10).toList
val keys = Array("Aa", "BB")
val count = sc.parallelize(testData, 1)
  .map(x => {
    (keys(x % 2), x)
  }).reduceByKey(_ + _, 3).collectPartitions().foreach(x => {
  x.foreach(y => {
    println(y._1 + "," + y._2)
  })
})

下圖演示了reduceByKey在有hash衝突的情況下，整個mergeWithAggregation過程

reduceByKey with hash collisions

aggregator and sorter

雖然有這段邏輯，但是我並沒找到同時帶有aggregator和sorter的操作，所以這裡我們簡單過下這段邏輯就好了。

合併`SpilledFile`

經過partition的merge操作之後就可以進行data和index檔案的寫入，具體的寫入過程和BypassMergeSortShuffleWriter是一樣的，這裡我們就不再做更多的解釋了。

private[this] case class SpilledFile(
  file: File,
  blockId: BlockId,
  serializerBatchSizes: Array[Long],
  elementsPerPartition: Array[Long])

`SortShuffleWriter`總結

序列化了兩次，一次是寫SpilledFile，一次是合併SpilledFile

UnsafeShuffleWriter

上面我們介紹了兩種在堆內做Shuffle write的方式，這種方式的缺點很明顯，就是在大物件的情況下，Jvm的垃圾回收效能表現比較差。所以就衍生了堆外記憶體的Shuffle write，即UnsafeShuffleWriter。

從巨集觀上看，UnsafeShuffleWriter跟SortShufflerWriter設計很相似，都是將map端的資料，按照reduce端的partitionId進行排序，超過一定限制就將記憶體中的記錄溢寫到磁碟上。最後將這些檔案合併寫入一個MapOutputFile，並記錄每個partition的offset。

通過上面兩種on-heap的Shuffle write模型，我們就可以知道

預備知識

記憶體分頁管理模型

實現細節

在詳細介紹UnsafeShuffleWriter之前，讓我們先來看下基礎知識，先看下PackedRecordPointer類。

final class PackedRecordPointer {
  ...
  public static long packPointer(long recordPointer, int partitionId) {
    final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
    final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
    return (((long) partitionId) << 40) | compressedAddress;
  }

  private long packedRecordPointer;

  public void set(long packedRecordPointer) {
    this.packedRecordPointer = packedRecordPointer;
  }

  public int getPartitionId() {
    return (int) ((packedRecordPointer & MASK_LONG_UPPER_24_BITS) >>> 40);
  }

  public long getRecordPointer() {
    final long pageNumber = (packedRecordPointer << 24) & MASK_LONG_UPPER_13_BITS;
    final long offsetInPage = packedRecordPointer & MASK_LONG_LOWER_27_BITS;
    return pageNumber | offsetInPage;
  }
}

PackedRecordPointer用一個long型別來儲存partitionId,pageNumber,offsetInPage，已知一個long是64位，從程式碼中我們可以看出:

[ 24 bit partitionId ] [ 13 bit pageNumber] [ 27 bit offset in page]

insertRecord方法：

public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
  throws IOException {
  // 如果寫入記憶體的條數大於強制Spill閾值進行spill
  if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
    spill();
  }

  growPointerArrayIfNecessary();
  // Need 4 bytes to store the record length.
  final int required = length + 4;
  acquireNewPageIfNecessary(required);

  assert(currentPage != null);
  final Object base = currentPage.getBaseObject();
  final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
  Platform.putInt(base, pageCursor, length);
  pageCursor += 4;
  Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
  pageCursor += length;
  inMemSorter.insertRecord(recordAddress, partitionId);
}

spill過程其實就是寫檔案的過程，也就是呼叫writeSortedFile的過程：

private void writeSortedFile(boolean isLastFile) {
  ...
  // 將inMemSorter，也就是PackedRecordPointer按照partitionId排序
  final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
    inMemSorter.getSortedIterator();

  final byte[] writeBuffer = new byte[diskWriteBufferSize];

  final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
    blockManager.diskBlockManager().createTempShuffleBlock();
  final File file = spilledFileInfo._2();
  final TempShuffleBlockId blockId = spilledFileInfo._1();
  final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

  final SerializerInstance ser = DummySerializerInstance.INSTANCE;

  final DiskBlockObjectWriter writer =
    blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);

  int currentPartition = -1;
  while (sortedRecords.hasNext()) {
    sortedRecords.loadNext();
    final int partition = sortedRecords.packedRecordPointer.getPartitionId();
    if (partition != currentPartition) {
      // Switch to the new partition
      if (currentPartition != -1) {
        final FileSegment fileSegment = writer.commitAndGet();
        spillInfo.partitionLengths[currentPartition] = fileSegment.length();
      }
      currentPartition = partition;
    }

    final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
    final Object recordPage = taskMemoryManager.getPage(recordPointer);
    final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
    int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);
    long recordReadPosition = recordOffsetInPage + 4; // skip over record length
    while (dataRemaining > 0) {
      final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
      Platform.copyMemory(
        recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
      writer.write(writeBuffer, 0, toTransfer);
      recordReadPosition += toTransfer;
      dataRemaining -= toTransfer;
    }
    writer.recordWritten();
  }

  final FileSegment committedSegment = writer.commitAndGet();
  writer.close();
  if (currentPartition != -1) {
    spillInfo.partitionLengths[currentPartition] = committedSegment.length();
    spills.add(spillInfo);
  }
}

下圖演示了資料在記憶體中的過程

ShuffleExternalSorter

由於UnsafeShuffleWriter並沒有aggreate和sort操作，所以合併多個臨時檔案中某一個partition的資料就變得很簡單了，因為我們記錄了每個partition的offset

private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
  ...
  if (spills.length == 0) {
    java.nio.file.Files.newOutputStream(outputFile.toPath()).close(); // Create an empty file
    return new long[partitioner.numPartitions()];
  } else if (spills.length == 1) {
    Files.move(spills[0].file, outputFile);
    return spills[0].partitionLengths;
  } else {
    final long[] partitionLengths;
    if (fastMergeEnabled && fastMergeIsSupported) {
      if (transferToEnabled && !encryptionEnabled) {
        logger.debug("Using transferTo-based fast merge");
        partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
      } else {
        logger.debug("Using fileStream-based fast merge");
        partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
      }
    }
  }
  ...
}

與`SortShuffleWriter`對比

資料是放在堆外記憶體，減少GC開銷。
merge檔案無需反序列化檔案。

觸發條件

我們先來看下SortShuffleManager是如何選擇應該採用哪種ShuffleWriter的

override def registerShuffle[K, V, C](
    shuffleId: Int,
    numMaps: Int,
    dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
  if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
    new BypassMergeSortShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
    // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
    new SerializedShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else {
    // Otherwise, buffer map outputs in a deserialized form:
    new BaseShuffleHandle(shuffleId, numMaps, dependency)
  }
}

Bypass觸發條件

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
  // We cannot bypass sorting if we need to do map-side aggregation.
  if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    false
  } else {
    val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
    dep.partitioner.numPartitions <= bypassMergeThreshold
  }
}

1.reduce端partition個數小於spark.shuffle.sort.bypassMergeThreshold
2.無map端combine操作

UnsafeShuffleWriter觸發條件

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
  val shufId = dependency.shuffleId
  val numPartitions = dependency.partitioner.numPartitions
  if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
      s"${dependency.serializer.getClass.getName}, does not support object relocation")
    false
  } else if (dependency.aggregator.isDefined) {
    log.debug(
      s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
    false
  } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
      s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
    false
  } else {
    log.debug(s"Can use serialized shuffle for shuffle $shufId")
    true
  }
}

1.Serializer支援relocation
2.無map端combine操作
3.reduce端partition個數小於

SortShuffleWriter觸發條件
無法使用上述兩種ShuffleWriter則採用SortShuffleWriter

關鍵點

為什麼需要合併shuffle中間結果

減少讀取時的檔案控制程式碼數。 我們可以看到一個partition產生的臨時檔案數目為reduce個數，當我們reduce個數非常大的時候，executor需要維護非常多的檔案控制程式碼。在HashShuffleWriter實現中，需要讀取過多的檔案。

說明

本文是基於寫部落格時的最新master程式碼分析的，而spark還不斷迭代中，大家需要根據spark發展繼續分析。
文中所有原始碼都擷取關鍵程式碼，忽略了大部分對邏輯分析無關的程式碼，並不代表其他程式碼不重要。

總結

1.ShuffleWriter肯定會產生落磁碟檔案。
2.從巨集觀上看，ShuffleWriter過程就是在Map端根據Partitioner聚合Reduce端的資料，最後將資料寫入一個資料檔案，並記錄每個Partitoin的偏移量，為Reduce端讀取做準備。

Future work

[SPARK-7271] Redesign shuffle interface for binary processing

Spark Shuffle機制詳細原始碼解析
2020-11-12
Spark原始碼
Spark 原始碼系列（六）Shuffle 的過程解析
2019-04-25
Spark原始碼
Spark 原始碼解析之SparkContext
2016-12-09
Spark原始碼Context
Spark原始碼解析之Storage模組
2015-11-17
Spark原始碼
Spark原始碼-SparkContext原始碼解析
2017-09-24
Spark原始碼Context
spark reduceByKey原始碼解析
2020-12-06
Spark原始碼
Apache Spark原始碼走讀之24 -- Sort-based Shuffle的設計與實現
2014-09-19
ApacheSpark原始碼
Spark shuffle調優
2018-12-17
Spark
Spark Shuffle實現
2015-03-06
Spark
spark Shuffle相關
2015-09-30
Spark
Spark 原始碼系列（九）Spark SQL 初體驗之解析過程詳解
2019-04-25
Spark原始碼SQL
Spark學習——排序Shuffle
2019-04-03
Spark排序
Spark原始碼分析之MemoryManager
2017-11-11
Spark原始碼
Spark原始碼分析之BlockStore
2017-11-11
Spark原始碼BloC
【Spark篇】---Spark中Shuffle檔案的定址
2018-03-07
Spark
Spark原始碼解析-Yarn部署流程（ApplicationMaster）
2020-10-13
Spark原始碼YarnAPPAST
Spark的Shuffle總結分析
2020-02-15
Spark
Spark開發-Shuffle優化
2017-11-15
Spark優化
Spark面試題（八）——Spark的Shuffle配置調優
2021-11-19
Spark面試題
【Spark篇】---Spark中Shuffle機制，SparkShuffle和SortShuffle
2018-02-07
Spark
【大資料學習日記】Spark之shuffle調優
2018-01-19
大資料Spark
spark核心(下)——job任務提交原始碼解析
2020-12-16
Spark原始碼
spark 原始碼分析之十三 -- SerializerManager剖析
2019-07-15
Spark原始碼
Spark原始碼分析之DiskBlockMangaer分析
2017-11-10
Spark原始碼BloC
Spark原始碼分析之cahce原理分析
2017-11-11
Spark原始碼
Spark原始碼分析之Checkpoint機制
2017-11-11
Spark原始碼
spark 原始碼分析之十八 -- Spark儲存體系剖析
2019-07-23
Spark原始碼
spark 原始碼分析之十五 -- Spark記憶體管理剖析
2019-07-17
Spark原始碼記憶體
spark的基本運算元使用和原始碼解析
2019-07-23
Spark原始碼
Spark SQL原始碼解析（四）Optimization和Physical Planning階段解析
2020-05-14
SparkSQL原始碼
spark 原始碼分析之十六 -- Spark記憶體儲存剖析
2019-07-18
Spark原始碼記憶體
spark 原始碼分析之十九 -- Stage的提交
2019-07-26
Spark原始碼
spark原始碼之任務提交過程
2018-10-15
Spark原始碼
Spark的兩種核心Shuffle詳解
2021-08-16
Spark
Spark-Shuffle過程概要參考
2018-10-15
Spark
Spark 效能調優--Shuffle調優 SortShuffleManager
2018-01-05
Spark
jQuery原始碼解析之clone()
2019-04-25
jQuery原始碼
jQuery原始碼解析之position()
2019-05-05
jQuery原始碼

Spark原始碼解析之Shuffle Writer

ShuffleWriter

BypassMergeSortShuffleWriter

關鍵程式碼解讀

BypassMergeSortShuffleWriter Example

SortShuffleWriter

預備知識:

SortShuffleWriter.writer()實現

寫磁碟的時機

溢寫磁碟的過程

讀取SpilledFile過程

merge過程

no aggregator or sorter

no aggregator but sorter

aggregator, but no sorter

aggregator and sorter

合併SpilledFile

SortShuffleWriter總結

UnsafeShuffleWriter

預備知識

實現細節

與SortShuffleWriter對比

觸發條件

關鍵點

說明

總結

相關文章

讀取`SpilledFile`過程

合併`SpilledFile`

`SortShuffleWriter`總結

與`SortShuffleWriter`對比