Spark原始碼解析之Shuffle Writer

weixin_34337265發表於2017-12-24

摘要:ShuffleMapReduce程式設計模型中最耗時的一個步驟,而SparkShuffle過程分解成了Shuffle WriteShuffle Read兩個過程,本文我們將詳細解讀SparkShuffle Write實現。

ShuffleWriter

Spark Shuffle Write的介面是org.apache.spark.shuffle.ShuffleWriter

我們來看下介面定義:

private[spark] abstract class ShuffleWriter[K, V] {![螢幕快照 2017-12-17 下午2.48.59.png](http://upload-images.jianshu.io/upload_images/1381055-7248e894ca3ea2b4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

  /** Write a sequence of records to this task's output */
  @throws[IOException]
  def write(records: Iterator[Product2[K, V]]): Unit

  /** Close this writer, passing along whether the map completed */
  def stop(success: Boolean): Option[MapStatus]
}

共有三個實現類:

1381055-91ad7c7d3ab10c1b.png
Shuffle Writer的實現類

BypassMergeSortShuffleWriter

我們以第一個stage(map)的個數為m個來計算,第二個stage個數為r個來計算

BypassMergeSortShuffleWriter可以分為

1.為每個ShuffleMapTask(即map端的每個partition,每個ShuffleMapTask處理的是map端的一個partition)建立r個臨時檔案
2.迭代每個map的partition,根據getPartition(key)來分組,並寫入對應的partitionId的檔案
3.合併步驟2產生的r個檔案,並將每個partitionId對應的索引寫入index檔案

1381055-761fa6dafdcb7165.png
BypassMergeSortShuffleWriter流程圖

關鍵程式碼解讀

public void write(Iterator<Product2<K, V>> records) throws IOException {
  ...
  // 根據下游stage(reduce端)的partition個數建立對應個數的DiskWriter
  partitionWriters = new DiskBlockObjectWriter[numPartitions];
  partitionWriterSegments = new FileSegment[numPartitions];
  for (int i = 0; i < numPartitions; i++) {
    final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
      blockManager.diskBlockManager().createTempShuffleBlock();
    final File file = tempShuffleBlockIdPlusFile._2();
    final BlockId blockId = tempShuffleBlockIdPlusFile._1();
    partitionWriters[i] =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
  }

  // 根據`getPartition(key)`獲取kv所屬的reduce的partitionId,並將kv寫入對應的partitionId的臨時檔案
  while (records.hasNext()) {
    final Product2<K, V> record = records.next();
    final K key = record._1();
    partitionWriters[partitioner.getPartition(key)].write(key, record._2());
  }

  for (int i = 0; i < numPartitions; i++) {
    final DiskBlockObjectWriter writer = partitionWriters[i];
    partitionWriterSegments[i] = writer.commitAndGet();
    writer.close();
  }

  File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
  File tmp = Utils.tempFileWith(output);
  try {
    // 合併多個partitionId對應的臨時檔案,寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.data`檔案
    partitionLengths = writePartitionedFile(tmp);
    // 將多個partitionId對應的index寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.index`檔案
    shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
  }
  mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}

1.預設的Partitioner的實現類為HashPartitioner
2.預設的SerializerInstance的實現類為JavaSerializerInstance

FileSegment

一個BypassMergeSortShuffleWriter的中間臨時檔案稱之為FileSegment

class FileSegment(val file: File, val offset: Long, val length: Long)

file記錄物理檔案,length記錄檔案大小,用於合併多個FileSegment時寫index檔案。

我們再看下合併臨時檔案方法writePartitionedFile的實現:

private long[] writePartitionedFile(File outputFile) throws IOException {
  final long[] lengths = new long[numPartitions];
  ...
  final FileChannel out = FileChannel.open(outputFile.toPath(), WRITE, APPEND, CREATE);
    for (int i = 0; i < numPartitions; i++) {
      final File file = partitionWriterSegments[i].file();
      if (file.exists()) {
        final FileChannel in = FileChannel.open(file.toPath(), READ);
        try {
          long size = in.size();
          // 合併檔案的關鍵程式碼,通過NIO的transferTo提高合併檔案流的效率
          Utils.copyFileStreamNIO(in, out, 0, size);
          lengths[i] = size;
        }
      }
    }
  }
  partitionWriters = null;
  // 返回每個臨時檔案大小,用於寫Index檔案
  return lengths;
}

寫index檔案的方法writeIndexFileAndCommit:

def writeIndexFileAndCommit(
    shuffleId: Int,
    mapId: Int,
    lengths: Array[Long],
    dataTmp: File): Unit = {
  val indexFile = getIndexFile(shuffleId, mapId)
  val indexTmp = Utils.tempFileWith(indexFile)
  try {
    val out = new DataOutputStream(
      new BufferedOutputStream(Files.newOutputStream(indexTmp.toPath)))
    Utils.tryWithSafeFinally {
      var offset = 0L
      out.writeLong(offset)
      for (length <- lengths) {
        offset += length
        out.writeLong(offset)
      }
    }
  }
  ...
}

NOTE: 1.檔案合併時採用了java nio的transferTo方法提高檔案合併效率。
2.BypassMergeSortShuffleWriter完整程式碼

BypassMergeSortShuffleWriter Example

我們通過下面一個例子來看下BypassMergeSortShuffleWriter的工作原理。

1381055-21edd00b570c4b2d.png
BypassMergeSortShuffleWriter Example

1.真實場景下,我們的partition上的資料往往是無序的,本例中我們模擬的資料是有序的,不要誤認為BypassMergeSortShuffleWriter會為我們的資料排序。

SortShuffleWriter

預備知識:

  • org.apache.spark.util.collection.AppendOnlyMap
  • org.apache.spark.util.collection.PartitionedPairBuffer
  • TimSorter

SortShuffleWriter.writer()實現

我們先看下writer的具體實現:

override def write(records: Iterator[Product2[K, V]]): Unit = {
  sorter = if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  sorter.insertAll(records)

  val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
  val tmp = Utils.tempFileWith(output)
  try {
    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
    val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
    shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
  } finally {
    if (tmp.exists() && !tmp.delete()) {
      logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
    }
  }
}

SortShuffleWriter write過程大概可以分成兩個步驟,第一步insertAll,第二步merge溢寫到磁碟的SpilledFile

ExternalSorter可以分為四個步驟來理解

  • 根據是否需要combine操作,決定快取結構是PartitionedAppendOnlyMap還是PartitionedPairBuffer,在這兩種資料結構中,我們會先按照partitionId將資料排序,而且在每個partition中,我們會根據key排序。
  • 當快取資料到達我們的記憶體限制,或者或者條數限制,我們將進行spill操作,並且每個SpilledFile會記錄每個parition有多少條記錄。
  • 當我們請求一個iterator或者檔案時,會將所有的SpilledFile和在記憶體當中未進行溢寫的資料進行合併。
  • 最後請求stop方法刪除相關臨時檔案。

ExternalSorter.insertAll的實現:

def insertAll(records: Iterator[Product2[K, V]]): Unit = {
  val shouldCombine = aggregator.isDefined
  // 根據aggregator是否定義來判斷是否需要map端合併(combine)
  if (shouldCombine) {
    // Combine values in-memory first using our AppendOnlyMap
    // 對應rdd.aggregatorByKey的 seqOp 引數
    val mergeValue = aggregator.get.mergeValue
    // 對應rdd.aggregatorByKey的zeroValue引數,利用zeroValue來建立Combiner
    val createCombiner = aggregator.get.createCombiner
    var kv: Product2[K, V] = null
    val update = (hadValue: Boolean, oldValue: C) => {
      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
    }
    while (records.hasNext) {
      addElementsRead()
      kv = records.next()
      map.changeValue((getPartition(kv._1), kv._1), update)
      maybeSpillCollection(usingMap = true)
    }
  } else {
    // Stick values into our buffer
    while (records.hasNext) {
      addElementsRead()
      val kv = records.next()
      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
      maybeSpillCollection(usingMap = false)
    }
  }
}

需要注意的一點是往map/buffer中寫入的key都是(partitionId,key),因為我們需要對一個臨時檔案中的資料結構,先按照partitionId排序,再按照key排序。

寫磁碟的時機

寫磁碟的時機有兩個條件,滿足其中一個就進行spill操作。

  • 1.每32個元素取樣一次,判斷當前記憶體指是否大於myMemoryThreshold,即currentMemory >= myMemoryThresholdcurrentMemory需要通過預估當前map/buffer大小來獲取。
  • 2.判斷記憶體快取結構中資料條數是否大於強制溢寫閾值,即_elementsRead > numElementsForceSpillThreshold。強制溢寫閾值可以通過在SparkConf中設定spark.shuffle.spill.batchSize來控制。
private def maybeSpillCollection(usingMap: Boolean): Unit = {
  var estimatedSize = 0L
  if (usingMap) {
    // 預估map在記憶體中的大小
    estimatedSize = map.estimateSize()
    if (maybeSpill(map, estimatedSize)) {
    // 如果記憶體中的資料spill到磁碟上了,重置map
      map = new PartitionedAppendOnlyMap[K, C]
    }
  } else {
    // 預估buffer在記憶體中的大小
    estimatedSize = buffer.estimateSize()
    if (maybeSpill(buffer, estimatedSize)) {
    // 同map操作
      buffer = new PartitionedPairBuffer[K, C]
    }
  }

  if (estimatedSize > _peakMemoryUsedBytes) {
    _peakMemoryUsedBytes = estimatedSize
  }
}
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
  var shouldSpill = false
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
    val amountToRequest = 2 * currentMemory - myMemoryThreshold
    val granted = acquireMemory(amountToRequest)
    myMemoryThreshold += granted
    shouldSpill = currentMemory >= myMemoryThreshold
  }
  shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
  if (shouldSpill) {
    _spillCount += 1
    logSpillage(currentMemory)
    // 溢寫
    spill(collection)
    _elementsRead = 0
    _memoryBytesSpilled += currentMemory
    releaseMemory()
  }
  shouldSpill
}

溢寫磁碟的過程

override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
  // 利用timsort演算法將記憶體中的資料排序
  val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
  // 將記憶體中的資料寫入磁碟
  val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
  // 加入spills陣列
  spills += spillFile
}

總結下insertAll過程就是,利用記憶體快取結構的資料結構PartitionedPairBuffer/PartitionedAppendOnlyMap,一邊往記憶體快取寫資料一邊判斷是否達到spill的條件,一次spill就是一個磁碟臨時檔案。

讀取SpilledFile過程

SpilledFile資料檔案是按照(partitionId,recordKey)來排序,而且我們記錄了每個partitionoffset,所以我們獲取一個SpilledFile中的某個partition資料就變得很簡單了。

讀取SpilledFile的實現類是SpillReader

merge過程

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    if (aggregator.isDefined) {
      (p, mergeWithAggregation(
        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}

merge過程是比較複雜的一個過程,要涉及到當前Shuffle是否有aggregatorordering操作。接下來我們將就這幾種情況一一分析。

no aggregator or sorter

partitionBy

case class TestIntKey(i: Int)
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", 4.toString)
val sc = new SparkContext(conf)
val testData = (1 to 100).toList
sc.parallelize(testData, 1)
  .map(x => {
    (TestIntKey(x % 3), x)
  }).partitionBy(new HashPartitioner(3)).collect()
1381055-1f3cce52aa156886.png
partitionBy流程圖

no aggregator but sorter

這段程式碼其實很容易混淆,因為很容易想到sortByKey操作就是無aggregatorsorter操作,但是我們其實可以看到SortShuffleWriter在初始化ExternalSorter的時,ordring = None。具體程式碼如下:

sorter = if (dep.mapSideCombine) {
  ...
} else {
  new ExternalSorter[K, V, V](
    context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}

NOTE: sortBykeyordering的邏輯將會被放到Shuffle Read過程中執行,這個我們後續將會介紹。

不過我們還是來簡單看下mergeSort方法的實現。我們的SpilledFile中,每個partition內的資料已經是按照recordKey排好序的,所以我們只要拿到每個SpilledFile的

private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
    : Iterator[Product2[K, C]] =
{
  // NOTE:(fchen)將該partition資料全部放入等級佇列當中,取資料時進行每個iterator頭部對比,取出最小的
  val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
  type Iter = BufferedIterator[Product2[K, C]]
  val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
    override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
  })
  heap.enqueue(bufferedIters: _*)  // Will contain only the iterators with hasNext = true
  new Iterator[Product2[K, C]] {
    override def hasNext: Boolean = !heap.isEmpty

    override def next(): Product2[K, C] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      val firstBuf = heap.dequeue()
      val firstPair = firstBuf.next()
      if (firstBuf.hasNext) {
        // 將迭代器重新放回等級佇列
        heap.enqueue(firstBuf)
      }
      firstPair
    }
  }
}

我們通過下面這個例子來看下mergeSort的整個過程:

1381055-06c2646e73d22abb.png
mergeSort

從示例圖中我們可以清晰的看出,一個分散在多個SpilledFile的partition資料,經過mergeSort操作之後,就會變成按照recordKey排序的Iterator了。

aggregator, but no sorter

reduceByKey

if (!totalOrder) {
  new Iterator[Iterator[Product2[K, C]]] {
    val sorted = mergeSort(iterators, comparator).buffered

    // Buffers reused across elements to decrease memory allocation
    val keys = new ArrayBuffer[K]
    val combiners = new ArrayBuffer[C]

    override def hasNext: Boolean = sorted.hasNext

    override def next(): Iterator[Product2[K, C]] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      keys.clear()
      combiners.clear()
      val firstPair = sorted.next()
      keys += firstPair._1
      combiners += firstPair._2
      val key = firstPair._1
      while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
        val pair = sorted.next()
        var i = 0
        var foundKey = false
        while (i < keys.size && !foundKey) {
          if (keys(i) == pair._1) {
            combiners(i) = mergeCombiners(combiners(i), pair._2)
            foundKey = true
          }
          i += 1
        }
        if (!foundKey) {
          keys += pair._1
          combiners += pair._2
        }
      }

      keys.iterator.zip(combiners.iterator)
    }
  }.flatMap(i => i)
}

看到這我們可能會有所困惑,為什麼key儲存需要一個ArrayBuffer

reduceByKey Example:

val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", (4).toString)
val sc = new SparkContext(conf)
val testData = (1 to 10).toList
val keys = Array("Aa", "BB")
val count = sc.parallelize(testData, 1)
  .map(x => {
    (keys(x % 2), x)
  }).reduceByKey(_ + _, 3).collectPartitions().foreach(x => {
  x.foreach(y => {
    println(y._1 + "," + y._2)
  })
})

下圖演示了reduceByKey在有hash衝突的情況下,整個mergeWithAggregation過程

1381055-34ed270742132fd3.png
reduceByKey with hash collisions

aggregator and sorter

雖然有這段邏輯,但是我並沒找到同時帶有aggregator和sorter的操作,所以這裡我們簡單過下這段邏輯就好了。

合併SpilledFile

經過partition的merge操作之後就可以進行data和index檔案的寫入,具體的寫入過程和BypassMergeSortShuffleWriter是一樣的,這裡我們就不再做更多的解釋了。

private[this] case class SpilledFile(
  file: File,
  blockId: BlockId,
  serializerBatchSizes: Array[Long],
  elementsPerPartition: Array[Long])

SortShuffleWriter總結

序列化了兩次,一次是寫SpilledFile,一次是合併SpilledFile

UnsafeShuffleWriter

上面我們介紹了兩種在堆內做Shuffle write的方式,這種方式的缺點很明顯,就是在大物件的情況下,Jvm的垃圾回收效能表現比較差。所以就衍生了堆外記憶體的Shuffle write,即UnsafeShuffleWriter

從巨集觀上看,UnsafeShuffleWriterSortShufflerWriter設計很相似,都是將map端的資料,按照reduce端的partitionId進行排序,超過一定限制就將記憶體中的記錄溢寫到磁碟上。最後將這些檔案合併寫入一個MapOutputFile,並記錄每個partitionoffset

通過上面兩種on-heap的Shuffle write模型,我們就可以知道

預備知識

記憶體分頁管理模型

實現細節

在詳細介紹UnsafeShuffleWriter之前,讓我們先來看下基礎知識,先看下PackedRecordPointer類。

final class PackedRecordPointer {
  ...
  public static long packPointer(long recordPointer, int partitionId) {
    final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
    final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
    return (((long) partitionId) << 40) | compressedAddress;
  }

  private long packedRecordPointer;

  public void set(long packedRecordPointer) {
    this.packedRecordPointer = packedRecordPointer;
  }

  public int getPartitionId() {
    return (int) ((packedRecordPointer & MASK_LONG_UPPER_24_BITS) >>> 40);
  }

  public long getRecordPointer() {
    final long pageNumber = (packedRecordPointer << 24) & MASK_LONG_UPPER_13_BITS;
    final long offsetInPage = packedRecordPointer & MASK_LONG_LOWER_27_BITS;
    return pageNumber | offsetInPage;
  }
}

PackedRecordPointer用一個long型別來儲存partitionId,pageNumber,offsetInPage,已知一個long是64位,從程式碼中我們可以看出:

[ 24 bit partitionId ] [ 13 bit pageNumber] [ 27 bit offset in page]

insertRecord方法:

public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
  throws IOException {
  // 如果寫入記憶體的條數大於強制Spill閾值進行spill
  if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
    spill();
  }

  growPointerArrayIfNecessary();
  // Need 4 bytes to store the record length.
  final int required = length + 4;
  acquireNewPageIfNecessary(required);

  assert(currentPage != null);
  final Object base = currentPage.getBaseObject();
  final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
  Platform.putInt(base, pageCursor, length);
  pageCursor += 4;
  Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
  pageCursor += length;
  inMemSorter.insertRecord(recordAddress, partitionId);
}

spill過程其實就是寫檔案的過程,也就是呼叫writeSortedFile的過程:

private void writeSortedFile(boolean isLastFile) {
  ...
  // 將inMemSorter,也就是PackedRecordPointer按照partitionId排序
  final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
    inMemSorter.getSortedIterator();

  final byte[] writeBuffer = new byte[diskWriteBufferSize];

  final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
    blockManager.diskBlockManager().createTempShuffleBlock();
  final File file = spilledFileInfo._2();
  final TempShuffleBlockId blockId = spilledFileInfo._1();
  final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

  final SerializerInstance ser = DummySerializerInstance.INSTANCE;

  final DiskBlockObjectWriter writer =
    blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);

  int currentPartition = -1;
  while (sortedRecords.hasNext()) {
    sortedRecords.loadNext();
    final int partition = sortedRecords.packedRecordPointer.getPartitionId();
    if (partition != currentPartition) {
      // Switch to the new partition
      if (currentPartition != -1) {
        final FileSegment fileSegment = writer.commitAndGet();
        spillInfo.partitionLengths[currentPartition] = fileSegment.length();
      }
      currentPartition = partition;
    }

    final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
    final Object recordPage = taskMemoryManager.getPage(recordPointer);
    final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
    int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);
    long recordReadPosition = recordOffsetInPage + 4; // skip over record length
    while (dataRemaining > 0) {
      final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
      Platform.copyMemory(
        recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
      writer.write(writeBuffer, 0, toTransfer);
      recordReadPosition += toTransfer;
      dataRemaining -= toTransfer;
    }
    writer.recordWritten();
  }

  final FileSegment committedSegment = writer.commitAndGet();
  writer.close();
  if (currentPartition != -1) {
    spillInfo.partitionLengths[currentPartition] = committedSegment.length();
    spills.add(spillInfo);
  }
}

下圖演示了資料在記憶體中的過程

1381055-d1460fd4617f6114.png
ShuffleExternalSorter

由於UnsafeShuffleWriter並沒有aggreatesort操作,所以合併多個臨時檔案中某一個partition的資料就變得很簡單了,因為我們記錄了每個partitionoffset

private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
  ...
  if (spills.length == 0) {
    java.nio.file.Files.newOutputStream(outputFile.toPath()).close(); // Create an empty file
    return new long[partitioner.numPartitions()];
  } else if (spills.length == 1) {
    Files.move(spills[0].file, outputFile);
    return spills[0].partitionLengths;
  } else {
    final long[] partitionLengths;
    if (fastMergeEnabled && fastMergeIsSupported) {
      if (transferToEnabled && !encryptionEnabled) {
        logger.debug("Using transferTo-based fast merge");
        partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
      } else {
        logger.debug("Using fileStream-based fast merge");
        partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
      }
    }
  }
  ...
}

SortShuffleWriter對比

  • 資料是放在堆外記憶體,減少GC開銷。
  • merge檔案無需反序列化檔案。

觸發條件

我們先來看下SortShuffleManager是如何選擇應該採用哪種ShuffleWriter的

override def registerShuffle[K, V, C](
    shuffleId: Int,
    numMaps: Int,
    dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
  if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
    new BypassMergeSortShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
    // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
    new SerializedShuffleHandle[K, V](
      shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
  } else {
    // Otherwise, buffer map outputs in a deserialized form:
    new BaseShuffleHandle(shuffleId, numMaps, dependency)
  }
}

Bypass觸發條件

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
  // We cannot bypass sorting if we need to do map-side aggregation.
  if (dep.mapSideCombine) {
    require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
    false
  } else {
    val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
    dep.partitioner.numPartitions <= bypassMergeThreshold
  }
}

1.reducepartition個數小於spark.shuffle.sort.bypassMergeThreshold
2.無mapcombine操作

UnsafeShuffleWriter觸發條件

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
  val shufId = dependency.shuffleId
  val numPartitions = dependency.partitioner.numPartitions
  if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
      s"${dependency.serializer.getClass.getName}, does not support object relocation")
    false
  } else if (dependency.aggregator.isDefined) {
    log.debug(
      s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
    false
  } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
      s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
    false
  } else {
    log.debug(s"Can use serialized shuffle for shuffle $shufId")
    true
  }
}

1.Serializer支援relocation
2.無mapcombine操作
3.reducepartition個數小於

1381055-07c4b9bf88774fa9.png

SortShuffleWriter觸發條件
無法使用上述兩種ShuffleWriter則採用SortShuffleWriter

關鍵點

  • 為什麼需要合併shuffle中間結果

減少讀取時的檔案控制程式碼數。 我們可以看到一個partition產生的臨時檔案數目為reduce個數,當我們reduce個數非常大的時候,executor需要維護非常多的檔案控制程式碼。在HashShuffleWriter實現中,需要讀取過多的檔案。

說明

  • 本文是基於寫部落格時的最新master程式碼分析的,而spark還不斷迭代中,大家需要根據spark發展繼續分析。
  • 文中所有原始碼都擷取關鍵程式碼,忽略了大部分對邏輯分析無關的程式碼,並不代表其他程式碼不重要。

總結

1.ShuffleWriter肯定會產生落磁碟檔案。
2.從巨集觀上看,ShuffleWriter過程就是在Map端根據Partitioner聚合Reduce端的資料,最後將資料寫入一個資料檔案,並記錄每個Partitoin的偏移量,為Reduce端讀取做準備。

  • Future work

[SPARK-7271] Redesign shuffle interface for binary processing

相關文章