Spark Shuffle機制詳細原始碼解析

馬晟發表於2020-11-12

Shuffle過程主要分為Shuffle write和Shuffle read兩個階段,2.0版本之後hash shuffle被刪除,只保留sort shuffle,下面結合程式碼分析:

1.ShuffleManager

Spark在初始化SparkEnv的時候,會在create()方法裡面初始化ShuffleManager

// Let the user specify short names for shuffle managers
    val shortShuffleMgrNames = Map(
      "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
      "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
    val shuffleMgrName = conf.get(config.SHUFFLE_MANAGER)
    val shuffleMgrClass =
      shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
    val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

這裡可以看到包含sort和tungsten-sort兩種shuffle,通過反射建立了ShuffleManager,ShuffleManager是一個特質,核心方法有下面幾個:

private[spark] trait ShuffleManager {

  /**
   * 註冊一個shuffle返回控制程式碼
   */
  def registerShuffle[K, V, C](
      shuffleId: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle

  /** 獲取一個Writer根據給定的分割槽,在executors執行map任務時被呼叫 */
  def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Long,
      context: TaskContext,
      metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V]

  /**
   * 獲取一個Reader根據reduce分割槽的範圍,在executors執行reduce任務時被呼叫
   */
  def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext,
      metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]
	...
}

2.SortShuffleManager

SortShuffleManager是ShuffleManager的唯一實現類,對於以上三個方法的實現如下:

2.1 registerShuffle

/**
   * Obtains a [[ShuffleHandle]] to pass to tasks.
   */
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    // 1.首先檢查是否符合BypassMergeSort
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
      // 2.否則檢查是否能夠序列化
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      new SerializedShuffleHandle[K, V](
        shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      new BaseShuffleHandle(shuffleId, dependency)
    }
  }

1.首先檢查是否符合BypassMergeSort,這裡需要滿足兩個條件,首先是當前shuffle依賴中沒有map端的聚合操作,其次是分割槽數要小於spark.shuffle.sort.bypassMergeThreshold的值,預設為200,如果滿足這兩個條件,會返回BypassMergeSortShuffleHandle,啟用bypass merge-sort shuffle機制

def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
  // We cannot bypass sorting if we need to do map-side aggregation.
  if (dep.mapSideCombine) {
    false
  } else {
    // 預設值為200
    val bypassMergeThreshold: Int = conf.get(config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)
    dep.partitioner.numPartitions <= bypassMergeThreshold
  }
}

2.如果不滿足上面條件,檢查是否滿足canUseSerializedShuffle()方法,如果滿足該方法中的3個條件,則會返回SerializedShuffleHandle,啟用tungsten-sort shuffle機制

def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
  val shufId = dependency.shuffleId
  val numPartitions = dependency.partitioner.numPartitions
  // 序列化器需要支援Relocation
  if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
      s"${dependency.serializer.getClass.getName}, does not support object relocation")
    false
    // 不能有map端聚合操作
  } else if (dependency.mapSideCombine) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because we need to do " +
      s"map-side aggregation")
    false
    // 分割槽數不能大於16777215+1
  } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
    log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
      s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
    false
  } else {
    log.debug(s"Can use serialized shuffle for shuffle $shufId")
    true
  }
}

3.如果以上兩個條件都不滿足的話,會返回BaseShuffleHandle,採用基本sort shuffle機制

2.2 getReader

/**
 * Get a reader for a range of reduce partitions (startPartition to endPartition-1, inclusive).
 * Called on executors by reduce tasks.
 */
override def getReader[K, C](
    handle: ShuffleHandle,
    startPartition: Int,
    endPartition: Int,
    context: TaskContext,
    metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C] = {
  val blocksByAddress = SparkEnv.get.mapOutputTracker.getMapSizesByExecutorId(
    handle.shuffleId, startPartition, endPartition)
  new BlockStoreShuffleReader(
    handle.asInstanceOf[BaseShuffleHandle[K, _, C]], blocksByAddress, context, metrics,
    shouldBatchFetch = canUseBatchFetch(startPartition, endPartition, context))
}

這裡返回BlockStoreShuffleReader

2.3 getWriter

/** Get a writer for a given partition. Called on executors by map tasks. */
override def getWriter[K, V](
    handle: ShuffleHandle,
    mapId: Long,
    context: TaskContext,
    metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V] = {
  val mapTaskIds = taskIdMapsForShuffle.computeIfAbsent(
    handle.shuffleId, _ => new OpenHashSet[Long](16))
  mapTaskIds.synchronized { mapTaskIds.add(context.taskAttemptId()) }
  val env = SparkEnv.get
  // 根據handle獲取不同ShuffleWrite
  handle match {
    case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
      new UnsafeShuffleWriter(
        env.blockManager,
        context.taskMemoryManager(),
        unsafeShuffleHandle,
        mapId,
        context,
        env.conf,
        metrics,
        shuffleExecutorComponents)
    case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
      new BypassMergeSortShuffleWriter(
        env.blockManager,
        bypassMergeSortHandle,
        mapId,
        env.conf,
        metrics,
        shuffleExecutorComponents)
    case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
      new SortShuffleWriter(
        shuffleBlockResolver, other, mapId, context, shuffleExecutorComponents)
  }
}

這裡會根據handle獲取不同ShuffleWrite,如果是SerializedShuffleHandle,使用UnsafeShuffleWriter,如果是BypassMergeSortShuffleHandle,採用BypassMergeSortShuffleWriter,否則使用SortShuffleWriter

3.三種Writer的實現

如上文所說,當開啟bypass機制後,會使用BypassMergeSortShuffleWriter,如果serializer支援relocation並且map端沒有聚合同時分割槽數目不大於16777215+1三個條件都滿足,使用UnsafeShuffleWriter,否則使用SortShuffleWriter

3.1 BypassMergeSortShuffleWriter

BypassMergeSortShuffleWriter繼承ShuffleWriter,用java實現,會將map端的多個輸出檔案合併為一個檔案,同時生成一個索引檔案,索引記錄到每個分割槽的初始地址,write()方法如下:

@Override
public void write(Iterator<Product2<K, V>> records) throws IOException {
  assert (partitionWriters == null);
  // 新建一個ShuffleMapOutputWriter
  ShuffleMapOutputWriter mapOutputWriter = shuffleExecutorComponents
  .createMapOutputWriter(shuffleId, mapId, numPartitions);
  try {
    // 如果沒有資料的話
    if (!records.hasNext()) {
      // 返回所有分割槽的寫入長度
      partitionLengths = mapOutputWriter.commitAllPartitions();
      // 更新mapStatus
      mapStatus = MapStatus$.MODULE$.apply(
        blockManager.shuffleServerId(), partitionLengths, mapId);
      return;
    }
    final SerializerInstance serInstance = serializer.newInstance();
    final long openStartTime = System.nanoTime();
    // 建立和分割槽數相等的DiskBlockObjectWriter FileSegment
    partitionWriters = new DiskBlockObjectWriter[numPartitions];
    partitionWriterSegments = new FileSegment[numPartitions];
    // 對於每個分割槽
    for (int i = 0; i < numPartitions; i++) {
      // 建立一個臨時的block
      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
      blockManager.diskBlockManager().createTempShuffleBlock();
      // 獲取temp block的file和id
      final File file = tempShuffleBlockIdPlusFile._2();
      final BlockId blockId = tempShuffleBlockIdPlusFile._1();
      // 對於每個分割槽,建立一個DiskBlockObjectWriter
      partitionWriters[i] =
      blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
    }
    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, and can take a long time in aggregate when we open many files, so should be
    // included in the shuffle write time.
    // 建立檔案和寫入檔案都需要大量時間,也需要包含在shuffle寫入時間裡面
    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

    // 如果有資料的話
    while (records.hasNext()) {
      final Product2<K, V> record = records.next();
      final K key = record._1();
      // 對於每條資料按key寫入相應分割槽對應的檔案
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());
    }

    for (int i = 0; i < numPartitions; i++) {
      try (DiskBlockObjectWriter writer = partitionWriters[i]) {
        // 提交
        partitionWriterSegments[i] = writer.commitAndGet();
      }
    }

    // 將所有分割槽檔案合併成一個檔案
    partitionLengths = writePartitionedData(mapOutputWriter);
    // 更新mapStatus
    mapStatus = MapStatus$.MODULE$.apply(
      blockManager.shuffleServerId(), partitionLengths, mapId);
  } catch (Exception e) {
    try {
      mapOutputWriter.abort(e);
    } catch (Exception e2) {
      logger.error("Failed to abort the writer after failing to write map output.", e2);
      e.addSuppressed(e2);
    }
    throw e;
  }
}

合併檔案的方法writePartitionedData()如下,預設採用零拷貝的方式來合併檔案:

private long[] writePartitionedData(ShuffleMapOutputWriter mapOutputWriter) throws IOException {
  // Track location of the partition starts in the output file
  if (partitionWriters != null) {
    // 開始時間
    final long writeStartTime = System.nanoTime();
    try {
      for (int i = 0; i < numPartitions; i++) {
        // 獲取每個檔案
        final File file = partitionWriterSegments[i].file();
        ShufflePartitionWriter writer = mapOutputWriter.getPartitionWriter(i);
        if (file.exists()) {
          // 採取零拷貝方式
          if (transferToEnabled) {
            // Using WritableByteChannelWrapper to make resource closing consistent between
            // this implementation and UnsafeShuffleWriter.
            Optional<WritableByteChannelWrapper> maybeOutputChannel = writer.openChannelWrapper();
            // 在這裡會呼叫Utils.copyFileStreamNIO方法,最終呼叫FileChannel.transferTo方法拷貝檔案
            if (maybeOutputChannel.isPresent()) {
              writePartitionedDataWithChannel(file, maybeOutputChannel.get());
            } else {
              writePartitionedDataWithStream(file, writer);
            }
          } else {
            // 否則採取流的方式拷貝
            writePartitionedDataWithStream(file, writer);
          }
          if (!file.delete()) {
            logger.error("Unable to delete file for partition {}", i);
          }
        }
      }
    } finally {
      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
    }
    partitionWriters = null;
  }
  return mapOutputWriter.commitAllPartitions();
}

3.2 UnsafeShuffleWriter

UnsafeShuffleWriter也是繼承ShuffleWriter,用java實現,write方法如下:

@Override
public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
  // Keep track of success so we know if we encountered an exception
  // We do this rather than a standard try/catch/re-throw to handle
  // generic throwables.
  // 跟蹤異常
  boolean success = false;
  try {
    while (records.hasNext()) {
      // 將資料插入ShuffleExternalSorter進行外部排序
      insertRecordIntoSorter(records.next());
    }
    // 合併並輸出檔案
    closeAndWriteOutput();
    success = true;
  } finally {
    if (sorter != null) {
      try {
        sorter.cleanupResources();
      } catch (Exception e) {
        // Only throw this error if we won't be masking another
        // error.
        if (success) {
          throw e;
        } else {
          logger.error("In addition to a failure during writing, we failed during " +
                       "cleanup.", e);
        }
      }
    }
  }
}

這裡主要有兩個方法:

3.2.1 insertRecordIntoSorter()

@VisibleForTesting
void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
  assert(sorter != null);
  // 獲取key和分割槽
  final K key = record._1();
  final int partitionId = partitioner.getPartition(key);
  // 重置緩衝區
  serBuffer.reset();
  // 將key和value寫入緩衝區
  serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
  serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
  serOutputStream.flush();

  // 獲取序列化資料大小
  final int serializedRecordSize = serBuffer.size();
  assert (serializedRecordSize > 0);

  // 將序列化後的資料插入ShuffleExternalSorter處理
  sorter.insertRecord(
    serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
}

該方法會將資料進行序列化,並且將序列化後的資料通過insertRecord()方法插入外部排序器中,insertRecord()方法如下:

public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
  throws IOException {

  // for tests
  assert(inMemSorter != null);
  // 如果資料條數超過溢寫閾值,直接溢寫磁碟
  if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
    logger.info("Spilling data because number of spilledRecords crossed the threshold " +
      numElementsForSpillThreshold);
    spill();
  }

  // Checks whether there is enough space to insert an additional record in to the sort pointer
  // array and grows the array if additional space is required. If the required space cannot be
  // obtained, then the in-memory data will be spilled to disk.
  // 檢查是否有足夠的空間插入額外的記錄到排序指標陣列中,如果需要額外的空間對陣列進行擴容,如果空間不夠,記憶體中的資料將會被溢寫到磁碟上
  growPointerArrayIfNecessary();
  final int uaoSize = UnsafeAlignedOffset.getUaoSize();
  // Need 4 or 8 bytes to store the record length.
  // 需要額外的4或8個位元組儲存資料長度
  final int required = length + uaoSize;
  // 如果需要更多的記憶體,會想TaskMemoryManager申請新的page
  acquireNewPageIfNecessary(required);

  assert(currentPage != null);
  final Object base = currentPage.getBaseObject();
  //Given a memory page and offset within that page, encode this address into a 64-bit long.
  //This address will remain valid as long as the corresponding page has not been freed.
  // 通過給定的記憶體頁和偏移量,將當前資料的邏輯地址編碼成一個long型
  final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
  // 寫長度值
  UnsafeAlignedOffset.putSize(base, pageCursor, length);
  // 移動指標
  pageCursor += uaoSize;
  // 寫資料
  Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
  // 移動指標
  pageCursor += length;
  // 將編碼的邏輯地址和分割槽id傳給ShuffleInMemorySorter進行排序
  inMemSorter.insertRecord(recordAddress, partitionId);
}

在這裡對於資料的快取和溢寫不借助於其他高階資料結構,而是直接操作記憶體空間

growPointerArrayIfNecessary()方法如下:

/**
 * Checks whether there is enough space to insert an additional record in to the sort pointer
 * array and grows the array if additional space is required. If the required space cannot be
 * obtained, then the in-memory data will be spilled to disk.
 */
private void growPointerArrayIfNecessary() throws IOException {
  assert(inMemSorter != null);
  // 如果沒有空間容納新的資料
  if (!inMemSorter.hasSpaceForAnotherRecord()) {
    // 獲取當前記憶體使用量
    long used = inMemSorter.getMemoryUsage();
    LongArray array;
    try {
      // could trigger spilling
      // 分配給快取原來兩倍的容量
      array = allocateArray(used / 8 * 2);
    } catch (TooLargePageException e) {
      // The pointer array is too big to fix in a single page, spill.
      // 如果超出了一頁的大小,直接溢寫,溢寫方法見後面
      // 一頁的大小為128M,在PackedRecordPointer類中
      // static final int MAXIMUM_PAGE_SIZE_BYTES = 1 << 27;  // 128 megabytes
      spill();
      return;
    } catch (SparkOutOfMemoryError e) {
      // should have trigger spilling
      if (!inMemSorter.hasSpaceForAnotherRecord()) {
        logger.error("Unable to grow the pointer array");
        throw e;
      }
      return;
    }
    // check if spilling is triggered or not
    if (inMemSorter.hasSpaceForAnotherRecord()) {
      // 如果有了剩餘空間,則表明沒必要擴容,釋放分配的空間
      freeArray(array);
    } else {
      // 否則把原來的陣列複製到新的陣列
      inMemSorter.expandPointerArray(array);
    }
  }
}

spill()方法如下:

@Override
public long spill(long size, MemoryConsumer trigger) throws IOException {
  if (trigger != this || inMemSorter == null || inMemSorter.numRecords() == 0) {
    return 0L;
  }

  logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",
    Thread.currentThread().getId(),
    Utils.bytesToString(getMemoryUsage()),
    spills.size(),
    spills.size() > 1 ? " times" : " time");

  // Sorts the in-memory records and writes the sorted records to an on-disk file.
  // This method does not free the sort data structures.
  // 對記憶體中的資料進行排序並且將有序記錄寫到一個磁碟檔案中,這個方法不會釋放排序的資料結構
  writeSortedFile(false);
  final long spillSize = freeMemory();
  // 重置ShuffleInMemorySorter
  inMemSorter.reset();
  // Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the
  // records. Otherwise, if the task is over allocated memory, then without freeing the memory
  // pages, we might not be able to get memory for the pointer array.
  taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);
  return spillSize;
}

writeSortedFile()方法:

private void writeSortedFile(boolean isLastFile) {

  // This call performs the actual sort.
  // 返回一個排序好的迭代器
  final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
    inMemSorter.getSortedIterator();

  // If there are no sorted records, so we don't need to create an empty spill file.
  if (!sortedRecords.hasNext()) {
    return;
  }

  final ShuffleWriteMetricsReporter writeMetricsToUse;

  // 如果為true,則為輸出檔案,否則為溢寫檔案
  if (isLastFile) {
    // We're writing the final non-spill file, so we _do_ want to count this as shuffle bytes.
    writeMetricsToUse = writeMetrics;
  } else {
    // We're spilling, so bytes written should be counted towards spill rather than write.
    // Create a dummy WriteMetrics object to absorb these metrics, since we don't want to count
    // them towards shuffle bytes written.
    writeMetricsToUse = new ShuffleWriteMetrics();
  }

  // Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to
  // be an API to directly transfer bytes from managed memory to the disk writer, we buffer
  // data through a byte array. This array does not need to be large enough to hold a single
  // record;
  // 建立一個位元組緩衝陣列,大小為1m
  final byte[] writeBuffer = new byte[diskWriteBufferSize];

  // Because this output will be read during shuffle, its compression codec must be controlled by
  // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
  // createTempShuffleBlock here; see SPARK-3426 for more details.
  // 建立一個臨時的shuffle block
  final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
    blockManager.diskBlockManager().createTempShuffleBlock();
  // 獲取檔案和id
  final File file = spilledFileInfo._2();
  final TempShuffleBlockId blockId = spilledFileInfo._1();
  final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

  // Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.
  // Our write path doesn't actually use this serializer (since we end up calling the `write()`
  // OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work
  // around this, we pass a dummy no-op serializer.
  // 不做任何轉換的序列化器,因為需要一個例項來構造DiskBlockObjectWriter
  final SerializerInstance ser = DummySerializerInstance.INSTANCE;

  int currentPartition = -1;
  final FileSegment committedSegment;
  try (DiskBlockObjectWriter writer =
      blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse)) {

    final int uaoSize = UnsafeAlignedOffset.getUaoSize();
    // 遍歷
    while (sortedRecords.hasNext()) {
      sortedRecords.loadNext();
      final int partition = sortedRecords.packedRecordPointer.getPartitionId();
      assert (partition >= currentPartition);
      if (partition != currentPartition) {
        // Switch to the new partition
        // 如果切換到了新的分割槽,提交當前分割槽,並且記錄當前分割槽大小
        if (currentPartition != -1) {
          final FileSegment fileSegment = writer.commitAndGet();
          spillInfo.partitionLengths[currentPartition] = fileSegment.length();
        }
        // 然後切換到下一個分割槽
        currentPartition = partition;
      }

      // 獲取指標,通過指標獲取頁號和偏移量
      final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
      final Object recordPage = taskMemoryManager.getPage(recordPointer);
      final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
      // 獲取剩餘資料
      int dataRemaining = UnsafeAlignedOffset.getSize(recordPage, recordOffsetInPage);
      // 跳過資料前面儲存的長度
      long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length
      while (dataRemaining > 0) {
        final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
        // 將資料拷貝到緩衝陣列中
        Platform.copyMemory(
          recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
        // 從緩衝陣列中轉入DiskBlockObjectWriter
        writer.write(writeBuffer, 0, toTransfer);
        // 更新位置
        recordReadPosition += toTransfer;
        // 更新剩餘資料
        dataRemaining -= toTransfer;
      }
      writer.recordWritten();
    }

    // 提交
    committedSegment = writer.commitAndGet();
  }
  // If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,
  // then the file might be empty. Note that it might be better to avoid calling
  // writeSortedFile() in that case.
  // 記錄溢寫檔案的列表
  if (currentPartition != -1) {
    spillInfo.partitionLengths[currentPartition] = committedSegment.length();
    spills.add(spillInfo);
  }

  // 如果是溢寫檔案,更新溢寫的指標
  if (!isLastFile) {  
    writeMetrics.incRecordsWritten(
      ((ShuffleWriteMetrics)writeMetricsToUse).recordsWritten());
    taskContext.taskMetrics().incDiskBytesSpilled(
      ((ShuffleWriteMetrics)writeMetricsToUse).bytesWritten());
  }
}

encodePageNumberAndOffset()方法如下:

public long encodePageNumberAndOffset(MemoryBlock page, long offsetInPage) {
  // 如果開啟了堆外記憶體,偏移量為絕對地址,可能需要64位進行編碼,由於頁大小限制,將其減去當前頁的基地址,變為相對地址
  if (tungstenMemoryMode == MemoryMode.OFF_HEAP) {
    // In off-heap mode, an offset is an absolute address that may require a full 64 bits to
    // encode. Due to our page size limitation, though, we can convert this into an offset that's
    // relative to the page's base offset; this relative offset will fit in 51 bits.
    offsetInPage -= page.getBaseOffset();
  }
  return encodePageNumberAndOffset(page.pageNumber, offsetInPage);
}

@VisibleForTesting
public static long encodePageNumberAndOffset(int pageNumber, long offsetInPage) {
  assert (pageNumber >= 0) : "encodePageNumberAndOffset called with invalid page";
  // 高13位為頁號,低51位為偏移量
  // 頁號左移51位,再拼偏移量和上一個低51位都為1的掩碼0x7FFFFFFFFFFFFL
  return (((long) pageNumber) << OFFSET_BITS) | (offsetInPage & MASK_LONG_LOWER_51_BITS);
}

ShuffleInMemorySorter的insertRecord()方法如下:

public void insertRecord(long recordPointer, int partitionId) {
  if (!hasSpaceForAnotherRecord()) {
    throw new IllegalStateException("There is no space for new record");
  }
  array.set(pos, PackedRecordPointer.packPointer(recordPointer, partitionId));
  pos++;
}

PackedRecordPointer.packPointer()方法:

public static long packPointer(long recordPointer, int partitionId) {
  assert (partitionId <= MAXIMUM_PARTITION_ID);
  // Note that without word alignment we can address 2^27 bytes = 128 megabytes per page.
  // Also note that this relies on some internals of how TaskMemoryManager encodes its addresses.
  // 將頁號右移24位,和低27位拼在一起,這樣邏輯地址被壓縮成40位
  final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
  final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
  // 將分割槽號放在高24位上
  return (((long) partitionId) << 40) | compressedAddress;
}

getSortedIterator()方法:

public ShuffleSorterIterator getSortedIterator() {
  int offset = 0;
  // 使用基數排序對記憶體分割槽ID進行排序。基數排序要快得多,但是在新增指標時需要額外的記憶體作為保留記憶體
  if (useRadixSort) {
    offset = RadixSort.sort(
      array, pos,
      PackedRecordPointer.PARTITION_ID_START_BYTE_INDEX,
      PackedRecordPointer.PARTITION_ID_END_BYTE_INDEX, false, false);
    // 否則採用timSort排序
  } else {
    MemoryBlock unused = new MemoryBlock(
      array.getBaseObject(),
      array.getBaseOffset() + pos * 8L,
      (array.size() - pos) * 8L);
    LongArray buffer = new LongArray(unused);
    Sorter<PackedRecordPointer, LongArray> sorter =
      new Sorter<>(new ShuffleSortDataFormat(buffer));

    sorter.sort(array, 0, pos, SORT_COMPARATOR);
  }
  return new ShuffleSorterIterator(pos, array, offset);
}

3.2.2 closeAndWriteOutput()

@VisibleForTesting
void closeAndWriteOutput() throws IOException {
  assert(sorter != null);
  updatePeakMemoryUsed();
  serBuffer = null;
  serOutputStream = null;
  // 獲取溢寫檔案
  final SpillInfo[] spills = sorter.closeAndGetSpills();
  sorter = null;
  final long[] partitionLengths;
  try {
    // 合併溢寫檔案
    partitionLengths = mergeSpills(spills);
  } finally {
    // 刪除溢寫檔案
    for (SpillInfo spill : spills) {
      if (spill.file.exists() && !spill.file.delete()) {
        logger.error("Error while deleting spill file {}", spill.file.getPath());
      }
    }
  }
  // 更新mapstatus
  mapStatus = MapStatus$.MODULE$.apply(
    blockManager.shuffleServerId(), partitionLengths, mapId);
}

mergeSpills()方法:

private long[] mergeSpills(SpillInfo[] spills) throws IOException {
  long[] partitionLengths;
  // 如果沒有溢寫檔案,建立空的
  if (spills.length == 0) {
    final ShuffleMapOutputWriter mapWriter = shuffleExecutorComponents
        .createMapOutputWriter(shuffleId, mapId, partitioner.numPartitions());
    return mapWriter.commitAllPartitions();
    // 如果只有一個溢寫檔案,將它合併輸出
  } else if (spills.length == 1) {
    Optional<SingleSpillShuffleMapOutputWriter> maybeSingleFileWriter =
        shuffleExecutorComponents.createSingleFileMapOutputWriter(shuffleId, mapId);
    if (maybeSingleFileWriter.isPresent()) {
      // Here, we don't need to perform any metrics updates because the bytes written to this
      // output file would have already been counted as shuffle bytes written.
      partitionLengths = spills[0].partitionLengths;
      maybeSingleFileWriter.get().transferMapSpillFile(spills[0].file, partitionLengths);
    } else {
      partitionLengths = mergeSpillsUsingStandardWriter(spills);
    }
    // 如果有多個,合併輸出,合併的時候有NIO和BIO兩種方式
  } else {
    partitionLengths = mergeSpillsUsingStandardWriter(spills);
  }
  return partitionLengths;
}

3.3 SortShuffleWriter

SortShuffleWriter會使用PartitionedAppendOnlyMap或PartitionedPariBuffer在記憶體中進行排序,如果超過記憶體限制,會溢寫到檔案中,在全域性輸出有序檔案的時候,對之前的所有輸出檔案和當前記憶體中的資料進行全域性歸併排序,對key相同的元素會使用定義的function進行聚合,入口為write()方法:

override def write(records: Iterator[Product2[K, V]]): Unit = {
  // 建立一個外部排序器,如果map端有預聚合,就傳入aggregator和keyOrdering,否則不需要傳入
  sorter = if (dep.mapSideCombine) {
    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {
    // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
    // care whether the keys get sorted in each partition; that will be done on the reduce side
    // if the operation being run is sortByKey.
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  // 將資料放入ExternalSorter進行排序
  sorter.insertAll(records)

  // Don't bother including the time to open the merged output file in the shuffle write time,
  // because it just opens a single file, so is typically too fast to measure accurately
  // (see SPARK-3570).
  // 建立一個輸出Wrtier
  val mapOutputWriter = shuffleExecutorComponents.createMapOutputWriter(
    dep.shuffleId, mapId, dep.partitioner.numPartitions)
  // 將外部排序的資料寫入Writer
  sorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)
  val partitionLengths = mapOutputWriter.commitAllPartitions()
  // 更新mapstatus
  mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths, mapId)
}

insertAll()方法:

def insertAll(records: Iterator[Product2[K, V]]): Unit = {
  // TODO: stop combining if we find that the reduction factor isn't high
  val shouldCombine = aggregator.isDefined

  // 是否需要map端聚合
  if (shouldCombine) {
    // Combine values in-memory first using our AppendOnlyMap
    // 使用AppendOnlyMap在記憶體中聚合values
    // 獲取mergeValue()函式,將新值合併到當前聚合結果中
    val mergeValue = aggregator.get.mergeValue
    // 獲取createCombiner()函式,建立聚合初始值
    val createCombiner = aggregator.get.createCombiner
    var kv: Product2[K, V] = null
    // 如果一個key當前有聚合值,則合併,如果沒有建立初始值
    val update = (hadValue: Boolean, oldValue: C) => {
      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
    }
    // 遍歷
    while (records.hasNext) {
      // 增加讀取記錄數
      addElementsRead()
      kv = records.next()
      // map為PartitionedAppendOnlyMap,將分割槽和key作為key,聚合值作為value
      map.changeValue((getPartition(kv._1), kv._1), update)
      // 是否需要溢寫到磁碟
      maybeSpillCollection(usingMap = true)
    }
    // 如果不需要map端聚合
  } else {
    // Stick values into our buffer
    while (records.hasNext) {
      addElementsRead()
      val kv = records.next()
      // buffer為PartitionedPairBuffer,將分割槽和key加進去
      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
      // 是否需要溢寫到磁碟
      maybeSpillCollection(usingMap = false)
    }
  }
}

該方法主要是判斷在插入資料時,是否需要在map端進行預聚合,分別採用兩種資料結構來儲存

maybeSpillCollection()方法裡面會呼叫maybeSpill()方法檢查是否需要溢寫,如果發生溢寫,重新構造一個map或者buffer結構從頭開始快取,如下:

private def maybeSpillCollection(usingMap: Boolean): Unit = {
  var estimatedSize = 0L
  if (usingMap) {
    estimatedSize = map.estimateSize()
    // 判斷是否需要溢寫
    if (maybeSpill(map, estimatedSize)) {
      map = new PartitionedAppendOnlyMap[K, C]
    }
  } else {
    estimatedSize = buffer.estimateSize()
    // 判斷是否需要溢寫
    if (maybeSpill(buffer, estimatedSize)) {
      buffer = new PartitionedPairBuffer[K, C]
    }
  }

  if (estimatedSize > _peakMemoryUsedBytes) {
    _peakMemoryUsedBytes = estimatedSize
  }
}


  protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
    var shouldSpill = false
    // 如果讀取的記錄數是32的倍數,並且預估map或者buffer記憶體佔用大於預設的5m閾值
    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
      // Claim up to double our current memory from the shuffle memory pool
      // 嘗試申請2*currentMemory-5m的記憶體
      val amountToRequest = 2 * currentMemory - myMemoryThreshold
      val granted = acquireMemory(amountToRequest)
      // 更新閾值
      myMemoryThreshold += granted
      // If we were granted too little memory to grow further (either tryToAcquire returned 0,
      // or we already had more memory than myMemoryThreshold), spill the current collection
      // 判斷,如果還是不夠,確定溢寫
      shouldSpill = currentMemory >= myMemoryThreshold
    }
    // 如果shouldSpill為false,但是讀取的記錄數大於Integer.MAX_VALUE,也是需要溢寫
    shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
    // Actually spill
    if (shouldSpill) {
      // 溢寫次數+1
      _spillCount += 1
      logSpillage(currentMemory)
      // 溢寫快取的集合
      spill(collection)
      _elementsRead = 0
      _memoryBytesSpilled += currentMemory
      // 釋放記憶體
      releaseMemory()
    }
    shouldSpill
  }

maybeSpill()方法裡面會呼叫spill()進行溢寫,如下:

  override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
    // 根據給定的比較器進行排序,返回排序結果的迭代器
    val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
    // 將迭代器中的資料溢寫到磁碟檔案中
    val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
    // ArrayBuffer記錄所有溢寫的檔案
    spills += spillFile
  }

spillMemoryIteratorToDisk()方法如下:

private[this] def spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator)
    : SpilledFile = {
  // Because these files may be read during shuffle, their compression must be controlled by
  // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
  // createTempShuffleBlock here; see SPARK-3426 for more context.
  // 建立一個臨時塊
  val (blockId, file) = diskBlockManager.createTempShuffleBlock()

  // These variables are reset after each flush
  var objectsWritten: Long = 0
  val spillMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics
  // 建立溢寫檔案的DiskBlockObjectWriter
  val writer: DiskBlockObjectWriter =
    blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)

  // List of batch sizes (bytes) in the order they are written to disk
  // 記錄寫入批次大小
  val batchSizes = new ArrayBuffer[Long]

  // How many elements we have in each partition
  // 記錄每個分割槽條數
  val elementsPerPartition = new Array[Long](numPartitions)

  // Flush the disk writer's contents to disk, and update relevant variables.
  // The writer is committed at the end of this process.
  // 將記憶體中的資料按批次刷寫到磁碟中
  def flush(): Unit = {
    val segment = writer.commitAndGet()
    batchSizes += segment.length
    _diskBytesSpilled += segment.length
    objectsWritten = 0
  }

  var success = false
  try {
    // 遍歷map或者buffer中的記錄
    while (inMemoryIterator.hasNext) {
      val partitionId = inMemoryIterator.nextPartition()
      require(partitionId >= 0 && partitionId < numPartitions,
        s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")
      // 寫入並更新計數值
      inMemoryIterator.writeNext(writer)
      elementsPerPartition(partitionId) += 1
      objectsWritten += 1

      // 寫入條數達到10000條時,將這批刷寫到磁碟
      if (objectsWritten == serializerBatchSize) {
        flush()
      }
    }
    // 遍歷完以後,將剩餘的刷寫到磁碟
    if (objectsWritten > 0) {
      flush()
    } else {
      writer.revertPartialWritesAndClose()
    }
    success = true
  } finally {
    if (success) {
      writer.close()
    } else {
      // This code path only happens if an exception was thrown above before we set success;
      // close our stuff and let the exception be thrown further
      writer.revertPartialWritesAndClose()
      if (file.exists()) {
        if (!file.delete()) {
          logWarning(s"Error deleting ${file}")
        }
      }
    }
  }

  // 返回溢寫檔案
  SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)
}

接下來就是排序合併操作,呼叫ExternalSorter.writePartitionedMapOutput()方法:

def writePartitionedMapOutput(
    shuffleId: Int,
    mapId: Long,
    mapOutputWriter: ShuffleMapOutputWriter): Unit = {
  var nextPartitionId = 0
  // 如果沒有發生溢寫
  if (spills.isEmpty) {
    // Case where we only have in-memory data
    val collection = if (aggregator.isDefined) map else buffer
    // 根據指定的比較器進行排序
    val it = collection.destructiveSortedWritablePartitionedIterator(comparator)
    while (it.hasNext()) {
      val partitionId = it.nextPartition()
      var partitionWriter: ShufflePartitionWriter = null
      var partitionPairsWriter: ShufflePartitionPairsWriter = null
      TryUtils.tryWithSafeFinally {
        partitionWriter = mapOutputWriter.getPartitionWriter(partitionId)
        val blockId = ShuffleBlockId(shuffleId, mapId, partitionId)
        partitionPairsWriter = new ShufflePartitionPairsWriter(
          partitionWriter,
          serializerManager,
          serInstance,
          blockId,
          context.taskMetrics().shuffleWriteMetrics)
        // 將分割槽內的資料依次取出
        while (it.hasNext && it.nextPartition() == partitionId) {
          it.writeNext(partitionPairsWriter)
        }
      } {
        if (partitionPairsWriter != null) {
          partitionPairsWriter.close()
        }
      }
      nextPartitionId = partitionId + 1
    }
    // 如果發生溢寫,將溢寫檔案和快取資料進行歸併排序,排序完成後按照分割槽依次寫入ShufflePartitionPairsWriter
  } else {
    // We must perform merge-sort; get an iterator by partition and write everything directly.
    // 這裡會進行歸併排序
    for ((id, elements) <- this.partitionedIterator) {
      val blockId = ShuffleBlockId(shuffleId, mapId, id)
      var partitionWriter: ShufflePartitionWriter = null
      var partitionPairsWriter: ShufflePartitionPairsWriter = null
      TryUtils.tryWithSafeFinally {
        partitionWriter = mapOutputWriter.getPartitionWriter(id)
        partitionPairsWriter = new ShufflePartitionPairsWriter(
          partitionWriter,
          serializerManager,
          serInstance,
          blockId,
          context.taskMetrics().shuffleWriteMetrics)
        if (elements.hasNext) {
          for (elem <- elements) {
            partitionPairsWriter.write(elem._1, elem._2)
          }
        }
      } {
        if (partitionPairsWriter != null) {
          partitionPairsWriter.close()
        }
      }
      nextPartitionId = id + 1
    }
  }

  context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
  context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
  context.taskMetrics().incPeakExecutionMemory(peakMemoryUsedBytes)
}

partitionedIterator()方法:

def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = {
  val usingMap = aggregator.isDefined
  val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer
  if (spills.isEmpty) {
    // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps
    // we don't even need to sort by anything other than partition ID
    // 如果沒有溢寫,並且沒有排序,只按照分割槽id排序
    if (ordering.isEmpty) {
      // The user hasn't requested sorted keys, so only sort by partition ID, not key
      groupByPartition(destructiveIterator(collection.partitionedDestructiveSortedIterator(None)))
      // 如果沒有溢寫但是排序,先按照分割槽id排序,再按key排序
    } else {
      // We do need to sort by both partition ID and key
      groupByPartition(destructiveIterator(
        collection.partitionedDestructiveSortedIterator(Some(keyComparator))))
    }
  } else {
    // Merge spilled and in-memory data
    // 如果有溢寫,就將溢寫檔案和記憶體中的資料歸併排序
    merge(spills, destructiveIterator(
      collection.partitionedDestructiveSortedIterator(comparator)))
  }
}

歸併方法如下:

private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  // 讀取溢寫檔案
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  // 遍歷分割槽
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    // 合併溢寫檔案和記憶體中的資料
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    // 如果有聚合邏輯,按分割槽聚合,對key按照keyComparator排序
    if (aggregator.isDefined) {
      // Perform partial aggregation across partitions
      (p, mergeWithAggregation(
        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
      // 如果沒有聚合,但是有排序邏輯,按照ordering做歸併
    } else if (ordering.isDefined) {
      // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
      // sort the elements without trying to merge them
      (p, mergeSort(iterators, ordering.get))
      // 什麼都沒有直接歸併
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}

在write()方法中呼叫commitAllPartitions()方法輸出資料,其中呼叫writeIndexFileAndCommit()方法寫出資料和索引檔案,如下:

def writeIndexFileAndCommit(
    shuffleId: Int,
    mapId: Long,
    lengths: Array[Long],
    dataTmp: File): Unit = {
  // 建立索引檔案和臨時索引檔案
  val indexFile = getIndexFile(shuffleId, mapId)
  val indexTmp = Utils.tempFileWith(indexFile)
  try {
    // 獲取shuffle data file
    val dataFile = getDataFile(shuffleId, mapId)
    // There is only one IndexShuffleBlockResolver per executor, this synchronization make sure
    // the following check and rename are atomic.
    // 對於每個executor只有一個IndexShuffleBlockResolver,確保原子性
    synchronized {
      // 檢查索引是否和資料檔案已經有了對應關係
      val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)
      if (existingLengths != null) {
        // Another attempt for the same task has already written our map outputs successfully,
        // so just use the existing partition lengths and delete our temporary map outputs.
        // 如果存在對應關係,說明shuffle write已經完成,刪除臨時索引檔案
        System.arraycopy(existingLengths, 0, lengths, 0, lengths.length)
        if (dataTmp != null && dataTmp.exists()) {
          dataTmp.delete()
        }
      } else {
        // 如果不存在,建立一個BufferedOutputStream
        // This is the first successful attempt in writing the map outputs for this task,
        // so override any existing index and data files with the ones we wrote.
        val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp)))
        Utils.tryWithSafeFinally {
          // We take in lengths of each block, need to convert it to offsets.
          // 獲取每個分割槽的大小,累加偏移量,寫入臨時索引檔案
          var offset = 0L
          out.writeLong(offset)
          for (length <- lengths) {
            offset += length
            out.writeLong(offset)
          }
        } {
          out.close()
        }

        // 刪除可能存在的其他索引檔案
        if (indexFile.exists()) {
          indexFile.delete()
        }
        // 刪除可能存在的其他資料檔案
        if (dataFile.exists()) {
          dataFile.delete()
        }
        // 將臨時檔案重新命名成正式檔案
        if (!indexTmp.renameTo(indexFile)) {
          throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
        }
        if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
          throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
        }
      }
    }
  } finally {
    if (indexTmp.exists() && !indexTmp.delete()) {
      logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
    }
  }
}

4.小結

  • Spark在初始化SparkEnv的時候,會在create()方法裡面初始化ShuffleManager,包含sort和tungsten-sort兩種shuffle

  • ShuffleManager是一個特質,核心方法有registerShuffle()、getReader()、getWriter(),

  • SortShuffleManager是ShuffleManager的唯一實現類,在registerShuffle()方法裡面選擇採用哪種shuffle機制,getReader()方法只會返回一種BlockStoreShuffleReader,getWriter()方法根據不同的handle選擇不同的Writer,共有三種

  • BypassMergeSortShuffleWriter:如果當前shuffle依賴中沒有map端的聚合操作,並且分割槽數小於spark.shuffle.sort.bypassMergeThreshold的值,預設為200,啟用bypass機制,核心方法有:write()、writePartitionedData()(合併所有分割槽檔案,預設採用零拷貝方式)

  • UnsafeShuffleWriter:如果serializer支援relocation並且map端沒有聚合同時分割槽數目不大於16777215+1三個條件都滿足,採用該Writer,核心方法有:write()、insertRecordIntoSorter()(將資料插入外部選擇器排序)、closeAndWriteOutput()(合併並輸出檔案),前一個方法裡核心方法有:insertRecord()(將序列化資料插入外部排序器)、growPointerArrayIfNecessary()(如果需要額外空間需要對陣列擴容或溢寫到磁碟)、spill()(溢寫到磁碟)、writeSortedFile()(將記憶體中的資料進行排序並寫出到磁碟檔案中)、encodePageNumberAndOffset()(對當前資料的邏輯地址進行編碼,轉成long型),後面的方法裡核心方法有:mergeSpills()(合併溢寫檔案),合併檔案的時候有BIO和NIO兩種方式

  • SortShuffleWriter:如果上面兩者都不滿足,採用該Writer,該Writer會使用PartitionedAppendOnlyMap或PartitionedPariBuffer在記憶體中進行排序,如果超過記憶體限制,會溢寫到檔案中,在全域性輸出有序檔案的時候,對之前的所有輸出檔案和當前記憶體中的資料進行全域性歸併排序,對key相同的元素會使用定義的function進行聚合核心方法有:write()、insertAll()(將資料放入ExternalSorter進行排序)、maybeSpillCollection()(是否需要溢寫到磁碟)、maybeSpill()、spill()、spillMemoryIteratorToDisk()(將記憶體中資料溢寫到磁碟)、writePartitionedMapOutput()、commitAllPartitions()裡面呼叫writeIndexFileAndCommit()方法寫出資料和索引檔案

相關文章