Spark原始碼解析之Shuffle Writer
摘要:
Shuffle
是MapReduce
程式設計模型中最耗時的一個步驟,而Spark
將Shuffle
過程分解成了Shuffle Write
和Shuffle Read
兩個過程,本文我們將詳細解讀Spark
的Shuffle Write
實現。
ShuffleWriter
Spark Shuffle Write
的介面是org.apache.spark.shuffle.ShuffleWriter
我們來看下介面定義:
private[spark] abstract class ShuffleWriter[K, V] {![螢幕快照 2017-12-17 下午2.48.59.png](http://upload-images.jianshu.io/upload_images/1381055-7248e894ca3ea2b4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
/** Write a sequence of records to this task's output */
@throws[IOException]
def write(records: Iterator[Product2[K, V]]): Unit
/** Close this writer, passing along whether the map completed */
def stop(success: Boolean): Option[MapStatus]
}
共有三個實現類:
BypassMergeSortShuffleWriter
我們以第一個stage
(map)的個數為m個來計算,第二個stage個數為r個來計算
BypassMergeSortShuffleWriter
可以分為
1.為每個
ShuffleMapTask
(即map
端的每個partition
,每個ShuffleMapTask
處理的是map端的一個partition
)建立r
個臨時檔案
2.迭代每個map的partition,根據getPartition(key)來分組,並寫入對應的partitionId的檔案
3.合併步驟2產生的r個檔案,並將每個partitionId對應的索引寫入index檔案
關鍵程式碼解讀
public void write(Iterator<Product2<K, V>> records) throws IOException {
...
// 根據下游stage(reduce端)的partition個數建立對應個數的DiskWriter
partitionWriters = new DiskBlockObjectWriter[numPartitions];
partitionWriterSegments = new FileSegment[numPartitions];
for (int i = 0; i < numPartitions; i++) {
final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
blockManager.diskBlockManager().createTempShuffleBlock();
final File file = tempShuffleBlockIdPlusFile._2();
final BlockId blockId = tempShuffleBlockIdPlusFile._1();
partitionWriters[i] =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
}
// 根據`getPartition(key)`獲取kv所屬的reduce的partitionId,並將kv寫入對應的partitionId的臨時檔案
while (records.hasNext()) {
final Product2<K, V> record = records.next();
final K key = record._1();
partitionWriters[partitioner.getPartition(key)].write(key, record._2());
}
for (int i = 0; i < numPartitions; i++) {
final DiskBlockObjectWriter writer = partitionWriters[i];
partitionWriterSegments[i] = writer.commitAndGet();
writer.close();
}
File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
File tmp = Utils.tempFileWith(output);
try {
// 合併多個partitionId對應的臨時檔案,寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.data`檔案
partitionLengths = writePartitionedFile(tmp);
// 將多個partitionId對應的index寫入`shuffle_${shuffleId}_${mapId}_${reduceId}.index`檔案
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
}
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}
1.預設的
Partitioner
的實現類為HashPartitioner
2.預設的SerializerInstance
的實現類為JavaSerializerInstance
FileSegment
一個BypassMergeSortShuffleWriter
的中間臨時檔案稱之為FileSegment
class FileSegment(val file: File, val offset: Long, val length: Long)
file
記錄物理檔案,length
記錄檔案大小,用於合併多個FileSegment
時寫index檔案。
我們再看下合併臨時檔案方法writePartitionedFile
的實現:
private long[] writePartitionedFile(File outputFile) throws IOException {
final long[] lengths = new long[numPartitions];
...
final FileChannel out = FileChannel.open(outputFile.toPath(), WRITE, APPEND, CREATE);
for (int i = 0; i < numPartitions; i++) {
final File file = partitionWriterSegments[i].file();
if (file.exists()) {
final FileChannel in = FileChannel.open(file.toPath(), READ);
try {
long size = in.size();
// 合併檔案的關鍵程式碼,通過NIO的transferTo提高合併檔案流的效率
Utils.copyFileStreamNIO(in, out, 0, size);
lengths[i] = size;
}
}
}
}
partitionWriters = null;
// 返回每個臨時檔案大小,用於寫Index檔案
return lengths;
}
寫index檔案的方法writeIndexFileAndCommit:
def writeIndexFileAndCommit(
shuffleId: Int,
mapId: Int,
lengths: Array[Long],
dataTmp: File): Unit = {
val indexFile = getIndexFile(shuffleId, mapId)
val indexTmp = Utils.tempFileWith(indexFile)
try {
val out = new DataOutputStream(
new BufferedOutputStream(Files.newOutputStream(indexTmp.toPath)))
Utils.tryWithSafeFinally {
var offset = 0L
out.writeLong(offset)
for (length <- lengths) {
offset += length
out.writeLong(offset)
}
}
}
...
}
NOTE: 1.檔案合併時採用了java nio的transferTo方法提高檔案合併效率。
2.BypassMergeSortShuffleWriter
的完整程式碼
BypassMergeSortShuffleWriter Example
我們通過下面一個例子來看下BypassMergeSortShuffleWriter
的工作原理。
1.真實場景下,我們的partition上的資料往往是無序的,本例中我們模擬的資料是有序的,不要誤認為BypassMergeSortShuffleWriter會為我們的資料排序。
SortShuffleWriter
預備知識:
org.apache.spark.util.collection.AppendOnlyMap
org.apache.spark.util.collection.PartitionedPairBuffer
TimSorter
SortShuffleWriter.writer()實現
我們先看下writer
的具體實現:
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records)
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
val tmp = Utils.tempFileWith(output)
try {
val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}
SortShuffleWriter write
過程大概可以分成兩個步驟,第一步insertAll
,第二步merge
溢寫到磁碟的SpilledFile
ExternalSorter
可以分為四個步驟來理解
- 根據是否需要
combine
操作,決定快取結構是PartitionedAppendOnlyMap
還是PartitionedPairBuffer
,在這兩種資料結構中,我們會先按照partitionId
將資料排序,而且在每個partition
中,我們會根據key排序。 - 當快取資料到達我們的記憶體限制,或者或者條數限制,我們將進行
spill
操作,並且每個SpilledFile
會記錄每個parition
有多少條記錄。 - 當我們請求一個
iterator
或者檔案時,會將所有的SpilledFile
和在記憶體當中未進行溢寫的資料進行合併。 - 最後請求
stop
方法刪除相關臨時檔案。
ExternalSorter.insertAll
的實現:
def insertAll(records: Iterator[Product2[K, V]]): Unit = {
val shouldCombine = aggregator.isDefined
// 根據aggregator是否定義來判斷是否需要map端合併(combine)
if (shouldCombine) {
// Combine values in-memory first using our AppendOnlyMap
// 對應rdd.aggregatorByKey的 seqOp 引數
val mergeValue = aggregator.get.mergeValue
// 對應rdd.aggregatorByKey的zeroValue引數,利用zeroValue來建立Combiner
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null
val update = (hadValue: Boolean, oldValue: C) => {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
while (records.hasNext) {
addElementsRead()
kv = records.next()
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpillCollection(usingMap = true)
}
} else {
// Stick values into our buffer
while (records.hasNext) {
addElementsRead()
val kv = records.next()
buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
maybeSpillCollection(usingMap = false)
}
}
}
需要注意的一點是往
map/buffer
中寫入的key
都是(partitionId,key)
,因為我們需要對一個臨時檔案中的資料結構,先按照partitionId
排序,再按照key
排序。
寫磁碟的時機
寫磁碟的時機有兩個條件,滿足其中一個就進行spill操作。
- 1.每32個元素取樣一次,判斷當前記憶體指是否大於
myMemoryThreshold
,即currentMemory >= myMemoryThreshold
。currentMemory
需要通過預估當前map/buffer
大小來獲取。 - 2.判斷記憶體快取結構中資料條數是否大於強制溢寫閾值,即
_elementsRead > numElementsForceSpillThreshold
。強制溢寫閾值可以通過在SparkConf
中設定spark.shuffle.spill.batchSize
來控制。
private def maybeSpillCollection(usingMap: Boolean): Unit = {
var estimatedSize = 0L
if (usingMap) {
// 預估map在記憶體中的大小
estimatedSize = map.estimateSize()
if (maybeSpill(map, estimatedSize)) {
// 如果記憶體中的資料spill到磁碟上了,重置map
map = new PartitionedAppendOnlyMap[K, C]
}
} else {
// 預估buffer在記憶體中的大小
estimatedSize = buffer.estimateSize()
if (maybeSpill(buffer, estimatedSize)) {
// 同map操作
buffer = new PartitionedPairBuffer[K, C]
}
}
if (estimatedSize > _peakMemoryUsedBytes) {
_peakMemoryUsedBytes = estimatedSize
}
}
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
var shouldSpill = false
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
val amountToRequest = 2 * currentMemory - myMemoryThreshold
val granted = acquireMemory(amountToRequest)
myMemoryThreshold += granted
shouldSpill = currentMemory >= myMemoryThreshold
}
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
if (shouldSpill) {
_spillCount += 1
logSpillage(currentMemory)
// 溢寫
spill(collection)
_elementsRead = 0
_memoryBytesSpilled += currentMemory
releaseMemory()
}
shouldSpill
}
溢寫磁碟的過程
override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
// 利用timsort演算法將記憶體中的資料排序
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
// 將記憶體中的資料寫入磁碟
val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
// 加入spills陣列
spills += spillFile
}
總結下insertAll過程就是,利用記憶體快取結構的資料結構PartitionedPairBuffer
/PartitionedAppendOnlyMap
,一邊往記憶體快取寫資料一邊判斷是否達到spill的條件,一次spill就是一個磁碟臨時檔案。
讀取SpilledFile
過程
SpilledFile
資料檔案是按照(partitionId,recordKey)來排序,而且我們記錄了每個partition
的offset
,所以我們獲取一個SpilledFile
中的某個partition資料就變得很簡單了。
讀取SpilledFile
的實現類是SpillReader
merge過程
private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
: Iterator[(Int, Iterator[Product2[K, C]])] = {
val readers = spills.map(new SpillReader(_))
val inMemBuffered = inMemory.buffered
(0 until numPartitions).iterator.map { p =>
val inMemIterator = new IteratorForPartition(p, inMemBuffered)
val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
if (aggregator.isDefined) {
(p, mergeWithAggregation(
iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
} else if (ordering.isDefined) {
(p, mergeSort(iterators, ordering.get))
} else {
(p, iterators.iterator.flatten)
}
}
}
merge
過程是比較複雜的一個過程,要涉及到當前Shuffle
是否有aggregator
和ordering
操作。接下來我們將就這幾種情況一一分析。
no aggregator or sorter
partitionBy
case class TestIntKey(i: Int)
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", 4.toString)
val sc = new SparkContext(conf)
val testData = (1 to 100).toList
sc.parallelize(testData, 1)
.map(x => {
(TestIntKey(x % 3), x)
}).partitionBy(new HashPartitioner(3)).collect()
no aggregator but sorter
這段程式碼其實很容易混淆,因為很容易想到sortByKey
操作就是無aggregator
有sorter
操作,但是我們其實可以看到SortShuffleWriter
在初始化ExternalSorter
的時,ordring = None
。具體程式碼如下:
sorter = if (dep.mapSideCombine) {
...
} else {
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
NOTE:
sortBykey
的ordering
的邏輯將會被放到Shuffle Read
過程中執行,這個我們後續將會介紹。
不過我們還是來簡單看下mergeSort
方法的實現。我們的SpilledFile
中,每個partition內的資料已經是按照recordKey排好序的,所以我們只要拿到每個SpilledFile的
private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
: Iterator[Product2[K, C]] =
{
// NOTE:(fchen)將該partition資料全部放入等級佇列當中,取資料時進行每個iterator頭部對比,取出最小的
val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
type Iter = BufferedIterator[Product2[K, C]]
val heap = new mutable.PriorityQueue[Iter]()(new Ordering[Iter] {
override def compare(x: Iter, y: Iter): Int = -comparator.compare(x.head._1, y.head._1)
})
heap.enqueue(bufferedIters: _*) // Will contain only the iterators with hasNext = true
new Iterator[Product2[K, C]] {
override def hasNext: Boolean = !heap.isEmpty
override def next(): Product2[K, C] = {
if (!hasNext) {
throw new NoSuchElementException
}
val firstBuf = heap.dequeue()
val firstPair = firstBuf.next()
if (firstBuf.hasNext) {
// 將迭代器重新放回等級佇列
heap.enqueue(firstBuf)
}
firstPair
}
}
}
我們通過下面這個例子來看下mergeSort
的整個過程:
從示例圖中我們可以清晰的看出,一個分散在多個
SpilledFile
的partition資料,經過mergeSort
操作之後,就會變成按照recordKey排序的Iterator了。
aggregator, but no sorter
reduceByKey
if (!totalOrder) {
new Iterator[Iterator[Product2[K, C]]] {
val sorted = mergeSort(iterators, comparator).buffered
// Buffers reused across elements to decrease memory allocation
val keys = new ArrayBuffer[K]
val combiners = new ArrayBuffer[C]
override def hasNext: Boolean = sorted.hasNext
override def next(): Iterator[Product2[K, C]] = {
if (!hasNext) {
throw new NoSuchElementException
}
keys.clear()
combiners.clear()
val firstPair = sorted.next()
keys += firstPair._1
combiners += firstPair._2
val key = firstPair._1
while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
val pair = sorted.next()
var i = 0
var foundKey = false
while (i < keys.size && !foundKey) {
if (keys(i) == pair._1) {
combiners(i) = mergeCombiners(combiners(i), pair._2)
foundKey = true
}
i += 1
}
if (!foundKey) {
keys += pair._1
combiners += pair._2
}
}
keys.iterator.zip(combiners.iterator)
}
}.flatMap(i => i)
}
看到這我們可能會有所困惑,為什麼key儲存需要一個ArrayBuffer
reduceByKey Example:
val conf = new SparkConf()
conf.setMaster("local[3]")
conf.setAppName("shuffle debug")
conf.set("spark.shuffle.sort.bypassMergeThreshold", "0")
conf.set("spark.shuffle.spill.numElementsForceSpillThreshold", (4).toString)
val sc = new SparkContext(conf)
val testData = (1 to 10).toList
val keys = Array("Aa", "BB")
val count = sc.parallelize(testData, 1)
.map(x => {
(keys(x % 2), x)
}).reduceByKey(_ + _, 3).collectPartitions().foreach(x => {
x.foreach(y => {
println(y._1 + "," + y._2)
})
})
下圖演示了reduceByKey
在有hash
衝突的情況下,整個mergeWithAggregation
過程
aggregator and sorter
雖然有這段邏輯,但是我並沒找到同時帶有aggregator和sorter的操作,所以這裡我們簡單過下這段邏輯就好了。
合併SpilledFile
經過partition的merge操作之後就可以進行data和index檔案的寫入,具體的寫入過程和BypassMergeSortShuffleWriter
是一樣的,這裡我們就不再做更多的解釋了。
private[this] case class SpilledFile(
file: File,
blockId: BlockId,
serializerBatchSizes: Array[Long],
elementsPerPartition: Array[Long])
SortShuffleWriter
總結
序列化了兩次,一次是寫SpilledFile,一次是合併SpilledFile
UnsafeShuffleWriter
上面我們介紹了兩種在堆內做Shuffle write的方式,這種方式的缺點很明顯,就是在大物件的情況下,Jvm的垃圾回收效能表現比較差。所以就衍生了堆外記憶體的Shuffle write,即UnsafeShuffleWriter
。
從巨集觀上看,UnsafeShuffleWriter
跟SortShufflerWriter
設計很相似,都是將map
端的資料,按照reduce
端的partitionId
進行排序,超過一定限制就將記憶體中的記錄溢寫到磁碟上。最後將這些檔案合併寫入一個MapOutputFile
,並記錄每個partition
的offset
。
通過上面兩種on-heap的Shuffle write模型,我們就可以知道
預備知識
記憶體分頁管理模型
實現細節
在詳細介紹UnsafeShuffleWriter
之前,讓我們先來看下基礎知識,先看下PackedRecordPointer
類。
final class PackedRecordPointer {
...
public static long packPointer(long recordPointer, int partitionId) {
final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
return (((long) partitionId) << 40) | compressedAddress;
}
private long packedRecordPointer;
public void set(long packedRecordPointer) {
this.packedRecordPointer = packedRecordPointer;
}
public int getPartitionId() {
return (int) ((packedRecordPointer & MASK_LONG_UPPER_24_BITS) >>> 40);
}
public long getRecordPointer() {
final long pageNumber = (packedRecordPointer << 24) & MASK_LONG_UPPER_13_BITS;
final long offsetInPage = packedRecordPointer & MASK_LONG_LOWER_27_BITS;
return pageNumber | offsetInPage;
}
}
PackedRecordPointer
用一個long型別來儲存partitionId,pageNumber,offsetInPage
,已知一個long
是64位,從程式碼中我們可以看出:
[ 24 bit partitionId ] [ 13 bit pageNumber] [ 27 bit offset in page]
insertRecord
方法:
public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
throws IOException {
// 如果寫入記憶體的條數大於強制Spill閾值進行spill
if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
spill();
}
growPointerArrayIfNecessary();
// Need 4 bytes to store the record length.
final int required = length + 4;
acquireNewPageIfNecessary(required);
assert(currentPage != null);
final Object base = currentPage.getBaseObject();
final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
Platform.putInt(base, pageCursor, length);
pageCursor += 4;
Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
pageCursor += length;
inMemSorter.insertRecord(recordAddress, partitionId);
}
spill
過程其實就是寫檔案的過程,也就是呼叫writeSortedFile
的過程:
private void writeSortedFile(boolean isLastFile) {
...
// 將inMemSorter,也就是PackedRecordPointer按照partitionId排序
final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
inMemSorter.getSortedIterator();
final byte[] writeBuffer = new byte[diskWriteBufferSize];
final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
blockManager.diskBlockManager().createTempShuffleBlock();
final File file = spilledFileInfo._2();
final TempShuffleBlockId blockId = spilledFileInfo._1();
final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);
final SerializerInstance ser = DummySerializerInstance.INSTANCE;
final DiskBlockObjectWriter writer =
blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);
int currentPartition = -1;
while (sortedRecords.hasNext()) {
sortedRecords.loadNext();
final int partition = sortedRecords.packedRecordPointer.getPartitionId();
if (partition != currentPartition) {
// Switch to the new partition
if (currentPartition != -1) {
final FileSegment fileSegment = writer.commitAndGet();
spillInfo.partitionLengths[currentPartition] = fileSegment.length();
}
currentPartition = partition;
}
final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
final Object recordPage = taskMemoryManager.getPage(recordPointer);
final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);
long recordReadPosition = recordOffsetInPage + 4; // skip over record length
while (dataRemaining > 0) {
final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
Platform.copyMemory(
recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
writer.write(writeBuffer, 0, toTransfer);
recordReadPosition += toTransfer;
dataRemaining -= toTransfer;
}
writer.recordWritten();
}
final FileSegment committedSegment = writer.commitAndGet();
writer.close();
if (currentPartition != -1) {
spillInfo.partitionLengths[currentPartition] = committedSegment.length();
spills.add(spillInfo);
}
}
下圖演示了資料在記憶體中的過程
由於UnsafeShuffleWriter
並沒有aggreate
和sort
操作,所以合併多個臨時檔案中某一個partition
的資料就變得很簡單了,因為我們記錄了每個partition
的offset
private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
...
if (spills.length == 0) {
java.nio.file.Files.newOutputStream(outputFile.toPath()).close(); // Create an empty file
return new long[partitioner.numPartitions()];
} else if (spills.length == 1) {
Files.move(spills[0].file, outputFile);
return spills[0].partitionLengths;
} else {
final long[] partitionLengths;
if (fastMergeEnabled && fastMergeIsSupported) {
if (transferToEnabled && !encryptionEnabled) {
logger.debug("Using transferTo-based fast merge");
partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
} else {
logger.debug("Using fileStream-based fast merge");
partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
}
}
}
...
}
與SortShuffleWriter
對比
- 資料是放在堆外記憶體,減少GC開銷。
-
merge
檔案無需反序列化檔案。
觸發條件
我們先來看下SortShuffleManager是如何選擇應該採用哪種ShuffleWriter的
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
new BypassMergeSortShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
new SerializedShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
// Otherwise, buffer map outputs in a deserialized form:
new BaseShuffleHandle(shuffleId, numMaps, dependency)
}
}
Bypass觸發條件
def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
// We cannot bypass sorting if we need to do map-side aggregation.
if (dep.mapSideCombine) {
require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
false
} else {
val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
dep.partitioner.numPartitions <= bypassMergeThreshold
}
}
1.
reduce
端partition
個數小於spark.shuffle.sort.bypassMergeThreshold
2.無map
端combine
操作
UnsafeShuffleWriter觸發條件
def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
val shufId = dependency.shuffleId
val numPartitions = dependency.partitioner.numPartitions
if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
s"${dependency.serializer.getClass.getName}, does not support object relocation")
false
} else if (dependency.aggregator.isDefined) {
log.debug(
s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
false
} else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
false
} else {
log.debug(s"Can use serialized shuffle for shuffle $shufId")
true
}
}
1.
Serializer
支援relocation
2.無map
端combine
操作
3.reduce
端partition
個數小於
SortShuffleWriter觸發條件
無法使用上述兩種ShuffleWriter
則採用SortShuffleWriter
關鍵點
- 為什麼需要合併shuffle中間結果
減少讀取時的檔案控制程式碼數。 我們可以看到一個partition產生的臨時檔案數目為reduce個數,當我們reduce個數非常大的時候,executor需要維護非常多的檔案控制程式碼。在HashShuffleWriter實現中,需要讀取過多的檔案。
說明
- 本文是基於寫部落格時的最新master程式碼分析的,而spark還不斷迭代中,大家需要根據spark發展繼續分析。
- 文中所有原始碼都擷取關鍵程式碼,忽略了大部分對邏輯分析無關的程式碼,並不代表其他程式碼不重要。
總結
1.
ShuffleWriter
肯定會產生落磁碟檔案。
2.從巨集觀上看,ShuffleWriter
過程就是在Map
端根據Partitioner
聚合Reduce
端的資料,最後將資料寫入一個資料檔案,並記錄每個Partitoin
的偏移量,為Reduce
端讀取做準備。
- Future work
[SPARK-7271] Redesign shuffle interface for binary processing
相關文章
- Spark Shuffle機制詳細原始碼解析Spark原始碼
- Spark 原始碼系列(六)Shuffle 的過程解析Spark原始碼
- Spark 原始碼解析之SparkContextSpark原始碼Context
- Spark原始碼解析之Storage模組Spark原始碼
- Spark原始碼-SparkContext原始碼解析Spark原始碼Context
- spark reduceByKey原始碼解析Spark原始碼
- Apache Spark原始碼走讀之24 -- Sort-based Shuffle的設計與實現ApacheSpark原始碼
- Spark shuffle調優Spark
- Spark Shuffle實現Spark
- spark Shuffle相關Spark
- Spark 原始碼系列(九)Spark SQL 初體驗之解析過程詳解Spark原始碼SQL
- Spark學習——排序ShuffleSpark排序
- Spark原始碼分析之MemoryManagerSpark原始碼
- Spark原始碼分析之BlockStoreSpark原始碼BloC
- 【Spark篇】---Spark中Shuffle檔案的定址Spark
- Spark原始碼解析-Yarn部署流程(ApplicationMaster)Spark原始碼YarnAPPAST
- Spark的Shuffle總結分析Spark
- Spark開發-Shuffle優化Spark優化
- Spark面試題(八)——Spark的Shuffle配置調優Spark面試題
- 【Spark篇】---Spark中Shuffle機制,SparkShuffle和SortShuffleSpark
- 【大資料學習日記】Spark之shuffle調優大資料Spark
- spark核心(下)——job任務提交原始碼解析Spark原始碼
- spark 原始碼分析之十三 -- SerializerManager剖析Spark原始碼
- Spark原始碼分析之DiskBlockMangaer分析Spark原始碼BloC
- Spark原始碼分析之cahce原理分析Spark原始碼
- Spark原始碼分析之Checkpoint機制Spark原始碼
- spark 原始碼分析之十八 -- Spark儲存體系剖析Spark原始碼
- spark 原始碼分析之十五 -- Spark記憶體管理剖析Spark原始碼記憶體
- spark的基本運算元使用和原始碼解析Spark原始碼
- Spark SQL原始碼解析(四)Optimization和Physical Planning階段解析SparkSQL原始碼
- spark 原始碼分析之十六 -- Spark記憶體儲存剖析Spark原始碼記憶體
- spark 原始碼分析之十九 -- Stage的提交Spark原始碼
- spark原始碼之任務提交過程Spark原始碼
- Spark的兩種核心Shuffle詳解Spark
- Spark-Shuffle過程概要參考Spark
- Spark 效能調優--Shuffle調優 SortShuffleManagerSpark
- jQuery原始碼解析之clone()jQuery原始碼
- jQuery原始碼解析之position()jQuery原始碼