歡迎轉載,轉載請註明出處。
概要
Spark 1.1中對spark core的一個重大改進就是引入了sort-based shuffle處理機制,本文就該處理機制的實現進行初步的分析。
Sort-based Shuffle之初體驗
通過一個小的實驗來直觀的感受一下sort-based shuffle演算法會產生哪些中間檔案,具體實驗步驟如下所述。
步驟1: 修改conf/spark-default.conf, 加入如下內容
spark.shuffle.manager SORT
步驟2: 執行spark-shell
SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-shell
步驟3: 執行wordcount
sc.textFile("README.md").flatMap(l => l.split(" ")).map(w=>(w,1)).reduceByKey(_ + _).collect
步驟4: 檢視生成的中間檔案
find /tmp/spark-local* -type f
檔案查詢結果如下所示
/tmp/spark-local-20140919091822-aa66/0f/shuffle_0_1_0.index
/tmp/spark-local-20140919091822-aa66/30/shuffle_0_0_0.index
/tmp/spark-local-20140919091822-aa66/0c/shuffle_0_0_0.data
/tmp/spark-local-20140919091822-aa66/15/shuffle_0_1_0.data
可以看到生成了兩人種字尾的檔案,分別為data和index型別,這兩者的用途在後續分析中會詳細講述。
如果我們做一下對比實驗,將shuffle模式改為Hash,再來觀看生成的檔案,就會找到區別。將原先配置檔案中的SORT改為HASH,重新啟動spark-shell,執行相同的wordcount之後,在tmp目錄下找到的檔案列表如下。
/tmp/spark-local-20140919092949-14cc/10/shuffle_0_1_3
/tmp/spark-local-20140919092949-14cc/0f/shuffle_0_1_2
/tmp/spark-local-20140919092949-14cc/0f/shuffle_0_0_3
/tmp/spark-local-20140919092949-14cc/0c/shuffle_0_0_0
/tmp/spark-local-20140919092949-14cc/0d/shuffle_0_1_0
/tmp/spark-local-20140919092949-14cc/0d/shuffle_0_0_1
/tmp/spark-local-20140919092949-14cc/0e/shuffle_0_1_1
/tmp/spark-local-20140919092949-14cc/0e/shuffle_0_0_2
兩者生成的檔案數量差異非常大,具體數值計算如下
- 在HASH模式下,每一次shuffle會生成M*R的數量的檔案,如上述wordcount例子中,整個job有一次shuffle過程,由於輸入檔案預設分片為2,故M個數為2,而spark.default.parallelism配置的值為4,故R為4,所以總共生成1*2*4=8個檔案。shuffle_0_1_2解讀為shuffle+shuffle_id+map_id+reduce_id,故0_1_2表示由第0次shuffle中的第1個maptask生成的檔案,該檔案內容會被第2個reduce task消費
- 在SORT模式下,一個Map Task只生成一個檔案,而不管生成的檔案要被多少的Reduce消費,故檔案個數是M的數量,由於wordcount中的預設分片為2,故只生成兩個data檔案
多次shuffle
剛才的示例中只有一次shuffle過程,我們可以通過小小的改動來達到兩次shuffle,程式碼如下
sc.textFile("README.md").flatMap(l => l.split(" ")).map(w => (w,1)).reduceByKey(_ + _).map(p=>(p._2,p._1)).groupByKey.collect
上述程式碼將reduceByKey的結果通過map進行反轉,即將原來的(w, count)轉換為(count,w),然後根據出現次數進行歸類。 groupByKey會再次導致資料shuffle過程。
在HASH模式下產生的檔案如下所示
/tmp/spark-local-20140919094531-1cb6/12/shuffle_0_3_3
/tmp/spark-local-20140919094531-1cb6/0c/shuffle_0_0_0
/tmp/spark-local-20140919094531-1cb6/11/shuffle_0_2_3
/tmp/spark-local-20140919094531-1cb6/11/shuffle_0_3_2
/tmp/spark-local-20140919094531-1cb6/11/shuffle_1_1_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_2_2
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_1_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_0_3_1
/tmp/spark-local-20140919094531-1cb6/10/shuffle_1_0_3
/tmp/spark-local-20140919094531-1cb6/10/shuffle_1_1_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_0_3
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_3_0
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_2_1
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_0_1_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_1_0_2
/tmp/spark-local-20140919094531-1cb6/0f/shuffle_1_1_1
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_0_0_1
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_0_1_0
/tmp/spark-local-20140919094531-1cb6/0d/shuffle_1_0_0
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_2_0
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_1_1
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_0_0_2
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_1_0_1
/tmp/spark-local-20140919094531-1cb6/0e/shuffle_1_1_0
引入一次新的shuffle,產生了大量的中間檔案
如果是使用SORT,效果如何呢?只會增加M個檔案,由於在新的shuffle過程中,map task數目為4,所以總共的檔案是2+4=6。
/tmp/spark-local-20140919094731-034a/29/shuffle_0_3_0.data
/tmp/spark-local-20140919094731-034a/30/shuffle_0_0_0.index
/tmp/spark-local-20140919094731-034a/15/shuffle_0_1_0.data
/tmp/spark-local-20140919094731-034a/36/shuffle_0_2_0.data
/tmp/spark-local-20140919094731-034a/0c/shuffle_0_0_0.data
/tmp/spark-local-20140919094731-034a/32/shuffle_0_2_0.index
/tmp/spark-local-20140919094731-034a/32/shuffle_1_1_0.index
/tmp/spark-local-20140919094731-034a/0f/shuffle_0_1_0.index
/tmp/spark-local-20140919094731-034a/0f/shuffle_1_0_0.index
/tmp/spark-local-20140919094731-034a/0a/shuffle_1_1_0.data
/tmp/spark-local-20140919094731-034a/2b/shuffle_1_0_0.data
/tmp/spark-local-20140919094731-034a/0d/shuffle_0_3_0.index
值得指出的是shuffle_0和shuffle_1的執行次序問題,數字越大越先執行,由於spark job提交的時候是從後往前倒推的,故0是最後將執行,而前面的先執行。
Sort-based Shuffle的設計思想
sort-based shuffle的總體指導思想是一個map task最終只生成一個shuffle檔案,那麼後續的reduce task是如何從這一個shuffle檔案中得到自己的partition呢,這個時候就需要引入一個新的檔案型別即index檔案。
其具體實現步驟如下:
- Map Task在讀取自己輸入的partition之後,將計算結果寫入到ExternalSorter
- ExternalSorter會使用一個map來儲存新的計算結果,新的計算結果根據partiton分類,如果是有combine操作,則需要將新的值與原有的值進行合併
- 如果ExternalSorter中的map佔用的記憶體已經超越了使用的閥值,則將map中的內容spill到磁碟中,每一次spill產生一個不同的檔案
- 當輸入Partition中的所有資料都已經處理完畢之後,這時有可能一部分計算結果在記憶體中,另一部分計算結果在spill的一到多個檔案之中,這時通過merge操作將記憶體和spill檔案中的內容合併整到一個檔案裡
- 最後將每一個partition的在data檔案中的起始位置和結束位置寫入到index檔案
相應的原始檔
- SortShuffleManager.scala
- SortShuffleWriter.scala
- ExternalSorter.scala
- IndexShuffleBlockManager.scala
幾個重要的函式
SortShuffleWriter.write
override def write(records: Iterator[_ >: Product2[K, V]]): Unit = {
if (dep.mapSideCombine) {
if (!dep.aggregator.isDefined) {
throw new IllegalStateException("Aggregator is empty for map-side combine")
}
sorter = new ExternalSorter[K, V, C](
dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
sorter.insertAll(records)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
// care whether the keys get sorted in each partition; that will be done on the reduce side
// if the operation being run is sortByKey.
sorter = new ExternalSorter[K, V, V](
None, Some(dep.partitioner), None, dep.serializer)
sorter.insertAll(records)
}
val outputFile = shuffleBlockManager.getDataFile(dep.shuffleId, mapId)
val blockId = shuffleBlockManager.consolidateId(dep.shuffleId, mapId)
val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
shuffleBlockManager.writeIndexFile(dep.shuffleId, mapId, partitionLengths)
mapStatus = new MapStatus(blockManager.blockManagerId,
partitionLengths.map(MapOutputTracker.compressSize))
}
ExternalSorter.insertAll
def insertAll(records: Iterator[_ {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
while (records.hasNext) {
elementsRead += 1
kv = records.next()
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpill(usingMap = true)
}
} else {
// Stick values into our buffer
while (records.hasNext) {
elementsRead += 1
val kv = records.next()
buffer.insert((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
maybeSpill(usingMap = false)
}
}
}
writePartitionedFile將記憶體中的資料和spill檔案中內容一起合併到一個檔案當中
def writePartitionedFile(
blockId: BlockId,
context: TaskContext,
outputFile: File): Array[Long] = {
// Track location of each range in the output file
val lengths = new Array[Long](numPartitions)
if (bypassMergeSort && partitionWriters != null) {
// We decided to write separate files for each partition, so just concatenate them. To keep
// this simple we spill out the current in-memory collection so that everything is in files.
spillToPartitionFiles(if (aggregator.isDefined) map else buffer)
partitionWriters.foreach(_.commitAndClose())
var out: FileOutputStream = null
var in: FileInputStream = null
try {
out = new FileOutputStream(outputFile)
for (i <- 0 until numPartitions) {
in = new FileInputStream(partitionWriters(i).fileSegment().file)
val size = org.apache.spark.util.Utils.copyStream(in, out, false)
in.close()
in = null
lengths(i) = size
}
} finally {
if (out != null) {
out.close()
}
if (in != null) {
in.close()
}
}
} else {
// Either we're not bypassing merge-sort or we have only in-memory data; get an iterator by
// partition and just write everything directly.
for ((id, elements) <- this.partitionedIterator) {
if (elements.hasNext) {
val writer = blockManager.getDiskWriter(
blockId, outputFile, ser, fileBufferSize, context.taskMetrics.shuffleWriteMetrics.get)
for (elem
而資料讀取過程中則需要使用IndexShuffleBlockManager來獲取Partiton的具體位置
override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = {
// The block is actually going to be a range of a single map output file for this map, so
// find out the consolidated file, then the offset within that from our index
val indexFile = getIndexFile(blockId.shuffleId, blockId.mapId)
val in = new DataInputStream(new FileInputStream(indexFile))
try {
in.skip(blockId.reduceId * 8)
val offset = in.readLong()
val nextOffset = in.readLong()
new FileSegmentManagedBuffer(
getDataFile(blockId.shuffleId, blockId.mapId),
offset,
nextOffset - offset)
} finally {
in.close()
}
}