依據Spark1.4版
Spark中的shuffle大概是這麼個過程:map端把map輸出寫成本地檔案,reduce端去讀取這些檔案,然後執行reduce操作。
那麼,問題來了:
reducer是怎麼知道它的輸入在哪呢?
首先,mapper在寫完檔案之後,肯定能提供與它的輸出相關的資訊。這個資訊,在Spark中由MapStatus表示
private[spark] sealed trait MapStatus { def location: BlockManagerId def getSizeForBlock(reduceId: Int): Long }
在ShuffleMapTask執行完畢時,MapStatus會被作為執行結果傳遞給driver。ShuffleMapTasks的runTask方法的宣告是這樣的
override def runTask(context: TaskContext): MapStatus
reducer如果從driver端獲取了跟自己相關的MapStatus, 它就知道哪些BlockManager儲存了自己所需要的map輸出。
但是,還存在以下問題:
1. driver拿到MapStatus是如何處理的?
2. reducer是如何獲取到MapStatus的?
3. reducer是如何根據MapStatus獲取map輸出的?
driver拿到MapStatus是如何處理的?
首先,executor會把MapStatus作為任務執行的結果,通過statusUpdate方法傳給driver
override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) { val msg = StatusUpdate(executorId, taskId, state, data) driver match { case Some(driverRef) => driverRef.send(msg) case None => logWarning(s"Drop $msg because has not yet connected to driver") } }
DriverEndpoint收到StatusUpdate後,會呼叫TaskScheduler的statusUpdate方法
case StatusUpdate(executorId, taskId, state, data) => scheduler.statusUpdate(taskId, state, data.value)
然後經過一個很長的呼叫鏈……會呼叫到DAGScheduler的handleTaskCompletion方法,這個方法會對task的型別進行匹配
case smt: ShuffleMapTask =>
匹配後執行了很多操作,與shuffle有關的有以下一些
val shuffleStage = stage.asInstanceOf[ShuffleMapStage] updateAccumulators(event) val status = event.result.asInstanceOf[MapStatus] val execId = status.location.executorId if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) { logInfo("Ignoring possibly bogus ShuffleMapTask completion from " + execId) } else { shuffleStage.addOutputLoc(smt.partitionId, status) }
重點在於,會把output location加到ShuffleMapStage的OutputLoc裡,這個OutputLoc是ShuffleMapStage持有的一個MapStatus的陣列。當這個Stage的所有任務都完成了,這個Stage裡所有任務的MapStatus會被告知給MapOutputTracker
mapOutputTracker.registerMapOutputs( shuffleStage.shuffleDep.shuffleId, shuffleStage.outputLocs.map(list => if (list.isEmpty) null else list.head).toArray, changeEpoch = true)
MapOutputTracker和BlockManager一樣,都是master-worker的結構,worker通過RPC請求master,來提供資訊。
由此,MapStatus的資訊被從executor傳遞給driver,最終註冊給了MapOutputTracker。
reducer是如何獲取到MapStatus的?
首先,引發shuffle的transformation會生成特殊的RDD,ShuffledRDD和CoGroupedRDD,這些RDD的compute方法被呼叫時,會觸發reduce的過程。
下面還是以ShuffledRDD為例。
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = { val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]] SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context) .read() .asInstanceOf[Iterator[(K, C)]] }
目前,shuffleManager的getReader方法,只會返回HashShuffleReader型別的reader,它是ShuffleReader的唯一子類。
它的read方法,會呼叫BlockStoreShuffleFetcher的fetch方法去獲取map的輸出
val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, ser)
這個fetch方法會請求MapOutputTracker來獲取map輸出的位置和大小,MapOutputTracker的getServerStatus方法會獲取這個reducer對應的MapStatus。
//statuses: Array[(BlockManagerId, Long)] 獲取這個shuffleId, reduceId對應的map輸出的位置和大小 val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId)
reducer是如何根據MapStatus獲取map輸出的呢
statuses的型別是Array[(BlockManagerId, Long)],這也就是MapStatus能提供的兩個資訊。
fetch方法會用獲取到的MapStatus裡的資訊組裝ShuffleBlockId
val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(Int, Long)]] for (((address, size), index) <- statuses.zipWithIndex) { splitsByAddress.getOrElseUpdate(address, ArrayBuffer()) += ((index, size)) } val blocksByAddress: Seq[(BlockManagerId, Seq[(BlockId, Long)])] = splitsByAddress.toSeq.map { case (address, splits) => (address, splits.map(s => (ShuffleBlockId(shuffleId, s._1, reduceId), s._2))) }
注意,statuses這個陣列裡的資訊包括了每個map的輸出,即使有map沒有對應於此reduce的輸出,也會有。這個陣列i索引處的資訊,即是mapId為i的map的輸出資訊。因此, splitsByAddress在生成時,使用了statues.zipWithIndex來獲取mapId。而組裝blocksByAddress的過程就由此生成ShuffleBlockId
case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId { override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId }
這個blocksByAddress會被用來構造ShuffleBlockFetcherIterator,它會去請求BlockManager獲取對應的ShuffleBlock。下面是fetch方法中構造ShuffleBlockFetcherIterator的程式碼
val blockFetcherItr = new ShuffleBlockFetcherIterator( context, SparkEnv.get.blockManager.shuffleClient, blockManager, blocksByAddress, serializer, // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024)
ShuffleBlockFetcherIterator是一個迭代器,它的主構造器會呼叫initialize方法進行初始化。這個initialize的主要功能是生成對ShuffleBlock的fetch請求,併傳送這些請求。
因此在ShuffleBlockFetcherIterator物件建立後,它就傳送了很多FetchRequest. 又因為請求的傳送和獲得結果是非同步的,而它需要把這些非同步獲取的結果封裝於一個迭代器中,其實現還是有些複雜的。
private[this] def initialize(): Unit = { // Add a task completion callback (called in both success case and failure case) to cleanup. context.addTaskCompletionListener(_ => cleanup()) // 區分開本地的和遠端的block val remoteRequests = splitLocalRemoteBlocks() // 把遠端的block隨機排列,加到佇列裡 fetchRequests ++= Utils.randomize(remoteRequests) // 傳送對遠端的block的請求 while (fetchRequests.nonEmpty && (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size <= maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } val numFetches = remoteRequests.size - fetchRequests.size logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime)) // 獲取本地的block fetchLocalBlocks() logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime)) }
它會區分遠端的還是本地的block,本地的block就是當前這個executor的BlockManager所管理的block,它可以通過block所在BlockManagerId是否等於本地的BlockManagerId來判斷。
fetchLocalBlocks的過程很簡單,只要請求本地的BlockManager就行了
val buf = blockManager.getBlockData(blockId)
獲取遠端的block麻煩一點, 需要ShuffleClient提供幫助
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray, new BlockFetchingListener { ... } )
這個shuffleClient是由BlockManager提供的, 它是BlockTransferService的父類,ShuffleClient有兩種
private[spark] val shuffleClient = if (externalShuffleServiceEnabled) { val transConf = SparkTransportConf.fromSparkConf(conf, numUsableCores) new ExternalShuffleClient(transConf, securityManager, securityManager.isAuthenticationEnabled(), securityManager.isSaslEncryptionEnabled()) } else { blockTransferService }
預設情況下會使用BlockTransferService這種ShuffleClient。這個東西有兩種
val blockTransferService = conf.get("spark.shuffle.blockTransferService", "netty").toLowerCase match { case "netty" => new NettyBlockTransferService(conf, securityManager, numUsableCores) case "nio" => new NioBlockTransferService(conf, securityManager) }
預設使用NettyBlockTransferService。這個東西會啟動一個NettyBlockRpcServer,提供block的傳輸服務。ShuffleClient會通過host和port聯絡上它。
經過一串的呼叫,這個server會收到OpenBlocks型別的訊息,然後它會這麼處理
message match { case openBlocks: OpenBlocks => val blocks: Seq[ManagedBuffer] = openBlocks.blockIds.map(BlockId.apply).map(blockManager.getBlockData) val streamId = streamManager.registerStream(blocks.iterator) logTrace(s"Registered streamId $streamId with ${blocks.size} buffers") responseContext.onSuccess(new StreamHandle(streamId, blocks.size).toByteArray)
在這裡,它會呼叫BlockDataManager的getBlockData方法獲取block。BlockManager繼承了BlockDataManager,它會把自己註冊給BlockTransferService
這個註冊,發生在BlockManager的intialize方法中
def initialize(appId: String): Unit = { blockTransferService.init(this) //把自己註冊給BlockTransferService,讓BlockTransferService能通過自己存取block
所以,最終會呼叫到BlockManager的getBlockData方法
override def getBlockData(blockId: BlockId): ManagedBuffer = { if (blockId.isShuffle) { shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId]) } else { val blockBytesOpt = doGetLocal(blockId, asBlockResult = false) .asInstanceOf[Option[ByteBuffer]] if (blockBytesOpt.isDefined) { val buffer = blockBytesOpt.get new NioManagedBuffer(buffer) } else { throw new BlockNotFoundException(blockId.toString) } } }
所以對於ShuffleBlockId,它會呼叫ShuffleBlockResover來獲取block的資料。
這個ShuffleBlockResolver是個神奇的東西。它是作為ShuffleManager和BlockManager之間的翻譯用的,對於不同的shuffle方式,使用不同的ShuffleBlockResolver。
Spark的shuffle有兩種, sort和hash, 分別使用HashShuffleManager和SortShuffleManager,而它們分別使用FileShuffleBlockResolver和IndexShuffleBlockResolver。
hash的方式會把每個map為每個reduce的輸出寫一個檔案,但是sort是每個map只寫一個檔案。這種不同的寫檔案的方式是使用不同的ShuffleWriter實現的,而不同的ShuffleWriter使用不同的ShuffleBlockResolver確定檔案的結構和命名。
這種對應關係是:
HashShuffleManager -> HashShuffleWriter, 它們都使用FileShuffleBlockResolver。HashShuffleManager的getWriter方法和相關程式碼為:
private val fileShuffleBlockResolver = new FileShuffleBlockResolver(conf)
override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext) : ShuffleWriter[K, V] = { new HashShuffleWriter( shuffleBlockResolver, handle.asInstanceOf[BaseShuffleHandle[K, V, _]], mapId, context)
SortShuffleManager -> SortShuffleWriter, 它們都使用IndexShuffleBlockResolver。SortShuffleManager的getWriter方法和相關程式碼為:
private val indexShuffleBlockResolver = new IndexShuffleBlockResolver(conf) override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext) : ShuffleWriter[K, V] = { val baseShuffleHandle = handle.asInstanceOf[BaseShuffleHandle[K, V, _]] shuffleMapNumber.putIfAbsent(baseShuffleHandle.shuffleId, baseShuffleHandle.numMaps) new SortShuffleWriter( shuffleBlockResolver, baseShuffleHandle, mapId, context) }
這兩個ShuffleBlockResolver的區別集中體現了hash和sort兩種shuffle方式裡reducer讀取map輸出檔案時的差別。
和Sort兩種shuffle方式讀取map輸出檔案時的差別
HashShuffleManager使用的是FileShuffleBlockResolver,它的getBlockData方法依據是否啟用了consolidate shuffle有不同的執行方式,consolidate shuffle預設是不啟用的,此時執行的是
override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = { if (consolidateShuffleFiles) { ... } else { val file = blockManager.diskBlockManager.getFile(blockId) new FileSegmentManagedBuffer(transportConf, file, 0, file.length) } }
會直接根據blockId去DiskBlockManager獲取相應的檔案,然後生成一個FileSegmentManagedBuffer物件,這個buffer的offset從0開始,長度為file.length,也就是整個檔案。
SortShuffleManager使用IndexShuffleBlockResolver。由於sort方式的shuffle裡的每個map會寫一個資料檔案和一個索引檔案,這個資料檔案裡會有對應於多個reducer的資料,因此需要先讀索引檔案來確定對於哪個reducer該從何處讀起。
override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = { // The block is actually going to be a range of a single map output file for this map, so // find out the consolidated file, then the offset within that from our index val indexFile = getIndexFile(blockId.shuffleId, blockId.mapId) val in = new DataInputStream(new FileInputStream(indexFile)) try { ByteStreams.skipFully(in, blockId.reduceId * 8) val offset = in.readLong() val nextOffset = in.readLong() new FileSegmentManagedBuffer( transportConf, getDataFile(blockId.shuffleId, blockId.mapId), offset, nextOffset - offset) } finally { in.close() } }
這個索引檔案記得是一系列的long型的值,第i個值代表第i個reducer的資料在資料檔案中的偏移。因此,它返回的FileSegmentManagedBuffer不像hash方式時的一樣包括整個檔案,而是這個檔案中的一個片段。
總結:
關於map輸出的資訊會封裝在MapStatus物件中,它會由DAGScheduler相關的任務結果收回的系統帶給driver,然後進行driver端的MapOutputTrackerMaster,進行MapOutputTracker系統。reducer可以通過MapOutputTracker系統獲取map輸出的位置和大小。然後它會使用BlockTransferService來獲取自己需要的block。而根據BlockId獲取map輸出資料的功能由ShuffleManager使用不同的ShuffleBlockResolver完成,ShuffleBlockResolver會訪問BlockManager,確切地說是DiskBlockManager最終獲取檔案或者檔案的segment。
為了實現shuffle系統,Spark使用了兩套master-slave結構的系統:排程系統和MapOutputTracker系統,也使用了一個專門用於Block傳輸的BlockTransferService系統,它由一系列的server組成,其中的呼叫關係還是有點複雜的。