Spark 原始碼解析 : DAGScheduler中的DAG劃分與提交
一、Spark 執行架構
Spark 執行架構如下圖:
各個RDD之間存在著依賴關係,這些依賴關係形成有向無環圖DAG,DAGScheduler對這些依賴關係形成的DAG,進行Stage劃分,劃分的規則很簡單,從後往前回溯,遇到窄依賴加入本stage,遇見寬依賴進行Stage切分。完成了Stage的劃分,DAGScheduler基於每個Stage生成TaskSet,並將TaskSet提交給TaskScheduler。TaskScheduler 負責具體的task排程,在Worker節點上啟動task。
Paste_Image.png
二、原始碼解析:DAGScheduler中的DAG劃分
當RDD觸發一個Action操作(如:colllect)後,導致SparkContext.runJob的執行。而在SparkContext的run方法中會呼叫DAGScheduler的run方法最終呼叫了DAGScheduler的submit方法:
def submitJob[T, U]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], callSite: CallSite, resultHandler: (Int, U) => Unit, properties: Properties): JobWaiter[U] = { // Check to make sure we are not launching a task on a partition that does not exist. val maxPartitions = rdd.partitions.length partitions.find(p => p >= maxPartitions || p < 0).foreach { p => throw new IllegalArgumentException( "Attempting to access a non-existent partition: " + p + ". " + "Total number of partitions: " + maxPartitions) } val jobId = nextJobId.getAndIncrement() if (partitions.size == 0) { // Return immediately if the job is running 0 tasks return new JobWaiter[U](this, jobId, 0, resultHandler) } assert(partitions.size > 0) val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _] val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler) //給eventProcessLoop傳送JobSubmitted訊息 eventProcessLoop.post(JobSubmitted( jobId, rdd, func2, partitions.toArray, callSite, waiter, SerializationUtils.clone(properties))) waiter }
DAGScheduler的submit方法中,像eventProcessLoop物件傳送了JobSubmitted訊息。eventProcessLoop是DAGSchedulerEventProcessLoop類的物件
private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
DAGSchedulerEventProcessLoop,接收各種訊息並進行處理,處理的邏輯在其doOnReceive方法中:
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match { //Job提交 case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) => dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) case MapStageSubmitted(jobId, dependency, callSite, listener, properties) => dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties) case StageCancelled(stageId) => dagScheduler.handleStageCancellation(stageId) case JobCancelled(jobId) => dagScheduler.handleJobCancellation(jobId) case JobGroupCancelled(groupId) => dagScheduler.handleJobGroupCancelled(groupId) case AllJobsCancelled => dagScheduler.doCancelAllJobs() case ExecutorAdded(execId, host) => dagScheduler.handleExecutorAdded(execId, host) case ExecutorLost(execId) => dagScheduler.handleExecutorLost(execId, fetchFailed = false) case BeginEvent(task, taskInfo) => dagScheduler.handleBeginEvent(task, taskInfo) case GettingResultEvent(taskInfo) => dagScheduler.handleGetTaskResult(taskInfo) case completion: CompletionEvent => dagScheduler.handleTaskCompletion(completion) case TaskSetFailed(taskSet, reason, exception) => dagScheduler.handleTaskSetFailed(taskSet, reason, exception) case ResubmitFailedStages => dagScheduler.resubmitFailedStages() }
可以把DAGSchedulerEventProcessLoop理解成DAGScheduler的對外的功能介面。它對外隱藏了自己內部實現的細節。無論是內部還是外部訊息,DAGScheduler可以共用同一訊息處理程式碼,邏輯清晰,處理方式統一。
接下來分析DAGScheduler的Stage劃分,handleJobSubmitted方法首先建立ResultStage
try { //建立新stage可能出現異常,比如job執行依賴hdfs文檔案被刪除 finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite) } catch { case e: Exception => logWarning("Creating new stage failed due to exception - job: " + jobId, e) listener.jobFailed(e) return }
然後呼叫submitStage方法,進行stage的劃分。
Paste_Image.png
首先由finalRDD獲取它的父RDD依賴,判斷依賴型別,如果是窄依賴,則將父RDD壓入棧中,如果是寬依賴,則作為父Stage。
看一下原始碼的具體過程:
private def getMissingParentStages(stage: Stage): List[Stage] = { val missing = new HashSet[Stage] //儲存需要返回的父Stage val visited = new HashSet[RDD[_]] //儲存訪問過的RDD //自己建立棧,以免函式的遞迴呼叫導致 val waitingForVisit = new Stack[RDD[_]] def visit(rdd: RDD[_]) { if (!visited(rdd)) { visited += rdd val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil) if (rddHasUncachedPartitions) { for (dep <- rdd.dependencies) { dep match { case shufDep: ShuffleDependency[_, _, _] => val mapStage = getShuffleMapStage(shufDep, stage.firstJobId) if (!mapStage.isAvailable) { missing += mapStage //遇到寬依賴,加入父stage } case narrowDep: NarrowDependency[_] => waitingForVisit.push(narrowDep.rdd) //窄依賴入棧, } } } } } //回溯的起始RDD入棧 waitingForVisit.push(stage.rdd) while (waitingForVisit.nonEmpty) { visit(waitingForVisit.pop()) } missing.toList }
getMissingParentStages方法是由當前stage,返回他的父stage,父stage的建立由getShuffleMapStage返回,最終會呼叫newOrUsedShuffleStage方法返回ShuffleMapStage
private def newOrUsedShuffleStage( shuffleDep: ShuffleDependency[_, _, _], firstJobId: Int): ShuffleMapStage = { val rdd = shuffleDep.rdd val numTasks = rdd.partitions.length val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite) if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) { //Stage已經被計算過,從MapOutputTracker中獲取計算結果 val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) (0 until locs.length).foreach { i => if (locs(i) ne null) { // locs(i) will be null if missing stage.addOutputLoc(i, locs(i)) } } } else { // Kind of ugly: need to register RDDs with the cache and map output tracker here // since we can't do it in the RDD constructor because # of partitions is unknown logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")") mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length) } stage }
現在父Stage已經劃分好,下面看看你Stage的提交邏輯
/** Submits stage, but first recursively submits any missing parents. */ private def submitStage(stage: Stage) { val jobId = activeJobForStage(stage) if (jobId.isDefined) { logDebug("submitStage(" + stage + ")") if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) { val missing = getMissingParentStages(stage).sortBy(_.id) logDebug("missing: " + missing) if (missing.isEmpty) { logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents") //如果沒有父stage,則提交當前stage submitMissingTasks(stage, jobId.get) } else { for (parent <- missing) { //如果有父stage,則遞迴提交父stage submitStage(parent) } waitingStages += stage } } } else { abortStage(stage, "No active job for stage " + stage.id, None) } }
提交的過程很簡單,首先當前stage獲取父stage,如果父stage為空,則當前Stage為起始stage,交給submitMissingTasks處理,如果當前stage不為空,則遞迴呼叫submitStage進行提交。
到這裡,DAGScheduler中的DAG劃分與提交就講完了,下次解析這些stage是如果封裝成TaskSet交給TaskScheduler以及TaskSchedule的排程過程。
作者:飛帥記憶
連結:
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/3486/viewspace-2818589/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- spark 原始碼分析之十九 -- DAG的生成和Stage的劃分Spark原始碼
- spark核心(下)——job任務提交原始碼解析Spark原始碼
- spark 原始碼分析之十九 -- Stage的提交Spark原始碼
- spark reduceByKey原始碼解析Spark原始碼
- Spark----RDD運算元分類 DAGSpark
- spark原始碼之任務提交過程Spark原始碼
- Dolphinscheduler DAG核心原始碼剖析原始碼
- Spark原始碼解析-Yarn部署流程(ApplicationMaster)Spark原始碼YarnAPPAST
- Spark 原始碼系列(六)Shuffle 的過程解析Spark原始碼
- YARN原始碼解析(3)-作業提交2Yarn原始碼
- spark的基本運算元使用和原始碼解析Spark原始碼
- Spark Shuffle機制詳細原始碼解析Spark原始碼
- Spark3.0YarnCluster模式任務提交流程原始碼分析SparkYarn模式原始碼
- Spark SQL原始碼解析(四)Optimization和Physical Planning階段解析SparkSQL原始碼
- [原始碼解析] PyTorch 流水線並行實現 (2)--如何劃分模型原始碼PyTorch並行模型
- 原始碼解析:Git的第一個提交是什麼樣的?原始碼Git
- Spark 原始碼系列(九)Spark SQL 初體驗之解析過程詳解Spark原始碼SQL
- [原始碼解析]Oozie來龍去脈之提交任務原始碼
- Spring5原始碼解析-Spring中的非同步和計劃任務Spring原始碼非同步
- 通過WordCount解析Spark RDD內部原始碼機制Spark原始碼
- EventBus原理與原始碼解析原始碼
- 【原始碼解析】AsyncTask的用法與規則原始碼
- PyTorch ResNet 使用與原始碼解析PyTorch原始碼
- Optional原始碼解析與實踐原始碼
- Function原始碼解析與實踐Function原始碼
- Netty原始碼解析 -- FastThreadLocal與HashedWheelTimerNetty原始碼ASTthread
- Spark 原始碼分析系列Spark原始碼
- spark-submit 提交的引數SparkMIT
- java流的中間操作原始碼解析Java原始碼
- 原始碼解析.Net中Middleware的實現原始碼
- 原始碼解析.Net中DependencyInjection的實現原始碼
- Spark原始碼編譯與匯入IDEA除錯Spark原始碼編譯Idea除錯
- Spark SQL原始碼解析(五)SparkPlan準備和執行階段SparkSQL原始碼
- [原始碼解析] 深度學習分散式訓練框架 horovod (8) --- on spark原始碼深度學習分散式框架Spark
- Golang的GMP排程模型與原始碼解析Golang模型原始碼
- 原始碼解析Spring AOP的載入與生效原始碼Spring
- Spark RDD中Runtime流程解析Spark
- Trap (陷入/中斷) 原始碼解析原始碼