概念

A stage is a set of parallel tasks ① all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the DAGScheduler runs these stages in topological order.
Each Stage can ② either be a shuffle map stage, in which case its tasks' results are input for other stage(s), or a result stage, in which case its tasks directly compute a Spark action (e.g. count(), save(), etc) by running a function on an RDD. ③For shuffle map stages, we also track the nodes that each output partition is on.
Each Stage also has a firstJobId, identifying the job that first submitted the stage. When FIFO scheduling is used, this ④ allows Stages from earlier jobs to be computed first or recoveredfaster on failure.Finally, a single stage can be re-executed in multiple attempts due to fault recovery. In thatcase, the Stage object will track multiple StageInfo objects to pass to listeners or the web UI.The latest one will be accessible through latestInfo.

程式碼解讀

[DAGScheduler]->private[scheduler] def handleJobSubmitted

{   var finalStage: ResultStage = null
    try {    /**
        ②stage 的型別只有兩種,一種是shuffle map stage 另一種是result 
        stage，並且result stage 一定是呼叫action操作的RDD所在
        的stage，引數含義：func-對每個分割槽進行的操作根據action的不同
        而不同,例如action為count的時候那麼func就是計算每個分割槽的大小,
        最終結果由jobwaiter(在SubmitJob方法中有涉及)蒐集並計算將func
        的結果進行相加返回。
    **/
      finalStage = newResultStage(finalRDD, func, partitions, jobId,   
                    callSite)
    } catch {     case e: Exception => logWarning("Creating new stage failed due to     
     exception - job: " + jobId,   e) listener.jobFailed(e) return 
    } 
                                  . . . 
     /**
        [1]首次提交的一定是finalStage即resultStage,然後會遞迴
        尋找該Stage的依賴直到找到一個沒有依賴的Stage才會生
        成taskSet進行提交
        submitStage(finalStage)
        [2]在遞迴尋找依賴stage的過程中如果發現當前stage有依
        賴則將當前stage放入等待佇列中以便後續排程
     **/
     submitWaitingStages()
}

[DAGScheduler]->private def submitStage(stage: Stage)

{
                                      ...       //[1]
       val missing = **getMissingParentStages(stage)**.sortBy(_.id)
        logDebug("missing: " + missing)        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents”)
        //[1]
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            submitStage(parent)
          }
         //[2]
          waitingStages += stage
        }
      ...
}

[DAGScheduler]->getMissingParentStages(stage: Stage): List[Stage]
根據dependency是否是shuffle dependency(wild or narrow)來進行stage劃分

{
                                       . . . 

  for (dep <- rdd.dependencies) {
            dep match {              case shufDep: ShuffleDependency[_, _, _] =>
                val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)                if (!mapStage.isAvailable) {
                  missing += mapStage
                }              case narrowDep: NarrowDependency[_] =>
                [2]
                waitingForVisit.push(narrowDep.rdd)
            }
          }
                                        . . .
}

ShuffleMapStage

概念

ShuffleMapStages are intermediate stages in the execution DAG that produce data for a shuffle.⑤They occur right before each shuffle operation, and might contain multiple pipelined operations before that (e.g. map and filter). When executed, ⑥they save map output files that can later be fetched by reduce tasks. ⑦The shuffleDep field describes the shuffle each stage is part of,and ⑧variables like outputLocs and numAvailableOutputs track how many map outputs are ready.ShuffleMapStages can also be submitted independently as jobs with DAGScheduler.submitMapStage. For such stages, the ActiveJobs that submitted them are tracked in mapStageJobs. ⑨Note that there can be multiple ActiveJobs trying to compute the same shuffle map stage.

程式碼解讀

⑤-在對stage進行劃分時,shuffle map stage 包含前個shuffle之後的所有非shuffle操作,如map、filter等。
⑥ 對每個partition的output 資訊進行維護

/**
   List of [[MapStatus]] for each partition. The index of the array
   is the map partition id,and each value in the array is the list of     
   possible [[MapStatus]]  for a partition(a single task might run 
   multiple times).
   ③⑧當前rdd的位置及狀態資訊及每個partiton會在哪個executor
  上執行併產生輸出。該資訊將用於DAG對task的排程.
**/
  private[this] val outputLocs = Array.fill[List[MapStatus]](numPartitions)(Nil)

[DAGScheduler]->submitMissingTasks(stage: Stage, jobId: Int)

                               ... 
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() 
                               ...
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap        case s: ResultStage =>
          val job = s.activeJob.get
          partitionsToCompute.map { id =>
          val p = s.partitions(id)
          (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    }
                                       ...

⑦shuffleDep定義了整個shuffle的資訊，每個stage的shuffleDep變數則標識該stage屬於哪個shuffle應該執行怎麼樣的操作,
在提交執行stage時需要用到該資訊。

class  ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false){
                                     ...
}

作者：覺悟吧騷年
連結：

Spark Stage

概念

程式碼解讀

ShuffleMapStage

概念

程式碼解讀

相關文章