Spark job分配流程原始碼分析

破棉襖發表於2015-10-13


        job提交大體
執行流程圖:
Spark job分配流程原始碼分析


SparkContext : scheduler建立 

  1. // Create and start the scheduler
  2.     val (sched, ts) = SparkContext.createTaskScheduler(this, master)
  3.     _schedulerBackend = sched
  4.     _taskScheduler = ts
  5.     _dagScheduler = new DAGScheduler(this)
  6.     _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
SparkContext : createTaskScheduler()方法,主要透過傳入的Master URL來生成Scheduler 和 SchedulerBackend

  1. private def createTaskScheduler(
  2.                                    sc: SparkContext,
  3.                                    master: String): (SchedulerBackend, TaskScheduler) = {
  4.     // Regular expression used for local[N] and local[*] master formats
  5.     val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r
  6.     // Regular expression for local[N, maxRetries], used in tests with failing tasks
  7.     val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+|\*)\s*,\s*([0-9]+)\]""".r
  8.     // Regular expression for simulating a Spark cluster of [N, cores, memory] locally
  9.     val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
  10.     // Regular expression for connecting to Spark deploy clusters
  11.     val SPARK_REGEX = """spark://(.*)""".r
  12.     // Regular expression for connection to Mesos cluster by mesos:// or zk:// url
  13.     val MESOS_REGEX = """(mesos|zk)://.*""".r
  14.     // Regular expression for connection to Simr cluster
  15.     val SIMR_REGEX = """simr://(.*)""".r

  16.     // When running locally, don't try to re-execute tasks on failure.
  17.     val MAX_LOCAL_TASK_FAILURES = 1

  18.     master match {
  19.       case "local" =>
  20.         val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
  21.         val backend = new LocalBackend(sc.getConf, scheduler, 1)
  22.         scheduler.initialize(backend)
  23.         (backend, scheduler)

  24.       case LOCAL_N_REGEX(threads) =>
  25.         def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
  26.         // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
  27.         val threadCount = if (threads == "*") localCpuCount else threads.toInt
  28.         if (threadCount <= 0) {
  29.           throw new SparkException(s"Asked to run locally with $threadCount threads")
  30.         }
  31.         val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
  32.         val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
  33.         scheduler.initialize(backend)
  34.         (backend, scheduler)

  35.       case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
  36.         def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
  37.         // local[*, M] means the number of cores on the computer with M failures
  38.         // local[N, M] means exactly N threads with M failures
  39.         val threadCount = if (threads == "*") localCpuCount else threads.toInt
  40.         val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
  41.         val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
  42.         scheduler.initialize(backend)
  43.         (backend, scheduler)

  44.       case SPARK_REGEX(sparkUrl) =>
  45.         val scheduler = new TaskSchedulerImpl(sc)
  46.         val masterUrls = sparkUrl.split(",").map("spark://" + _)
  47.         val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
  48.         scheduler.initialize(backend)
  49.         (backend, scheduler)

  50.       case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
  51.         // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
  52.         val memoryPerSlaveInt = memoryPerSlave.toInt
  53.         if (sc.executorMemory > memoryPerSlaveInt) {
  54.           throw new SparkException(
  55.             "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
  56.               memoryPerSlaveInt, sc.executorMemory))
  57.         }

  58.         val scheduler = new TaskSchedulerImpl(sc)
  59.         val localCluster = new LocalSparkCluster(
  60.           numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
  61.         val masterUrls = localCluster.start()
  62.         val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
  63.         scheduler.initialize(backend)
  64.         backend.shutdownCallback = (backend: SparkDeploySchedulerBackend) => {
  65.           localCluster.stop()
  66.         }
  67.         (backend, scheduler)

  68.       case "yarn-standalone" | "yarn-cluster" =>
  69.         if (master == "yarn-standalone") {
  70.           logWarning(
  71.             "\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
  72.         }
  73.         val scheduler = try {
  74.           val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
  75.           val cons = clazz.getConstructor(classOf[SparkContext])
  76.           cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
  77.         } catch {
  78.           // TODO: Enumerate the exact reasons why it can fail
  79.           // But irrespective of it, it means we cannot proceed !
  80.           case e: Exception => {
  81.             throw new SparkException("YARN mode not available ?", e)
  82.           }
  83.         }
  84.         val backend = try {
  85.           val clazz =
  86.             Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
  87.           val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
  88.           cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
  89.         } catch {
  90.           case e: Exception => {
  91.             throw new SparkException("YARN mode not available ?", e)
  92.           }
  93.         }
  94.         scheduler.initialize(backend)
  95.         (backend, scheduler)

  96.       case "yarn-client" =>
  97.         val scheduler = try {
  98.           val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnScheduler")
  99.           val cons = clazz.getConstructor(classOf[SparkContext])
  100.           cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]

  101.         } catch {
  102.           case e: Exception => {
  103.             throw new SparkException("YARN mode not available ?", e)
  104.           }
  105.         }

  106.         val backend = try {
  107.           val clazz =
  108.             Utils.classForName("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend")
  109.           val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
  110.           cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
  111.         } catch {
  112.           case e: Exception => {
  113.             throw new SparkException("YARN mode not available ?", e)
  114.           }
  115.         }

  116.         scheduler.initialize(backend)
  117.         (backend, scheduler)

  118.       case mesosUrl @ MESOS_REGEX(_) =>
  119.         MesosNativeLibrary.load()
  120.         val scheduler = new TaskSchedulerImpl(sc)
  121.         val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", false)
  122.         val url = mesosUrl.stripPrefix("mesos://") // strip scheme from raw Mesos URLs
  123.       val backend = if (coarseGrained) {
  124.           new CoarseMesosSchedulerBackend(scheduler, sc, url, sc.env.securityManager)
  125.         } else {
  126.           new MesosSchedulerBackend(scheduler, sc, url)
  127.         }
  128.         scheduler.initialize(backend)
  129.         (backend, scheduler)

  130.       case SIMR_REGEX(simrUrl) =>
  131.         val scheduler = new TaskSchedulerImpl(sc)
  132.         val backend = new SimrSchedulerBackend(scheduler, sc, simrUrl)
  133.         scheduler.initialize(backend)
  134.         (backend, scheduler)

  135.       case _ =>
  136.         throw new SparkException("Could not parse Master URL: '" + master + "'")
  137.     }
  138.   }
  139. }


RDD的Action操作會呼叫sc的runJob方法
1.SparkContext的runJob方法: 


  1. def runJob[T, U: ClassTag](
  2.       rdd: RDD[T], //此處是具體的 RDD 例項值

  3.       func: (TaskContext, Iterator[T]) => U, //具體的執行的 action 的邏輯 , 如 reduceByKey

  4.       partitions: Seq[Int], //分割槽陣列 , 一個數值從 0 到 partitions.size-1

  5.       //result 的處理邏輯 , 每一個 Task 的處理

  6.       resultHandler: (Int, U) => Unit): Unit = {
  7.     if (stopped.get()) {
  8.       throw new IllegalStateException("SparkContext has been shutdown")
  9.     }
  10.     val callSite = getCallSite
  11.     val cleanedFunc = clean(func)
  12.     logInfo("Starting job: " + callSite.shortForm)
  13.     if (conf.getBoolean("spark.logLineage", false)) {
  14.       logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  15.     }
  16.     dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
  17.     progressBar.foreach(_.finishAll())
  18.     rdd.doCheckpoint()
  19.   }
跟進:DAGScheduler

2.DAGScheduler的runjob方法:
  1. def runJob[T, U](
  2.       rdd: RDD[T],
  3.       func: (TaskContext, Iterator[T]) => U,
  4.       partitions: Seq[Int],
  5.       callSite: CallSite,
  6.       resultHandler: (Int, U) => Unit,
  7.       properties: Properties): Unit = {
  8.     val start = System.nanoTime
  9.     val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
  10.       //等待 job 執行完成。

  11.     waiter.awaitResult() match {
  12.       case JobSucceeded =>
  13.         logInfo("Job %d finished: %s, took %f s".format
  14.           (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
  15.       case JobFailed(exception: Exception) =>
  16.         logInfo("Job %d failed: %s, took %f s".format
  17.           (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
  18.         // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.

  19.         val callerStackTrace = Thread.currentThread().getStackTrace.tail
  20.         exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
  21.         throw exception
  22.     }
  23.   }
繼續進入submitJob:
  1. def submitJob[T, U](
  2.       rdd: RDD[T],
  3.       func: (TaskContext, Iterator[T]) => U,
  4.       partitions: Seq[Int],
  5.       callSite: CallSite,
  6.       resultHandler: (Int, U) => Unit,
  7.       properties: Properties): JobWaiter[U] = {
  8.     // Check to make sure we are not launching a task on a partition that does not exist.

  9.     val maxPartitions = rdd.partitions.length
  10.     partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
  11.       throw new IllegalArgumentException(
  12.         "Attempting to access a non-existent partition: " + p + ". " +
  13.           "Total number of partitions: " + maxPartitions)
  14.     }

  15.     val jobId = nextJobId.getAndIncrement()
  16.     if (partitions.size == 0) {
  17.       // Return immediately if the job is running 0 tasks

  18.       return new JobWaiter[U](this, jobId, 0, resultHandler)
  19.     }

  20.     assert(partitions.size > 0)
  21.     val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
  22.     val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
  23.     eventProcessLoop.post(JobSubmitted( // 向佇列新增一個 event, 此 event 為 JobSubmitted

  24.       jobId, rdd, func2, partitions.toArray, callSite, waiter,
  25.       SerializationUtils.clone(properties)))
  26.     waiter
  27.   }

submitJob方法將JobSubmitted傳送到佇列(並沒有採用Actor模型  而是1.4版本後重新採用了1.0版本之前的佇列方式,那佇列是在哪消費的呢? 看下邊;
在DAGScheduler類的最後有這麼一段:

  1. // Start the event thread and register the metrics source at the end of the constructor

  2.   env.metricsSystem.registerSource(metricsSource)
  3.   eventProcessLoop.start()
這段程式碼是寫在方法體外的,如果這是一個java類,則這顯然是不合乎語法的。這裡涉及一個scala的特性在一個類裡,你可以隨意的寫一些指令碼,或者說是程式碼片段更加形象,這些指令碼會在物件例項化時被執行。

DAGSchedulerEventProcessLoop 繼承 EventLoop,看一下EventLoop的start方法:
      
  1.   //開一個執行緒不斷消費佇列 呼叫onReceive()方法

  2.   private val eventThread = new Thread(name) {
  3.     setDaemon(true)

  4.     override def run(): Unit = {
  5.       try {
  6.         while (!stopped.get) {
  7.           val event = eventQueue.take()
  8.           try {
  9.             onReceive(event)
  10.           } catch {
  11.             case NonFatal(e) => {
  12.               try {
  13.                 onError(e)
  14.               } catch {
  15.                 case NonFatal(e) => logError("Unexpected error in " + name, e)
  16.               }
  17.             }
  18.           }
  19.         }
  20.       } catch {
  21.         case ie: InterruptedException => // exit even if eventQueue is not empty

  22.         case NonFatal(e) => logError("Unexpected error in " + name, e)
  23.       }
  24.     }

  25.   }

  26.   def start(): Unit = {
  27.     if (stopped.get) {
  28.       throw new IllegalStateException(name + " has already been stopped")
  29.     }
  30.     // Call onStart before starting the event thread to make sure it happens before onReceive

  31.     onStart()
  32.     eventThread.start()
  33.   }
由此可知在DAGScheduler初始化時會起一個執行緒去進行佇列消費。
接下來看onReceive()方法:
  1. /**
  2.    * The main event loop of the DAG scheduler.
  3.    */
  4.   override def onReceive(event: DAGSchedulerEvent): Unit = {
  5.     val timerContext = timer.time()
  6.     try {
  7.       doOnReceive(event)
  8.     } finally {
  9.       timerContext.stop()
  10.     }
  11.   }

  1. private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
  2.     case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
  3.       dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

  4.     case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
  5.       dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)

  6.     case StageCancelled(stageId) =>
  7.       dagScheduler.handleStageCancellation(stageId)

  8.     case JobCancelled(jobId) =>
  9.       dagScheduler.handleJobCancellation(jobId)

  10.     case JobGroupCancelled(groupId) =>
  11.       dagScheduler.handleJobGroupCancelled(groupId)

  12.     case AllJobsCancelled =>
  13.       dagScheduler.doCancelAllJobs()

  14.     case ExecutorAdded(execId, host) =>
  15.       dagScheduler.handleExecutorAdded(execId, host)

  16.     case ExecutorLost(execId) =>
  17.       dagScheduler.handleExecutorLost(execId, fetchFailed = false)

  18.     case BeginEvent(task, taskInfo) =>
  19.       dagScheduler.handleBeginEvent(task, taskInfo)

  20.     case GettingResultEvent(taskInfo) =>
  21.       dagScheduler.handleGetTaskResult(taskInfo)

  22.     case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>
  23.       dagScheduler.handleTaskCompletion(completion)

  24.     case TaskSetFailed(taskSet, reason, exception) =>
  25.       dagScheduler.handleTaskSetFailed(taskSet, reason, exception)

  26.     case ResubmitFailedStages =>
  27.       dagScheduler.resubmitFailedStages()
  28.   }

進入handleJobSubmitted方法:
  1. private[scheduler] def handleJobSubmitted(jobId: Int,
  2.       finalRDD: RDD[_],
  3.       func: (TaskContext, Iterator[_]) => _,
  4.       partitions: Array[Int],
  5.       callSite: CallSite,
  6.       listener: JobListener,
  7.       properties: Properties) {
  8.     var finalStage: ResultStage = null
  9.     try {
  10.       // New stage creation may throw an exception if, for example, jobs are run on a

  11.       // HadoopRDD whose underlying HDFS files have been deleted.

  12.       //生成一個 finalStage, 每一個 JOB 都有一個 finalStage, 根據 job 劃分出不同的 stage , RDD提交過來的肯定為最後一個RDD 所以此處封裝為ResultStage 
  13.       //RDD中:  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

  14.       finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
  15.     } catch {
  16.       case e: Exception =>
  17.         logWarning("Creating new stage failed due to exception - job: " + jobId, e)
  18.         listener.jobFailed(e)
  19.         return
  20.     }

  21.     val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
  22.     clearCacheLocs()
  23.     logInfo("Got job %s (%s) with %d output partitions".format(
  24.       job.jobId, callSite.shortForm, partitions.length))
  25.     logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
  26.     logInfo("Parents of final stage: " + finalStage.parents)
  27.     logInfo("Missing parents: " + getMissingParentStages(finalStage))

  28.     val jobSubmissionTime = clock.getTimeMillis()
  29.     jobIdToActiveJob(jobId) = job
  30.     activeJobs += job
  31.     finalStage.resultOfJob = Some(job)
  32.     val stageIds = jobIdToStageIds(jobId).toArray
  33.     val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
  34.     listenerBus.post(
  35.       SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
  36.     submitStage(finalStage)

  37.     submitWaitingStages()
  38.   }
進入submitStage方法:
  1. /** Submits stage, but first recursively submits any missing parents. */
  2.   //提交階段,但首先遞迴提交的父母                        首次進來為ResultStage,在此方法中遞迴提交父RDD組成的ShuffleMapStage
  3.   private def submitStage(stage: Stage) {
  4.     val jobId = activeJobForStage(stage)
  5.     if (jobId.isDefined) {
  6.       logDebug("submitStage(" + stage + ")")
  7.       if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
  8.         val missing = getMissingParentStages(stage).sortBy(_.id)                                         //獲取父Stages
  9.         logDebug("missing: " + missing)
  10.         if (missing.isEmpty) {
  11.           logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
  12.           submitMissingTasks(stage, jobId.get)
  13.         } else {
  14.           for (parent <- missing) {
  15.             submitStage(parent)
  16.           }
  17.           waitingStages += stage
  18.         }
  19.       }
  20.     } else {
  21.       abortStage(stage, "No active job for stage " + stage.id, None)
  22.     }
  23.   }


繼續進入submitMissingTasks方法:
  1. /** Called when stage's parents are available and we can now do its task. */
  2.   // taskScheduler.submitTasks 提交

  3.   private def submitMissingTasks(stage: Stage, jobId: Int) {
  4.     logDebug("submitMissingTasks(" + stage + ")")
  5.     // Get our pending tasks and remember them in our pendingTasks entry

  6.     stage.pendingTasks.clear()

  7.     // First figure out the indexes of partition ids to compute.

  8.     val (allPartitions: Seq[Int], partitionsToCompute: Seq[Int]) = {
  9.       stage match {                                                            //stage分為兩種型別 shuffleMapStage和ResultStage
  10.         case stage: ShuffleMapStage =>
  11.           val allPartitions = 0 until stage.numPartitions
  12.           val filteredPartitions = allPartitions.filter { id => stage.outputLocs(id).isEmpty }
  13.           (allPartitions, filteredPartitions)
  14.         case stage: ResultStage =>
  15.           val job = stage.resultOfJob.get
  16.           val allPartitions = 0 until job.numPartitions
  17.           val filteredPartitions = allPartitions.filter { id => !job.finished(id) }
  18.           (allPartitions, filteredPartitions)
  19.       }
  20.     }

  21.     // Create internal accumulators if the stage has no accumulators initialized.

  22.     // Reset internal accumulators only if this stage is not partially submitted

  23.     // Otherwise, we may override existing accumulator values from some tasks

  24.     if (stage.internalAccumulators.isEmpty || allPartitions == partitionsToCompute) {
  25.       stage.resetInternalAccumulators()
  26.     }

  27.     val properties = jobIdToActiveJob.get(stage.firstJobId).map(_.properties).orNull

  28.     runningStages += stage
  29.     // SparkListenerStageSubmitted should be posted before testing whether tasks are

  30.     // serializable. If tasks are not serializable, a SparkListenerStageCompleted event

  31.     // will be posted, which should always come after a corresponding SparkListenerStageSubmitted

  32.     // event.

  33.     outputCommitCoordinator.stageStart(stage.id)
  34.     val taskIdToLocations = try {
  35.       stage match {
  36.         case s: ShuffleMapStage =>
  37.           partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
  38.         case s: ResultStage =>
  39.           val job = s.resultOfJob.get
  40.           partitionsToCompute.map { id =>
  41.             val p = s.partitions(id)
  42.             (id, getPreferredLocs(stage.rdd, p))
  43.           }.toMap
  44.       }
  45.     } catch {
  46.       case NonFatal(e) =>
  47.         stage.makeNewStageAttempt(partitionsToCompute.size)
  48.         listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
  49.         abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
  50.         runningStages -= stage
  51.         return
  52.     }

  53.     stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
  54.     listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

  55.     // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.

  56.     // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast

  57.     // the serialized copy of the RDD and for each task we will deserialize it, which means each

  58.     // task gets a different copy of the RDD. This provides stronger isolation between tasks that

  59.     // might modify state of objects referenced in their closures. This is necessary in Hadoop

  60.     // where the JobConf/Configuration object is not thread-safe.

  61.     var taskBinary: Broadcast[Array[Byte]] = null
  62.     try {
  63.       // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).

  64.       // For ResultTask, serialize and broadcast (rdd, func).

  65.       val taskBinaryBytes: Array[Byte] = stage match {
  66.         case stage: ShuffleMapStage =>
  67.           closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
  68.         case stage: ResultStage =>
  69.           closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
  70.       }

  71.       taskBinary = sc.broadcast(taskBinaryBytes)
  72.     } catch {
  73.       // In the case of a failure during serialization, abort the stage.

  74.       case e: NotSerializableException =>
  75.         abortStage(stage, "Task not serializable: " + e.toString, Some(e))
  76.         runningStages -= stage

  77.         // Abort execution

  78.         return
  79.       case NonFatal(e) =>
  80.         abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}", Some(e))
  81.         runningStages -= stage
  82.         return
  83.     }

  84.     val tasks: Seq[Task[_]] = try {
  85.       stage match {
  86.         case stage: ShuffleMapStage =>
  87.           partitionsToCompute.map { id =>
  88.             val locs = taskIdToLocations(id)
  89.             val part = stage.rdd.partitions(id)
  90.             new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
  91.               taskBinary, part, locs, stage.internalAccumulators)
  92.           }

  93.         case stage: ResultStage =>
  94.           val job = stage.resultOfJob.get
  95.           partitionsToCompute.map { id =>
  96.             val p: Int = stage.partitions(id)
  97.             val part = stage.rdd.partitions(p)
  98.             val locs = taskIdToLocations(id)
  99.             new ResultTask(stage.id, stage.latestInfo.attemptId,
  100.               taskBinary, part, locs, id, stage.internalAccumulators)
  101.           }
  102.       }
  103.     } catch {
  104.       case NonFatal(e) =>
  105.         abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
  106.         runningStages -= stage
  107.         return
  108.     }

  109.     if (tasks.size > 0) {
  110.       logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
  111.       stage.pendingTasks ++= tasks
  112.       logDebug("New pending tasks: " + stage.pendingTasks)
  113.       //把 stage 根據 partition 生成 TaskSet, 透過 TaskScheduler 提交 Task

  114.       taskScheduler.submitTasks(new TaskSet(
  115.         tasks.toArray, stage.id, stage.latestInfo.attemptId, stage.firstJobId, properties))
  116.       stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
  117.     } else {
  118.       // Because we posted SparkListenerStageSubmitted earlier, we should mark

  119.       // the stage as completed here in case there are no tasks to run

  120.       markStageAsFinished(stage, None)

  121.       val debugString = stage match {
  122.         case stage: ShuffleMapStage =>
  123.           s"Stage ${stage} is actually done; " +
  124.             s"(available: ${stage.isAvailable}," +
  125.             s"available outputs: ${stage.numAvailableOutputs}," +
  126.             s"partitions: ${stage.numPartitions})"
  127.         case stage : ResultStage =>
  128.           s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
  129.       }
  130.       logDebug(debugString)
  131.     }
  132.   }

此處不深入研究該方法看一下taskScheduler.submitTasks方法:
  1. override def submitTasks(taskSet: TaskSet) {
  2.     val tasks = taskSet.tasks
  3.     logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
  4.     this.synchronized {

  5.       //生成一個 TaskSetManager 例項,TaskSetManager負責監控該TaskSet的所有Task任務,有任務完成後,taskScheduler會從該Manager中刪除task

  6.       val manager = createTaskSetManager(taskSet, maxTaskFailures)
  7.       val stage = taskSet.stageId
  8.       val stageTaskSets =
  9.         taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
  10.       stageTaskSets(taskSet.stageAttemptId) = manager
  11.       val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
  12.         ts.taskSet != taskSet && !ts.isZombie
  13.       }
  14.       if (conflictingTaskSet) {
  15.         throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
  16.           s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
  17.       }
  18.       schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)                //將manager加入到rootPool池中,在初始化schedulableBuilder時賦予rootPool池(FIFO、FAIR)


  19.       if (!isLocal && !hasReceivedTask) {
  20.         starvationTimer.scheduleAtFixedRate(new TimerTask() {
  21.           override def run() {
  22.             if (!hasLaunchedTask) {
  23.               logWarning("Initial job has not accepted any resources; " +
  24.                 "check your cluster UI to ensure that workers are registered " +
  25.                 "and have sufficient resources")
  26.             } else {
  27.               this.cancel()
  28.             }
  29.           }
  30.         }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
  31.       }
  32.       hasReceivedTask = true
  33.     }
  34.     //透過 backend 發起執行訊息 ,backend 是 SchedulerBackend 的具體實現,在 yarn-cluster 模式為 CoarseGrainedSchedulerBackend


  35.     backend.reviveOffers()
  36.   }


CoarseGrainedSchedulerBackend:

  1. override def reviveOffers() {
  2.     driverEndpoint.send(ReviveOffers)
  3.   }
可見reviveOffers方法是向schedulerBackend傳送了一個ReviveOffers訊息。schedulerBackend為Actor模型看一下start方法,taskScheduler中初始化時會呼叫start()方法:

  1. override def start() {
  2.     val properties = new ArrayBuffer[(String, String)]
  3.     for ((key, value) <- scheduler.sc.conf.getAll) {
  4.       if (key.startsWith("spark.")) {
  5.         properties += ((key, value))
  6.       }
  7.     }

  8.     // TODO (prashant) send conf instead of properties   
  9.     // 初始化話 endpoint 用來通訊
  10.     driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))
  11.   }

    //建立一個Actor
  1.   protected def createDriverEndpoint(properties: Seq[(String, String)]): DriverEndpoint = {
  2.     new DriverEndpoint(rpcEnv, properties)
  3.   }




看一下receive方法:

  1. override def receive: PartialFunction[Any, Unit] = {
  2.       case StatusUpdate(executorId, taskId, state, data) =>
  3.         scheduler.statusUpdate(taskId, state, data.value)
  4.         if (TaskState.isFinished(state)) {
  5.           executorDataMap.get(executorId) match {
  6.             case Some(executorInfo) =>
  7.               executorInfo.freeCores += scheduler.CPUS_PER_TASK
  8.               makeOffers(executorId)
  9.             case None =>
  10.               // Ignoring the update since we don't know about the executor.
  11.               logWarning(s"Ignored task status update ($taskId state $state) " +
  12.                 s"from unknown executor with ID $executorId")
  13.           }
  14.         }

  15.       case ReviveOffers =>                          //在初始化時會每隔一定時間(spark.scheduler.revive.interval, 預設為1s)進行一次排程(給自身傳送ReviveOffers訊息, 進行呼叫makeOffers進行排程)。
  16.         makeOffers()

  17.       case KillTask(taskId, executorId, interruptThread) =>
  18.         executorDataMap.get(executorId) match {
  19.           case Some(executorInfo) =>
  20.             executorInfo.executorEndpoint.send(KillTask(taskId, executorId, interruptThread))
  21.           case None =>
  22.             // Ignoring the task kill since the executor is not registered.
  23.             logWarning(s"Attempted to kill task $taskId for unknown executor $executorId.")
  24.         }

  25.     }
看紅色部分進入makeOffers()方法:
  1. // Make fake resource offers on all executors
  2.     private def makeOffers() {
  3.       // Filter out executors under killing
  4.       val activeExecutors = executorDataMap.filterKeys(!executorsPendingToRemove.contains(_))         //過濾出executor
  5.       val workOffers = activeExecutors.map { case (id, executorData) =>
  6.         new WorkerOffer(id, executorData.executorHost, executorData.freeCores)                        //轉為seq
  7.       }.toSeq                                                    
  8.       launchTasks(scheduler.resourceOffers(workOffers))
  9.     }
重點是TaskScheduler的resourceOffers()方法:

  1. def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized { //SchedulerBackend中呼叫 封裝TaskDescription 傳送給Executor端執行
  2.     // Mark each slave as alive and remember its hostname
  3.     // Also track if new executor is added
  4.     var newExecAvail = false
  5.     // 遍歷worker提供的資源,更新executor相關的對映
  6.     for (o <- offers) {
  7.       executorIdToHost(o.executorId) = o.host
  8.       activeExecutorIds += o.executorId
  9.       if (!executorsByHost.contains(o.host)) {
  10.         executorsByHost(o.host) = new HashSet[String]()
  11.         executorAdded(o.executorId, o.host)
  12.         newExecAvail = true
  13.       }
  14.       for (rack <- getRackForHost(o.host)) {
  15.         hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
  16.       }
  17.     }
  18.     // 從worker當中隨機選出一些來,防止任務都堆在一個機器上
  19.     // Randomly shuffle offers to avoid always placing tasks on the same set of workers.
  20.     val shuffledOffers = Random.shuffle(offers) //將offers集合打散順序
  21.     // Build a list of tasks to assign to each worker.
  22.     val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
  23.     val availableCpus = shuffledOffers.map(o => o.cores).toArray
  24.     // getSortedTask函式對taskset進行排序
  25.     val sortedTaskSets = rootPool.getSortedTaskSetQueue //排序佇列 獲取到一組 TaskSetManager :ArrayBuffer[TaskSetManager]
  26.     //按就近原則進行Task排程
  27.     for (taskSet <- sortedTaskSets) {
  28.       logDebug("parentName: %s, name: %s, runningTasks: %s".format(
  29.         taskSet.parent.name, taskSet.name, taskSet.runningTasks))
  30.       if (newExecAvail) {
  31.         taskSet.executorAdded()
  32.       }
  33.     }

  34.     // Take each TaskSet in our scheduling order, and then offer it each node in increasing order
  35.     // of locality levels so that it gets a chance to launch local tasks on all of them.
  36.     // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
  37.     var launchedTask = false
  38.     for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
  39.       do {
  40.         launchedTask = resourceOfferSingleTaskSet(
  41.             taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
  42.       } while (launchedTask)
  43.     }

  44.     if (tasks.size > 0) {
  45.       hasLaunchedTask = true
  46.     }
  47.     return tasks
  48.   }
看一下resourceOfferSingleTaskSet():

  1. private def resourceOfferSingleTaskSet(
  2.       taskSet: TaskSetManager,
  3.       maxLocality: TaskLocality,
  4.       shuffledOffers: Seq[WorkerOffer],
  5.       availableCpus: Array[Int],
  6.       tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {
  7.     var launchedTask = false
  8.     for (i <- 0 until shuffledOffers.size) {  // 遍歷所有Executor
  9.       val execId = shuffledOffers(i).executorId
  10.       val host = shuffledOffers(i).host
  11.       if (availableCpus(i) >= CPUS_PER_TASK) { // >1
  12.         try {
  13.           for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {    // 根據executor的資訊返回一個最合適的Task
  14.             // resourceOffer :
  15.             //根據TaskScheduler所提供的單個Resource資源包括host,executor和locality的要求返回一個合適的Task。
  16.             // TaskSetManager內部會根據上一個任務成功提交的時間,自動調整自身的Locality匹配策略,
  17.             // 如果上一次成功提交任務的時間間隔很長,則降低對Locality的要求(例如從最差要求Process Local降低為最差要求Node Local),
  18.             // 反之則提高對Locality的要求。這一動態調整Locality策略基本可以理解為是為了提高任務在最佳Locality的情況下得到執行的機會,
  19.             // 因為Resource資源可能是在短期內分批提供給TaskSetManager的,動態調整Locality門檻有助於改善整體的Locality分佈情況。
  20.             tasks(i) += task
  21.             val tid = task.taskId
  22.             taskIdToTaskSetManager(tid) = taskSet
  23.             taskIdToExecutorId(tid) = execId
  24.             executorsByHost(host) += execId
  25.             availableCpus(i) -= CPUS_PER_TASK
  26.             assert(availableCpus(i) >= 0)
  27.             launchedTask = true
  28.           }
  29.         } catch {
  30.           case e: TaskNotSerializableException =>
  31.             logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")
  32.             // Do not offer resources for this task, but don't throw an error to allow other
  33.             // task sets to be submitted.
  34.             return launchedTask
  35.         }
  36.       }
  37.     }
  38.     return launchedTask
  39.   }


launchTasks()方法:
  1. // Launch tasks returned by a set of resource offers
  2.     private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  3.       for (task <- tasks.flatten) {
  4.         val serializedTask = ser.serialize(task)
  5.         if (serializedTask.limit >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
  6.           scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
  7.             try {
  8.               var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
  9.                 "spark.akka.frameSize (%d bytes) - reserved (%d bytes). Consider increasing " +
  10.                 "spark.akka.frameSize or using broadcast variables for large values."
  11.               msg = msg.format(task.taskId, task.index, serializedTask.limit, akkaFrameSize,
  12.                 AkkaUtils.reservedSizeBytes)
  13.               taskSetMgr.abort(msg)
  14.             } catch {
  15.               case e: Exception => logError("Exception in error callback", e)
  16.             }
  17.           }
  18.         }
  19.         else {
  20.           val executorData = executorDataMap(task.executorId)
  21.           executorData.freeCores -= scheduler.CPUS_PER_TASK
  22.           executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))    // 獲取到executorEndpoint併發射到ExecutorBackend
  23.         }
  24.       }
  25.     }
找到CoarseGrainedExecutorBackend的receive方法:
  1. override def receive: PartialFunction[Any, Unit] = {
  2.     case RegisteredExecutor =>
  3.       logInfo("Successfully registered with driver")
  4.       val (hostname, _) = Utils.parseHostPort(hostPort)
  5.       executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)

  6.     case RegisterExecutorFailed(message) =>
  7.       logError("Slave registration failed: " + message)
  8.       System.exit(1)

  9.     case LaunchTask(data) =>                                                                            // 接收任務
  10.       if (executor == null) {
  11.         logError("Received LaunchTask command but executor was null")
  12.         System.exit(1)
  13.       } else {
  14.         val taskDesc = ser.deserialize[TaskDescription](data.value)
  15.         logInfo("Got assigned task " + taskDesc.taskId)
  16.         executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,        // 交由Executor執行
  17.           taskDesc.name, taskDesc.serializedTask)
  18.       }

  19.     case KillTask(taskId, _, interruptThread) =>
  20.       if (executor == null) {
  21.         logError("Received KillTask command but executor was null")
  22.         System.exit(1)
  23.       } else {
  24.         executor.killTask(taskId, interruptThread)
  25.       }

  26.     case StopExecutor =>
  27.       logInfo("Driver commanded a shutdown")
  28.       executor.stop()
  29.       stop()
  30.       rpcEnv.shutdown()
  31.   }
Executor中有一個執行緒池物件,每分發一個任務就會啟一個執行緒去執行Task。



簡要總結:

DAGScheduler:
               1.handleJobSubmitted(): 根據提交過來的最後一個RDD封裝成一個ResultStage
               2.submitStage(): 根據這個ResultStage遞迴提交父RDD組成的ShuffleMapStage
               3.submitMissingTasks() : 根據Stage封裝成ShuffleMapTask或者ResultTask再封裝成TaskSet呼叫TaskScheduler提交

TaskScheduler:
               1.submitTasks() : 將TaskSet封裝成TaskSetManager提交至Executor端
                2.launchTasks() : 透過resourceOffers()方法獲取到TaskDescription集合後
                                    透過再透過該方法提交至ExecutorBackend

             


SchedulerBackend:
                  1.makeOffers() : 該方法呼叫TaskScheduler的resourceOffers(workOffers)
                                   workOffers為executorBackend註冊的子節點資訊
                  2.resourceOffers() : 透過該方法將workOffers順序打散後再從佇列中獲取到一組
                                    TaskSetManager,然後呼叫TaskSetManager的resourceOffer()
                                    方法獲取到每個節點上最合適的Task並封裝成TaskDescription



Executor :
               1.執行緒在run()中將Task反序列化解析出來呼叫run()方法
               2.run()方法中呼叫子類(ShuffleMapTask、ResultTask)的runTask()方法
               3.runTask中呼叫RDD的iterator()方法,迭代計算父RDD。
               4.在iterator()方法中如果當前RDD的storage level變數標記不是NONE的話,表示該RDD呼叫了快取方法,在BlockManager中應有儲存,那麼呼叫           
                  CacheManager中的getOrCompute()函式計算RDD














來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29754888/viewspace-1813916/,如需轉載,請註明出處,否則將追究法律責任。

相關文章