【Spark】 Spark作業執行原理--獲取執行結果

weixin_33724059發表於2019-01-08

原文網址 : https://blog.csdn.net/weixin_33724059/article/details/86912400

一、執行結果並序列化

任務執行完成後，是在 TaskRunner 的 run 方法的後半部分返回結果給 Driver 的：

override def run(): Unit = {
    ...
    // 執行任務
    val value = try {
      val res = task.run(
        taskAttemptId = taskId,
        attemptNumber = attemptNumber,
        metricsSystem = env.metricsSystem)
      threwException = false
      res
    } 
    ...
    val taskFinish = System.currentTimeMillis()
    val taskFinishCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
      threadMXBean.getCurrentThreadCpuTime
    } else 0L

    // If the task has been killed, let's fail it.
    if (task.killed) {
      throw new TaskKilledException
    }
    
    // 序列化結果
    val resultSer = env.serializer.newInstance()
    val beforeSerialization = System.currentTimeMillis()
    val valueBytes = resultSer.serialize(value)
    val afterSerialization = System.currentTimeMillis()

    // Deserialization happens in two parts: first, we deserialize a Task object, which
    // includes the Partition. Second, Task.run() deserializes the RDD and function to be run.
    task.metrics.setExecutorDeserializeTime(
      (taskStart - deserializeStartTime) + task.executorDeserializeTime)
    task.metrics.setExecutorDeserializeCpuTime(
      (taskStartCpu - deserializeStartCpuTime) + task.executorDeserializeCpuTime)
    // We need to subtract Task.run()'s deserialization time to avoid double-counting
    task.metrics.setExecutorRunTime((taskFinish - taskStart) - task.executorDeserializeTime)
    task.metrics.setExecutorCpuTime(
      (taskFinishCpu - taskStartCpu) - task.executorDeserializeCpuTime)
    task.metrics.setJvmGCTime(computeTotalGcTime() - startGCTime)
    task.metrics.setResultSerializationTime(afterSerialization - beforeSerialization)

    // 序列化後的結果封裝成 DirectTaskResult
    // Note: accumulator updates must be collected after TaskMetrics is updated
    val accumUpdates = task.collectAccumulatorUpdates()
    // TODO: do not serialize value twice
    val directResult = new DirectTaskResult(valueBytes, accumUpdates)
    val serializedDirectResult = ser.serialize(directResult)
    val resultSize = serializedDirectResult.limit

    // directSend = sending directly back to the driver
    val serializedResult: ByteBuffer = {
      // 生成結果大於最大值（預設1GB）直接丟棄
      if (maxResultSize > 0 && resultSize > maxResultSize) {
        logWarning(s"Finished $taskName (TID $taskId). Result is larger than maxResultSize " +
          s"(${Utils.bytesToString(resultSize)} > ${Utils.bytesToString(maxResultSize)}), " +
          s"dropping it.")
        ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize))
      // 生成結果設定的 maxDirectResultSize 且小於 最大值，則存放到 BlockManager 中，然後返回 BlockManager 的編號
      } else if (resultSize > maxDirectResultSize) {
        val blockId = TaskResultBlockId(taskId)
        env.blockManager.putBytes(
          blockId,
          new ChunkedByteBuffer(serializedDirectResult.duplicate()),
          StorageLevel.MEMORY_AND_DISK_SER)
        logInfo(
          s"Finished $taskName (TID $taskId). $resultSize bytes result sent via BlockManager)")
        ser.serialize(new IndirectTaskResult[Any](blockId, resultSize))
      // 其他結果直接返回
      } else {
        logInfo(s"Finished $taskName (TID $taskId). $resultSize bytes result sent to driver")
        serializedDirectResult
      }
    }
    // 向 Driver 終端傳送任務執行完畢的訊息
    execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)

從上面可以看出，對於 Executor 的計算結果，會根據結果大小不同有不同策略。

（1）生成結果大於maxResultSize（預設 1GB），結果直接丟棄，可以通過 spark.driver.maxResultSize 進行設定；

（2）生成結果大小大於 maxDirectResultSize（預設128M），小於 maxResultSize（預設 1GB），將結果存入 BlockManager，並返回其編號，通過 Netty 傳送給 Driver，maxDirectResultSize 由 spark.task.maxDirectResultSiz 和 spark.rpc.message.maxSize 控制，取兩個中的最小值。

（3）生成結果小於 maxDirectResultSize（預設128M），則直接傳送給 Driver。

二、傳送執行結果

任務執行後，TaskRunner 將執行結果傳送給 DriverEndpoint 終端：

override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) {
  val msg = StatusUpdate(executorId, taskId, state, data)
  driver match {
    case Some(driverRef) => driverRef.send(msg)
    case None => logWarning(s"Drop $msg because has not yet connected to driver")
  }
}

三、獲取執行結果

在 statusUpdate 中，將轉給 TaskScheduler 處理：

case StatusUpdate(executorId, taskId, state, data) =>
  scheduler.statusUpdate(taskId, state, data.value)
  if (TaskState.isFinished(state)) {
    executorDataMap.get(executorId) match {
      case Some(executorInfo) =>
        executorInfo.freeCores += scheduler.CPUS_PER_TASK
        makeOffers(executorId)
      case None =>
        // Ignoring the update since we don't know about the executor.
        logWarning(s"Ignored task status update ($taskId state $state) " +
          s"from unknown executor with ID $executorId")
    }
  }

TaskScheduler 中對任務的不同狀態有不同處理：

case Some(taskSet) =>
  if (state == TaskState.LOST) {
    // TaskState.LOST is only used by the deprecated Mesos fine-grained scheduling mode,
    // where each executor corresponds to a single task, so mark the executor as failed.
    val execId = taskIdToExecutorId.getOrElse(tid, throw new IllegalStateException(
      "taskIdToTaskSetManager.contains(tid) <=> taskIdToExecutorId.contains(tid)"))
    if (executorIdToRunningTaskIds.contains(execId)) {
      reason = Some(
        SlaveLost(s"Task $tid was lost, so marking the executor as lost as well."))
      removeExecutor(execId, reason.get)
      failedExecutor = Some(execId)
    }
  }
  if (TaskState.isFinished(state)) {
    cleanupTaskState(tid)
    taskSet.removeRunningTask(tid)
    if (state == TaskState.FINISHED) {
      taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
    } else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
      taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
    }
  }

3.1、TaskState.FINISHED

如果 TaskState.FINISHED，則進入 taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)：

def enqueueSuccessfulTask(
    taskSetManager: TaskSetManager,
    tid: Long,
    serializedData: ByteBuffer): Unit = {
  getTaskResultExecutor.execute(new Runnable {
    override def run(): Unit = Utils.logUncaughtExceptions {
      try {
        val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
          case directResult: DirectTaskResult[_] =>
            if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
              return
            }
            // deserialize "value" without holding any lock so that it won't block other threads.
            // We should call it here, so that when it's called again in
            // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value.
            directResult.value(taskResultSerializer.get())
            (directResult, serializedData.limit())
          case IndirectTaskResult(blockId, size) =>
            if (!taskSetManager.canFetchMoreResults(size)) {
              // dropped by executor if size is larger than maxResultSize
              sparkEnv.blockManager.master.removeBlock(blockId)
              return
            }
            logDebug("Fetching indirect task result for TID %s".format(tid))
            scheduler.handleTaskGettingResult(taskSetManager, tid)
            val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)
            if (!serializedTaskResult.isDefined) {
              /* We won't be able to get the task result if the machine that ran the task failed
               * between when the task ended and when we tried to fetch the result, or if the
               * block manager had to flush the result. */
              scheduler.handleFailedTask(
                taskSetManager, tid, TaskState.FINISHED, TaskResultLost)
              return
            }
            val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
              serializedTaskResult.get.toByteBuffer)
            // force deserialization of referenced value
            deserializedResult.value(taskResultSerializer.get())
            sparkEnv.blockManager.master.removeBlock(blockId)
            (deserializedResult, size)
        }

        // Set the task result size in the accumulator updates received from the executors.
        // We need to do this here on the driver because if we did this on the executors then
        // we would have to serialize the result again after updating the size.
        result.accumUpdates = result.accumUpdates.map { a =>
          if (a.name == Some(InternalAccumulator.RESULT_SIZE)) {
            val acc = a.asInstanceOf[LongAccumulator]
            assert(acc.sum == 0L, "task result size should not have been set on the executors")
            acc.setValue(size.toLong)
            acc
          } else {
            a
          }
        }

        scheduler.handleSuccessfulTask(taskSetManager, tid, result)
      } catch {
        case cnf: ClassNotFoundException =>
          val loader = Thread.currentThread.getContextClassLoader
          taskSetManager.abort("ClassNotFound with classloader: " + loader)
        // Matching NonFatal so we don't catch the ControlThrowable from the "return" above.
        case NonFatal(ex) =>
          logError("Exception while getting task result", ex)
          taskSetManager.abort("Exception while getting task result: %s".format(ex))
      }
    }
  })
}

enqueueSuccessfulTask 方法中判斷如果結果是 DirectTaskResult 型別，就直接獲取，如果是 IndirectTaskResult 型別，則根據 blockId 遠端呼叫 sparkEnv.blockManager.getRemoteBytes(blockId) 獲取；

接著呼叫 scheduler.handleSuccessfulTask:

def handleSuccessfulTask(
    taskSetManager: TaskSetManager,
    tid: Long,
    taskResult: DirectTaskResult[_]): Unit = synchronized {
  taskSetManager.handleSuccessfulTask(tid, taskResult)
}

最終經過呼叫鏈會來到 DAGScheduler # handleTaskCompletion 中，在該方法中，如果 Task 是 ResultTask，判斷作業是否完成，如果完成，標記完成，並清理作業依賴的資源，傳送訊息給訊息匯流排。

case Success =>
  stage.pendingPartitions -= task.partitionId
  task match {
    case rt: ResultTask[_, _] =>
      // Cast to ResultStage here because it's part of the ResultTask
      // TODO Refactor this out to a function that accepts a ResultStage
      val resultStage = stage.asInstanceOf[ResultStage]
      resultStage.activeJob match {
        case Some(job) =>
          if (!job.finished(rt.outputId)) {
            updateAccumulators(event)
            job.finished(rt.outputId) = true
            job.numFinished += 1
            // If the whole job has finished, remove it
            if (job.numFinished == job.numPartitions) {
              markStageAsFinished(resultStage)
              cleanupStateForJobAndIndependentStages(job)
              listenerBus.post(
                SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), JobSucceeded))
            }

            // taskSucceeded runs some user code that might throw an exception. Make sure
            // we are resilient against that.
            try {
              job.listener.taskSucceeded(rt.outputId, event.result)
            } catch {
              case e: Exception =>
                // TODO: Perhaps we want to mark the resultStage as failed?
                job.listener.jobFailed(new SparkDriverExecutionException(e))
            }
          }
        case None =>
          logInfo("Ignoring result from " + rt + " because its job has finished")
      }

如果是 ShuffleMapTask，則將結果（MapStatus）序列化後存入 DirectTaskResult 或者 IndirectTaskResult 中，DAGScheduler 的 handleTaskCompletion 獲取這個結果，並註冊到 MapOutputTrackerMaster 中：

  case smt: ShuffleMapTask =>
    val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
    updateAccumulators(event)
    val status = event.result.asInstanceOf[MapStatus]
    val execId = status.location.executorId
    logDebug("ShuffleMapTask finished on " + execId)
    if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
      logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
    } else {
      shuffleStage.addOutputLoc(smt.partitionId, status)
    }

    if (runningStages.contains(shuffleStage) && shuffleStage.pendingPartitions.isEmpty) {
      markStageAsFinished(shuffleStage)
      logInfo("looking for newly runnable stages")
      logInfo("running: " + runningStages)
      logInfo("waiting: " + waitingStages)
      logInfo("failed: " + failedStages)

      // We supply true to increment the epoch number here in case this is a
      // recomputation of the map outputs. In that case, some nodes may have cached
      // locations with holes (from when we detected the error) and will need the
      // epoch incremented to refetch them.
      // TODO: Only increment the epoch number if this is not the first time
      //       we registered these map outputs.
      mapOutputTracker.registerMapOutputs(
        shuffleStage.shuffleDep.shuffleId,
        shuffleStage.outputLocInMapOutputTrackerFormat(),
        changeEpoch = true)

      clearCacheLocs()

      if (!shuffleStage.isAvailable) {
        // Some tasks had failed; let's resubmit this shuffleStage
        // TODO: Lower-level scheduler should also deal with this
        logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
          ") because some of its tasks had failed: " +
          shuffleStage.findMissingPartitions().mkString(", "))
        submitStage(shuffleStage)
      } else {
        // Mark any map-stage jobs waiting on this stage as finished
        if (shuffleStage.mapStageJobs.nonEmpty) {
          val stats = mapOutputTracker.getStatistics(shuffleStage.shuffleDep)
          for (job <- shuffleStage.mapStageJobs) {
            markMapStageJobAsFinished(job, stats)
          }
        }
        submitWaitingChildStages(shuffleStage)
      }
    }
}

3.2、TaskState.FAILED, TaskState.KILLED, TaskState.LOST

如果結果型別 TaskState.FAILED, TaskState.KILLED, TaskState.LOST，則進入 taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)：

def enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: TaskState,
  serializedData: ByteBuffer) {
  var reason : TaskFailedReason = UnknownReason
  try {
    getTaskResultExecutor.execute(new Runnable {
      override def run(): Unit = Utils.logUncaughtExceptions {
        val loader = Utils.getContextOrSparkClassLoader
        try {
          if (serializedData != null && serializedData.limit() > 0) {
            reason = serializer.get().deserialize[TaskFailedReason](
              serializedData, loader)
          }
        } catch {
          case cnd: ClassNotFoundException =>
            // Log an error but keep going here -- the task failed, so not catastrophic
            // if we can't deserialize the reason.
            logError(
              "Could not deserialize TaskEndReason: ClassNotFound with classloader " + loader)
          case ex: Exception => // No-op
        }
        scheduler.handleFailedTask(taskSetManager, tid, taskState, reason)
      }
    })
  } catch {
    case e: RejectedExecutionException if sparkEnv.isStopped =>
      // ignore it
  }
}

然後再呼叫 scheduler.handleFailedTask 重新分配資源重試：

def handleFailedTask(
    taskSetManager: TaskSetManager,
    tid: Long,
    taskState: TaskState,
    reason: TaskFailedReason): Unit = synchronized {
  taskSetManager.handleFailedTask(tid, taskState, reason)
  if (!taskSetManager.isZombie && taskState != TaskState.KILLED) {
    // Need to revive offers again now that the task set manager state has been updated to
    // reflect failed tasks that need to be re-run.
    backend.reviveOffers()
  }
}

Spark的執行原理
2019-03-04
Spark
spark執行原理、模型
2018-08-03
Spark模型
Spark原理-物理執行圖
2020-12-10
Spark
Spark Task 的執行流程④ - task 結果的處理
2021-09-09
Spark
獲取任務的執行結果
2019-01-22
python執行shell並獲取結果
2024-07-10
Python
spark學習筆記--叢集執行Spark
2018-07-12
Spark筆記
Spark學習（一）——執行模式與執行流程
2019-03-31
Spark模式
Java獲取多執行緒執行結果方式的歸納與總結
2021-04-28
Java執行緒
easyexcel多sheet多執行緒匯入示例，獲取所以執行緒執行結果後返回
2024-11-12
Excel執行緒
《深入理解Spark》之Spark的整體執行流程
2018-08-15
Spark
檢視spark程式執行狀態以及安裝spark
2018-08-01
Spark
自適應查詢執行：在執行時提升Spark SQL執行效能
2020-12-21
SparkSQL
Spark叢集和任務執行
2020-11-06
Spark
多執行緒的補充獲取一定時間的執行結果
2018-05-03
執行緒
spark streaming執行kafka資料來源
2020-11-14
SparkKafka
Spark入門（二）--如何用Idea執行我們的Spark專案
2019-02-26
SparkIdea
【java】【多執行緒】獲取和設定執行緒名字、獲取執行緒物件（3）
2018-04-15
Java執行緒物件
Spark閉包 | driver & executor程式程式碼執行
2020-11-23
Spark
Spark Task 的執行流程② - 建立、分發 Task
2021-09-09
Spark
執行計劃-1：獲取執行計劃
2019-10-07
springboot：使用非同步註解@Async獲取執行結果的坑
2021-08-21
Spring Boot非同步
Spark報錯（二）：關於Spark-Streaming官方示例wordcount執行異常
2018-09-13
Spark
更高階的技術可用於獲取使用QThreadPool和QRunnable啟動執行緒的執行結果
2024-11-10
QTthread執行緒
spark-stage任務劃分、sparkclient執行模式
2020-10-22
Sparkclient模式
Hadoop 及Spark 分散式HA執行環境搭建
2023-02-27
HadoopSpark分散式
除了訊息佇列，以下這些高階技術也可用於獲取執行緒執行結果
2024-11-10
佇列執行緒
多執行緒並行執行，然後彙總結果
2019-01-18
執行緒並行
Spark RPC框架原始碼分析（二）RPC執行時序
2019-02-25
SparkRPC框架原始碼
一篇文章搞清spark任務如何執行
2020-03-20
Spark
Spark程式碼在叢集上執行過程理解
2020-10-10
Spark
在IntelliJ IDEA中建立和執行java/scala/spark程式
2020-05-18
IntelliJIdeaJavaSpark
pycharm下與spark的互動詳細配置說明以及spark 執行簡單例子
2018-12-29
PyCharmSpark單例
獲取執行計劃之Autotrace
2021-04-03
spark 原始碼分析之二十一 -- Task的執行流程
2019-07-29
Spark原始碼
spark sql語句效能最佳化及執行計劃
2024-10-19
SparkSQL
spark-submit提交任務時執行流程（簡單版）
2024-07-12
SparkMIT
Spark 叢集執行任務失敗的故障處理
2023-02-23
Spark