Flink -- Failover

weixin_34067049發表於2016-11-24

JobManager failover

LeaderLatch

private synchronized void setLeadership(boolean newValue)
{
    boolean oldValue = hasLeadership.getAndSet(newValue);

    if ( oldValue && !newValue ) //原來是leader，當前不是leader，所以是lost leadership
    { // Lost leadership, was true, now false
        listeners.forEach(new Function<LeaderLatchListener, Void>()
            {
                @Override
                public Void apply(LeaderLatchListener listener)
                {
                    listener.notLeader();
                    return null;
                }
            });
    }
    else if ( !oldValue && newValue )
    { // Gained leadership, was false, now true
        listeners.forEach(new Function<LeaderLatchListener, Void>()
            {
                @Override
                public Void apply(LeaderLatchListener input)
                {
                    input.isLeader();
                    return null;
                }
            });
    }

    notifyAll();
}

ZooKeeperLeaderElectionService

@Override
public void isLeader() {
   synchronized (lock) {
      issuedLeaderSessionID = UUID.randomUUID();


      leaderContender.grantLeadership(issuedLeaderSessionID);
   }
}

@Override
public void notLeader() {
   synchronized (lock) {
      issuedLeaderSessionID = null;
      confirmedLeaderSessionID = null;



      leaderContender.revokeLeadership();
   }
}

可以看到，只是分別呼叫leaderContender.grantLeadership，leaderContender.revokeLeadership

而JobManager繼承了leaderContender介面，

revokeLeadership

val newFuturesToComplete = cancelAndClearEverything(
  new Exception("JobManager is no longer the leader."))

在cancelAndClearEverything中，關鍵的是suspend executionGraph；停止執行，但是並不會job刪除，這樣其他的JobManager還能重新提交

* The SUSPENDED state is a local terminal state which stops the execution of the job but does
* not remove the job from the HA job store so that it can be recovered by another JobManager.

private def cancelAndClearEverything(cause: Throwable)
  : Seq[Future[Unit]] = {
  val futures = for ((jobID, (eg, jobInfo)) <- currentJobs) yield {
    future {
      eg.suspend(cause) //suspend Execution Graph

      if (jobInfo.listeningBehaviour != ListeningBehaviour.DETACHED) {
        jobInfo.client ! decorateMessage(
          Failure(new JobExecutionException(jobID, "All jobs are cancelled and cleared.", cause)))
      }
    }(context.dispatcher)
  }

  currentJobs.clear()

  futures.toSeq
}

grantLeadership

context.system.scheduler.scheduleOnce(
  jobRecoveryTimeout,
  self,
  decorateMessage(RecoverAllJobs))(
  context.dispatcher)

主要是要恢復所有的job，RecoverAllJobs

case RecoverAllJobs =>
  future {
    try {
      // The ActorRef, which is part of the submitted job graph can only be
      // de-serialized in the scope of an actor system.
      akka.serialization.JavaSerializer.currentSystem.withValue(
        context.system.asInstanceOf[ExtendedActorSystem]) {

        log.info(s"Attempting to recover all jobs.")

        val jobGraphs = submittedJobGraphs.recoverJobGraphs().asScala //從submittedJobGraphs store裡面讀出所有submitted的job，也是從zk裡面讀出

        if (!leaderElectionService.hasLeadership()) {
          // we've lost leadership. mission: abort.
          log.warn(s"Lost leadership during recovery. Aborting recovery of ${jobGraphs.size} " +
            s"jobs.")
        } else {
          log.info(s"Re-submitting ${jobGraphs.size} job graphs.")

          jobGraphs.foreach{
            submittedJobGraph =>
              self ! decorateMessage(RecoverSubmittedJob(submittedJobGraph)) //recover job
          }
        }
      }
    } catch {
      case t: Throwable => log.error("Fatal error: Failed to recover jobs.", t)
    }
  }(context.dispatcher)

在recover job，

case RecoverSubmittedJob(submittedJobGraph) =>
  if (!currentJobs.contains(submittedJobGraph.getJobId)) {
    submitJob(
      submittedJobGraph.getJobGraph(),
      submittedJobGraph.getJobInfo(),
      isRecovery = true)
  }
  else {
    log.info(s"Ignoring job recovery for ${submittedJobGraph.getJobId}, " +
      s"because it is already submitted.")
  }

其實就是重新的submit job，注意這裡的，isRecovery = true

在submit job時，如果isRecovery = true，會做下面的操作，然後後續具體的操作參考Checkpoint篇

if (isRecovery) {
  executionGraph.restoreLatestCheckpointedState()
}

TaskManager Failover

在job manager內部通過death watch發現task manager dead，

/**
    * Handler to be executed when a task manager terminates.
    * (Akka Deathwatch or notifiction from ResourceManager)
    *
    * @param taskManager The ActorRef of the taskManager
    */
  private def handleTaskManagerTerminated(taskManager: ActorRef): Unit = {
    if (instanceManager.isRegistered(taskManager)) {
      log.info(s"Task manager ${taskManager.path} terminated.")

      instanceManager.unregisterTaskManager(taskManager, true)
      context.unwatch(taskManager)
    }
  }

instanceManager.unregisterTaskManager，

/**
* Unregisters the TaskManager with the given {@link ActorRef}. Unregistering means to mark
* the given instance as dead and notify {@link InstanceListener} about the dead instance.
*
* @param instanceID TaskManager which is about to be marked dead.
*/
public void unregisterTaskManager(ActorRef instanceID, boolean terminated){
    Instance instance = registeredHostsByConnection.get(instanceID);
    
    if (instance != null){
        ActorRef host = instance.getActorGateway().actor();
        
        registeredHostsByConnection.remove(host);
        registeredHostsById.remove(instance.getId());
        registeredHostsByResource.remove(instance.getResourceId());
        
        if (terminated) {
            deadHosts.add(instance.getActorGateway().actor());
        }
        
        instance.markDead();
        
        totalNumberOfAliveTaskSlots -= instance.getTotalNumberOfSlots();
        
        notifyDeadInstance(instance);
    }
}

instance.markDead()

public void markDead() {

    // create a copy of the slots to avoid concurrent modification exceptions
    List<Slot> slots;
    
    synchronized (instanceLock) {
    if (isDead) {
        return;
    }
    isDead = true;
    
    // no more notifications for the slot releasing
    this.slotAvailabilityListener = null;
    
    slots = new ArrayList<Slot>(allocatedSlots);
    
    allocatedSlots.clear();
        availableSlots.clear();
    }
    
    /*
    * releaseSlot must not own the instanceLock in order to avoid dead locks where a slot
    * owning the assignment group lock wants to give itself back to the instance which requires
    * the instance lock
    */
    for (Slot slot : slots) {
        slot.releaseSlot();
    }
}

SimpleSolt.releaseSlot

@Override 
public void releaseSlot() { 

    if (!isCanceled()) { 

        // kill all tasks currently running in this slot 
        Execution exec = this.executedTask; 
        if (exec != null && !exec.isFinished()) { 
            exec.fail(new Exception( 
                    "The slot in which the task was executed has been released. Probably loss of TaskManager " 
                            + getInstance())); 
        } 

        // release directly (if we are directly allocated), 
        // otherwise release through the parent shared slot 
        if (getParent() == null) { 
            // we have to give back the slot to the owning instance 
            if (markCancelled()) { 
                getInstance().returnAllocatedSlot(this); 
            } 
        } else { 
            // we have to ask our parent to dispose us 
            getParent().releaseChild(this); 
        }

}

Execution.fail

public void fail(Throwable t) {
   processFail(t, false);
}

Execution.processFail

先將Execution的狀態設為failed

transitionState(current, FAILED, t)

private boolean transitionState(ExecutionState currentState, ExecutionState targetState, Throwable error) { 

    if (STATE_UPDATER.compareAndSet(this, currentState, targetState)) {
        markTimestamp(targetState); 

        try {
            vertex.notifyStateTransition(attemptId, targetState, error);
        }
        catch (Throwable t) {
            LOG.error("Error while notifying execution graph of execution state transition.", t);
        }
        return true;
    } else {
        return false;
    }
}

設定完後，需要notifyStateTransition

getExecutionGraph().notifyExecutionChange(getJobvertexId(), subTaskIndex, executionId, newState, error);

void notifyExecutionChange(JobVertexID vertexId, int subtask, ExecutionAttemptID executionID, ExecutionState
                        newExecutionState, Throwable error)
{
    ExecutionJobVertex vertex = getJobVertex(vertexId);

    if (executionListenerActors.size() > 0) {
        String message = error == null ? null : ExceptionUtils.stringifyException(error);
        ExecutionGraphMessages.ExecutionStateChanged actorMessage =
                new ExecutionGraphMessages.ExecutionStateChanged(jobID, vertexId,  vertex.getJobVertex().getName(),
                                                                vertex.getParallelism(), subtask,
                                                                executionID, newExecutionState,
                                                                System.currentTimeMillis(), message);

        for (ActorGateway listener : executionListenerActors) {
            listener.tell(actorMessage);
        }
    }

    // see what this means for us. currently, the first FAILED state means -> FAILED
    if (newExecutionState == ExecutionState.FAILED) {
        fail(error);
    }
}

主要就是將ExecutionGraphMessages.ExecutionStateChanged，傳送給所有的listeners

listener是在JobManager裡面在提交job的時候加上的，

     if (jobInfo.listeningBehaviour == ListeningBehaviour.EXECUTION_RESULT_AND_STATE_CHANGES) {
          // the sender wants to be notified about state changes
          val gateway = new AkkaActorGateway(jobInfo.client, leaderSessionID.orNull)

          executionGraph.registerExecutionListener(gateway)
          executionGraph.registerJobStatusListener(gateway)
      }

而在client，

JobClientActor，只是log和print這些資訊

if (message instanceof ExecutionGraphMessages.ExecutionStateChanged) {
    logAndPrintMessage((ExecutionGraphMessages.ExecutionStateChanged) message);
} else if (message instanceof ExecutionGraphMessages.JobStatusChanged) {
    logAndPrintMessage((ExecutionGraphMessages.JobStatusChanged) message);
}

注意，這裡如果newExecutionState == ExecutionState.FAILED，會呼叫ExecutionGraph.fail
就像註釋說的，第一個failed，就意味著整個jobfailed

public void fail(Throwable t) {
    while (true) {
        JobStatus current = state;
        // stay in these states
        if (current == JobStatus.FAILING ||
            current == JobStatus.SUSPENDED ||
            current.isGloballyTerminalState()) {
            return;
        } else if (current == JobStatus.RESTARTING && transitionState(current, JobStatus.FAILED, t)) {
            synchronized (progressLock) {
                postRunCleanup();
                progressLock.notifyAll();
                return;
            }
        } else if (transitionState(current, JobStatus.FAILING, t)) { //將job的狀態設為JobStatus.FAILING
            this.failureCause = t;

            if (!verticesInCreationOrder.isEmpty()) {
                // cancel all. what is failed will not cancel but stay failed
                for (ExecutionJobVertex ejv : verticesInCreationOrder) {
                    ejv.cancel();
                }
            } else {
                // set the state of the job to failed
                transitionState(JobStatus.FAILING, JobStatus.FAILED, t); //
            }

            return;
        }

    }
}

可以看到，這裡直接把job狀態設為Failing，並且呼叫所有的ExecutionJobVertex.cancel

接著，從ExecutionGraph中deregister這個execution，

vertex.getExecutionGraph().deregisterExecution(this);

Execution contained = currentExecutions.remove(exec.getAttemptId());

最終，呼叫

vertex.executionFailed(t);

void executionFailed(Throwable t) {
    jobVertex.vertexFailed(subTaskIndex, t);
}

ExecutionJobVertex

void vertexFailed(int subtask, Throwable error) {
    subtaskInFinalState(subtask);
}

private void subtaskInFinalState(int subtask) {
    synchronized (stateMonitor) {
        if (!finishedSubtasks[subtask]) {
            finishedSubtasks[subtask] = true;
            
            if (numSubtasksInFinalState+1 == parallelism) { //看看對於Vertex而言，是否所有的subTask都已經finished
                
                // call finalizeOnMaster hook
                try {
                    getJobVertex().finalizeOnMaster(getGraph().getUserClassLoader());
                }
                catch (Throwable t) {
                    getGraph().fail(t);
                }

                numSubtasksInFinalState++;
                
                // we are in our final state
                stateMonitor.notifyAll();
                
                // tell the graph
                graph.jobVertexInFinalState();
            } else {
                numSubtasksInFinalState++;
            }
        }
    }
}

graph.jobVertexInFinalState()

void jobVertexInFinalState() {
        numFinishedJobVertices++;

        if (numFinishedJobVertices == verticesInCreationOrder.size()) { //是否所有JobVertices都已經finished

            // we are done, transition to the final state
            JobStatus current;
            while (true) {
                current = this.state;

                if (current == JobStatus.RUNNING) {
                    if (transitionState(current, JobStatus.FINISHED)) {
                        postRunCleanup();
                        break;
                    }
                }
                else if (current == JobStatus.CANCELLING) {
                    if (transitionState(current, JobStatus.CANCELED)) {
                        postRunCleanup();
                        break;
                    }
                }
                else if (current == JobStatus.FAILING) {
                    boolean allowRestart = !(failureCause instanceof SuppressRestartsException);

                    if (allowRestart && restartStrategy.canRestart() && transitionState(current, JobStatus.RESTARTING)) {
                        restartStrategy.restart(this);
                        break;
                    } else if ((!allowRestart || !restartStrategy.canRestart()) && transitionState(current, JobStatus.FAILED, failureCause)) {
                        postRunCleanup();
                        break;
                    }
                }
                else if (current == JobStatus.SUSPENDED) {
                    // we've already cleaned up when entering the SUSPENDED state
                    break;
                }
                else if (current.isGloballyTerminalState()) {
                    LOG.warn("Job has entered globally terminal state without waiting for all " +
                        "job vertices to reach final state.");
                    break;
                }
                else {
                    fail(new Exception("ExecutionGraph went into final state from state " + current));
                    break;
                }
            }
            // done transitioning the state

            // also, notify waiters
            progressLock.notifyAll();
        }
    }
}

如果Job狀態是JobStatus.FAILING，並且滿足restart的條件，transitionState(current, JobStatus.RESTARTING)

restartStrategy.restart(this);

這個restart策略是可以配置的，但無論什麼策略最終呼叫到，

executionGraph.restart();

public void restart() {
    try {
        synchronized (progressLock) {
            JobStatus current = state;

            if (current == JobStatus.CANCELED) {
                LOG.info("Canceled job during restart. Aborting restart.");
                return;
            } else if (current == JobStatus.FAILED) {
                LOG.info("Failed job during restart. Aborting restart.");
                return;
            } else if (current == JobStatus.SUSPENDED) {
                LOG.info("Suspended job during restart. Aborting restart.");
                return;
            } else if (current != JobStatus.RESTARTING) {
                throw new IllegalStateException("Can only restart job from state restarting.");
            }

            if (scheduler == null) {
                throw new IllegalStateException("The execution graph has not been scheduled before - scheduler is null.");
            }

            this.currentExecutions.clear();

            Collection<CoLocationGroup> colGroups = new HashSet<>();

            for (ExecutionJobVertex jv : this.verticesInCreationOrder) {

                CoLocationGroup cgroup = jv.getCoLocationGroup();
                if(cgroup != null && !colGroups.contains(cgroup)){
                    cgroup.resetConstraints();
                    colGroups.add(cgroup);
                }

                jv.resetForNewExecution();
            }

            for (int i = 0; i < stateTimestamps.length; i++) {
                if (i != JobStatus.RESTARTING.ordinal()) {
                    // Only clear the non restarting state in order to preserve when the job was
                    // restarted. This is needed for the restarting time gauge
                    stateTimestamps[i] = 0;
                }
            }
            numFinishedJobVertices = 0;
            transitionState(JobStatus.RESTARTING, JobStatus.CREATED);

            // if we have checkpointed state, reload it into the executions
            if (checkpointCoordinator != null) {
                boolean restored = checkpointCoordinator
                        .restoreLatestCheckpointedState(getAllVertices(), false, false); //重新載入checkpoint和狀態

                // TODO(uce) Temporary work around to restore initial state on
                // failure during recovery. Will be superseded by FLINK-3397.
                if (!restored && savepointCoordinator != null) {
                    String savepointPath = savepointCoordinator.getSavepointRestorePath();
                    if (savepointPath != null) {
                        savepointCoordinator.restoreSavepoint(getAllVertices(), savepointPath);
                    }
                }
            }
        }

        scheduleForExecution(scheduler); //把ExecuteGraph加入排程，重新提交
    }
    catch (Throwable t) {
        fail(t);
    }
}

redis_failover - Automatic Redis Failover Client/Server
2012-05-28
RedisAIclientServer
靜態FAILOVER
2010-11-07
AI
關於failover
2016-03-11
AI
Can the JDBC Thin Driver Do Failover by Specifying FAILOVER_MODE?
2011-04-02
JDBCAI
ASA failover配置(A/S)
2020-05-16
AI
DG 進行failover
2013-05-29
AI
DataGuard:Physical Standby Failover
2009-05-15
AI
Oracle dataguard failover 實戰
2018-04-03
OracleAI
Oracle RAC Failover 詳解
2013-12-05
OracleAI
oracle Physical Standby failover step
2015-12-28
OracleAI
oracle rac failover 詳解
2014-12-18
OracleAI
oracle rac failover的疑惑
2010-04-24
OracleAI
Oracle Data Guard Failover(activate)
2012-06-27
OracleAI
主備切換(failover)
2012-06-19
AI
In Data Guard,choose switchover or failover?
2007-07-04
AI
10g_dataguard_failover
2012-05-11
AI
DataGuard:Logical Standby Failover
2009-05-15
AI
data guard failover on solaris 10
2009-04-14
AI
How to configure Client Failover after Data Guard Switchover or Failover [ID 316740.1]
2011-09-18
clientAI
[Flink] Flink 版本特性的演進
2024-10-24
Performing a Failover to a Physical Standby Database
2020-02-24
ORMAIDatabase
Oracle：Failover 到物理備庫
2021-10-21
OracleAI
Oracle RAC TAF 無縫failover
2017-02-25
OracleAI
Dataguard failover切換實驗
2016-07-22
AI
物理DG角色轉換: failover
2014-04-01
AI
DataGuard模擬FailOver實驗
2014-11-10
AI
物理備庫failover實驗
2014-08-18
AI
DG物理standby，failover步驟
2014-09-09
AI
FAILOVER後DG的重新搭建
2010-09-19
AI
rac failover and load_balance
2012-07-01
AI
switchover和failover的區別
2011-10-19
AI
Data Guard Switchover and Failover Best Practices
2012-01-29
AI
轉：Oracle RAC Failover 詳解
2013-02-27
OracleAI
flink實戰--讀寫Hive（Flink on Hive）
2019-08-28
Hive
【Flink】深入理解Flink-On-Yarn模式
2019-07-05
Yarn模式
【Flink】Flink 底層RPC框架分析
2019-07-02
RPC框架
flink快速入門（部署+flink-sql）
2020-12-26
SQL
Flink模式
2023-03-12
模式

Flink -- Failover

JobManager failover

TaskManager Failover

相關文章