dolphinscheduler 實現master當機故障轉移能力原始碼分析

明月照江江發表於2024-03-10

DS(dolphinscheduler)的master 是去中心化的,而故障轉移能力是由master完成的,那麼是多個master同時幹故障轉移,還是選舉出一個master來幹這件事情呢?

迴歸到原始碼進行分析

1. master 啟動方法


@PostConstruct
public void run() throws SchedulerException {

....

    this.failoverExecuteThread.start();

....
}

也就是每個master都會啟動一個failoverExecuteThread 執行緒去處理故障轉移,看一下內部邏輯

2. failoverExecuteThread run方法

public void run() {
    // when startup, wait 10s for ready
    ThreadUtils.sleep(Constants.SLEEP_TIME_MILLIS * 10);

    while (!ServerLifeCycleManager.isStopped()) {
        try {
            if (!ServerLifeCycleManager.isRunning()) {
                continue;
            }
            // 在這裡進行檢查故障轉移,其實轉移動作也是在這裡執行了
            masterFailoverService.checkMasterFailover();
        } catch (Exception e) {
            log.error("Master failover thread execute error", e);
        } finally {
           // 定時檢查故障轉移的間隔 ThreadUtils.sleep(masterConfig.getFailoverInterval().toMillis());
        }
    }
}

3. masterFailoverService.checkMasterFailover()

public void checkMasterFailover() {
    List<String> needFailoverMasterHosts = processService.queryNeedFailoverProcessInstanceHost()
            .stream()
            // failover myself || dead server
            .filter(host -> localAddress.equals(host)
                    || !registryClient.checkNodeExists(host, RegistryNodeType.MASTER))
            .distinct()
            .collect(Collectors.toList());
    if (CollectionUtils.isEmpty(needFailoverMasterHosts)) {
        return;
    }
    log.info("Master failover service {} begin to failover hosts:{}", localAddress, needFailoverMasterHosts);

    for (String needFailoverMasterHost : needFailoverMasterHosts) {
        failoverMaster(needFailoverMasterHost);
    }
}

processService.queryNeedFailoverProcessInstanceHost 將所有處理已提交,正在執行,延遲執行,執行暫停,執行停止狀態的 工作流例項全部取出,用於後面failoverMaster進行比較

private static final int[] NEED_FAILOVER_STATES = new int[]{
        SUBMITTED_SUCCESS.getCode(),
        RUNNING_EXECUTION.getCode(),
        DELAY_EXECUTION.getCode(),
        READY_PAUSE.getCode(),
        READY_STOP.getCode()
};

4. failoverMaster 真正檢查故障轉移和執行故障轉移動作的入口

public void failoverMaster(String masterHost) {
    String failoverPath = RegistryNodeType.MASTER_FAILOVER_LOCK.getRegistryPath() + "/" + masterHost;
    try {
        //第一步拿取分散式鎖
        registryClient.getLock(failoverPath);
        //拿到鎖幹活
        doFailoverMaster(masterHost);
    } catch (Exception e) {
        log.error("Master server failover failed, host:{}", masterHost, e);
    } finally {
        //釋放鎖
        registryClient.releaseLock(failoverPath);
    }
}

也就是說,這裡是透過拿取Zookeeper 分散式鎖來保證同一時間只有一個master在幹這事

5. doFailoverMaster

// 這裡刪除了一些非核心的程式碼
private void doFailoverMaster(@NonNull String masterHost) {
    // 從zk獲取入參master的啟動時間,如果已掛,那麼得到的將是null
    Optional<Date> masterStartupTimeOptional =
            getServerStartupTime(registryClient.getServerList(RegistryNodeType.MASTER),
                    masterHost);
    // 獲取入參master 所操作的所有工作流例項的列表
    List<ProcessInstance> needFailoverProcessInstanceList = processService.queryNeedFailoverProcessInstances(
            masterHost);
    if (CollectionUtils.isEmpty(needFailoverProcessInstanceList)) {
        return;
    }


    for (ProcessInstance processInstance : needFailoverProcessInstanceList) {
        try {
        // 這裡在檢查這個工作流例項是否需要被轉移
            if (!checkProcessInstanceNeedFailover(masterStartupTimeOptional, processInstance)) {
                continue;
            }

            ProcessInstanceMetrics.incProcessInstanceByStateAndProcessDefinitionCode("failover",
                    processInstance.getProcessDefinitionCode().toString());
    // 這裡處理需要故障轉移的例項
    processService.processNeedFailoverProcessInstances(processInstance);
        } finally {
        }
    }

}

6. checkProcessInstanceNeedFailover

checkProcessInstanceNeedFailover的邏輯很簡單,就是檢查是否需要轉移,比如當前工作流的啟動時間是否小於對應的master的啟動時間,如果小於,說明master重啟過,需要轉移等等

7. processService.processNeedFailoverProcessInstances

public void processNeedFailoverProcessInstances(ProcessInstance processInstance) {
    // updateProcessInstance host is null to mark this processInstance has been failover
    // and insert a failover command
    processInstance.setHost(Constants.NULL);
    processInstanceMapper.updateById(processInstance);

    // 2 insert into recover command
    Command cmd = new Command();
    cmd.setProcessDefinitionCode(processInstance.getProcessDefinitionCode());
    cmd.setProcessDefinitionVersion(processInstance.getProcessDefinitionVersion());
    // 注意:這裡寫入了當前的instance的id,用於後期複用
    cmd.setProcessInstanceId(processInstance.getId());
    cmd.setCommandParam(JSONUtils.toJsonString(createCommandParams(processInstance)));
    cmd.setExecutorId(processInstance.getExecutorId());
    cmd.setCommandType(CommandType.RECOVER_TOLERANCE_FAULT_PROCESS);
    cmd.setProcessInstancePriority(processInstance.getProcessInstancePriority());
    cmd.setTestFlag(processInstance.getTestFlag());
    commandService.createCommand(cmd);
}

這裡很簡單,就是吧instance的host置為null,然後建立一個新的Command (RECOVER_TOLERANCE_FAULT_PROCESS),然後透過去多個master去查詢時執行這個Command 實現故障轉移(有了複用的id,就可以再查出來使用這個processInstance)

相關文章