DS(dolphinscheduler)的master 是去中心化的,而故障轉移能力是由master完成的,那麼是多個master同時幹故障轉移,還是選舉出一個master來幹這件事情呢?
迴歸到原始碼進行分析
1. master 啟動方法
@PostConstruct
public void run() throws SchedulerException {
....
this.failoverExecuteThread.start();
....
}
也就是每個master都會啟動一個failoverExecuteThread 執行緒去處理故障轉移,看一下內部邏輯
2. failoverExecuteThread run方法
public void run() {
// when startup, wait 10s for ready
ThreadUtils.sleep(Constants.SLEEP_TIME_MILLIS * 10);
while (!ServerLifeCycleManager.isStopped()) {
try {
if (!ServerLifeCycleManager.isRunning()) {
continue;
}
// 在這裡進行檢查故障轉移,其實轉移動作也是在這裡執行了
masterFailoverService.checkMasterFailover();
} catch (Exception e) {
log.error("Master failover thread execute error", e);
} finally {
// 定時檢查故障轉移的間隔 ThreadUtils.sleep(masterConfig.getFailoverInterval().toMillis());
}
}
}
3. masterFailoverService.checkMasterFailover()
public void checkMasterFailover() {
List<String> needFailoverMasterHosts = processService.queryNeedFailoverProcessInstanceHost()
.stream()
// failover myself || dead server
.filter(host -> localAddress.equals(host)
|| !registryClient.checkNodeExists(host, RegistryNodeType.MASTER))
.distinct()
.collect(Collectors.toList());
if (CollectionUtils.isEmpty(needFailoverMasterHosts)) {
return;
}
log.info("Master failover service {} begin to failover hosts:{}", localAddress, needFailoverMasterHosts);
for (String needFailoverMasterHost : needFailoverMasterHosts) {
failoverMaster(needFailoverMasterHost);
}
}
processService.queryNeedFailoverProcessInstanceHost
將所有處理已提交,正在執行,延遲執行,執行暫停,執行停止狀態的 工作流例項全部取出,用於後面failoverMaster
進行比較
private static final int[] NEED_FAILOVER_STATES = new int[]{
SUBMITTED_SUCCESS.getCode(),
RUNNING_EXECUTION.getCode(),
DELAY_EXECUTION.getCode(),
READY_PAUSE.getCode(),
READY_STOP.getCode()
};
4. failoverMaster 真正檢查故障轉移和執行故障轉移動作的入口
public void failoverMaster(String masterHost) {
String failoverPath = RegistryNodeType.MASTER_FAILOVER_LOCK.getRegistryPath() + "/" + masterHost;
try {
//第一步拿取分散式鎖
registryClient.getLock(failoverPath);
//拿到鎖幹活
doFailoverMaster(masterHost);
} catch (Exception e) {
log.error("Master server failover failed, host:{}", masterHost, e);
} finally {
//釋放鎖
registryClient.releaseLock(failoverPath);
}
}
也就是說,這裡是透過拿取Zookeeper 分散式鎖來保證同一時間只有一個master在幹這事
5. doFailoverMaster
// 這裡刪除了一些非核心的程式碼
private void doFailoverMaster(@NonNull String masterHost) {
// 從zk獲取入參master的啟動時間,如果已掛,那麼得到的將是null
Optional<Date> masterStartupTimeOptional =
getServerStartupTime(registryClient.getServerList(RegistryNodeType.MASTER),
masterHost);
// 獲取入參master 所操作的所有工作流例項的列表
List<ProcessInstance> needFailoverProcessInstanceList = processService.queryNeedFailoverProcessInstances(
masterHost);
if (CollectionUtils.isEmpty(needFailoverProcessInstanceList)) {
return;
}
for (ProcessInstance processInstance : needFailoverProcessInstanceList) {
try {
// 這裡在檢查這個工作流例項是否需要被轉移
if (!checkProcessInstanceNeedFailover(masterStartupTimeOptional, processInstance)) {
continue;
}
ProcessInstanceMetrics.incProcessInstanceByStateAndProcessDefinitionCode("failover",
processInstance.getProcessDefinitionCode().toString());
// 這裡處理需要故障轉移的例項
processService.processNeedFailoverProcessInstances(processInstance);
} finally {
}
}
}
6. checkProcessInstanceNeedFailover
checkProcessInstanceNeedFailover的邏輯很簡單,就是檢查是否需要轉移,比如當前工作流的啟動時間是否小於對應的master的啟動時間,如果小於,說明master重啟過,需要轉移等等
7. processService.processNeedFailoverProcessInstances
public void processNeedFailoverProcessInstances(ProcessInstance processInstance) {
// updateProcessInstance host is null to mark this processInstance has been failover
// and insert a failover command
processInstance.setHost(Constants.NULL);
processInstanceMapper.updateById(processInstance);
// 2 insert into recover command
Command cmd = new Command();
cmd.setProcessDefinitionCode(processInstance.getProcessDefinitionCode());
cmd.setProcessDefinitionVersion(processInstance.getProcessDefinitionVersion());
// 注意:這裡寫入了當前的instance的id,用於後期複用
cmd.setProcessInstanceId(processInstance.getId());
cmd.setCommandParam(JSONUtils.toJsonString(createCommandParams(processInstance)));
cmd.setExecutorId(processInstance.getExecutorId());
cmd.setCommandType(CommandType.RECOVER_TOLERANCE_FAULT_PROCESS);
cmd.setProcessInstancePriority(processInstance.getProcessInstancePriority());
cmd.setTestFlag(processInstance.getTestFlag());
commandService.createCommand(cmd);
}
這裡很簡單,就是吧instance的host置為null,然後建立一個新的Command (RECOVER_TOLERANCE_FAULT_PROCESS),然後透過去多個master去查詢時執行這個Command 實現故障轉移(有了複用的id,就可以再查出來使用這個processInstance)