TiCDC 原始碼閱讀（四）TiCDC Scheduler 工作原理解析

本文是 TiCDC 原始碼解讀的第四篇，主要內容是講述 TiCDC 中 Scheduler 模組的工作原理。主要內容如下：

Scheduler 模組的工作機制
兩階段排程原理

Scheduler 模組介紹

Scheduler 是 Changefeed 內的一個重要模組，它主要負責兩件事情：

將一個 Changefeed 所有需要被同步的表，分發到不同的 TiCDC 節點上進行同步工作，以達到負載均衡的目的。
維護每張表的同步進度，同時推進 Changefeed 的全域性同步進度。

本次介紹的 Scheduler 相關程式碼都在 tiflow/cdc/scheduler/internal/v3 目錄下，包含多個資料夾，具體如下：

Coordinator 執行在 Changefeed，是 Scheduler 的全域性排程中心，負責傳送表排程任務，維護全部同步狀態。
Agent 執行在 Processor，它接收表排程任務，彙報當前節點上的表同步狀態給 Coordinator。
Transport 是對底層 peer-2-peer 機制的封裝，主要負責在 Coordinator 和 Agent 之間傳遞網路訊息。
Member 主要是對叢集中 Captures 狀態的管理和維護。
Replication 負責管理每張表的同步狀態。ReplicationSet 記錄了每張表的同步資訊，ReplicationManager 負責管理所有的 ReplicationSet。
Scheduler 實現了多種不同的排程規則，可以由 OpenAPI 觸發。

下面我們詳細介紹 Scheduler 模組的工作過程。

表 & 表排程任務 & 表同步單元

TiCDC 的任務是以表為單位，將資料同步到下游目標節點。所以對於一張表，可以透過如下形式來表示，該資料結構即刻畫了一張表當前的同步進度。

type Table struct {
    TableID model.TableID
    Checkpoint uint64
    ResolvedTs uint64
}

Scheduler 主要是透過 Add Table / Remove Table / Move Table 三類表排程任務來平衡每個 TiCDC 節點上的正在同步的表數量。對於這三類任務，可以被簡單地刻畫為：

Add Table：「TableID, Checkpoint, CaptureID」，即在 CaptureID 所指代的 Capture 上從 Checkpoint 開始載入並且同步 TableID 所指代的表同步單元。
Remove Table：「TableID, CaptureID」，即從 CaptureID 所指代的 Capture 上移除 TableID 所指代的表同步單元。
Move Table：「TableID, Source CaptureID, Target CaptureID」，即將 TableID 所指代的表同步單元從 Source CaptureID 指代的 Capture 上挪動到 Target CaptureID 指代的 Capture 之上。

表同步單元主要負責對一張表進行資料同步工作，在 TiCDC 內這由 Table Pipeline 實現。它的基本結構如下所示：

每個 Processor 開始同步一張表，即會為這張表建立一個 Table Pipeline，該過程可以分成兩個部分：

載入表：建立 Table Pipeline，分配相關的系統資源。KV-Client 從上游 TiKV 拉取資料，經由 Puller 寫入到 Sorter 中，但是此時不向下游目標資料系統寫入資料。
複製表：在載入表的前提下，啟動 Mounter 和 Sink 開始工作，從 Sorter 中讀取資料，並且寫入到下游目標資料系統。

Processor 實現了 TableExecutor 介面，如下所示：

type TableExecutor interface {
        // AddTable add a new table with `startTs`
        // if `isPrepare` is true, the 1st phase of the 2 phase scheduling protocol.
        // if `isPrepare` is false, the 2nd phase.
        AddTable(
                ctx context.Context, tableID model.TableID, startTs model.Ts, isPrepare bool,
        ) (done bool, err error)

        // IsAddTableFinished make sure the requested table is in the proper status
        IsAddTableFinished(tableID model.TableID, isPrepare bool) (done bool)

        // RemoveTable remove the table, return true if the table is already removed
        RemoveTable(tableID model.TableID) (done bool)
        // IsRemoveTableFinished convince the table is fully stopped.
        // return false if table is not stopped
        // return true and corresponding checkpoint otherwise.
        IsRemoveTableFinished(tableID model.TableID) (model.Ts, bool)

        // GetAllCurrentTables should return all tables that are being run,
        // being added and being removed.
        //
        // NOTE: two subsequent calls to the method should return the same
        // result, unless there is a call to AddTable, RemoveTable, IsAddTableFinished
        // or IsRemoveTableFinished in between two calls to this method.
        GetAllCurrentTables() []model.TableID

        // GetCheckpoint returns the local checkpoint-ts and resolved-ts of
        // the processor. Its calculation should take into consideration all
        // tables that would have been returned if GetAllCurrentTables had been
        // called immediately before.
        GetCheckpoint() (checkpointTs, resolvedTs model.Ts)

        // GetTableStatus return the checkpoint and resolved ts for the given table
        GetTableStatus(tableID model.TableID) tablepb.TableStatus
}

在 Changefeed 的整個執行週期中，Scheduler 都處於工作狀態，Agent 利用 Processor 提供的上述介面方法實現，實際地執行表排程任務，獲取到表排程任務進行的程度，以及表同步單元當前的執行狀態等，以供後續做出排程決策。

Coordinator & Agent

Scheduler 模組由 Coordinator 和 Agent 兩部分組成。Coordinator 執行在 Changefeed 內，Agent 執行在 Processor 內，Coordinator 和 Agent 即是 Changefeed 和 Processor 之間的通訊介面。二者使用 peer-2-peer 框架完成網路資料交換，該框架基於 gRPC 實現。下圖展示了一個有 3 個 TiCDC 節點的叢集中，一個 Changefeed 的 Scheduler 模組的通訊拓撲情況。可以看到，Coordinator 和 Agent 之間會交換兩類網路訊息，訊息格式由 Protobuf 定義，原始碼位於 tiflow/cdc/scheduler/schedulepb。

image (1).png

第一類是 Heartbeat 訊息，Coordinator 週期性地向 Agent 傳送 HeartbeatRequest，Agent 返回相應的 HeartbeatResponse，該類訊息主要目的是讓 Coordinator 能夠及時獲取到所有表在不同 TiCDC 節點上的同步狀態。
第二類是 DispatchTable 訊息，在有對錶進行排程的需求的時候，Coordinator 向特定 Agent 傳送 DispatchTableRequest，後者返回 DispatchTableResponse，用於及時同步每一張表的排程進展。

下面我們從訊息傳遞的角度，分別看一下 Coordinator 和 Agent 的工作邏輯。

Coordinator 工作過程

Coordinator 會收到來自 Agent 的 HeartbeatReponse 和 DispatchTableResponse 這兩類訊息。Coordinator 內的 CaptureM 負責維護 Capture 的狀態，在每次接收到 HeartbeatResponse 之後，都會更新自身維護的 Captures 的狀態，包括每個 Capture 當前的存活狀態，Capture 上當前同步的所有表資訊。同時也生成新的 HeartbeatRequest 訊息，再次傳送到所有 Agents。ReplicationM 負責維護所有表的同步狀態，它接收到 HeartbeatResponse 和 DispatchTableResponse 之後，按照訊息中記錄的表資訊，更新自己維護的這些表對應的同步狀態。CaptureM 提供了當前叢集中存活的所有 Captures 資訊，ReplicationM 則提供了所有表的同步狀態資訊，SchedulerM 以二者提供的資訊為輸入，以讓每個 Capture 上的表同步單元數量儘可能均衡為目標，生成表排程任務，這些表排程任務會被 ReplicationM 進一步處理，生成 DispatchTableRequest，然後傳送到對應的 Agent。

image (2).png

Agent 工作過程

Agent 會從 Coordinator 收到 HeartbeatRequest 和 DispatchTableRequest 這兩類訊息。對於前者，Agent 會收集當前執行在當前 TiCDC 節點上的所有表同步單元的執行狀態，構造 HeartbeatRespone。對於後者，則透過訪問 Processor 來新增或者移除表同步單元，獲取到表排程任務的執行進度，構造對應的 DispatchTableResponse，最後傳送到 Coordinator。

image (3).png

Changefeed 同步進度計算

一個 changefeed 內同步了多張表。對於每張表，有 Checkpoint 和 ResolvedTs 來標識它的同步進度，Coordinator 透過 HeartbeatResponse 週期性地收集所有表的同步進度資訊，然後就可以計算得到一個 Changefeed 的同步進度。具體計算方法如下：

// AdvanceCheckpoint tries to advance checkpoint and returns current checkpoint.
func (r *Manager) AdvanceCheckpoint(currentTables []model.TableID) (newCheckpointTs, newResolvedTs model.Ts) {
    newCheckpointTs, newResolvedTs = math.MaxUint64, math.MaxUint64
    for _, tableID := range currentTables {
        table, ok := r.tables[tableID]
        if !ok {
            // Can not advance checkpoint there is a table missing.
            return checkpointCannotProceed, checkpointCannotProceed
        }
        // Find the minimum checkpoint ts and resolved ts.
        if newCheckpointTs > table.Checkpoint.CheckpointTs {
            newCheckpointTs = table.Checkpoint.CheckpointTs
        }
        if newResolvedTs > table.Checkpoint.ResolvedTs {
            newResolvedTs = table.Checkpoint.ResolvedTs
        }
    }
    return newCheckpointTs, newResolvedTs
}

從上面的示例程式碼中我們可以看出，一個 Changefeed 的 Checkpoint 和 ResolvedTs，即是它同步的所有表的對應指標的最小值。Changefeed 的 Checkpoint 的意義是，它的所有表的同步進度都不小於該值，所有時間戳小於該值的資料變更事件已經被同步到了下游；ResolvedTs 指的是 TiCDC 當前已經捕獲到了所有時間戳小於該值的資料變更事件。除此之外的一個重點是，只有當所有表都被分發到 Capture 上並且建立了對應的表同步單元之後，才可以推進同步進度。

以上從訊息傳遞的角度對 Scheduler 模組基本工作原理的簡單介紹。下面我們更加詳細地聊一下 Scheduler 對錶表度任務的處理機制。

兩階段排程原理

兩階段排程是 Scheduler 內部對錶排程任務的執行原理，主要目的是降低 Move Table 操作對同步延遲的影響。

image (4).png

上圖展示了將表 X 從 Agent-1 所在的 Capture 上挪動到 Agent-2 所在的 Capture 上的過程，具體如下：

Coordinator 讓 Agent-2 準備表 X 的資料。
Agent-2 在準備好了資料之後，告知 Coordinator 這一訊息。
Coordinator 傳送訊息到 Agent-1，告知它移除表 X 的同步任務。
Agent-1 在移除了表 X 的同步任務之後，告知 Coordinator 這一訊息。
Coordinator 再次傳送訊息到 Agent-2，開始向下遊複製表 X 的資料。
Agent-2 再次傳送訊息到 Coordinator，告知表 X 正處於複製資料到下游的狀態。

上述過程的重點是在將一張表從原節點上移除之前，先在目標節點上分配相關的資源，準備需要被同步的資料。準備資料的過程，往往頗為耗時，這是引起挪動表過程耗時長的主要原因。兩階段排程機制透過提前在目標節點上準備表資料，同時保證其他節點上有該表的同步單元正在向下遊複製資料，保證了該表一直處於同步狀態，這樣可以減少整個挪動表過程的時間開銷，降低對同步延遲的影響。

Replication set 狀態轉換過程

在上文中講述的兩階段排程挪動表的基本過程中，可以看到在 Agent-2 執行了前兩步之後，表 X 在 Agent-1 和 Agent-2 的 Capture 之上，均存在表同步單元。不同點在於，Agent-1 此時正在複製表，Agent-2 此時只是載入表。

Coordinator 使用 ReplicationSet 來跟蹤一張表在多個 Capture 上的表同步單元的狀態，並以此維護了該表真實的同步狀態。基本定義如下：

// ReplicationSet is a state machine that manages replication states.
type ReplicationSet struct {
    TableID    model.TableID
    State      ReplicationSetState
    Primary model.CaptureID
    Secondary model.CaptureID
    Checkpoint tablepb.Checkpoint
    ...
}

TableID 唯一地標識了一張表，State 則記錄了當前該 ReplicationSet 所處的狀態，Primary 記錄了當前正在複製該表的 Capture 的 ID，而 Secondary 則記錄了當前已經載入了該表，但是尚未同步資料的 Capture 的 ID，Checkpoint 則記錄了該表當前的同步狀態。
在對錶進行排程的過程中，一個 ReplicationSet 會處於多種狀態。如下圖所示：

image (5).png

Absent 表示沒有任何一個節點載入了該表的同步單元。
Prepare 可能出現在兩種情況。第一種是表正處於 Absent 狀態，呼叫 Add Table 在某一個 Capture 上開始載入該表。第二種情況是需要將正在被同步的表挪動到其他節點上，發起 Move Table 請求，在目標節點上載入表。
Commit 指的是在至少一個節點上，已經準備好了可以同步到下游的資料。
Replicating 指的是有且只有一個節點正在複製該表的資料到下游目標系統。
Removing 說明當前只有一個節點上載入了表的同步單元，並且當前正在停止向下遊同步資料，同時釋放該同步單元。一般發生在上游執行了 Drop table 的情況。在一張表被完全移除之後，即再次回到 Absent 狀態。

下面假設存在一張表 table-0，它在被排程時發生的各種情況。首先考慮如何將表 X 載入到 Agent-0 所在的 Capture 之上，並且向下遊複製資料。

image (6).png

首先 table-0 處於 Absent 狀態，此時發起 Add Table 排程任務，讓 Agent-0 從 checkpoint = 5 開始該表的同步工作，Agent-0 會建立相應的表同步單元，和上游 TiKV 叢集中的 Regions 建立網路連線，拉取資料。當準備好了可以向下遊同步的資料之後，Agent-0 告知 Coordinator 該表同步單元當前已經處於 Prepared 狀態。Coordinator 會根據該訊息，將該 ReplicationSet 從 Prepare 切換到 Commit 狀態，然後發起第二條訊息到 Agent-0，讓它開始從 checkpoint = 5 從下游開始同步資料。當 Agent-0 完成相關操作，返回響應到 Coordinator 之後，Coordinator 再次更新 table-0 的 ReplicationSet，進入到 Replicating 狀態。

image (7).png

再來看一下移除表 table-0 的過程，如上所示。最開始正處於 Replicating 狀態，並且在 Capture-0 上同步。Coordinator 向 Agent-0 傳送 Remove Table 請求，Agent-0 透過 Processor 來取消該表的同步單元，釋放相關的資源，待所有資源釋放完畢之後，返回訊息到 Coordinator，告知該表當前已經沒有被同步了，同時帶有最後同步的 Checkpoint。在 Agent-0 正在取消表的過程中，Coordinator 和 Agent-0 之間依舊有保持透過 Heartbeat 進行狀態通知，Coordinator 可以及時地知道當前表 t = 0 正處於 Removing 狀態，在後續收到表已經被完全取消的訊息之後，則從 Removing 切換到 Absent 狀態。

最後再來看一下 Move Table，它本質上是先在目標節點 Add Table，然後在原節點上 Remove Table。

image (8).png

如上圖所示，首先假設 table-0 正在 capture-0 上被同步，處於 Replicating 狀態，現在需要將 table-0 從 capture-0 挪動到 capture-1。首先 Coordinator 將 ReplicationSet 的狀態從 Replicating 轉移到 Prepare，同時向 Agent-1 發起新增 table-0 的請求，Agent-1 載入完了該表的同步單元之後，會告訴 Coordinator 這一訊息，此時 Coordinator 會再次更新 table-0 到 Commit 狀態。此時可以知道表 table-0 目前正在 capture-0 上被同步，在 agent-1 上也已經有了它的同步單元和可同步資料。Coordinator 再向 Agent-0 上傳送 Remove Table，Agent-0 收到排程指示之後，停止並且釋放表 table-0 的同步單元，再向 Coordinator 返回執行結果。Coordinator 在得知 capture-0 上已經沒有該表的同步單元之後，將 Primary 從 capture-0 修改為 capture-1，告知 Agent-1 開始向下遊同步表 table-0 的資料，Coordinator 在收到從 Agent-1 傳來的響應之後，再次更新 table-0 的狀態為 Replicating。

從上面三種排程操作中，可以看到 Coordinator 維護的 ReplicationSet 記錄了整個排程過程中，一張表的同步狀態，它由從 Agent 處收到的各種訊息來驅動狀態的改變。同時可以看到訊息中還有 Checkpoint 和 Resolved Ts 在不斷更新。Coordinator 在處理收到的 Checkpoint 和 ResolvedTs 時，保證二者均不會發生會退。

總結

以上就是本文的全部內容。希望在閱讀上面的內容之後，讀者能夠對 TiCDC 的 Scheduler 模組的工作原理有一個基本的瞭解。