Kafka的coordinator

devos發表於2016-07-11

(基於0.10版本)

Group Management Protocol

Kafka的coordiantor要做的事情就是group management，就是要對一個團隊(或者叫組)的成員進行管理。Group management就是要做這些事情：

維持group的成員組成。這包括允許新的成員加入，檢測成員的存活性，清除不再存活的成員。
協調group成員的行為。

Kafka為其設計了一個協議，就收做Group Management Protocol.

很明顯，consumer group所要做的事情，是可以用group management 協議做到的。而cooridnator, 及這個協議，也是為了實現不依賴Zookeeper的高階消費者而提出並實現的。只不過，Kafka對高階消費者的成員管理行為進行了抽象，抽象出來group management功能共有的邏輯，以此設計了Group Management Protocol，使得這個協議不只適用於Kafka consumer(目前Kafka Connect也在用它)，也可以作為其它"group"的管理協議。

那麼，這個協議抽象出來了哪些group management共有的邏輯呢？ Kafka Consumer的AbstractCoordinator的註釋給出了一些答案。

AbstractCoordinator

AbstractCoordinator implements group management for a single group member by interacting with a designated Kafka broker (the coordinator). Group semantics are provided by extending this class.

See ConsumerCoordinator for example usage.

From a high level, Kafka's group management protocol consists of the following sequence of actions:

Group Registration: Group members register with the coordinator providing their own metadata (such as the set of topics they are interested in).

Group/Leader Selection: The coordinator select the members of the group and chooses one member as the leader.

State Assignment: The leader collects the metadata from all the members of the group and assigns state.

Group Stabilization: Each member receives the state assigned by the leader and begins processing.

To leverage this protocol, an implementation must define the format of metadata provided by each member for group registration in metadata() and the format of the state assignment provided by the leader in performAssignment(String, String, Map) and becomes available to members in onJoinComplete(int, String, String, ByteBuffer).

首先，AbstractorCoordinator是位於broker端的coordinator的客戶端。這段註釋裡的"The cooridnator"都是指broker端的那個cooridnator，而不是AbstractCoordiantor。AbstractCoordinator和broker端的coordinator的分工，可以從註釋裡大致看出來。這段註釋說，Kafka的group management protocol包括以下的動作序列：

Group Registration：Group的成員需要向cooridnator註冊自己，並且提供關於成員自身的後設資料(比如，這個消費成員想要消費的topic)
Group/Leader Selection：cooridnator確定這個group包括哪些成員，並且選擇其中的一個作為leader。
State Assignment: leader收集所有成員的metadata，並且給它們分配狀態(state，可以理解為資源，或者任務)。
Group Stabilization: 每個成員收到leader分配的狀態，並且開始處理。

這裡邊有三個角色：coordinator, group memeber, group leader.

有這麼幾個情況：

所有的成員要先向coordinator註冊，由coordinator選出leader, 然後由leader來分配state。這裡存在著3個角色，其分工並不像storm的nimbus和supervisor或者其它的master-slave系統一樣，而更類似於Yarn的resource manager, application master和node manager. 它們也都是為了解決擴充套件性的問題。單個Kafka叢集可能會存在著比broker的數量大得多的消費者和消費者組，而消費者的情況可能是不穩定的，可能會頻繁變化，每次變化都需要一次協調，如果由broker來負責實際的協調工作，會給broker增加很多負擔。所以，從group memeber裡選出來一個做為leader，由leader來執行效能開銷大的協調任務, 這樣把負載分配到client端，可以減輕broker的壓力，支援更多數量的消費組。
但是leader和follower具體的行為是怎麼樣的呢？follower的心跳直接發給leader嗎？state assign是leader直接傳送給follower的嗎？
1. 這裡肯定與YARN有所不同，畢竟Kafka並不存在像NodeManager一樣的東西。也就是說如果leader至少需要向coordinator發heartbeat。
2. YARN的RM是隻負責資源分配的，Kafka的coordinator按照上面註釋的說法還需要確定group的成員，即使在leader確定後，leader也不負責確定group的成員，可以推斷出，所有group member都需要發心跳給coordinator，這樣coordinator才能確定group的成員。為什麼心跳不直接發給leader呢？或許是為了可靠性。畢竟，leader和follower之間是可能存在著網路分割槽的情況的。但是，coordinator作為broker，如果任何group member無法與coordinator通訊，那也就肯定不能作為這個group的成員了。這也決定了，這個Group Management Protocol不應依賴於follower和leader之間可靠的網路通訊，因為leader不應該與follower直接互動。而應該通過coordinator來管理這個組。這種行為與YARN有明顯的區別，因為YARN的每個節點都在叢集內部，而Kafka的client卻不是叢集的一部分，可能存在於這種網路環境和地理位置。
3. 對於Kafka consumer，它的實際上必須跟coordinator保持連線，因為它還需要提交offset給coordinator。所以coordinator實際上負責commit offset，那麼，即使leader來確定狀態的分配，但是每個partition的消費起始點，還需要coordinator來確定。這就帶來了一問題，每個partition的消費開始的offset是由leader向coordinator請求，然後做為state分配，還是leader只分配partition，而follower去coordinator處請求開始消費的offset?

要回答這些問題，就要看程式碼了。AbstractCoordinator的註釋還沒完，它接下來這麼說：

To leverage this protocol, an implementation must define the format of metadata provided by each member for group registration in metadata() and the format of the state assignment provided by the leader in performAssignment(String, String, Map) and becomes available to members in onJoinComplete(int, String, String, ByteBuffer).

這是說AbstractCoordinator的實現必須實現三個方法: metadata(), performAssignment(String, String, Map)和onJoinComplete(int, String, String, ByteBuffer)。

從這三個方法入手，可以瞭解Group Management Protocol的一些細節。

Metadata

metadata()

protected abstract List<ProtocolMetadata> metadata();
Get the current list of protocols and their associated metadata supported by the local member. The order of the protocols in the list indicates the preference of the protocol (the first entry is the most preferred). The coordinator takes this preference into account when selecting the generation protocol (generally more preferred protocols will be selected as long as all members support them and there is no disagreement on the preference).
Returns:
Non-empty map of supported protocols and metadata

這個方法返回的是這個group member所支援的協議，以及適用於生個協議的protocol。這些資料會提交給coordinator，coordinator會考慮到所有成員支援的協議，來為它們選擇一個通用的協議。

下面看一下ConsumerCoordinator對它的實現：

    @Override
    public List<ProtocolMetadata> metadata() {
        List<ProtocolMetadata> metadataList = new ArrayList<>();
        for (PartitionAssignor assignor : assignors) {
            Subscription subscription = assignor.subscription(subscriptions.subscription());
            ByteBuffer metadata = ConsumerProtocol.serializeSubscription(subscription);
            metadataList.add(new ProtocolMetadata(assignor.name(), metadata));
        }
        return metadataList;
    }

在這裡，consumer提供給每個協議的metadata都是一樣的，是Subscription物件包含的資料。Subscription是PartitionAssignor的一個內部類，它有兩個field

    class Subscription {
        private final List<String> topics;
        private final ByteBuffer userData;
     ...
    }

也就是說，consumer提供給coordinator的有兩部分資訊：1. 它訂閱了哪些topic 2. userData。對於consumer, userData實際上是一個空陣列。不過PartitionAssignor這麼定義Subscription是有其用意的，userData是幹啥的呢？再看一下PartitionAssgnor的註釋。這也有助於瞭解ConsumerCoordinator#metadata()方法時使用的assignors是哪來的。

PartitionAssignor

This interface is used to define custom partition assignment for use in org.apache.kafka.clients.consumer.KafkaConsumer. Members of the consumer group subscribe to the topics they are interested in and forward their subscriptions to a Kafka broker serving as the group coordinator. The coordinator selects one member to perform the group assignment and propagates the subscriptions of all members to it. Then assign(Cluster, Map) is called to perform the assignment and the results are forwarded back to each respective members. In some cases, it is useful to forward additional metadata to the assignor in order to make assignment decisions. For this, you can override subscription(Set) and provide custom userData in the returned Subscription. For example, to have a rack-aware assignor, an implementation can use this user data to forward the rackId belonging to each member.

這段註釋也回答了一些之前在分析AbstractCoordinator的註釋時提出的問題。這段註釋提供了以下幾點資訊

PartitionAssignor這個介面是用來定義KafkaConsumer所用的“分割槽分配策略”. 使用者可以實現這個介面，以定義自己所需的策略。
consumer group的成員把它們所訂閱的topic傳送給coordinator。然後coordinator來選擇一個leader, 然後由coordinator把這個group的所有成員的訂閱情況發給leader，由leader來執行分割槽的分配。
leader呼叫PartitionAssignor的assign方法，來執行分割槽，然後把結果發給coordinator。由coordinator來轉發分配的結果到每個group的成員。
有時候，需要利用各個consumer的額外的資訊來決定分配結果，比如consumer所在的機架情況。這時候，在實現PartitionAssignor時，就可以覆蓋subscription(Set)方法，在其返回的Subscription物件中提供自己需要的userData。

俺覺得，某些資源排程框架可能會受益於自定的PartitionAssignor，除了rack-aware之外，它們還可以根據每個機器上分配的consumer個數以及機器的效能來更好地進行負載勻衡。而且，這個東東也可以用來實現partition分配的“粘性”，即某個consumer可以一直被分配特定的分割槽，以便於它維持本地的狀態。

performAssignment

protected abstract Map<String, ByteBuffer> performAssignment(String leaderId, String protocol, Map<String, ByteBuffer> allMemberMetadata)

Perform assignment for the group. This is used by the leader to push state to all the members of the group (e.g. to push partition assignments in the case of the new consumer)

Parameters:
leaderId - The id of the leader (which is this member)
allMemberMetadata - Metadata from all members of the group

Returns:
A map from each member to their state assignment

這裡leader Id, allMemeberMetadata都是Coordinator通過JoinGroupRespone發給leader的。leader基於這些資訊做出分配，然後把分配結果寫在SyncGroupRequest裡發回給cooridnator，由Cooridnator把每個member被分配的狀態發給這個member。

下面來看一下ConsumerCooridnator對這個方法的實現：

    @Override
    protected Map<String, ByteBuffer> performAssignment(String leaderId,
                                                        String assignmentStrategy,
                                                        Map<String, ByteBuffer> allSubscriptions) {
        //根據coordinator選擇的協議確定PartitionAssignor
        PartitionAssignor assignor = lookupAssignor(assignmentStrategy);
        if (assignor == null)
            throw new IllegalStateException("Coordinator selected invalid assignment protocol: " + assignmentStrategy);

        //確定當前group訂閱的所有topic，以及每個member訂閱了哪些topic
        Set<String> allSubscribedTopics = new HashSet<>();
        Map<String, Subscription> subscriptions = new HashMap<>();
        for (Map.Entry<String, ByteBuffer> subscriptionEntry : allSubscriptions.entrySet()) {
            Subscription subscription = ConsumerProtocol.deserializeSubscription(subscriptionEntry.getValue());
            subscriptions.put(subscriptionEntry.getKey(), subscription);
            allSubscribedTopics.addAll(subscription.topics());
        }

        // the leader will begin watching for changes to any of the topics the group is interested in,
        // which ensures that all metadata changes will eventually be seen
        //leader會監聽這個group訂部的所有topic的metadata的變化
        this.subscriptions.groupSubscribe(allSubscribedTopics);
        metadata.setTopics(this.subscriptions.groupSubscription());

        // update metadata (if needed) and keep track of the metadata used for assignment so that
        // we can check after rebalance completion whether anything has changed
        //根據需要更新metadata,並且記錄assign時用的metadata到assignmentSnapshot裡
        client.ensureFreshMetadata();
        assignmentSnapshot = metadataSnapshot;

        log.debug("Performing assignment for group {} using strategy {} with subscriptions {}",
                groupId, assignor.name(), subscriptions);

        //執行分配。metadata.fetch會獲得當前的metadata，由於KafkaConsumer是單執行緒的，所以這裡fetch的metadata和前邊儲存的是一致的
        Map<String, Assignment> assignment = assignor.assign(metadata.fetch(), subscriptions);

        log.debug("Finished assignment for group {}: {}", groupId, assignment);

        //生成groupAssignment。它指明瞭哪個group member該消費哪個TopicPartition
        Map<String, ByteBuffer> groupAssignment = new HashMap<>();
        for (Map.Entry<String, Assignment> assignmentEntry : assignment.entrySet()) {
            ByteBuffer buffer = ConsumerProtocol.serializeAssignment(assignmentEntry.getValue());
            groupAssignment.put(assignmentEntry.getKey(), buffer);
        }

        return groupAssignment;
    }

這裡的Assignor有兩種： RangeAssignor和RoundRobinAssignor。

兩者都是把一個Topic的分割槽依次分給所有訂閱這個topic的consumer.以t表示topic, c表示consumer，p表示partition, 字母后邊的數字表示topic, partiton, consumer的id。

RangeAssignor與RoundRobinAssignor的區別在於對於一個topic的分割槽的分配，是否會受到其它topic分割槽分配的影響。

RangeAssignor

RangeAssignor對於每個topic，都是從consumer0開始分配。比如，topic0有3個分割槽，訂閱它的有兩個consumer。那麼consumer0會分到t0p0和t0p1, 而consumer1會分到t0p2.

如果它兩個consumer也都訂閱了另一個有三個分割槽的topic1, 那麼consumer0還會分到t1p0和t1p1，而consumer1會分到t1p2。具體的演算法RangeAssignor的JavaDoc有描述。

可見RangeAssignor有某些情況下是不公平的，在上邊的例子中，如果這兩個consumer訂閱了更多有三個分割槽的topic，那麼consumer0分配的partition數量會一直是consumer1的兩倍。

RoundRobinAssignor

RoundRobinAssignor會首先把這個group訂閱的所有TopicPartition排序，排序是先按topic排序，同一個topic的分割槽按partition id排序。具體的演算法RoundRobinAssignor的JavaDoc有描述。比如，假如有兩個各有三個分割槽的topic，它們的TopicPartition排序後為t0p0 t0p1 t0p2 t1p0 t1p1 t1p2。

分配時會把這個排序後的TopicPartition列表依次分配給訂閱它們的consumer。比如c0和c1都訂閱了這兩個topic, 那麼分配結果是

t0p0	t0p1	t0p2	t1p0	t1p1	t1p2
c0	c1	c0	c1	c0	c1

這樣c0分到了: t0p0, t0p2, t1p2. c1分到了: t0p1, t1p0, t1p2

如果有三個consumer,

c0訂閱了t0, t1, t3.

c1訂閱了t0, t2, t4。

c2訂閱了t0, t2, t4。

t0有兩個分割槽，而其它topic都只有一個分割槽。

那麼排序後的TopicPartition以及分配的結果為

t0p0	t0p1	t1p0	t2p0	t3p0	t4p0
c0	c1	c0	c1	c0	c1

可見c3乾脆就分不到分割槽了。所以RoundRobinAssignor也不能保證絕對公平。不過這只是比較極端的例子。

onJoinComplete

    /**
     * Invoked when a group member has successfully joined a group.
     * @param generation The generation that was joined
     * @param memberId The identifier for the local member in the group
     * @param protocol The protocol selected by the coordinator
     * @param memberAssignment The assignment propagated from the group leader
     */
    protected abstract void onJoinComplete(int generation,
                                           String memberId,
                                           String protocol,
                                           ByteBuffer memberAssignment);

ConsumerCoordinator對它的實現是：

    @Override
    protected void onJoinComplete(int generation,
                                  String memberId,
                                  String assignmentStrategy,
                                  ByteBuffer assignmentBuffer) {
        // if we were the assignor, then we need to make sure that there have been no metadata updates
        // since the rebalance begin. Otherwise, we won't rebalance again until the next metadata change
        if (assignmentSnapshot != null && !assignmentSnapshot.equals(metadataSnapshot)) {
            subscriptions.needReassignment();
            return;
        }

        PartitionAssignor assignor = lookupAssignor(assignmentStrategy);
        if (assignor == null)
            throw new IllegalStateException("Coordinator selected invalid assignment protocol: " + assignmentStrategy);

        Assignment assignment = ConsumerProtocol.deserializeAssignment(assignmentBuffer);

        // set the flag to refresh last committed offsets
        subscriptions.needRefreshCommits();

        // update partition assignment
        subscriptions.assignFromSubscribed(assignment.partitions());

        // give the assignor a chance to update internal state based on the received assignment
        assignor.onAssignment(assignment);

        // reschedule the auto commit starting from now
        if (autoCommitEnabled)
            autoCommitTask.reschedule();

        // execute the user's callback after rebalance
        ConsumerRebalanceListener listener = subscriptions.listener();
        log.info("Setting newly assigned partitions {} for group {}", subscriptions.assignedPartitions(), groupId);
        try {
            Set<TopicPartition> assigned = new HashSet<>(subscriptions.assignedPartitions());
            listener.onPartitionsAssigned(assigned);
        } catch (WakeupException e) {
            throw e;
        } catch (Exception e) {
            log.error("User provided listener {} for group {} failed on partition assignment",
                    listener.getClass().getName(), groupId, e);
        }
    }

首先，對於leader來說，它要檢查一下進行分配時的metadata跟當前的metadata是否一致，不一致的話，就標記下需要重新協調一次assign.

如果不存在上邊的情況，就做以下幾個事情：

設定“需要重新整理last committed offset"的標誌
更新這個conumser所採集的TopicPartition集合
呼叫Assignor的onAssignment方法，設Assignor來處理一下自己的內部狀態
重新排程autoCommit任務。這個任務用於週期性地 commit offset
呼叫ConsumerRebalanceListener。這個Listener是使用者傳給KafkaConsumer的。

這裡需要注意的是，所有KafkaConsumer的操作都是在一個執行緒完成的，而且大部分都是在poll這個方法呼叫中完成。所以上邊程式碼中的

subscriptions.needReassignment()和subscriptions.needRefreshCommits()

這些方法，都是改變了subscription物件的狀態，並沒有直正執行reassign和refresh commit操作。KafkaConsumer在執行poll方法時，會檢查這subscription物件的狀態，然後執行所需要的操作。所以，程式碼裡這兩句

        // set the flag to refresh last committed offsets
        subscriptions.needRefreshCommits();

        // update partition assignment
        subscriptions.assignFromSubscribed(assignment.partitions());

當freshCommit執行時，第二句assignFromSubscribed已經執行完了，所以是獲取分配給這個consumer的所有partition的last committed offset.

Kafka Client-side Assignment Proposal

Kafka Cooridnator的具體行為，可以參照這篇wiki。

kafka系列之(3)——Coordinator與offset管理和Consumer Rebalance
2017-05-11
Kafka
Druid中coordinator的介紹
2017-09-08
UI
工作流引擎Oozie（二）：coordinator
2020-03-11
【iOS】MVVM+RxSwift+ReactorKit+Coordinator
2019-06-02
iOSMVVMSwiftReact
Streams AQ: qmn coordinator waiting for slave to start等待
2013-07-17
AI
Streams AQ: qmn coordinator waiting for slave to start等待事件
2016-11-10
AI事件
Oracle OCP IZ0-053 Q448(Job Coordinator)
2016-03-17
Oracle
用 Swift 模仿 Vue + Vuex 進行 iOS 開發（二）：Coordinator
2018-03-18
SwiftVueiOS
kafka的Docker映象使用說明(wurstmeister/kafka)
2022-07-25
KafkaDocker
【kafka學習筆記】kafka的基本概念
2021-12-12
Kafka筆記
kafka詳解四：Kafka的設計思想、理念
2015-11-17
Kafka
kafka的幾點
2018-11-01
Kafka
Kafka 的穩定性
2022-06-13
Kafka
Kafka——zookeeper的作用
2024-04-13
Kafka
kafka-ngx_kafka_module
2020-11-16
Kafka
【Kafka】Kafka叢集搭建
2017-07-17
Kafka
Kafka實戰－Kafka Cluster
2015-05-29
Kafka
Kafka server的的停止
2014-05-18
KafkaServer
MySQL 5.7 多主複製報錯Coordinator stopped because there were error(s)
2017-12-12
MySqlError
Kafka實戰(三) - Kafka的自我修養與定位
2020-01-23
Kafka
kafka：spark-project專案的kafka和scala配置
2018-09-18
KafkaSparkProject
kafka之一：kafka簡介
2021-06-04
Kafka
kafka工具kafka-topic.sh
2017-10-08
Kafka
Kafka實戰－Flume到Kafka
2015-07-02
Kafka
Kafka實戰－Kafka到Storm
2015-07-09
KafkaORM
Kafka科普系列 | Kafka中的事務是什麼樣子的？
2019-06-01
Kafka
kafka原始碼剖析(二)之kafka-server的啟動
2018-03-15
Kafka原始碼Server
雅虎開源的Kafka叢集管理器(Kafka Manager)
2015-12-09
Kafka
Yahoo開源的Apache Kafka管理工具：Kafka Manager
2015-11-17
ApacheKafka
kafka
2024-10-30
Kafka
Kafka 的設計思想
2019-03-19
Kafka
kafka+flume的整合
2021-09-09
Kafka
kafka的偏移量
2020-11-12
Kafka
Cassandra與Kafka的整合
2020-12-30
Kafka
Kafka的訊息格式
2016-01-10
Kafka
Kafka的常用管理命令
2016-03-04
Kafka
一、kafka 介紹 && kafka-client
2020-06-04
Kafkaclient
《Kafka筆記》1、Kafka初識
2020-10-18
Kafka筆記