Kafka社群KIP-500中文譯文(去除ZooKeeper)

昔久發表於2024-10-31

原文連結:https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum

譯者:關於Kafka3.x版本最大的一個變化即是解除了對ZooKeeper的依賴,而本文的作者是大神Colin,他高屋建瓴地闡述去ZK的整個過程,更多的是偏整體設計,瞭解它有助於我們從一個較高的高度,更好的掌握KRaft

背景(Motivation

Currently, Kafka uses ZooKeeper to store its metadata about partitions and brokers, and to elect a broker to be the Kafka Controller. We would like to remove this dependency on ZooKeeper. This will enable us to manage metadata in a more scalable and robust way, enabling support for more partitions. It will also simplify the deployment and configuration of Kafka.

當前,Kafka使用ZooKeeper來儲存partitions、brokers這樣的後設資料,同時Kafka Controller的選主操作也是依賴ZooKeeper,因此我們想刪除對ZooKeeper的依賴。這樣我們可以以更具擴充套件性及魯棒性的方式來管理後設資料,並能夠支援更多的分割槽,同時也可簡化Kafka的部署及相關配置

後設資料->事件日誌(Metadata as an Event Log

We often talk about the benefits of managing state as a stream of events. A single number, the offset, describes a consumer's position in the stream. Multiple consumers can quickly catch up to the latest state simply by replaying all the events newer than their current offset. The log establishes a clear ordering between events, and ensures that the consumers always move along a single timeline.

我們常常談論將狀態作為一系列事件來管理的好處。位點,只是一個單純的數字,它描述出了一個consumer在流中的位置。所有的consumer都可以簡單的重放那些比自己當前位點大的訊息事件,從而快速追趕上來。日誌在事件之間建立了清晰的順序,並確保消費者始終沿著單一的時間線移動

However, although our users enjoy these benefits, Kafka itself has been left out. We treat changes to metadata as isolated changes with no relationship to each other. When the controller pushes out state change notifications (such as LeaderAndIsrRequest) to other brokers in the cluster, it is possible for brokers to get some of the changes, but not all. Although the controller retries several times, it eventually give up. This can leave brokers in a divergent state.

然而,我們Kafka的使用者享受著這些好處,但Kafka自己卻沒有好好利用它。我們將後設資料的更改視為彼此無關的孤立更改,當controller向叢集中的其他broker推送狀態變更時(例如LeaderAndIsrRequest),broker可能會收到一些更改請求,但並不是全部。雖然controller重試了幾次,但最終還是放棄了。這會導致broker處於一個分裂狀態

Worse still, although ZooKeeper is the store of record, the state in ZooKeeper often doesn't match the state that is held in memory in the controller. For example, when a partition leader changes its ISR in ZK, the controller will typically not learn about these changes for many seconds. There is no generic way for the controller to follow the ZooKeeper event log. Although the controller can set one-shot watches, the number of watches is limited for performance reasons. When a watch triggers, it doesn't tell the controller the current state-- only that the state has changed. By the time the controller re-reads the znode and sets up a new watch, the state may have changed from what it was when the watch originally fired. If there is no watch set, the controller may not learn about the change at all. In some cases, restarting the controller is the only way to resolve the discrepancy.

更糟糕的是,雖然ZooKeeper儲存了後設資料,但很多時候ZooKeeper儲存的資料與controller記憶體中的後設資料不一致。例如,當一個partition leader修改了其ISR列表,controller通常在很多秒後才能收到通知(譯者:這裡簡單說明一下,partition leader會直接將ISR寫入ZooKeeper,而不是controller,這樣就導致了controller儲存的並不是最新的後設資料)。controller並沒有一種通用的方法來跟蹤ZooKeeper的事件日誌。儘管controller可以設定一次性監視器,但出於效能考慮,監視器的數量是有限的。當一個ZooKeeper的監視器被觸發,它並不會告訴controller只有某個partition的ISR變更了,只會告訴controller有變更了。當controller重新讀取znode並重新設定watch時,狀態可能相對比最初觸發時已經發生了變更。而如果不設定watch,controller可能完全感知不到後設資料的變更。有些時候,重啟controller變成了唯一解決資料不一致的方法 (譯者:本節通篇說的是ZooKeeper帶來的一些不一致問題

Rather than being stored in a separate system, metadata should be stored in Kafka itself. This will avoid all the problems associated with discrepancies between the controller state and the Zookeeper state. Rather than pushing out notifications to brokers, brokers should simply consume metadata events from the event log. This ensures that metadata changes will always arrive in the same order. Brokers will be able to store metadata locally in a file. When they start up, they will only need to read what has changed from the controller, not the full state. This will let us support more partitions with less CPU consumption.

比起將後設資料儲存在另外一個系統,Kafka更應該自己來管理它。這樣可以避免所有因為controller及ZooKeeper狀態不一致帶來的問題。比起controller向broker推送資料,broker更應該從event日誌中消費後設資料變更事件。這樣可以保證broker收到後設資料的變更總是有序的。同時broker也可以將後設資料儲存在本地檔案中,當它重啟時,它不需要從controller處讀取全量後設資料,而只讀取增量部分即可,這樣就可以花費較少的cpu處理更多的partition

簡化部署及配置(Simpler Deployment and Configuration

ZooKeeper is a separate system, with its own configuration file syntax, management tools, and deployment patterns. This means that system administrators need to learn how to manage and deploy two separate distributed systems in order to deploy Kafka. This can be a daunting task for administrators, especially if they are not very familiar with deploying Java services. Unifying the system would greatly improve the "day one" experience of running Kafka, and help broaden its adoption.

ZooKeeper是一個獨立的系統,有自己的配置檔案、管理工具和部署模式。這意味著,系統管理員需要同時掌握管理及部署兩個分散式系統,對於管理員來說,這可能是一項艱鉅的任務,特別是如果他們不太熟悉如何部署Java服務的話。統一該系統將大大改善執行Kafka的“第一天”(譯者:這裡是老外用詞的習慣,所謂“第一天”就是初次使用kafka的感受將會變好)體驗,並有助於擴大其採用範圍

Because the Kafka and ZooKeeper configurations are separate, it is easy to make mistakes. For example, administrators may set up SASL on Kafka, and incorrectly think that they have secured all of the data travelling over the network. In fact, it is also necessary to configure security in the separate, external ZooKeeper system in order to do this. Unifying the two systems would give a uniform security configuration model.

因為Kafka跟ZooKeeper的配置是分離的,因此這很容易犯錯。例如,管理員可能會在Kafka上設定SASL,並錯誤地認為他們已經保護了透過網路傳輸的所有資料。事實上,為了做到這一點,還需要在單獨的外部ZooKeeper系統中配置SASL。將這兩個系統統一起來才能得到一個統一的安全配置模型

Finally, in the future we may want to support a single-node Kafka mode. This would be useful for people who wanted to quickly test out Kafka without starting multiple daemons. Removing the ZooKeeper dependency makes this possible.

最後,在將來,我們可能想支援一個單節點模式的Kafka叢集,這對於執行不需要多節點的測試來說是非常有用的,而刪除對ZooKeeper的依賴讓其變得可能

架構(Architecture

簡介(Introduction

This KIP presents an overall vision for a scalable post-ZooKeeper Kafka. In order to present the big picture, I have mostly left out details like RPC formats, on-disk formats, and so on. We will want to have follow-on KIPs to describe each step in greater detail. This is similar to KIP-4, which presented an overall vision which subsequent KIPs enlarged upon.

這個KIP為可擴充套件的、去ZooKeeper的Kafka提供了一個總體願景。為了從大局分析及宏觀敘事,我基本上省略了RPC格式、磁碟格式等細節。我們希望有後續的KIP來更詳細地描述每個步驟。這類似於KIP-4,它提出了一個總體願景,隨後的KIP對此進行詳細闡述

概述(Overview

Kafka社群KIP-500中文譯文(去除ZooKeeper)

Currently, a Kafka cluster contains several broker nodes, and an external quorum of ZooKeeper nodes. We have pictured 4 broker nodes and 3 ZooKeeper nodes in this diagram. This is a typical size for a small cluster. The controller (depicted in orange) loads its state from the ZooKeeper quorum after it is elected. The lines extending from the controller to the other nodes in the broker represent the updates which the controller pushes, such as LeaderAndIsr and UpdateMetadata messages.

目前,Kafka叢集包含多個broker節點和ZooKeeper的外部仲裁節點。上圖中描繪了4個broker節點和3個ZooKeeper節點。這通常是一個小叢集的規模。controller(以橙色顯示)在被選中後從ZooKeeper中載入後設資料等狀態資訊。從controller到其他broker的連線表示控制器推送的更新,如LeaderAndIsr和UpdateMetadata訊息(譯者:這裡需要注意,ZooKeeper模式下,controller是向其他broker推送後設資料的

Note that this diagram is slightly misleading. Other brokers besides the controller can and do communicate with ZooKeeper. So really, a line should be drawn from each broker to ZK. However, drawing that many lines would make the diagram difficult to read. Another issue which this diagram leaves out is that external command line tools and utilities can modify the state in ZooKeeper, without the involvement of the controller. As discussed earlier, these issues make it difficult to know whether the state in memory on the controller truly reflects the persistent state in ZooKeeper.

請注意,此圖多少有點誤導。除了controller之外的其他broker也會與ZooKeeper通訊。所以,實際上,應該從每個broker到ZK畫一條線。然而,畫那麼多線會使此圖難以閱讀。此圖遺漏的另一個問題是,外部的命令列工具和程式可以在不涉及controller的情況下修改ZooKeeper中的狀態。如前所述,這些問題使得很難知道controller上記憶體中的狀態是否真實反映了ZooKeeper中的持久狀態

In the proposed architecture, three controller nodes substitute for the three ZooKeeper nodes. The controller nodes and the broker nodes run in separate JVMs. The controller nodes elect a single leader for the metadata partition, shown in orange. Instead of the controller pushing out updates to the brokers, the brokers pull metadata updates from this leader. That is why the arrows point towards the controller rather than away.

在“proposed”模式的架構中,三個controller節點代替了三個ZooKeeper節點。controller節點和broker節點在獨立的JVM中執行。三個準controller節點會根據後設資料分割槽選出一個leader,從而成為真正的叢集controller(譯者:這裡說明一下,後設資料分割槽指的是內部topic '__cluster_metadata'的0號分割槽,當然這個topic也只有一個分割槽),如橙色所示。broker從該leader那裡獲取後設資料更新,而不是由controller向broker推送更新。這就是為什麼箭頭指向控制器而不是遠離控制器的原因

Note that although the controller processes are logically separate from the broker processes, they need not be physically separate. In some cases, it may make sense to deploy some or all of the controller processes on the same node as the broker processes. This is similar to how ZooKeeper processes may be deployed on the same nodes as Kafka brokers today in smaller clusters. As per usual, all sorts of deployment options are possible, including running in the same JVM.

注意,儘管controller程序在邏輯上與broker程序是分開的,但它們不需要在物理上分開。在某些情況下,將部分或全部controller程序部署在與broker程序相同的節點上也是有意義的。這類似於小叢集中,ZooKeeper程序也可能會與broker程序部署在同一個節點。與之前一樣,各種部署模式都是可能的,包括在同一JVM中執行

Controller仲裁The Controller Quorum

The controller nodes comprise a Raft quorum which manages the metadata log. This log contains information about each change to the cluster metadata. Everything that is currently stored in ZooKeeper, such as topics, partitions, ISRs, configurations, and so on, will be stored in this log.

這些controller節點由管理後設資料日誌的Raft仲裁組成,這個後設資料日誌包含了所有後設資料變更的資訊。每一個當初儲存在ZooKeeper的後設資料,例如topics、partitions、ISR、configurations等,現在均儲存在這個日誌中

Using the Raft algorithm, the controller nodes will elect a leader from amongst themselves, without relying on any external system. The leader of the metadata log is called the active controller. The active controller handles all RPCs made from the brokers. The follower controllers replicate the data which is written to the active controller, and serve as hot standbys if the active controller should fail. Because the controllers will now all track the latest state, controller failover will not require a lengthy reloading period where we transfer all the state to the new controller.

利用Raft共識演算法,這些個準controller節點將會在他們當中選舉出一個leader,而不依賴任何外部系統。這個管理了後設資料日誌的leader被稱為active controller。這個active controller將會處理所有來自broker的RPC請求,而所有的follower controllers會從active controller同步資料,並且作為熱備隨時等待active controller當機後接管工作。因為這些follower controllers全部跟蹤最新狀態,因此將所有的狀態轉移至新controller不需要長時間的重新載入。

Just like ZooKeeper, Raft requires a majority of nodes to be running in order to continue running. Therefore, a three-node controller cluster can survive one failure. A five-node controller cluster can survive two failures, and so on.

就像ZooKeeper,Raft也是需要大多數節點存活才能持續提供服務的。因此,一個3節點的controller叢集能夠容忍1個節點當機,一個5節點的叢集能夠容忍2個節點當機,等等。。。

Periodically, the controllers will write out a snapshot of the metadata to disk. While this is conceptually similar to compaction, the code path will be a bit different because we can simply read the state from memory rather than re-reading the log from disk.

這些個controller會將後設資料資訊寫入本地磁碟中,雖然這在概念上有點類似於“壓實”,但程式碼實現上是稍有不同的,因為我們只是簡單地直接從記憶體中讀取資料,而不是從磁碟上讀取 (譯者:這裡表達稍微有點繞,其實本質意思是說controller節點會將記憶體中的後設資料壓實後存入磁碟,但程式真正使用這些資料的時候,其實肯定是直接從記憶體中獲取的。這裡的壓實-compaction是什麼意思呢?其實對應的是kafka壓實型別的topic,概念對其到這裡。比如一個topic的partition數量,剛開始的時候是3,後來變成5,再後來最終變成10,那麼在日誌條數上一共有3條,但真正壓實後,只會保留最近一條日誌即可

Broker後設資料管理(Broker Metadata Management)

Instead of the controller pushing out updates to the other brokers, those brokers will fetch updates from the active controller via the new MetadataFetch API.

Broker將會使用新定義的API從active controller處拉取後設資料,而不是ZooKeeper時代的由controller向broker推送資料

A MetadataFetch is similar to a fetch request. Just like with a fetch request, the broker will track the offset of the last updates it fetched, and only request newer updates from the active controller.

後設資料的拉取很像是fetch請求的拉取。就像是fetch請求那樣,broker將會使用本地儲存的最新的offset向active controller傳送後設資料拉取請求(譯者:這裡注意一下,所謂fetch請求,就是consumer消費訊息的時候呼叫的API,follower從leader處拉取訊息也是使用的該API

The broker will persist the metadata it fetched to disk. This will allow the broker to start up very quickly, even if there are hundreds of thousands or even millions of partitions. (Note that since this persistence is an optimization, we can leave it out of the first version, if it makes development easier.)

而broker會將拉取到後設資料持久化到磁碟中。因此即便是有成千上萬,甚至百萬級別的partition,broker也能保證快速啟動。(不過需要注意一點,持久化操作是一個最佳化項,在第一個版本中我們可以先不去實現它)

Most of the time, the broker should only need to fetch the deltas, not the full state. However, if the broker is too far behind the active controller, or if the broker has no cached metadata at all, the controller will send a full metadata image rather than a series of deltas.

在大多數的情況下,broker只需要拉取後設資料的增量更新部分即可,而不需要拉取全量資料。然而,如果broker中的後設資料落後active controller太多,或者當前的broker壓根就沒有儲存任何後設資料資訊,active controller會直接傳送全量的後設資料資訊

Kafka社群KIP-500中文譯文(去除ZooKeeper)

The broker will periodically ask for metadata updates from the active controller. This request will double as a heartbeat, letting the controller know that the broker is alive.

broker會週期性地向active controller傳送後設資料拉取請求,這個請求可以同時作為心跳,讓controller知道這個broker還活著

Note that while this KIP only discusses broker metadata management, client metadata management is important for scalability as well. Once the infrastructure for sending incremental metadata updates exists, we will want to use it for clients as well as for brokers. After all, there are typically a lot more clients than brokers. As the number of partitions grows, it will become more and more important to deliver metadata updates incrementally to clients that are interested in many partitions. We will discuss this further in follow-on KIPs.

注意,本KIP只討論broker對於後設資料的管理,客戶端的後設資料管理對於可擴充套件性也非常重要。一旦這種增量更新後設資料的架構實現了,我們同樣也想將這種機制移植到客戶端,畢竟通常情況下,客戶端數量會比broker數量多很多。隨著partition數量的增長,客戶端增量更新後設資料的方式變得越來越重要。我們會在將來其他的KIP中來討論它

Broker狀態機(The Broker State Machine

Currently, brokers register themselves with ZooKeeper right after they start up. This registration accomplishes two things: it lets the broker know whether it has been elected as the controller, and it lets other nodes know how to contact it.

當前,broker一啟動便會向ZooKeeper發起註冊,註冊會完成兩件事:它讓broker知道它自己是否已被選為controller,並讓其他節點知道如何聯絡它

In the post-ZooKeeper world, brokers will register themselves with the controller quorum, rather than with ZooKeeper.

而在去ZooKeeper的模式下(譯者:這裡注意,後文原作者多次提到了“post-ZooKeeper”,直譯過來就是去ZooKeeper,其實也就是KRaft,讀者留意一下即可),broker會直接向controller發起註冊,而不是ZooKeeper

Currently, if a broker loses its ZooKeeper session, the controller removes it from the cluster metadata. In the post-ZooKeeper world, the active controller removes a broker from the cluster metadata if it has not sent a MetadataFetch heartbeat in a long enough time.

當前,如果一個broke的ZooKeeper session過期了,controller會將其從後設資料中移除。而在去ZooKeeper的模式下,當broker在指定時間內沒有向controller傳送MetadataFetch的心跳請求,此時其會被active controller從後設資料中移除

In the current world, a broker which can contact ZooKeeper but which is partitioned from the controller will continue serving user requests, but will not receive any metadata updates. This can lead to some confusing and difficult situations. For example, a producer using acks=1 might continue to produce to a leader that actually was not the leader any more, but which failed to receive the controller's LeaderAndIsrRequest moving the leadership.

在當前ZooKeeper的模式下,如果一個broker能連線上ZooKeeper,但連線不上controller時,它不能再收到後設資料更新了,它可以繼續處理使用者發過來的請求。這將會導致一些讓人困惑以及複雜的場景。例如,當producer設定引數acks=1時,目標broker可能已經不是leader了,但producer還會一直向其傳送資料,因為它已經收不到controller傳送的LeaderAndIsrRequest切換leader的請求了

In the post-ZK world, cluster membership is integrated with metadata updates. Brokers cannot continue to be members of the cluster if they cannot receive metadata updates. While it is still possible for a broker to be partitioned from a particular client, the broker will be removed from the cluster if it is partitioned from the controller.

在去ZooKeeper的模式下,叢集broker的成員資格是與後設資料整合在一起的。當一個broker無法收到後設資料變更時,它將不再是叢集的成員。雖然有可能將一個broker與特定的client隔離,但如果broker與controller無法建聯,則將從叢集中刪除該broker

Broker狀態(Broker States)

Kafka社群KIP-500中文譯文(去除ZooKeeper)

離線(Offline

When the broker process is in the Offline state, it is either not running at all, or in the process of performing single-node tasks needed to starting up such as initializing the JVM or performing log recovery.

當一個broker變為了離線狀態,要麼它壓根就沒有執行,要麼就是處於剛剛啟動階段,正在執行一些單節點的啟動任務,例如JVM的初始化或者日誌恢復等

隔離(Fenced

When the broker is in the Fenced state, it will not respond to RPCs from clients. The broker will be in the fenced state when starting up and attempting to fetch the newest metadata. It will re-enter the fenced state if it can't contact the active controller. Fenced brokers should be omitted from the metadata sent to clients.

當一個broker處於隔離狀態,它不會響應client傳送的RPC請求。broker啟動時準備去拉取後設資料時,它將會處於隔離狀態。如果它發現自己與controller無法建聯,同樣也會再次進入隔離狀態。隔離狀態的broker不會再向client傳送後設資料

線上(Online

When a broker is online, it is ready to respond to requests from clients.

當一個broker處於線上狀態,它會響應從client傳送過來的請求

離線(Stopping)

Brokers enter the stopping state when they receive a SIGINT. This indicates that the system administrator wants to shut down the broker.

當broker收到指定訊號時,broker將會進入離線狀態,這表明系統管理員想要讓這個broker當機

When a broker is stopping, it is still running, but we are trying to migrate the partition leaders off of the broker.

當一個broker正在停止時,它其實仍在執行,我們正在嘗試將leader遷移至其他broker

Eventually, the active controller will ask the broker to finally go offline, by returning a special result code in the MetadataFetchResponse. Alternately, the broker will shut down if the leaders can't be moved in a predetermined amount of time.

最後,active controller將會在MetadataFetchResponse請求時,透過返回一個特定的result code,來告訴這個broker可以真正下機了。或者leader轉移的操作在指定的時間內無法完成,broker也會強制下線

轉發已有API請求至Controller(Transitioning some existing APIs to Controller-Only)

Many operations that were formerly performed by a direct write to ZooKeeper will become controller operations instead. For example, changing configurations, altering ACLs that are stored with the default Authorizer, and so on.

之前許多同學需要寫入ZooKeeper的操作將會變為controller的操作。例如,修改配置、更改預設許可權的ACL,等等

New versions of the clients should send these operations directly to the active controller. This is a backwards compatible change: it will work with both old and new clusters. In order to preserve compatibility with old clients that sent these operations to a random broker, the brokers will forward these requests to the active controller.

新的客戶端版本應該直接向active controller傳送這些操作請求,這是一個向前相容的操作:它將適用於舊叢集及新叢集。之前舊的client會將請求隨機傳送給一個broker,為了做到向前相容,broker將會向controller轉發這些請求

新Controller API(New Controller APIs)

In some cases, we will need to create a new API to replace an operation that was formerly done via ZooKeeper. One example of this is that when the leader of a partition wants to modify the in-sync replica set, it currently modifies ZooKeeper directly In the post-ZK world, the leader will make an RPC to the active controller instead.

在某些情況下,我們需要新建一些API用來替換那些以前透過ZooKeeper完成的操作。一個典型的例子,比如leader需要修改ISR列表,當前的現狀是,leader會直接修改ZooKeeper。而在去ZooKeeper的模式下,leader將會向active controller傳送RPC請求

刪除直接操作ZooKeeper的工具(Removing Direct ZooKeeper Access from Tools)

Currently, some tools and scripts directly contact ZooKeeper. In a post-ZooKeeper world, these tools must use Kafka APIs instead. Fortunately, "KIP-4: Command line and centralized administrative operations" began the task of removing direct ZooKeeper access several years ago, and it is nearly complete.

當前,一些工具及指令碼會直接與ZooKeeper建聯。在去ZooKeeper的模式下,這些工具必須被Kafka API所替代。幸運的是,"KIP-4: Command line and centralized administrative operations" 這支KIP在幾年前就已經開始了移除直接操作ZooKeeper的工作,而且它幾乎已經都做完了

相容性、過時及遷移計劃(Compatibility, Deprecation, and Migration Plan)

客戶端相容性(Client Compatibility)

We will preserve compatibility with the existing Kafka clients. In some cases, the existing clients will take a less efficient code path. For example, the brokers may need to forward their requests to the active controller.

我們會為當前已經存在的kafka客戶端版本提供相容性。在一些場景下,現存的客戶端在程式碼上可能不會有太大影響,比如broker會將某些請求轉發給active controller(譯者:確實,這種情況下,client壓根就不用動,只需要broker後端做一些轉發即可

橋接版本(Bridge Release)

The overall plan for compatibility is to create a "bridge release" of Kafka where the ZooKeeper dependency is well-isolated.

相容性的整體規劃是建立一個“橋接版本”,在這個橋接版本中,對ZooKeeper的依賴能夠被很好的隔離(譯者:其實Kafka3.x版本同時支援了KRaft及ZooKeeper兩種版本,用的就是這個所謂的橋接版本,而到了kafka4.x,將會徹底刪除對ZooKeeper的依賴

滾動更新(Rolling Upgrade)

The rolling upgrade from the bridge release will take several steps.

橋接版本的滾動更新將會經歷多個步驟

升級至橋接版本(Upgrade to the Bridge Release)

The cluster must be upgraded to the bridge release, if it isn't already.

如果叢集還沒有升級到橋接版本,那它必須先升級

啟動Controller仲裁節點(Start the Controller Quorum Nodes)

We will configure the controller quorum nodes with the address of the ZooKeeper quorum. Once the controller quorum is established, the active controller will enter its node information into /brokers/ids and overwrite the /controller node with its ID. This will prevent any of the un-upgraded broker nodes from becoming the controller at any future point during the rolling upgrade.

我們將會在controller節點上配置ZooKeeper的接入點。一旦controller仲裁完成,也就是選出了active controller,active controller會將它的broker ID加入至ZooKeeper的/brokers/ids路徑下,並強制替換/controller路徑的內容為自己的ID,這個操作將會防止所有還未升級至KRaft的broker節點在將來的滾動升級過程中,成為controller節點

Once it has taken over the /controller node, the active controller will proceed to load the full state of ZooKeeper. It will write out this information to the quorum's metadata storage. After this point, the metadata quorum will be the metadata store of record, rather than the data in ZooKeeper.

一旦它接管了/controller節點,active controller將會載入ZooKeeper的全部狀態,而後它將這些資料寫入至KRfat的後設資料儲存中(譯者:其實就是將這些後設資料資訊寫入內建topic:__cluster_metadata),在這之後,後設資料的儲存將會由ZooKeeper轉移至KRfat模式下

We do not need to worry about the ZooKeeper state getting concurrently modified during this loading process. In the bridge release, neither the tools nor the non-controller brokers will modify ZooKeeper.

我們不需要擔心ZooKeeper狀態在這個載入過程中被其他broker或client併發修改。在橋接版本中,工具和非controller的broker都不會修改ZooKeeper

The new active controller will monitor ZooKeeper for legacy broker node registrations. It will know how to send the legacy "push" metadata requests to those nodes, during the transition period.

在這個叢集模式變更的過渡時期,這個新的active controller將會監控ZooKeeper上的broker延時註冊,它會將這些發生在ZooKeeper上的變化同步至KRfat模式下

滾動更新Broker節點(Roll the Broker Nodes)

We will roll the broker nodes as usual. The new broker nodes will not contact ZooKeeper. If the configuration for the zookeeper server addresses is left in the configuration, it will be ignored.

我們正常滾動更新這些broker節點即可。新的broker不會再與ZooKeeper建聯,這些broker發現ZooKeeper的連線串沒有配置,它們也不會在意,直接將其忽略

滾動更新Controller (Roll the Controller Quorum)

Once the last broker node has been rolled, there will be no more need for ZooKeeper. We will remove it from the configuration of the controller quorum nodes, and then roll the controller quorum to fully remove it.

一旦最後一個broker節點更新完畢,就再沒有節點需要ZooKeeper了。我們還需要將其從Controller的配置中刪除,滾動升級完畢後,整個叢集中會移除對ZooKeeper的依賴

拒絕的備選方案(Rejected Alternatives)

可插拔共識(Pluggable Consensus)

Rather than managing metadata ourselves, we could make the metadata storage layer pluggable so that it could work with systems other than ZooKeeper. For example, we could make it possible to store metadata in etcd, Consul, or similar systems.

我們可以不用自己管理後設資料,而是抽象一個後設資料儲存層,繼而我們可以對接其他系統,而不僅僅是ZooKeeper。例如我們可以很容易地將資料儲存在etcd、consul或其他詳細的系統

Unfortunately, this strategy would not address either of the two main goals of ZooKeeper removal. Because they have ZooKeeper-like APIs and design goals, these external systems would not let us treat metadata as an event log. Because they are still external systems that are not integrated with the project, deployment and configuration would still remain more complex than they needed to be.

不幸的是,這個方案解決不了刪除ZooKeeper這個目標。因為這個方案會擁有類似操作ZooKeeper的API及設計目標,而這些外部的系統也不能讓我們管理後設資料像event日誌這樣。因為它們仍然是一個外部系統而跟Kafka不是一個整體,部署、配置的複雜性依然很高

Supporting multiple metadata storage options would inevitably decrease the amount of testing we could give to each configuration. Our system tests would have to either run with every possible configuration storage mechanism, which would greatly increase the resources needed, or choose to leave some user under-tested. Increasing the size of test matrix in this fashion would really hurt the project.

支援多個後設資料儲存選項將不可避免地增加(譯者:這裡估計是原作者打錯字了,原文寫的減少decrease)對每個配置的測試量。我們的系統測試要麼使用每種可能存在的配置儲存機制執行,這將大大增加所需的資源;要麼選擇讓一些使用者處於測試中狀態。以這種方式增加測試矩陣的大小,實際上會損害專案

Additionally, if we supported multiple metadata storage options, we would have to use "least common denominator" APIs. In other words, we could not use any API unless all possible metadata storage options supported it. In practice, this would make it difficult to optimize the system.

此外,如果我們支援多個後設資料的儲存選項,我們將不得不使用“最小公分母”API。換句話說,除非所有可能的後設資料儲存選項都支援某個API,否則我們無法使用該API。在實際操作中,這將使系統的最佳化變得困難

後續工作(Follow-on Work

This KIP expresses a vision of how we would like to evolve Kafka in the future. We will create follow-on KIPs to hash out the concrete details of each change.

這個KIP表達了我們希望未來如何發展Kafka的願景。我們將建立後續的KIP,以詳細討論每個變更的具體細節

  • KIP-455: Create an Administrative API for Replica Reassignment
  • KIP-497: Add inter-broker API to alter ISR
  • KIP-543: Expand ConfigCommand's non-ZK functionality
  • KIP-555: Deprecate Direct Zookeeper access in Kafka Administrative Tools
  • KIP-589 Add API to update Replica state in Controller
  • KIP-590: Redirect Zookeeper Mutation Protocols to The Controller
  • KIP-595: A Raft Protocol for the Metadata Quorum
  • KIP-631: The Quorum-based Kafka Controller

參考(References)

The Raft consensus algorithm

  • Ongaro, D., Ousterhout, J. In Search of an Understandable Consensus Algorithm

Handling Metadata via Write-Ahead Logging

  • Shvachko, K., Kuang, H., Radia, S. Chansler, R. The Hadoop Distributed Filesystem
  • Balakrishnan, M., Malkhi, D., Wobber, T. Tango: Distributed Data Structures over a Shared Log

相關文章