Purgatory是Kafka server中處理請求時使用的一個重要的資料結構。正好研究ReplicaManager原始碼的時候發現了這篇文章,順便翻譯下。由於這個proposal裡的很多東西需要看原始碼才能理解得比較清楚,但是程式碼還是比較多的,所以先大概講一下其中的一些概念和原理,以便於閱讀接下來的文章。
1. purgatory是用於快取一些 delayed request的。這些請求因為一些條件得不到滿足,所以需要先放到purgatory裡,等到條件滿足了,再從裡邊移出來。
2. 這些request得到滿足的條件分成兩種:(1)它需要業務型別的條件,比如fetch的最少byte數目等 (2)超時時限。這兩個條件需要對應兩個不同型別的快取,第一個快取是用一個hashmap實現的,key就是條件,value就是等待這個條件的所有請求的列表(就是文章中的watcher list,每個在等待這個key的請求就是一個watcher);第二個快取是一個計時器,當request超時以後,它會主動complete這個請求。
3. 第一個hashmap裡的key與request是多對多的關係,所以通過一個key找到一個request, 然後complete這個request以後,可以把這個request從對應這個key的watcher list裡移除。但是這並不代表這個request就不在第一個快取裡了,因為它可能還在其它key的wather list裡,而遍歷所有wathers lists是一個開銷很大的操作,所以不能每次移除一個元素,都要對這個hashmap檢測一遍。所以,需要週期性地清理這個hashmap,就是下面文章中提到的purge操作。0.8.x裡的實現是根據當前watcher list總的大小來確定啥時候該purge,但是這個大小並不代表了第一個快取中的請求的數量,更不代表已實成的請求的數量。而實際應該purge的是已完成的請求的數量。舊的方案對這個問題的處理很不好,所以耗費了很多CPU,也限制了purgatory的吞吐量。新的方案部分解決了這個問題,至少比0.8.x的好很多。
4. 第二個快取,即超時佇列裡的元素即使被刪除了,也不能直接找到第一個快取裡的對應條目進行刪除。所以已經過期的請求也不能及時被從第一個快取裡移除,這也加到對一個快取清理的必要性。
5. 0.8.x的計時器的實現是用了一個java.util.concurrent.DelayQueue,把每個request做成一個DelayItem放進去。java的DelayQueue的實現是用的一個優先順序佇列,這個佇列的入隊和刪除的時間複雜度是O(logn)。所以,如果DelayQueue很大,那麼每次入隊和刪除的開銷都會比較高。而新的實現通過一個timing wheel和基於雙端連結串列的桶的實現,把插入和刪除請求到計時器的操作的時間複雜度降到了O(1),這也降低了對CPU的使用。
Purgatory Redesign Proposal
Introduction
簡介
Kafka implements several request types that cannot immediately be answered with a response. Examples:
- A produce request with acks=all cannot be considered complete until all replicas have acknowledged the write and we can guarantee it will not be lost if the leader fails.
- A fetch request with min.bytes=1 won't be answered until there is at least one new byte of data for the consumer to consume. This allows a "long poll" so that the consumer need not busy wait checking for new data to arrive.
Kafka實現了好幾種不能被立即響應的請求型別, 比如:
- 一個ack=all的produce request在所有副本都確認寫入之前是不能被認為已經完成了的,因為我們不能保證如果leader掛掉的話它不會丟失。
- 一個min.bytes=1的fetch request在至少有1bytes的資料可以被消費者消費之前,是不能給出迴應的。這使得“長時抓取”可以實現,這樣consumer就不用頻率檢查是否有新的資料到來。
These requests are considered complete when either (a) the criteria they requested is complete or (b) some timeout occurs.
一個請求只有符合以下任一條件時才會被認是已經完成了(a)它們需要條件得到了滿足(b)發生了超時
We intend to expand the use of this delayed request facility for various additional purposes including partition assignment and potentially quota enforcement.
The number of these asynchronous operations in flight at any time scales with the number of connections, which for Kafka is often tens of thousands.
A naive implementation of these would simply block the thread on the criteria, but this would not scale to the high number of in flight requests Kafka has.
我們準備把delayed request庫用於其它的一些目的,比如分割槽分配(partition assignment)以及可能用於配額控制(quota enforcement)功能。
在任何時刻正在執行中的這種非同步操作的數量跟連線(connections)的數量一起增長,對於Kafka來說這種連線經常是萬級別的。(譯註:是說隨著連線數的增加,正在執行的這種非同步操作的數量也會增加)。
對於這種問題的一個簡單的實現方案是把執行緒阻塞在請求完成的條件上,但是對於Kafka這種擁有非常多的請求(指前邊提到的這種delayed request)的情況,這種解決方案不具有擴充套件性。
The current approach uses a data structure called the "request purgatory". The purgatory holds any request that hasn't yet met its criteria to succeed but also hasn't yet resulted in an error. This structure holds onto these uncompleted requests and allows non-blocking event-based generation of the responses. This approach is obviously better than having a thread per in-flight request but our implementation of the data structure that accomplishes this has a number of deficiencies. The goal of this proposal is to improve the efficiency of this data structure.
當前Kafka的做法是使用一個叫做“request purgatory"的資料結構。這個purgatory持有還沒有達到完成條件但也沒有發生錯誤的請求。這個資料結構持有這些未完成的請求,並且允許以"非阻塞"的"事件驅動"的方式生成響應。這種做法很明顯比為每個正在等待的請求建立一個執行緒好得多,但是我們對於這個資料結構的實現有一些缺陷。這個提議(proposal)的目的。
Current Design
當前的設計
The request purgatory consists of a timeout timer and a hash map of watcher lists for event driven processing. A request is put into a purgatory when it is not immediately satisfiable because of unmet conditions. A request in the purgatory is completed later when the conditions are met or is forced to be completed (timeout) when it passed beyond the time specified in the timeout parameter of the request. Currently (0.8.x) it uses Java DelayQueue to implement the timer.
當前的request purgatory 包括一個超時計時器以及一個以watchers列表為value的雜湊表,這個雜湊表用於事件驅動的處理。當一個請求不能立即滿足時,它就被放到一個purgatory。對於一個在purgatory中的請求,當它的需求被滿足或者它因為超過了這個請求中指定的超時時限而被強制完成的時候,它就會被完成。當前的版本(0.8.x)使用一個Java的DelayedQueue來實現這個計時器。
When a request is completed, the request is not deleted from the timer or watcher lists immediately. Instead, completed requests are deleted as they were found during condition checking. When the deletion does not keep up, the server may exhaust JVM heap and cause OutOfMemoryError. To alleviate the situation, the reaper thread purges completed requests from the purgatory when the number of requests in the purgatory (including both pending or completed requests) exceeds the configured number. The purge operation scans the timer queue and all watcher lists to find completed requests and deletes them.
當一個請求完成以後,它沒有被立即從計時器或者watchers列表中刪除。而取代之後的是,已經完成的請求只有在條件檢查的時候被發現後,才會被刪除。當刪除的速度跟不的時候,伺服器可能會耗盡JVM堆,引發OutOfMemoryError。為了避免這種情況,在purgatory中的請求數目(包括在等待的以及完成的請求)達到一個指定的值時,收割者執行緒就會把已經完成的請求從purgatory裡清理。清理操作會掃描計時器佇列以及所有watcher列表來找到已經完成的請求,然後刪除它們。
By setting this configuration parameter low, the server can virtually avoid the memory problem. However, the server must pay a significant performance penalty if it scans all lists too frequently.
通過把這個配置引數設低,伺服器可以差不多避免記憶體問題。但是,如果伺服器掃描這些列表太頻繁的話,會遭受顯著的效能懲罰。
New Design
The goal of the new design is to allow immediate deletion of a completed request and reduce the load of expensive purge process significantly. It requires cross referencing of entries in the timer and the requests. Also it is strongly desired to have O(1) insert/delete cost since insert/delete operation happens for each request/completion.
To satisfy these requirements, we propose a new purgatory implementation based on Hierarchical Timing Wheels.
新設計
新設計的目標是允許把已完成的任務立即刪除,並且顯著減輕清理執行緒的負載。這需要對計時器的條目(entries in the timer)和請求進行交叉引用。並且,對於插入和刪除的複雜度為O(1)存在著強烈的需求,因為對於生個請求/完成都會有插入/刪除的操作。
為了實現上面的要求,我們提議一個基於Hierarchical Timing Wheels 的新purgatory的實現。
Hierarchical Timing Wheel
層級形式的時間輪
Doubly Linked List for Buckets in Timing Wheels
時間輪中用於桶的雙端連結串列
(譯註:意思是Timing Wheels中的桶是用雙端連結串列實現的)
Driving Clock using DelayQueue
使用DelayQueue驅動時鐘
A simple implementation may use a thread that wakes up every unit time and do the ticking, which checks if there is any task in the bucket. This can be wasteful if requests are sparse. We want the thread to wake up only when when there is a non-empty bucket to expire. We will do so by using java.util.concurrent.DelayQueue similarly to the current implementation, but we will enqueue task buckets instead of individual tasks. This design has a performance advantage. The number of items in DelayQueue is capped by the number of buckets, which is usually much smaller than the number of tasks, thus the number of offer/poll operations to the priority queue inside DelayQueue will be significantly smaller.
Purge of Watcher Lists
清理觀察者列表
In the current implementation, the purge operation of watcher lists is triggered by the total size if the watcher lists. The problem is that the watcher lists may exceed the threshold even when there isn't many requests to purge. When this happens it increases the CPU load a lot. Ideally, the purge operation should be triggered by the number of completed requests the watcher lists.
在當前的實現中,對於watchers list的清理是被watchers list的大小觸發。問題是,即使沒有什麼任務需要清理,watcher list的大小也可能會超過這個閥值。當這種情況發生,CPU負載就會增加很多。理想的情況是,清理操作是被watchers list中已經完成的請求的數目觸發。
In the new design, a completed request is removed from the timer queue immediately with O(1) cost. It means that the number of requests in the timer queue is the number of pending requests exactly at any time. So, if we know the total number of distinct requests in the purgatory, which includes the sum of the number of pending request and the numbers completed but still watched requests, we can avoid unnecessary purge operations. It is not trivial to keep track of the exact number of distinct requests in the purgatory because a request may or my not be watched. In the new design, we estimate the total number of requests in the purgatory rather than trying to maintain the exactly number.
在新的設計中,一個已經完成的請求會被以O(1)的開銷從計時器佇列(timer queue)中被刪除。這意味著計時器佇列的請求的數目在任何時間點就是在等待的請求的數目。因此,如果我們知道這個purgatory中的不同請求型別請求的總數,也就是所有在等待的請求的總數以及雖然已經完成了但還在watchers lists裡的請求數目,我們就可以避免沒必要的清理操作。追蹤purgatory中不同請求的確切數目不是一個簡單的事,因為一個請求可能被watch,也可能沒有。在這個新設計中,我們對purgatory中的請求的總數進行估計而不是試圖維護一個確切的值。
The estimated number of requests are maintained as follows. The estimated total number of requests, E, is incremented whenever a new request is watched. Before starting the purge operation, we reset the estimated total number of requests to the size of timer queue. If no requests are added to the purgatory during purge, E is the correct number of requests after purge. If some requests are added to the purgatory during purge, E is incremented to E + the number of newly watched requests. This may be an overestimation because it is possible that some of the new requests are completed and remove from the watcher lists during the purge operation. We expect the chance of overestimation and an amount of overestimation are small.
請求總數的估計值被以下面的方式維護。請求總數的估計值, E ,每當一個新的請求被watch就會加1. 在開始清理操作之前,我們把請求總數的估計值重置為timer queue的大小。如果在清理過程中沒有新的請求被加到purgatory,E就是清理之後留下來的訊息的總數。如果在清理過程中有新的請求被加到了purgatory, E就增加到了E + 新被watch的請求數量。這可能會是一個高估了的值因為在清理操作中可能會有新的請求被完成並且從watcher list裡移除。我們希望高估的概率以及被高估的數目會比較小。
Parameters
引數
- the tick size (the minimum time unit)
- the wheel size (the number of buckets per wheel)
- 一格的大小(也就是最小的時間單位)
- 輪的大小(每個輪的桶的數量)
BenchMark
We compared the enqueue performance of two purgatory implementation, the current implementation and the proposed new implementation. This is a micro benchmark. It measures the purgatory enqueue performance. The purgatory was separated from the rest of the system and also uses a fake request which does nothing useful. So, the throughput of the purgatory in a real system may be much lower than the number shown by the test.
我們比較了這兩種purgatory實現的入佇列(enqueue)效能,當前的實現和被提議的新的實現。這是一個小的benchmark。它度量的purgatory的入佇列效能。purgatory被從系統的其它部分剝離出來,並且使用了一個捏造的請求(fake request), 這個請求啥都不做。所以實際系統中這個purgatory的吞吐量會比benchmark裡顯示的值低很多。
In the test, the intervals of the requests are assumed to follow the exponential distribution. Each request takes a time drawn from a log-normal distribution. By adjusting the shape of the log-normal distribution, we can test different timeout rate.
在測試裡,請求的間隔被推測為符合指數分佈(follow the exponential distribution). 每個請求的時間(譯註:這裡應該是完成請求花費的時間,也就是從進入purgatory到complete的時間)取自一個對數正態分佈(log-normal distribution)。通過調整這個對數正態分佈的形狀,我們可以測試不同的超時比率(timeout rate)。
The tick size is 1ms and the wheel size is 20. The timeout was set to 200ms. The data size of a request was 100 bytes. For a low timeout rate case, we chose 75percentile = 60ms and 50percentile = 20. And for a high timeout rate case, we chose 75percentile = 400ms and 50percentile = 200ms. Total 1 million requests are enqueued in each run.
一個格(tick size)是一毫秒,輪的大小是20。超時時間是200ms。每個請求的資料大小是100位元組。對於低超時比率的情況,我們選擇百分位數為75的請求的完成時間是60ms, 百分位數50的完成時間是20ms(we chose 75percentile = 60ms and 50percentile = 20)。對於高超時比率的情況,我們選擇75percentile = 400ms以及50percentile = 200ms。每一輪中總共有100萬個請求被加入佇列。
Requests are actively completed by a separate thread. Requests that are supposed to be completed before timeout are enqueued to another DelayQueue. And a separate thread keeps polling and completes them. There is no guarantee of accuracy in terms of actual completion time.
請求被不斷的用另一個執行緒完成。應該在超時之前完成的請求被加入到另一個DelayQueue, 一個單獨的執行緒不斷地從這個佇列裡poll請求並且完成它們。並沒有對請求實際完成的時間有準確地保證。(譯註:這一段是講在benchmark中是怎麼樣完成(complete)這些請求的, 即用不會timeout的請求被放到一個DelayQueue裡,然後有一個程線不停地從裡邊拉取請求,然後完成它們。但是前邊講過DelayQueue的poll的時間複雜度為O(logn),所以這種方式本身會不會增加cpu load呢?尤其考慮到實際complete請求的時候,請求是從hashmap裡獲取的,時間複雜度要低很多。)
The JVM heap size is set to 200m to reproduce a memory tight situation.
JVM的堆大小被設成200m來模擬一個記憶體緊張的場景。
The result shows a dramatic difference in a high enqueue rate area. As the target rate increases, both implementations keep up with the requests initially. However, in low timeout scenario the old implementation was saturated around 40000 RPS (request per second), whereas the proposed implementation didn't show any significant performance degradation, and in high timeout scenario the old implementation was saturated around 25000 RPS, whereas the proposed implementation was saturated 105000 RPS in this benchmark.
CPU usage is significantly better in the new implementation.
新的實現在CPU使用率上明顯要好。
Finally, we measured total GC time (milliseconds) for ParNew collection and CMS collection. There isn't much difference in the old implementation and the new implementation in the region of enqueue rate that the old implementation can sustain.
最後,我們測量了用ParNew收集器和CMS收集器時的GC時間(譯註:新生代用ParNew, 老年代用CMS)。在舊的實現可以承受的入佇列速度的情況下,兩種實現並沒有什麼區別。
Summary
In the new design, we use Hierarchical Timing Wheels for the timeout timer and DelayQueue of timer buckets to advance the clock on demand. Completed requests are removed from the timer queue immediately with O(1) cost. The buckets remain in the delay queue, however, the number of buckets is bounded. And, in a healthy system, most of the requests are satisfied before timeout, and many of the buckets become empty before pulled out of the delay queue. Thus, the timer should rarely have the buckets of the lower interval. The advantage of this design is that the number of requests in the timer queue is the number of pending requests exactly at any time. This allows us to estimate the number of requests need to be purged. We can avoid unnecessary purge operation of the watcher lists. As the result we achieve a higher scalability in terms of request rate with much better CPU usage.