[原始碼解析] 並行分散式任務佇列 Celery 之負載均衡

[原始碼解析] 並行分散式任務佇列 Celery 之負載均衡

0x00 摘要

Celery是一個簡單、靈活且可靠的，處理大量訊息的分散式系統，專注於實時處理的非同步任務佇列，同時也支援任務排程。本文介紹 Celery 的負載均衡機制。

Autoscaler 的作用實際就是線上調節程式池大小。這也和緩解負載相關，所以放在這裡一起論述。

0x01 負載均衡

Celery 的負載均衡其實可以分為三個層次，而且是與 Kombu 高度耦合（本文 broker 以 Redis 為例）。

在 worker 決定與哪幾個 queue 互動，有一個負載均衡（對於 queues ）；
在 worker 決定與 broker 互動，使用 brpop 獲取訊息時候有一個負載均衡（決定哪一個 worker 來處理任務）；
在 worker 獲得 broker 訊息之後，內部具體呼叫 task 時候，worker 內部進行多程式分配時候，有一個負載均衡（決定 worker 內部哪幾個程式）。

注意，這個順序是從 worker 讀取任務處理任務的角度出發，而不是從系統架構角度出發。

因為從系統架構角度說，應該是 which worker ----> which queue in the worker ----> which subprocess in the worker 這個角度。

我們下面按照 "worker 讀取任務處理任務角度" 的順序進行分析。

1.1 哪幾個 queue

Kombu 事實上是使用 redis 的 BRPOP 功能來完成對具體 queue 中訊息的讀取。

Kombu 是迴圈呼叫，每次呼叫會制定讀取哪些內部queues的訊息；
queue 這個邏輯概念，其實就是對應了 redis 中的一個物理key，從 queue 讀取，就代表 BRPOP 需要指定監聽的 key。
Kombu 是在每一次監聽時候，根據這些 queues 得到其在 redis 之中對應的物理keys，即都指定監聽哪些 redis keys；
brpop是個多key命令，當給定多個 key 引數時，按引數 key 的先後順序依次檢查各個列表，彈出第一個非空列表的頭元素。這樣就得到了這些邏輯queue 對應的訊息。

因為 task 可能會用到多個 queue，所以具體從哪幾個queue 讀取？這時候就用到了策略。

1.1.1 _brpop_start 選擇下次讀取的queue

Kombu 在每次監聽時候，呼叫 _brpop_start 完成監聽。其作用就是選擇下一次讀取的queues。

_brpop_start 如下：

def _brpop_start(self, timeout=1):
    # 得到一些內部queues
    queues = self._queue_cycle.consume(len(self.active_queues))
    if not queues:
        return
    # 得到queue對應的keys  
    keys = [self._q_for_pri(queue, pri) for pri in self.priority_steps
            for queue in queues] + [timeout or 0]
    self._in_poll = self.client.connection
    self.client.connection.send_command('BRPOP', *keys) # 利用這些keys，從redis內部獲取key的訊息

此時變數如下：

self.active_queues = {set: 1} {'celery'}

len(self.active_queues) = {int} 1
  
self._queue_cycle = {round_robin_cycle} <kombu.utils.scheduling.round_robin_cycle object at 0x0000015A7EE9DE88>
  
self = {Channel} <kombu.transport.redis.Channel object at 0x0000015A7EE31048>

所以_brpop_start 就是從 self._queue_cycle 獲得幾個需要讀取的queue。

具體如下圖：

                                                              +
                                                  Kombu       |          Redis
                                                              |
                                                              |
+--------------------------------------------+                |
|  Worker                                    |                |
|                                            |                |           queue 1 key
|   +-----------+                            |                |
|   | queue 1   |                            |   BRPOP(keys)  |
|   | queue 2   |                    keys    |                |
|   | ......    | +--------+---------------------------------------->     queue 2 key
|   | queue n   |          ^                 |                |
|   +-----------+          | keys            |                |
|                          |                 |                |
|                          |                 |                |           queue 3 key
|            +-------------+------------+    |                |
|            |         Keys list        |    |                |
|            |                          |    |                |
|            +--------------------------+    |                |
+--------------------------------------------+                |
                                                              |
                                                              |
                                                              |
                                                              |
                                                              |
                                                              +

1.1.2 round_robin_cycle 設定下次讀取的 queue

從上面程式碼中，我們可以知道 consume 就是返回 round_robin_cycle 中前幾個 queue，即 return self.items[:n]。

而 self.items 的維護，是通過 rotate 完成的，就是把最近用的那個 queue 放到佇列最後，這樣給其他 queue 機會，就是 round robin 的概念了。

class round_robin_cycle:
    """Iterator that cycles between items in round-robin."""

    def __init__(self, it=None):
        self.items = it if it is not None else []

    def update(self, it):
        """Update items from iterable."""
        self.items[:] = it

    def consume(self, n):
        """Consume n items."""
        return self.items[:n]

    def rotate(self, last_used):
        """Move most recently used item to end of list."""
        items = self.items
        try:
            items.append(items.pop(items.index(last_used)))
        except ValueError:
            pass
        return last_used

比如在如下程式碼中，當讀取到訊息之後，就會呼叫 self._queue_cycle.rotate(dest) 進行調整。

    def _brpop_read(self, **options):
        try:
            try:
                dest__item = self.client.parse_response(self.client.connection,
                                                        'BRPOP',
                                                        **options)
            except self.connection_errors:
                # if there's a ConnectionError, disconnect so the next
                # iteration will reconnect automatically.
                self.client.connection.disconnect()
                raise
            if dest__item:
                dest, item = dest__item
                dest = bytes_to_str(dest).rsplit(self.sep, 1)[0]
                self._queue_cycle.rotate(dest) # 這裡進行調整
                self.connection._deliver(loads(bytes_to_str(item)), dest)
                return True
            else:
                raise Empty()
        finally:
            self._in_poll = None

具體如下圖：

                                                              +
                                                  Kombu       |          Redis
                                                              |
                                                              |
+--------------------------------------------+                |
|  Worker                                    |                |
|                                            |                |           queue 1 key
|   +-----------+                            |                |
|   | queue 1   |                            |   BRPOP(keys)  |
|   | queue 2   |                    keys    |                |
|   | ......    | +--------+---------------------------------------->     queue 2 key
|   | queue n   |          ^                 |                |
|   +-----------+          | keys            |                |
|                          |                 |                |
|                          +                 |                |           queue 3 key
|                 round_robin_cycle          |                |
|                          +                 |                |
|                          |                 |                |
|                          |                 |                |
|            +-------------+------------+    |                |
|            |         Keys list        |    |                |
|            +--------------------------+    |                |
+--------------------------------------------+                |
                                                              |
                                                              +

1.2 哪一個worker

如果多個 worker 同時去使用 brpop 獲取 broker 訊息，那麼具體哪一個能夠讀取到訊息，其實這就是有一個競爭機制，因為redis 的單程式處理，所以只能有一個 worker 才能讀到。

這本身就是一個負載均衡。這個和 spring quartz 的負載均衡實現非常類似。

spring quartz 是多個節點讀取同一個資料庫記錄決定誰能開始下一次處理，哪一個得到了資料庫鎖就是哪個。
Kombu 是通過多個 worker 讀取 redis "同一個或者一組key" 的實際結果來決定 "哪一個 worker 能開始下一次處理"。

具體如下圖：

                                                            +
                                            Kombu           |    Redis
                                                            |
                                                            |
+--------------------------------------+                    |
|  Worker 1                            |                    |
|                                      |                    |
|   +-----------+                      |                    |
|   | queue 1   |                      |   BRPOP(keys)      |
|   | queue 2   |              keys    |                    |
|   | ......    | +--------+-----------------------------+  |
|   | queue n   |          ^           |                 |  |
|   +-----------+          |  keys     |                 |  |
|                          |           |                 |  |
|                          +           |                 |  |
|                  round_robin_cycle   |                 |  |               +--> queue 1 key
|                          ^           |                 |  |               |
|                          |           |                 |  |               |
|                          |           |                 |  | Single Thread |
|             +------------+---------+ |                 +---------------------> queue 2 key
|             |     keys list        | |                 |  |               |
|             +----------------------+ |                 |  |               |
+--------------------------------------+                 |  |               |
                                                         |  |               +--> queue 3 key
                                                         |  |
+--------------------------------------+                 |  |
| Worker 2                             |   BRPOP(keys)   |  |
|                                      | +---------------+  |
|                                      |                 |  |
+--------------------------------------+                 |  |
                                                         |  |
+--------------------------------------+   BRPOP(keys)   |  |
| Worker 3                             |                 |  |
|                                      |  +--------------+  |
|                                      |                    +
|                                      |
+--------------------------------------+

1.3 哪一個程式

程式池中，使用了策略來決定具體使用哪一個程式來處理任務。

1.3.1 策略

先講解 strategy。在 AsynPool 啟動有如下，配置了策略：

class AsynPool(_pool.Pool):
    """AsyncIO Pool (no threads)."""

    def __init__(self, processes=None, synack=False,
                 sched_strategy=None, proc_alive_timeout=None,
                 *args, **kwargs):
        self.sched_strategy = SCHED_STRATEGIES.get(sched_strategy,
                                                   sched_strategy)

於是我們看看 strategy 定義如下，基本由名字可以知道其策略意義：

SCHED_STRATEGY_FCFS = 1 # 先來先服務
SCHED_STRATEGY_FAIR = 4 # 公平

SCHED_STRATEGIES = {
    None: SCHED_STRATEGY_FAIR,
    'default': SCHED_STRATEGY_FAIR,
    'fast': SCHED_STRATEGY_FCFS,
    'fcfs': SCHED_STRATEGY_FCFS,
    'fair': SCHED_STRATEGY_FAIR,
}

1.3.2 公平排程

我們講講公平排程的概念。

不同系統對於公平排程的理解大同小異，我們舉幾個例子看看。

Linux 中，排程器必須在各個程式之間儘可能公平地共享CPU時間，而同時又要考慮不同的任務優先順序。一般原理是：按所需分配的計算能力，向系統中每個程式提供最大的公正性，或者從另外一個角度上說, 試圖確保沒有程式被虧待。
Hadoop 中，公平排程是一種賦予作業（job）資源的方法，它的目的是讓所有的作業隨著時間的推移，都能平均的獲取等同的共享資源。當單獨一個作業在執行時，它將使用整個叢集。當有其它作業被提交上來時，系統會將任務（task）空閒時間片（slot）賦給這些新的作業，以使得每一個作業都大概獲取到等量的CPU時間。
Yarn 之中，Fair Share指的都是Yarn根據每個佇列的權重、最大，最小可執行資源計算的得到的可以分配給這個佇列的最大可用資源。

1.3.3 公平排程 in Celery

在 asynpool之中，有設定，看看"是否為 fair 排程":

is_fair_strategy = self.sched_strategy == SCHED_STRATEGY_FAIR

基於 is_fair_strategy 這個變數，Celery 的公平排程有幾處體現。

在開始 poll 時候，如果是 fair，則需要存在 idle worker 才排程，這樣就給了 idler worker 一個排程機會。

def on_poll_start():
    # Determine which io descriptors are not busy
    inactive = diff(active_writes)

    # Determine hub_add vs hub_remove strategy conditional
    if is_fair_strategy:
        # outbound buffer present and idle workers exist
        add_cond = outbound and len(busy_workers) < len(all_inqueues)
    else:  # default is add when data exists in outbound buffer
        add_cond = outbound

    if add_cond:  # calling hub_add vs hub_remove
        iterate_file_descriptors_safely(
            inactive, all_inqueues, hub_add,
            None, WRITE | ERR, consolidate=True)
    else:
        iterate_file_descriptors_safely(
            inactive, all_inqueues, hub_remove)

在具體釋出寫操作時候，也會看看是否 worker 已經正在忙於執行某一個 task，如果正在執行，就不排程，這樣就給了其他不忙worker 一個排程的機會。

        def schedule_writes(ready_fds, total_write_count=None):
            if not total_write_count:
                total_write_count = [0]
            # Schedule write operation to ready file descriptor.
            # The file descriptor is writable, but that does not
            # mean the process is currently reading from the socket.
            # The socket is buffered so writable simply means that
            # the buffer can accept at least 1 byte of data.

            # This means we have to cycle between the ready fds.
            # the first version used shuffle, but this version
            # using `total_writes % ready_fds` is about 30% faster
            # with many processes, and also leans more towards fairness
            # in write stats when used with many processes
            # [XXX On macOS, this may vary depending
            # on event loop implementation (i.e, select/poll vs epoll), so
            # have to test further]
            num_ready = len(ready_fds)

            for _ in range(num_ready):
                ready_fd = ready_fds[total_write_count[0] % num_ready]
                total_write_count[0] += 1
                if ready_fd in active_writes:
                    # already writing to this fd
                    continue 
                if is_fair_strategy and ready_fd in busy_workers: # 是否排程
                    # worker is already busy with another task
                    continue
                if ready_fd not in all_inqueues:
                    hub_remove(ready_fd)
                    continue

具體邏輯如下：

                                                          +
                                            Kombu         |    Redis
                                                          |
                                         BRPOP(keys)      |
+------------------------------------+                    |
|  Worker 1                          | +---------------+  |
|                                    |                 |  |
+------------------------------------+                 |  |                     queue 1 key
                                                       |  |                 +->
                                                       |  |                 |
+------------------------------------+     BRPOP(keys) |  |  Single thread  |
| Worker 2                           | +--------------------------------------> queue 2 key
|                                    |                 |  |  (which worker) |
+------------------------------------+                 |  |                 |
                                                       |  |                 |
+------------------------------------+                 |  |                 +-> queue 3 key
| Worker 3                           |                 |  |
|                                    |                 |  |
|     +-----------+                  |                 |  |
|     | queue 1   |                  |     BRPOP(keys) |  |
|     | queue 2   |          keys    |                 |  |
|     | ......    | +--------+-------------------------+  |
|     | queue n   |          ^       |                    |
|     +-----------+          | keys  |                    |
|                            |       |                    |
|                            +       |                    |
|                 round_robin_cycle (which queues)        |
|                            ^       |                    |
|                            |       |                    |
|                            |       |                    |
|                       +----+----+  |                    |
|             +         |keys list|  |                    |
|             |         +---------+  |                    |
+------------------------------------+                    |
              |                                           |
              |  fair_strategy(which subprocess)          |
              |                                           |
      +-------+----------+----------------+               |
      |                  |                |               |
      v                  v                v               |
+-----+--------+  +------+-------+  +-----+--------+      |
| subprocess 1 |  | subprocess 2 |  | subprocess 3 |      +
+--------------+  +--------------+  +--------------+

0x02 Autoscaler

Autoscaler 的作用實際就是線上調節程式池大小。這也和緩解負載相關，所以放在這裡一起論述。

2.1 呼叫時機

在 WorkerComponent 中可以看到，為 AutoScaler 註冊了兩個呼叫途徑：

註冊在 consumer 訊息響應方法中，這樣消費時候如果有需要，就會調整；
利用 Hub 的 call_repeatedly 方法註冊了週期任務，即週期看看是否需要調整。

這樣就會最大程度的加大呼叫頻率。

class WorkerComponent(bootsteps.StartStopStep):
    """Bootstep that starts the autoscaler thread/timer in the worker."""

    def create(self, w):
        scaler = w.autoscaler = self.instantiate(
            w.autoscaler_cls,
            w.pool, w.max_concurrency, w.min_concurrency,
            worker=w, mutex=DummyLock() if w.use_eventloop else None,
        )
        return scaler if not w.use_eventloop else None

    def register_with_event_loop(self, w, hub):
        w.consumer.on_task_message.add(w.autoscaler.maybe_scale) # 消費時候如果有需要，就會調整
        
        hub.call_repeatedly( # 週期看看是否需要調整
            w.autoscaler.keepalive, w.autoscaler.maybe_scale,
        )

2.2 具體實現

2.2.1 bgThread

Autoscaler 是Background thread，這樣 AutoScaler就可以在後臺執行：

class bgThread(threading.Thread):
    """Background service thread."""

    def run(self):
        body = self.body
        shutdown_set = self._is_shutdown.is_set
        try:
            while not shutdown_set():
            	body()
        finally:
            self._set_stopped()

2.2.2 定義

Autoscaler 的定義如下，可以看到其邏輯就是定期判斷是否需要調整：

如果當前併發已經到了最大，則下調；
如果到了最小併發，則上調；
則具體上調下調的，都是通過具體執行緒池函式做到的，這就是要根據具體作業系統來進行分析，此處略過。

class Autoscaler(bgThread):
    """Background thread to autoscale pool workers."""

    def __init__(self, pool, max_concurrency,
                 min_concurrency=0, worker=None,
                 keepalive=AUTOSCALE_KEEPALIVE, mutex=None):
        super().__init__()
        self.pool = pool
        self.mutex = mutex or threading.Lock()
        self.max_concurrency = max_concurrency
        self.min_concurrency = min_concurrency
        self.keepalive = keepalive
        self._last_scale_up = None
        self.worker = worker

    def body(self):
        with self.mutex:
            self.maybe_scale()
        sleep(1.0)

    def _maybe_scale(self, req=None):
        procs = self.processes
        cur = min(self.qty, self.max_concurrency)
        if cur > procs:
            self.scale_up(cur - procs)
            return True
        cur = max(self.qty, self.min_concurrency)
        if cur < procs:
            self.scale_down(procs - cur)
            return True

    def maybe_scale(self, req=None):
        if self._maybe_scale(req):
            self.pool.maintain_pool()

    def update(self, max=None, min=None):
        with self.mutex:
            if max is not None:
                if max < self.processes:
                    self._shrink(self.processes - max)
                self._update_consumer_prefetch_count(max)
                self.max_concurrency = max
            if min is not None:
                if min > self.processes:
                    self._grow(min - self.processes)
                self.min_concurrency = min
            return self.max_concurrency, self.min_concurrency

    def scale_up(self, n):
        self._last_scale_up = monotonic()
        return self._grow(n)

    def scale_down(self, n):
        if self._last_scale_up and (
                monotonic() - self._last_scale_up > self.keepalive):
            return self._shrink(n)

    def _grow(self, n):
        self.pool.grow(n)

    def _shrink(self, n):
		self.pool.shrink(n)

    def _update_consumer_prefetch_count(self, new_max):
        diff = new_max - self.max_concurrency
        if diff:
            self.worker.consumer._update_prefetch_count(
                diff
            )

    @property
    def qty(self):
        return len(state.reserved_requests)

    @property
    def processes(self):
        return self.pool.num_processes

0xEE 個人資訊

★★★★★★關於生活和技術的思考★★★★★★

微信公眾賬號：羅西的思考

如果您想及時得到個人撰寫文章的訊息推送，或者想看看個人推薦的技術資料，敬請關注。

在這裡插入圖片描述

0xFF 參考

Hadoop公平排程器指南

淺析Linux中完全公平排程——CFS

yarn公平排程詳細分析（一）

[原始碼解析] 並行分散式任務佇列 Celery 之 負載均衡

[原始碼解析] 並行分散式任務佇列 Celery 之 負載均衡

0x00 摘要

0x01 負載均衡

1.1 哪幾個 queue

1.1.1 _brpop_start 選擇下次讀取的queue

1.1.2 round_robin_cycle 設定下次讀取的 queue

1.2 哪一個worker

1.3 哪一個程式

1.3.1 策略

1.3.2 公平排程

1.3.3 公平排程 in Celery

0x02 Autoscaler

2.1 呼叫時機

2.2 具體實現

2.2.1 bgThread

2.2.2 定義

0xEE 個人資訊

0xFF 參考

相關文章

[原始碼解析] 並行分散式任務佇列 Celery 之負載均衡

[原始碼解析] 並行分散式任務佇列 Celery 之負載均衡