深入淺出 Locust 實現

bugVanisher發表於2020-06-17

原文網址 : https://testerhome.com/topics/24300

寫在前面

本文將嘗試通過一篇文章講清楚開源壓測工具——Locust的原理和實現過程，聚焦在實現上，不拘泥在一堆原始碼中（本身Locust的原始碼就比較簡單）。本人並不是Locust鐵粉，只是對它的實現方式感興趣，所以jmeter、wrk的粉絲們就不要來battle我了。在我看來，工具都有各自的優勢和缺陷，區別在於使用的人，能否選擇最合適的工具達到目的，當然你可以掌握所有的主流工具，對於特定的任務，哪個簡單用哪個，也可以鍾情於一個工具，信手拈來，只要你能在預期時間內完成也無可厚非。
網上有很多關於Locust原始碼實現的講解，但是我覺得都太片面了，並且沒有講清楚，恰巧Locust在上個月（5月）釋出了1.x版本，整個重構力度還是蠻大的，因此我想通過這篇文章，介紹它最新的設計以及展示最全面的Locust實現，讓大家感受到Locust的簡單與優雅（整個Locust專案不過4M大小）。

一、架構與核心類

基本介紹

Locust是開源、使用Python開發、基於事件、支援分散式並且提供Web UI進行測試執行和結果展示的效能測試工具。

Locust的主要特性有兩個：

模擬使用者操作:支援多協議，Locust可以用於壓測任意協議型別的系統
併發機制:摒棄了程式和執行緒，採用協程（gevent）的機制，單臺測試機可以產生數千併發壓力

Locust使用了以下幾個核心庫：

1) gevent
    gevent是一種基於協程的Python網路庫，它用到Greenlet提供的，封裝了libevent事件迴圈的高層同步API。
2) flask
    Python編寫的輕量級Web應用框架。
3) requests
    Python Http庫
4) msgpack-python
    MessagePack是一種快速、緊湊的二進位制序列化格式，適用於類似JSON的資料格式。msgpack-python主要提供MessagePack資料序列化及反序列化的方法。
5) pyzmq
    pyzmq是zeromq(一種通訊佇列)的Python實現,主要用來實現Locust的分散式模式執行

系統架構及對比

我們知道，完整的壓測系統應該包含以下元件：

相比於主流的壓測系統LoadRunner和Jmeter，Locust是一個更為純粹的開源壓測系統，框架本身的結構並不複雜，甚至只提供了最基礎的元件，但也正因為如此，Locust具有極高的可程式設計性和擴充套件性。對於測試開發同學來說，可以比較輕鬆地使用Locust實現對任意協議的壓測。

工具簡單對比

	LoadRunner	Jmeter	Locust
壓力生成器	√	√	√
負載控制器	√	√	√
系統資源監控器	√	x	x
結果採集器	√	√	√
結果分析器	√	√	√

上表簡單展示了幾個工具包含的壓測元件，Locust的架構非常簡單，部分元件的實現甚至都不完善，比如結果分析器，Locust本身並沒有很詳細的測試報告。但這並不妨礙它成為優秀的開源框架。

Locust的架構

Locust 最近釋出了1.x 版本，程式碼進行了重構，重封裝，對很多類進行了重新命名，本文儘量使用新名詞，但涉及的舊名詞不影響理解。

locust架構上使用master-slave模型，支援單機和分散式
master和slave（即worker）使用 ZeroMQ 協議通訊
提供web頁面管理master，從而控制slave，同時展示壓測過程和彙總結果
可選no-web模式（一般用於除錯）
基於Python本身已經支援跨平臺

為了減少CPython的GIL限制，充分利用多核CPU，建議單機啟動CPU核數的slave（多程式）。

核心類

上面展示的Locust架構，從程式碼層面來看究竟是如何實現的呢？下面我們就來窺視一番：

簡單來說，Locust的程式碼分為以下模組：

Runner-執行器，Locust的核心類，定義了整個框架的執行邏輯，實現了Master、Slave（worker）等執行器
User-壓測用例，提供了HttpUser壓測http協議，使用者可以定義事務，斷言等，也可以實現特定協議的User
Socket-通訊器，提供了分散式模式下master和slave的互動方式
RequestStats-採集、分析器，定義了結果分析和資料上報格式
EventHook-事件鉤子，通過預先定義的事件使得我們可以在這些事件發生時(比如slave上報)做一些額外的操作
WebUI-提供web介面的操作檯和壓測過程展示

下面我們看看核心類的主要成員變數和方法

使用者定義的User類作為Runner的user_classes傳入
master的client_listener監聽施壓端client訊息
slave的worker方法監聽master訊息
slave的stats_reporter方法上報壓測資料，預設3s上報一次
slave的start啟動協程，使用master分配的併發數開始壓測
slave預設1s上報一次心跳，如果master超過3s未收到某個slave的心跳則會將其標記為missing狀態
TaskSet和User持有client，可以在類中直接發起客戶端請求，client可以自己實現，Locust只實現了HttpUser

二、Runner的狀態與通訊機制

接下來我們繼續瞭解Locust分散式壓測的核心：Runner的狀態和通訊機制。我們知道Locust等壓測工具支援分散式壓測，就是說理論上可以通過不斷新增壓力機(worker)提高併發數量，這個機制讓使用者可以自由地增減機器資源，從而達到期望的施壓能力。

Runner狀態機

在分散式場景下，除了資料一致性，狀態同步也是非常重要的。在Locust的master-slave架構下，需要管理master和slave的狀態，不僅為了控制壓測的開始或停止，也是為了掌握當前的壓力機情況。那麼都有哪些狀態？

狀態	說明
ready	準備就緒，master和slave啟動後預設狀態
hatching	正在孵化壓力機，對master來說正在告訴slave們開始幹活，對slave來說是過渡狀態，因為它們馬上要running
running	正在壓測
cleanup	當發生GreenletExit時的狀態，一般不會出現
stopping	表示正在通知slave們停止，只有master有這個狀態
stopped	壓測已經停止
missing	狀態丟失，只有slave有的狀態，預設3秒如果master沒有收到slave的心跳就會認為它missing了，一般是程式沒有正常退出導致

Runner的狀態不多，但是在壓測過程中起到非常重要的作用，狀態之間是按約定的方式進行扭轉的，我們使用Locust的web介面管理master的狀態，master根據我們的操作通過通訊機制推進slave的狀態。

通訊機制

Master與Slave之間是通過Zeromq建立的TCP連線進行通訊的（一對多）。

ZeroMQ（簡稱ZMQ）是一個基於訊息佇列的多執行緒網路庫，其對套接字型別、連線處理、幀、甚至路由的底層細節進行抽象，提供跨越多種傳輸協議的套接字。

ZMQ是網路通訊中新的一層，介於應用層和傳輸層之間（按照TCP/IP劃分），其是一個可伸縮層，可並行執行，分散在分散式系統間。

master與各個slave各維持一個TCP連線，在每個連線中，master下發的命令，slave上報的資訊等自由地的傳輸著。

訊息格式

class Message(object):
    def __init__(self, message_type, data, node_id):
        self.type = message_type
        self.data = data
        self.node_id = node_id

    def serialize(self):
        return msgpack.dumps((self.type, self.data, self.node_id))

    @classmethod
    def unserialize(cls, data):
        msg = cls(*msgpack.loads(data, raw=False))
        return msg

其中message_type指明訊息型別，data是實際的訊息內容，node_id指明機器ID。Locust使用msgpack做序列化與反序列化處理。

訊息型別和結構

master和slave之間的訊息型別不過10種，其中大部分訊息由slave上報給master，下方表格可以清楚的看到，實現一套分散式系統並沒有那麼複雜。

序號	message_type	傳送者	data格式	傳送時機
1	client_ready	slave	空	slave啟動後或壓測停止完成
2	hatching	slave	空	接受到master的hatch先傳送一個確認
3	hatch_complete	slave	{"count":n}	所有併發使用者已經孵化完成
4	client_stopped	slave	空	停止所有併發使用者後
5	heatbeat	slave	{"state": x,"current_cpu_usage":x}	預設每3秒上報一次心跳
6	stats	slave	{"stats":[], "stats_total":{}, "errors":{},"user_count":x}	每3秒上報一次壓測資訊
7	exception	slave	{"msg":x, "traceback":x}	TaskSet.run出現異常
8	hatch	master	{"hatch_rate":x, "num_users":x, "host":x, "stop_timeout":x}	開始swarm
9	stop	master	空	點選stop
10	quit	master，slave	空	手動或其他方式退出的時候

可以看到上面有一種非常重要的訊息型別——stats，壓測的結果採集都封裝在這個訊息裡。

三、結果採集器

統計物件

從上面我們知道有一個非常重要的訊息型別——stats，這個是slave給master傳送的訊息，預設每3秒鐘上報一次。stats訊息的結構如下所示：

{
    "stats": [],
    "stats_total": {},
    "errors": {},
    "user_count": 10
}

實際上，slave也是持有著類似如上json格式的三個物件：

RequestStats
StatsEntry
StatsError

其中每一個locust程式會維護一個全域性RequestStats單例global_stats，這個例項包含一個StatsEntry例項total（對應json的stats_total），以及兩個字典entries（對應json的stats）和errors（對應json的errors），其中entries字典key為(name,method),對應值為一個StatsEntry例項，另一個errors字典的key為(name,method,error),對應值為一個StatsError例項。可見，global_stats包含了單個slave彙總的資訊，以及各個請求url或name的統計資訊。在分散式的場景下，每個slave都會維護一個global_stats, 在一個上報週期達到後將資訊傳送到master，傳送完成後就會重置所有資料，進入下一個週期的統計。

上圖中紅色的欄位是slave真正上報給master的資料。

統計過程

那麼slave是如何統計訊息，又究竟需要上報什麼內容給master？master又是如何彙總的呢？下面我們來看看整個統計過程：

在每一次請求後，無論是成功還是失敗，都會觸發對應的request_success或者request_failure事件，stats.py檔案中的資料統計模組訂閱了對應的事件，會呼叫global_stats對資料進行統計。

在slave的一個上報週期達到後，觸發on_report_to_master事件，此時global_stats就會依次呼叫以下方法對資料進行序列化：

serialize_stats
total.get_stripped_report
serialize_errors

其實也就是對上面提到的total和兩個字典中的內容進行序列化，其實就是轉為json字串。

def on_report_to_master(client_id, data):
    data["stats"] = global_stats.serialize_stats()
    data["stats_total"] = global_stats.total.get_stripped_report()
    data["errors"] =  global_stats.serialize_errors()
    global_stats.errors = {}

下圖是斷點過程看到的stats訊息內容（在msgpack序列化之前）:

每秒請求數和響應時間及其對應請求個數

StatsEntry有兩個比較重要的物件，分別是num_reqs_per_sec和response_times，它們都是字典型別，其中num_reqs_per_sec的key是秒時間戳，顯示當前秒完成了多少個請求，統計的時間是完成請求的時刻，比如如果一個請求從第0秒開始，第3秒完成，那麼這個請求統計在第3秒的時間戳上，這個物件可以很方便的計算出rps。response_times的key是響應時間，單位是豪秒，為了防止key過多，做了取整，比如147 取 150, 3432 取 3400 和 58760 取 59000，這個是為了方便獲得類似90%請求的完成時間(小於等於該時間)，99%請求的完成時間，下面具體的原始碼：

def _log_time_of_request(self, current_time):
    t = int(current_time)
    self.num_reqs_per_sec[t] = self.num_reqs_per_sec.setdefault(t, 0) + 1
    self.last_request_timestamp = current_time

def _log_response_time(self, response_time):
    if response_time is None:
        self.num_none_requests += 1
        return

    self.total_response_time += response_time

    if self.min_response_time is None:
        self.min_response_time = response_time

    self.min_response_time = min(self.min_response_time, response_time)
    self.max_response_time = max(self.max_response_time, response_time)

    # to avoid to much data that has to be transfered to the master node when
    # running in distributed mode, we save the response time rounded in a dict
    # so that 147 becomes 150, 3432 becomes 3400 and 58760 becomes 59000
    if response_time < 100:
        rounded_response_time = response_time
    elif response_time < 1000:
        rounded_response_time = int(round(response_time, -1))
    elif response_time < 10000:
        rounded_response_time = int(round(response_time, -2))
    else:
        rounded_response_time = int(round(response_time, -3))

    # increase request count for the rounded key in response time dict
    self.response_times.setdefault(rounded_response_time, 0)
    self.response_times[rounded_response_time] += 1

master彙總資訊

slave的每一個stats訊息到達master後，都會觸發master的slave_report事件，master也擁有自己的global_stats，因此只需要將對應的資訊進行累加（可以理解是所有slave對應內容的彙總）。具體在StatsEntry的extend方法：

def extend(self, other):
        """
        Extend the data from the current StatsEntry with the stats from another
        StatsEntry instance. 
        """
        if self.last_request_timestamp is not None and other.last_request_timestamp is not None:
            self.last_request_timestamp = max(self.last_request_timestamp, other.last_request_timestamp)
        elif other.last_request_timestamp is not None:
            self.last_request_timestamp = other.last_request_timestamp
        self.start_time = min(self.start_time, other.start_time)

        self.num_requests = self.num_requests + other.num_requests
        self.num_none_requests = self.num_none_requests + other.num_none_requests
        self.num_failures = self.num_failures + other.num_failures
        self.total_response_time = self.total_response_time + other.total_response_time
        self.max_response_time = max(self.max_response_time, other.max_response_time)
        if self.min_response_time is not None and other.min_response_time is not None:
            self.min_response_time = min(self.min_response_time, other.min_response_time)
        elif other.min_response_time is not None:
            # this means self.min_response_time is None, so we can safely replace it
            self.min_response_time = other.min_response_time
        self.total_content_length = self.total_content_length + other.total_content_length

        for key in other.response_times:
            self.response_times[key] = self.response_times.get(key, 0) + other.response_times[key]
        for key in other.num_reqs_per_sec:
            self.num_reqs_per_sec[key] = self.num_reqs_per_sec.get(key, 0) + other.num_reqs_per_sec[key]
        for key in other.num_fail_per_sec:
            self.num_fail_per_sec[key] = self.num_fail_per_sec.get(key, 0) + other.num_fail_per_sec[key]

核心指標

Locust核心的指標其實就4個：

併發數
RPS
響應時間
異常率

我們還是回到文章開頭的那個json：

{
    "stats": [],
    "stats_total": {},
    "errors": {},
    "user_count": 10
}

結合上方Locust的壓測過程截圖，我們可以看到，各個介面的指標其實就是stats物件裡的各個欄位，而最下方彙總的Aggregated這一行則對應stats_total的各個欄位，儘管這個json只是slave單個stats訊息的內容，卻也是最終要顯示的內容，只是master對各個訊息做了彙總而已。彙總的方式也相當簡單，請見上方的StatsEntry的extend方法。

因為master和web模組是一起部署的，因此web可以直接使用master的global_stats物件並展示其內容，可以做到動態顯示。

本文講解的eventHook基於Locust的1.x版本。

四、事件鉤子

Locust中的事件鉤子，我認為設計得非常巧妙，所以最後再來講一下它。event.py模組包含了兩個類，一個是事件鉤子定義類EventHook，一個是事件鉤子型別類Events，為不同的事件提供hook。事件處理函式註冊相應的hook以後，我們可以很方便的的基於event觸發處理函式，實現事件驅動。

EventHook定義了三個方法：

def add_listener(self, handler):
    self._handlers.append(handler)
    return handler

def remove_listener(self, handler):
    self._handlers.remove(handler)

def fire(self, *, reverse=False, **kwargs):
    if reverse:
        handlers = reversed(self._handlers)
    else:
        handlers = self._handlers
    for handler in handlers:
        handler(**kwargs)

可以看到add_listener、remove_listener的作用是註冊或刪除監聽函式，fire方法是觸發處理函式。EventHook的實現相比Locust 0.x版本有較大改變（不再使用Python的內建魔方方法）。

Events中包含了11個事件鉤子，分別是：

事件鉤子的實現原理

事件鉤子的原理可以簡單理解成，1.定義處理函式 ——> 2.註冊到某個eventHook ——> 3.在某個時機觸發eventHook ——> 4.該eventHook遍歷執行所有處理函式。在程式碼層面就是定義函式，然後add_listener，最後在想要的位置觸發eventHook的fire方法並傳入定義好的引數，這裡引數是固定的，不能隨意傳入，之所以每個處理函式都能對引數進行修改，是因為這裡的引數傳遞是『引用傳遞』，明白了這一點就算是掌握了EventHook的基本原理。

其中Locust本身預定義並註冊了一些處理函式，比如事件鉤子report_to_master、worker_report都有對應的處理函式，實現分散式模式下資料上報時資料的構造和計算，比如事件鉤子init，初始化master的WebUI。

事件鉤子的作用

那麼，事件鉤子究竟有什麼作用？

在我看來有以下作用：

程式碼解耦，比如解耦Runner和Stats模組。
提供擴充套件性，通過預置的鉤子和觸發點，為使用者提供了可擴充套件的特性。

舉擴充套件性的例子，使用者可以很輕鬆地：

往worker上報的stats訊息新增自定義資料
在Locust例項啟動時在WebUI中新增自定義介面
在每次請求成功或失敗之後做一些額外的事情

如何使用鉤子

在Locust 1.x版本之前，使用以下方法定義和註冊鉤子：

def on_report_to_master(client_id, data):
    data["stats"] = global_stats.serialize_stats()
    data["stats_total"] = global_stats.total.get_stripped_report()
    data["errors"] =  global_stats.serialize_errors()
    global_stats.errors = {}

def on_slave_report(client_id, data):
    for stats_data in data["stats"]:
        entry = StatsEntry.unserialize(stats_data)
        request_key = (entry.name, entry.method)
        if not request_key in global_stats.entries:
            global_stats.entries[request_key] = StatsEntry(global_stats, entry.name, entry.method)
        global_stats.entries[request_key].extend(entry)
        ...

events.report_to_master += on_report_to_master
events.slave_report += on_slave_report

在1.x及之後的版本有兩種使用方式：

# 方式一與之前類似
def on_report_to_master(client_id, data):
    data["stats"] = stats.serialize_stats()
    data["stats_total"] = stats.total.get_stripped_report()
    data["errors"] =  stats.serialize_errors()
    stats.errors = {}

def on_worker_report(client_id, data):
    for stats_data in data["stats"]:
        entry = StatsEntry.unserialize(stats_data)
        request_key = (entry.name, entry.method)
        if not request_key in stats.entries:
            stats.entries[request_key] = StatsEntry(stats, entry.name, entry.method, use_response_times_cache=True)
        stats.entries[request_key].extend(entry)

    for error_key, error in data["errors"].items():
        if error_key not in stats.errors:
            stats.errors[error_key] = StatsError.from_dict(error)
        else:
            stats.errors[error_key].occurrences += error["occurrences"]

    stats.total.extend(StatsEntry.unserialize(data["stats_total"]))

events.report_to_master.add_listener(on_report_to_master)
events.worker_report.add_listener(on_worker_report)

# 方式二，使用裝飾器
@events.report_to_master.add_listener
def on_report_to_master(client_id, data):
    """
    This event is triggered on the worker instances every time a stats report is
    to be sent to the locust master. It will allow us to add our extra content-length
    data to the dict that is being sent, and then we clear the local stats in the worker.
    """
    data["content-length"] = stats["content-length"]
    stats["content-length"] = 0

@events.worker_report.add_listener
def on_worker_report(client_id, data):
    """
    This event is triggered on the master instance when a new stats report arrives
    from a worker. Here we just add the content-length to the master's aggregated
    stats dict.
    """
    stats["content-length"] += data["content-length"]

第二種方式更加簡潔，不容易忘記註冊處理函式。

寫在最後

首先需要宣告的是，本人並不是效能測試方向的專家，只是做過一些效能測試，搗鼓過一些工具，之所以寫出這篇文章，初衷只是為了在組內分享。瞭解Locust的人，很少會真的用他來進行大型業務場景壓測，一是它的壓測效能受限，二是它沒有很好的測試報告，現在這些在我看來已經不是問題，我想在後面分兩篇文章介紹，敬請期待：

一、如何提高Locust的壓測效能——boomer
二、重新定義Locust的測試報告

深入淺出開源效能測試工具 Locust (使用篇 2)
2019-02-14
深入淺出--梯度下降法及其實現
2021-09-09
梯度
深入淺出 Vue 系列 -- 資料劫持實現原理
2019-04-22
Vue
深入淺出 Java 中列舉的實現原理
2021-12-14
Java
深入淺出FE（十四）深入淺出websocket
2020-11-29
Web
深入淺出換膚相關技術以及如何實現
2019-02-20
淺讀-《深入淺出Nodejs》
2023-03-30
NodeJS
深入淺出mongoose
2019-03-03
Go
HTTP深入淺出
2019-04-05
HTTP
深入淺出Webpack
2018-04-09
Web
深入淺出HTTP
2018-06-09
HTTP
mysqldump 深入淺出
2018-08-08
MySql
深入淺出——MVC
2020-05-26
MVC
深入淺出IO
2019-05-14
深入淺出decorator
2018-12-28
ArrayList 深入淺出
2021-07-23
深入淺出 RabbitMQ
2021-09-09
MQ
深入淺出Promise
2021-06-08
Promise
深入淺出 ZooKeeper
2020-12-29
基於React 原始碼深入淺出setState:setState非同步實現
2019-03-03
React原始碼非同步
IO多路複用——深入淺出理解select、poll、epoll的實現
2024-03-27
Locust 斷言的實現？
2024-03-19
深入淺出Nginx實戰與架構
2022-05-29
Nginx架構
深入淺出：使用Java和Spring Security實現認證與授權
2024-07-20
JavaSpring
【深入淺出 Yarn 架構與實現】4-3 RM 管理 NodeManager
2022-12-26
Yarn架構
【深入淺出 Yarn 架構與實現】4-4 RM 管理 Application
2023-01-11
Yarn架構APP
【深入淺出 Yarn 架構與實現】4-1 ResourceManager 功能概述
2022-11-22
Yarn架構
Flutter | 深入淺出Key
2019-04-02
Flutter
深入淺出 Laravel Echo
2018-09-20
Laravel
深入淺出理解Redux
2018-05-11
Redux
深入淺出 Laravel Macroable
2019-11-04
LaravelMac
flutter ScopedModel深入淺出
2019-09-10
Flutter
反射的深入淺出
2019-08-07
反射
《深入淺出webpack》有感
2019-10-09
Web
深入淺出Spring MVC
2020-05-31
SpringMVC
深入淺出Tomcat系列
2019-01-31
Tomcat
[譯] 深入淺出 SVG
2018-04-19
SVG
SSL/TLS 深入淺出
2024-07-29
TLS