[原始碼解析] 模型並行分散式訓練Megatron (5) --Pipedream Flush

羅西的思考發表於2022-02-14

[原始碼解析] 模型並行分散式訓練Megatron (5) --Pipedream Flush

0x00 摘要

NVIDIA Megatron 是一個基於 PyTorch 的分散式訓練框架，用來訓練超大Transformer語言模型，其通過綜合應用了資料並行，Tensor並行和Pipeline並行來複現 GPT3，值得我們深入分析其背後機理。本系列有 5 篇文章，通過論文和原始碼和大家一起學習研究。本文將看看 Megatron 如何給流水線各個階段安排執行執行序列。

本系列其他文章為：

[原始碼解析] 模型並行分散式訓練Megatron (1) --- 論文 & 基礎

[原始碼解析] 模型並行分散式訓練Megatron (2) --- 整體架構

[原始碼解析] 模型並行分散式訓練 Megatron (3) ---模型並行實現

[原始碼解析] 模型並行分散式訓練 Megatron (4) --- 如何設定各種並行

0x01 背景

在流水線訓練之中，如何給流水線各個階段安排執行執行序列是一個關鍵，所以這裡我們看看如何做schedule。

對於 Megatron 來說，在訓練時候，get_forward_backward_func 獲取pipeline 的schedule，這裡分為 flush 和 interleaving 兩種，因為時間所限，我們只分析 flush 的schedule，有興趣的讀者可以自行研究 interleaving。

def get_forward_backward_func():
    args = get_args()
    if mpu.get_pipeline_model_parallel_world_size() > 1:
        if args.virtual_pipeline_model_parallel_size is not None:
            forward_backward_func = forward_backward_pipelining_with_interleaving
        else:
            forward_backward_func = forward_backward_pipelining_without_interleaving
    else:
        forward_backward_func = forward_backward_no_pipelining
    return forward_backward_func

概括來說，Megatron 是基於 PipeDream-2BW 之上實現了定期重新整理。

PipeDream-2BW 在流水線之中維護了兩個版本的模型權重，“2BW” 是雙緩衝權重（double-buffered weights）”，PipeDream-2BW 會為每個微批次生成一個新的模型版本K（K>d），但是因為有些剩餘後向傳遞仍然依賴於舊版本模型，所以新的模型版本無法立即取代舊版本，但是由於只儲存了兩個版本，所以極大降低了記憶體佔用。
PipeDream-flush 則在 PipeDream-2BW 之上新增了一個全域性同步的流水線更新重新整理操作，思路類似 GPipe。這種方法通過犧牲吞吐量的能力部分下降的代價來減少了記憶體佔用（即只維護一個版本的模型權重）。

0x02 論文

Memory-Efficient Pipeline-Parallel DNN Training 和 Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM 是 Megatron 對應的相關論文，我們就從論文開始研究。注：下面論述內容是基於原論文發表時間，因為各個開源系統也在演進，所以其針對其他開源系統的論述在今天看來不一定完全正確。

2.1 引論

近來，一些工作提出了流水線模型並行以加速模型並行訓練。例如 GPipe（Huang等人，2019年）和PipeDream（Harlap等人 2018年；Narayanan等人，2019年）把多個輸入順序推送到一系列worker之中來訓練，每個worker負責一個模型分割槽，這允許不同worker並行處理不同的輸入。

由於特定輸入的向前和向後傳播之間的權重版本不一致，Native 流水線可能會造成模型不收斂，現有技術權衡了記憶體佔用和吞吐量，以不同的方式來避免這種情況。
GPipe維護單一權重版本，但是會定期進行流水線重新整理（圖1a），具體重新整理時間是在流水線訓練完輸入要更新權重時候，由於資源空閒，這些重新整理限制了總體吞吐量。
PipeDream不會定期Flush流水線，但會儲存多個權重版本，這增加了吞吐量，但也增加了記憶體佔用，由於記憶體限制，無法訓練大型模型。

所以，有效地訓練大型模型需要一種同時具有高吞吐量和低記憶體佔用的方法。此外，流水線並行系統的效能取決於DNN模型operators在 worker 上的劃分方式。這具有挑戰性，原因有三：

記憶體容量限制：與模型分割槽相關的引數和中間啟用需要能夠放置在加速器的主裝置記憶體之中。
異構網路互連：如今的訓練部署具有異構網路拓撲性，同一伺服器上的裝置之間具有更高的頻寬鏈路。
運算子如何放置的大搜尋空間：隨著模型尺寸的增加，拆分運算子圖在計算上變得非常昂貴，因為不同分割槽方式數量是指數級的。

圖1a。不同流水線並行執行的timeline。後向傳播的時間假定為向前傳播的兩倍；向前傳播以藍色顯示，後向傳播以綠色顯示。數字表示微批次ID，時間沿x軸顯示，每個worker的利用率沿y軸顯示。GPipe維護單一權重版本，但會定期重新整理flush流水線。PipeDream不引入週期性流水線重新整理，但維護多個權重版本。

在論文中，作者介紹了PipeDream-2BW，一個高效的DNN模型流水線並行訓練系統。PipeDream-2BW通過兩個關鍵貢獻實現了高吞吐和低記憶體佔用。首先，作者提出了雙緩衝權重更新（2BW），這是一種在避免流水線重新整理的同時減少訓練記憶體佔用的技術。

作者利用了這樣一個事實，即每個輸入生成的梯度不需要立即應用於權重，而是可以累積為“合併（coalesced）”梯度，以限制保留的權重版本的數量。2BW沒有在使用最近更新的權重之前重新整理流水線，而是將新權重用於新進入流水線的輸入，同時將以前的權重版本（稱為陰影版本）用於已在訓練中的輸入（in-flight inputs）。

每個 worker 的權重雙緩衝產生了一種流水線方案，其吞吐量高於GPipe（無流水線重新整理），記憶體效率高於PipeDream（這裡是2個權重版本，而在PipeDream 的一個depth-d流水線中，最差的情況是d個權重版本）。

作者還介紹了2BW的一種變體（稱為PipeDream Flush），它在吞吐量上進行權衡，以獲得更低的記憶體佔用和更高的效能。

2.2 背景

在本節中，作者簡要概述DNN模型分散式訓練的相關技術。

資料並行。
- 資料並行用於擴充套件模型訓練。利用資料並行性（Xing等人，2015），每個worker都有整個模型的副本，輸入資料集在worker之間分片。Worker 定期聚合他們的梯度，以確保所有 Worker 都看到一致版本的權重。資料並行性不能訓練沒法放入單個worker的大型模型，但可以用於較小的模型分割槽（model partitions）。
- 資料並行scale-out通常工作良好，但存在兩個限制：a）超出某一點，每個GPU的batch size變得太小，降低了GPU利用率並增加了通訊成本；b）可以使用的最大裝置數量是batch size，這限制了可用於訓練的加速器數量。於是人們提出了各種模型並行技術來解決這兩個挑戰。
模型並行。對於不適合單個worker的大型模型，一般來說使用模型並行訓練。
- 利用模型並行性（Dean等人，2012年；Chilimbi等人，2014年），模型中的權重引數在可用worker上進行分割（每個transformer層內的矩陣乘法在多個GPU上分割），worker之間交流中間啟用和梯度。層間模型並行性未充分利用資源，因為在任何時間點最多隻有一個工作程式處於活動狀態。
- Tensor（intra-layer）模型並行性（Shoeybi et al.，2019）容易導致關鍵路徑中all-to-all通訊過於昂貴。因為 tensor 並行所需的all-reduce通訊需要通過伺服器間鏈路，這比多GPU伺服器中可用的高頻寬NVLink要慢，於是容易將模型分割槽的數量限制為單個伺服器中的GPU數量。而且高度的模型並行會建立大量的小矩陣乘法 (GEMMs)，這可能會降低GPU的利用率。
- FlexFlow（Jia et al.，2018）展示瞭如何使用模型和資料並行性拆分模型圖，但在使用模型並行性時仍然存在資源利用率低的問題。
流水線並行。為了解決模型並行性的缺點，最近的工作如PipeDream和GPipe提出了流水線並行性。
- 通過流水線並行，將多個輸入（而不是1個）注入到由層間（inter-layer）模型分割槽組成的訓練中。這確保了計算資源得到更好的利用。
- 一個批（batch）被分割成更小的微批（microbatches），並在這些微批之間以流水線方式執行。可以用各種方式將層分配給worker，並且輸入的向前和向後傳播可以使用各種不同計劃。
- 然而，簡單的流水線可能會導致特定輸入的前後傳遞之間的權重版本不匹配。具體來說，如果立刻用最新的權重版本來進行權重更新，那麼在流水線之中，一個輸入可能會看到的是向後傳播更新的權重，而不是它在向前傳播時候看到的權重，從而導致不正確的梯度計算。
- 層分配和排程策略導致不同的效能權衡。不管計劃如何，為了保持嚴格的優化器語義，優化器步驟需要跨裝置同步，從而在每個批處理結束時進行流水線重新整理，允許微批處理完成執行（此時不加入新的微批）用來Flush流水線的時間最高可以達到50%，這取決於注入流水線的微批次數量。微批次數量與流水線尺寸的比率越大，流水線Flush所花費的時間越短。因此，為了實現高效率，通常需要更大的batch size。

使用者可以使用各種技術訓練他們的大型模型，每種技術都有不同的權衡。此外，這些技術可以結合使用。然而，結合這些技術會產生複雜（non-trivial）的互動，為了獲得良好的效能，需要仔細地進行推理，才能做到在保持嚴格的優化器語義的同時最大化給定batch size的大型模型訓練吞吐量。

要實現大規模的吞吐量，需要沿著多個軸進行創新和精心設計：高效的核心實現使大部分訓練受計算限制，而不是記憶體限制；應該在裝置上對計算圖進行智慧分割槽，以減少通過網路鏈路傳送的位元組數，同時限制裝置空閒時間；使用特定領域的通訊優化和快速硬體（最先進的GPU以及相同和不同伺服器上GPU之間的高頻寬鏈路）。

此外，論文作者還研究了影響吞吐量的各種成分之間的相互作用，包括經驗和分析。基於這些研究，論文作者就如何配置分散式培訓提供以下指導原則：

不同形式的並行以複雜的方式相互作用：並行化策略影響通訊量、執行核心的計算效率，以及 Worker 因流水線重新整理（流水線氣泡）而等待計算的空閒時間。例如，張量和流水線模型並行性的次優組合可以導致高達2×更低的吞吐量，即使伺服器之間的網路鏈路頻寬較高；張量模型並行性在多GPU伺服器中是有效的，但流水線模型並行性必須用於更大的模型。
用於流水線並行性的計劃會影響通訊量、流水線氣泡大小以及用於儲存啟用的記憶體。
- 超引數的值（如微批次大小）會影響記憶體佔用、在輔助程式上執行的核心的算術效率以及流水線氣泡大小。微批次大小的最佳值取決於具體問題，一個合適取值可以將吞吐量提高15%。
- 分散式培訓是通訊密集型的。如果節點間互連較慢或更多的通訊密集型分割槽將阻礙效能的擴充套件。
- 論文沒有研究如何自動探索並行策略的搜尋空間（如FlexFlow、PipeDream、Tarnawski 和DAPPLE），而是建議使用那些在實踐中效果良好的啟發式方法。

2.3 流水線權重問題

我們這裡回顧一下流水線權重問題。下圖是樸素流水線執行情況，本質上是一種 async SGD。

2.3.1 問題1

遇到的第一個問題是：在一般情況下，當計算第二個迭代時候，我們需要基於第一個迭代更新之後的模型來計算。但是如下圖所示，對於機器 1，當第二輪迭代開始時候（紅色圓圈的深藍色2號），第一輪迭代的反向傳播（淺綠色1號格）還沒有開始。

2.3.2 問題2

第二個問題是：對於機器2，當它進行第5個mini-batch的前向傳播時候（第二行藍色5），它基於更新兩次的權重來進行前向計算（第二行藍色5之前有兩個綠色格子，意味著權重被更新了兩次）。

但進行第5個mini-batch的反向傳播（第二行淺綠色5）時候，用到的權重是更新了4次的（第二行前面淺綠色的1,2,3,4一共會更新權重4次）。這與單節點深度學習假設衝突，會導致訓練效果下降。

PipeDream 作者為了解決這些問題，提出了 Weight Stashing，以確保相同輸入的向前和後向傳播中使用相同的權重版本（原論文圖1b）。具體就是每個機器多備份幾個版本的權重，前向傳播用哪個權重計算，反向傳播還用這個權重計算。

就上圖來說，機器1需要儲存4個版本的權重，機器2需要儲存3個版本的權重，機器3需要儲存2個版本的權重，機器4需要儲存1個版本的權重。在最壞的情況下，儲存的權重版本總數是d，其中d是流水線深度，這對於大型模型來說記憶體佔用太高了。使用PipeDream的預設權重更新語義，每個階段的權重更新都有不同的延遲項，並且不會在流水線內執行累積。

2.3.3 問題3

另外一個問題是：現在做前向傳播時候，每個機器計算時候，其基於的權重被更新的不同次數，比如第5個mini-batch（深藍色的5），在機器 1 計算 5 時候，基於的權重是更新一次的（其前面有一個綠色），但是機器 2 計算 5 時候，基於的權重是更新兩次的（其前面有兩個綠色）。

解決思路是：每次前向傳播時候，每個機器基於更新最少的權重來計算，比如對於機器2，就忽略綠色2更新的權重，對於機器3，就忽略綠色2,3兩次更新之後的權重，它們都使用被綠色1更新一次之後的權重（圖上矩形框黃色 1 ）。

2.4 PipeDream-2BW 系統設計

PipeDream-2BW使用記憶體高效的流水線並行性來訓練不適合單個加速器的大型模型。它的雙緩衝權重更新（2BW）和重新整理機制確保了高吞吐量、低記憶體佔用和類似於資料並行的權重更新語義。PipeDream-2BW將模型拆分為多個Worker上的多個階段，並對每個階段進行相同次數的複製（在同一階段的副本之間進行資料並行更新）。這種平行流水線適用於每層重複固定次數的模型（例如transformer模型）。

2.4.1 GPipe

GPipe維護模型權重的單一版本。輸入批次被分成更小的微批次。權重梯度是累積的，不會立即應用，並且定期flush 流水線，以確保不需要保持多個權重版本。GPipe提供了類似於資料並行的權重更新語義。原論文圖1a顯示了GPipe執行的時間線。週期性流水線Flush可能會很昂貴，從而限制吞吐量。緩解這一開銷的一種方法是在流水線內進行額外的累積，但這並不總是切實可行的：a）在large scale factors下，能支援的最小batch size較大（與scale factor成比例），且大批量會影響所有模型的收斂性，b）GPipe需要保持與批大小成比例的啟用儲存。

2.4.2 Double-Buffered Weight Updates (2BW)

PipeDream-2BW結合1F1B排程（Narayanan等人，2019年）使用了一種新穎的雙緩衝權重更新（2BW）方案，其中每個 worker 在不同輸入的向前和向後傳遞之間交替，以確保在特定輸入的向前和向後傳遞中使用相同的權重版本（論文原圖2）。2BW的記憶體佔用比PipeDream和GPipe低，並且避免了GPipe昂貴的流水線重新整理。

梯度是以較小的mi-crobatches粒度計算的。對於任何輸入微批次，PipeDream-2BW對輸入的向前和向後傳播使用相同的權重版本。在以批的粒度應用更新之前，會在多個微批次上累積更新，從而限制生成和維護的權重版本的數量。圖2顯示了2BW的時間線示例。

PipeDream-2BW 為每m個微批次生成一個新的權重版本（m≥ d, d是流水線深度）。為了簡單起見，作者首先假設m=d（圖2中的d=4）。新權重版本不能立即使用。特別是，進行中的輸入（in-flight）不能使用最新的權重版本進行向後傳播（例如，在t=21時， worker 3上的輸入7），因為這些輸入的向前傳遞已在不同階段使用較舊的權重版本啟動。

因此，新生成的權重版本需要緩衝以備將來使用。但是，需要維護的權重版本總數最多為2，因為用於生成新權重版本的權重版本可以立即丟棄（通過該階段的未來輸入不再使用舊權重版本）。例如，在圖2中，每個 worker 在處理完輸入8 的 backward pass後都可以丟棄W(0)，因為所有後續輸入的前向傳遞和後向傳遞都使用更高的權重版本。

給定輸入微批次k（基於1開始的索引）使用的權重版本為 \(max(⌊(k − 1)/m⌋ − 1, 0)\)，其中m是批次中的微批次數（圖2中的4）。對於輸入k的向前和向後傳播，此權重版本相同。m可以是任何 ≥ d 的數字，額外的梯度累積（較大的m）會增加全域性 batch size。

論文原圖2。時間軸顯示PipeDream-2BW的雙緩衝權重更新 (2BW) 方案，時間軸沿x軸進行。在不喪失通用性的情況下，假設向後傳播的時間是向前傳播的兩倍。PipeDream-2BW在每個worker上只儲存兩個權重版本，減少了總記憶體佔用，同時不再需要昂貴的流水線暫停。\(W_i^{(v)}\)表示worker i上的具有版本v的權重（包含從輸入v生成的權重梯度）。在方格綠色框中會生成新的權重版本； \(W_4^{(4)}\)首先用在輸入9的向前傳播之中。

上圖中的 Before 意思是丟棄版本之前系統的兩個權重buffer，After 意思是做了丟棄動作之後系統的兩個權重buffer。

2.4.2. Weight Updates with Flushes (PipeDream-Flush)

作者還提出了第二個記憶體高效的流水線計劃，稱為PipeDream Flush。它的記憶體佔用比2BW和vanilla優化器語義更低，但以較低的吞吐量為代價。該計劃重用PipeDream（Narayanan等人，2019年）的1F1B計劃，但保持單一權重版本，並引入定期流水線重新整理，以確保權重更新期間的一致權重版本。具有兩個流水線階段的PipeDream-Flush和GPipe的時間表如圖3所示。

為何要選擇 1F1B？因為它將n-flight microbatches 數量縮減到流水線深度 d，而不是GPipe的微批次數目 m，所以 1F1B 是memory-efficient。為了降低bubble time，一般來說，m >> d。

記憶體佔用。使用PipeDream Flush，in-flight “活動”輸入啟用的總數小於或等於流水線深度，這使其記憶體佔用比GPipe低，GPipe必須保持輸入啟用與梯度平均（m）的微批次數量成比例。PipeDream Flush的記憶體佔用也低於PipeDream-2BW，因為它只需要維護一個權重版本（而PipeDream-2BW只需要2個）。

Semantics。定期流水線重新整理確保可以使用最新權重版本計算的梯度執行權重更新。這將使權重更新用如下方式進行：\(W^{(t+1)} = W^{(t)} − ν · ∇f(W^{(t)})\)。

論文原圖3。GPipe和PipeDream的時間表將分為兩個階段。GPipe和PipeDream Flush都使用管道重新整理；PipeDream-Flush在穩定狀態下交替進行向前和向後傳播，通過限制僅保留進行中（in-flight）微批次的啟用來保持較低的記憶體佔用。

0x03 PipeDream-Flush 實現

我們前面提到，當沒有設定 virtual_pipeline_model_parallel_size 時候，就是使用 Flush 方法得到流水線schedule，具體函式是 forward_backward_pipelining_without_interleaving。

def get_forward_backward_func():
    args = get_args()
    if mpu.get_pipeline_model_parallel_world_size() > 1:
        if args.virtual_pipeline_model_parallel_size is not None:
            forward_backward_func = forward_backward_pipelining_with_interleaving
        else:
            # Flush schedule
            forward_backward_func = forward_backward_pipelining_without_interleaving
    else:
        forward_backward_func = forward_backward_no_pipelining
    return forward_backward_func

為何要選擇 1F1B？論文作者提到，因為它將in-flight microbatches 數量縮減到流水線深度 d，而不是GPipe的微批次數目 m，所以 1F1B 是memory-efficient。為了降低bubble time，一般來說，m >> d。

3.1 總體思路

3.1.1 預設計劃

GPipe提出了一個執行計劃，其中首先執行一個批次中所有微批次的正向傳播，然後執行所有微批次的反向傳播（如圖3所示）。我們可以量化GPipe流水線氣泡的大小(??? )。我們將批次中的微批次數量表示為?，流水線階段的數量（用於流水線並行的裝置數量）為?，每次迭代的理想時間為??? （假設完美或理想的縮放），以及執行單個微批次前進和後退通道的時間?? 和??。

在此計劃中，流水線氣泡包含：

在批次開始時的 ? − 1 個前向傳播。
在批次結束時候的 ? − 1 個向後傳播。

在流水線中花費的總時間??? = (?−1)·(?? +??)，於是此任務的處理時間為 ??? =?·(?? +??)。因此，在流水線氣泡中花費的計算時間的理想佔比（fraction）為：

\[Bubble\ time\ fraction (pipeline\ bubble\ size) = \frac{t_{pb}}{t_{id}} = \frac{p-1}{m} \]

圖3 : GPipe流水線計劃，所有微批次（以數字表示）均為前向傳播（藍色），然後為後向傳播（綠色）。灰色區域表示流水線氣泡。為簡單起見，我們假設前向傳播的時間是後向傳播的兩倍。流水線計劃的效率不取決於此時間因素。本例中的每個批次由8個微批次組成，每個藍色或綠色框中的數字是給相應微批次的唯一識別符號（比如，第一批由1− 8個微批次組成，第二批由微批次9− 16組成等）。優化器在流水線重新整理時進行步進（step）並更新權重引數，以確保嚴格的優化器語義。

為了使氣泡時間佔比（fraction）很小，我們需要? ≫ ?。但是對於這麼大的?, 這種方法具有很高的記憶體佔用，因為它需要將中間啟用（或在使用啟用重新編譯時僅為每個流水線階段輸入啟用）儲存在記憶體中，以供所有 ? 個微批次在訓練迭代的整個生命週期中都使用到。

3.1.2 PipeDream計劃

PipeDream-Flush 把一個迭代分成三個階段:

預熱前向傳播階段（warmup forward passes）：在這裡，除了最後一個stage，每個worker 會做前向計算，進行不同數目的前向傳播，並且向其下游傳送啟用，一直到最後一個stage被激發。該計劃將執行中的（in-flight）微批次數量（未完成反向傳播且需要保持啟用的微批次數量）限制在流水線深度之內，而不是一個批次中的微批次數量。
穩定 1F1B 階段（Run 1F1B in steady state）：進入穩定狀態之後，每個 worker 都進行1F1B 操作。
冷卻反向傳播階段（Cooldown backward passes）：此階段會把執行中的（in-flight）的微批次執行完畢，只是執行反向計算和向反向計算下游傳送梯度。

這個新計劃在氣泡中花費的時間與GPipe是相同的，但是未完成的向前傳播的數量最多和流水線階段的數量一樣。因此，該計劃要求將啟用減少到 ? 或更少的微批次（GPipe計劃則是 m 個微批次）。因此，當? ≫ ? 的時候, PipeDream-Flush 的記憶體效率比GPipe高得多。

我們首先給出具體程式碼如下，後續會逐步分析。

def forward_backward_pipelining_without_interleaving(forward_step_func, data_iterator,
                                                     model, optimizer, timers,
                                                     forward_only):
    """Run non-interleaved 1F1B schedule, with communication between pipeline
    stages.

    Returns dictionary with losses if the last stage, empty dict otherwise."""
    timers = get_timers()

    assert len(model) == 1
    model = model[0]

    # Compute number of warmup microbatches.
    num_microbatches = get_num_microbatches()
    num_warmup_microbatches = \
        (mpu.get_pipeline_model_parallel_world_size() -
         mpu.get_pipeline_model_parallel_rank() - 1)
    num_warmup_microbatches = min(
        num_warmup_microbatches,
        num_microbatches)
    num_microbatches_remaining = \
        num_microbatches - num_warmup_microbatches

    unwrapped_model = unwrap_model(
        model, (torchDDP, LocalDDP, Float16Module))
    model_type = unwrapped_model.model_type
    rank = mpu.get_pipeline_model_parallel_rank()
    recv_tensor_shapes = get_tensor_shapes(rank-1, model_type)
    send_tensor_shapes = get_tensor_shapes(rank, model_type)

    # Input, output tensors only need to be saved when doing backward passes
    input_tensors = None
    output_tensors = None
    if not forward_only:
        input_tensors = []
        output_tensors = []
    losses_reduced = []

    # Run warmup forward passes.
    for i in range(num_warmup_microbatches):
        input_tensor = recv_forward(recv_tensor_shapes, timers=timers)
        output_tensor = forward_step(forward_step_func, data_iterator, model,
                                     input_tensor, losses_reduced)
        send_forward(output_tensor, send_tensor_shapes, timers=timers)

        if not forward_only:
            input_tensors.append(input_tensor)
            output_tensors.append(output_tensor)

    # Before running 1F1B, need to receive first forward tensor.
    # If all microbatches are run in warmup / cooldown phase, then no need to
    # receive this tensor here.
    if num_microbatches_remaining > 0:
        input_tensor = recv_forward(recv_tensor_shapes, timers=timers)

    # Run 1F1B in steady state.
    for i in range(num_microbatches_remaining):
        last_iteration = (i == (num_microbatches_remaining - 1))

        output_tensor = forward_step(forward_step_func, data_iterator, model,
                                     input_tensor, losses_reduced)
        if forward_only:
            send_forward(output_tensor, send_tensor_shapes, timers=timers)

            if not last_iteration:
                input_tensor = recv_forward(recv_tensor_shapes, timers=timers)

        else:
            output_tensor_grad = \
                send_forward_recv_backward(output_tensor,
                                           send_tensor_shapes,
                                           timers=timers)

            # Add input_tensor and output_tensor to end of list.
            input_tensors.append(input_tensor)
            output_tensors.append(output_tensor)

            # Pop input_tensor and output_tensor from the start of the list for
            # the backward pass.
            input_tensor = input_tensors.pop(0)
            output_tensor = output_tensors.pop(0)

            input_tensor_grad = \
                backward_step(optimizer, input_tensor, output_tensor,
                              output_tensor_grad)

            if last_iteration:
                input_tensor = None
                send_backward(input_tensor_grad, recv_tensor_shapes, timers=timers)
            else:
                input_tensor = \
                    send_backward_recv_forward(
                        input_tensor_grad, recv_tensor_shapes, timers=timers)

    # Run cooldown backward passes.
    if not forward_only:
        for i in range(num_warmup_microbatches):
            input_tensor = input_tensors.pop(0)
            output_tensor = output_tensors.pop(0)

            output_tensor_grad = recv_backward(send_tensor_shapes, timers=timers)

            input_tensor_grad = \
                backward_step(optimizer, input_tensor, output_tensor,
                              output_tensor_grad)

            send_backward(input_tensor_grad, recv_tensor_shapes, timers=timers)

    return losses_reduced

3.2 啟動階段

這是在每個 worker 之上都會做的，每個worker 的rank 不同，具體邏輯如下：

首先需要確定本worker在熱身階段需要執行的微批次數目，是min((world-size - rank - 1), num_microbatches)，因為rank是依次遞增，所以熱身所需的微批次會逐次遞減，直到為0，這樣就會直接進入穩定階段進行計算，比如 world size 為5，rank區間為0～4，微批次數目為4，則從前往後幾個stage的熱身批次為 5 - 0 - 1 = 4， 5 - 1 - 1 = 3， 5 - 2 - 1 = 2， 5 - 3 - 1 = 1， 5 - 4 - 1 = 0（就直接進入穩定狀態）。
其次計算穩定階段所需要計算的微批次。
當需要進行反向傳播時候，需要建立兩個FIFO佇列，input_tensors 儲存來自上游的啟用，output_tensors 儲存來自下游的啟用。

timers = get_timers()

assert len(model) == 1
model = model[0]

# Compute number of warmup microbatches.
num_microbatches = get_num_microbatches() # 得到微批次數目
# 需要確定本worker在熱身階段需要執行的微批次數目，是min((world-size - rank - 1), num_microbatches)
# 因為rank是依次遞增，所以熱身所需的微批次會逐次遞減，直到為0，這樣就會直接進入穩定階段進行計算
# 比如 world size 為5，rank區間為0～4，微批次數目為4，則從前往後幾個stage的熱身批次為 5 - 0 - 1， 5 - 1 - 1， 5 - 2 - 1， 5 - 3 - 1， 5 - 4 - 1。
num_warmup_microbatches = \
    (mpu.get_pipeline_model_parallel_world_size() -
     mpu.get_pipeline_model_parallel_rank() - 1) 
num_warmup_microbatches = min(
    num_warmup_microbatches,
    num_microbatches) 
# 計算穩定階段所需要計算的微批次
num_microbatches_remaining = \
    num_microbatches - num_warmup_microbatches

unwrapped_model = unwrap_model(
    model, (torchDDP, LocalDDP, Float16Module))
model_type = unwrapped_model.model_type
rank = mpu.get_pipeline_model_parallel_rank()
recv_tensor_shapes = get_tensor_shapes(rank-1, model_type)
send_tensor_shapes = get_tensor_shapes(rank, model_type)

# Input, output tensors only need to be saved when doing backward passes
# 當需要進行反向傳播時候，需要建立兩個佇列，input_tensors 儲存來自上游的啟用，output_tensors 儲存來自下游的啟用
input_tensors = None
output_tensors = None
if not forward_only:
    input_tensors = []
    output_tensors = []
losses_reduced = []

3.3 熱身階段

熱身階段會根據本worker在熱身階段需要執行的微批次數目，依次進行處理：

從上游獲取輸入啟用。
本地進行前向計算，上游輸入的啟用就是本stage的輸入。
向下遊傳送本地啟用。
如果需要反向傳播，則每個 worker 在 input_tensor 之中儲存上游啟用，在output_tensor 之中儲存傳送給下游的啟用。
早期階段會執行儘可能多的向前傳播，這樣後期階段可以立即從1F1B開始。

# Run warmup forward passes.
for i in range(num_warmup_microbatches):
    # 從上游獲取輸入啟用
    input_tensor = recv_forward(recv_tensor_shapes, timers=timers)
    # 本地進行前向計算，上游輸入的啟用就是本stage的輸入
    output_tensor = forward_step(forward_step_func, data_iterator, model,
                                 input_tensor, losses_reduced)
    # 向下遊傳送本地啟用
    send_forward(output_tensor, send_tensor_shapes, timers=timers)

    if not forward_only:
        input_tensors.append(input_tensor) # 儲存上游啟用
        output_tensors.append(output_tensor) # 儲存本地計算的啟用，就是傳送給下游的啟用

其中，第一個stage因為沒有上游，所以recv_forward將會返回None，其他情況下將返回一個上游啟用。

def recv_forward(tensor_shapes, timers):
    input_tensors = []
    for tensor_shape in tensor_shapes:
        if tensor_shape is None:
            input_tensors.append(None)
        else:
            input_tensors.append(p2p_communication.recv_forward(tensor_shape,
                                                                timers=timers))
    return input_tensors

3.4 通訊模組

3.4.1 基礎通訊方法

pipeline parallelism需要inter-stage的P2P通訊，其主要實現是_communnicate函式，_communicate 函式主要是封裝了 PyTorch 的基礎通訊函式，給流水線並行提供了stage之間的雙向P2P通訊。在此基礎之上，又封裝了一些API方法。這個函式的註釋寫得不錯，解釋得非常清楚。這裡需要注意的是：每個層怎麼知道自己在流水線之中上下游 rank 是什麼？這是通過例如這樣的呼叫mpu.get_pipeline_model_parallel_next_rank() 來知道的。

_communicate 具體程式碼如下：

def _communicate(tensor_send_next, tensor_send_prev, recv_prev, recv_next,
                 tensor_shape,
                 use_ring_exchange=False,
                 dtype_=None):
    """Communicate tensors between stages. Used as helper method in other
    communication methods that are used in megatron/schedules.py.

    Takes the following arguments:
        tensor_send_next: tensor to send to next rank (no tensor sent if
                          set to None).
        tensor_send_prev: tensor to send to prev rank (no tensor sent if
                          set to None).
        recv_prev: boolean for whether tensor should be received from
                   previous rank.
        recv_next: boolean for whether tensor should be received from
                   next rank.
        tensor_shape: shape of tensor to receive (this method assumes that all
                      tensors sent and received in a single function call are
                      the same shape).
        use_ring_exchange: boolean for whether torch.distributed.ring_exchange()
                           API should be used.
        dtype_: optional, this is used when the tensor that needs to be
                communicated is different from args.params_dtype.
    Returns:
        (tensor_recv_prev, tensor_recv_next)
    """
    args = get_args()

    # Create placeholder tensors for receive in forward and backward directions
    # if needed.
    tensor_recv_prev = None
    tensor_recv_next = None

    # Some legacy inference code doesn't set the tensor shape, do so now
    # for the normal values for gpt/bert. This could be removed if inference
    # code is changed to provide tensor_shape.
    if tensor_shape is None:
        tensor_shape = (args.seq_length, args.micro_batch_size, args.hidden_size)

    override_scatter_gather_tensors_in_pipeline = False
    if args.scatter_gather_tensors_in_pipeline:
        tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
        if tensor_chunk_shape % mpu.get_tensor_model_parallel_world_size() == 0:
            tensor_chunk_shape = tensor_chunk_shape // \
                mpu.get_tensor_model_parallel_world_size()
        else:
            tensor_chunk_shape = tensor_shape
            override_scatter_gather_tensors_in_pipeline = True
    else:
        tensor_chunk_shape = tensor_shape
    dtype = args.params_dtype
    if args.fp32_residual_connection:
        dtype = torch.float

    requires_grad = True
    if dtype_ is not None:
        dtype = dtype_
        requires_grad = False

    # 如果需要接受張量，則先分配空張量，接受的張量會存在此處    
    if recv_prev:
        tensor_recv_prev = torch.empty(tensor_chunk_shape,
                                       requires_grad=requires_grad,
                                       device=torch.cuda.current_device(),
                                       dtype=dtype)
    if recv_next:
        tensor_recv_next = torch.empty(tensor_chunk_shape,
                                       requires_grad=requires_grad,
                                       device=torch.cuda.current_device(),
                                       dtype=dtype)

    # Split tensor into smaller chunks if using scatter-gather optimization.
    if not override_scatter_gather_tensors_in_pipeline and \
            args.scatter_gather_tensors_in_pipeline:
        if tensor_send_next is not None:
            tensor_send_next = mpu.split_tensor_into_1d_equal_chunks(tensor_send_next)

        if tensor_send_prev is not None:
            tensor_send_prev = mpu.split_tensor_into_1d_equal_chunks(tensor_send_prev)

    # Send tensors in both the forward and backward directions as appropriate.
    if use_ring_exchange:
        # 如果需要，則使用ring exchange，這個是新版本PyTorch才有
        torch.distributed.ring_exchange(tensor_send_prev=tensor_send_prev,
                                        tensor_recv_prev=tensor_recv_prev,
                                        tensor_send_next=tensor_send_next,
                                        tensor_recv_next=tensor_recv_next,
                                        group=mpu.get_pipeline_model_parallel_group())
    else:
        # 先根據目標rank生成對應的torch.distributed.P2POp，放入列表
        ops = []
        if tensor_send_prev is not None:
            send_prev_op = torch.distributed.P2POp(
                torch.distributed.isend, tensor_send_prev,
                mpu.get_pipeline_model_parallel_prev_rank())
            ops.append(send_prev_op)
        if tensor_recv_prev is not None:
            recv_prev_op = torch.distributed.P2POp(
                torch.distributed.irecv, tensor_recv_prev,
                mpu.get_pipeline_model_parallel_prev_rank())
            ops.append(recv_prev_op)
        if tensor_send_next is not None:
            send_next_op = torch.distributed.P2POp(
                torch.distributed.isend, tensor_send_next,
                mpu.get_pipeline_model_parallel_next_rank())
            ops.append(send_next_op)
        if tensor_recv_next is not None:
            recv_next_op = torch.distributed.P2POp(
                torch.distributed.irecv, tensor_recv_next,
                mpu.get_pipeline_model_parallel_next_rank())
            ops.append(recv_next_op)
         
        # 然後做批量非同步send/recv
        if len(ops) > 0:
            reqs = torch.distributed.batch_isend_irecv(ops)
            for req in reqs:
                req.wait() # 用wait來同步
                
    # To protect against race condition when using batch_isend_irecv().
    torch.cuda.synchronize()

    # If using scatter-gather optimization, gather smaller chunks.
    # 特殊優化，21年論文中提到，大概因為做了all-reduce，因此可以先split傳送，下游gather成統一資料
    # 有興趣讀者可以深入研究論文和程式碼，
    if not override_scatter_gather_tensors_in_pipeline and \
            args.scatter_gather_tensors_in_pipeline:
        if recv_prev:
            tensor_recv_prev = mpu.gather_split_1d_tensor(
                tensor_recv_prev).view(tensor_shape).requires_grad_()

        if recv_next:
            tensor_recv_next = mpu.gather_split_1d_tensor(
                tensor_recv_next).view(tensor_shape).requires_grad_()

    return tensor_recv_prev, tensor_recv_next

3.4.2 API

在 _communicate 的基礎之上，封裝了眾多API函式，主要就是依據引數的不同來做不同處理，比如：

def send_backward_recv_forward(input_tensor_grad, tensor_shape=None, timers=None):
    """Batched send and recv with previous rank in pipeline."""
    if mpu.is_pipeline_first_stage():
        input_tensor = None
    else:
        input_tensor, _ = _communicate(
            tensor_send_next=None,
            tensor_send_prev=input_tensor_grad,
            recv_prev=True,
            recv_next=False,
            tensor_shape=tensor_shape)
    return input_tensor

3.4.3 流水線上下游

以下若干函式用來確定流水線上下游，結合前一篇文章我們知道，假如本程式是 rank 2，則流水線程式組 ranks 是 [g2, g6, g10, g14]，那麼其下游就是 rank 6。

def get_pipeline_model_parallel_first_rank():
    return _PIPELINE_GLOBAL_RANKS[0]

def get_pipeline_model_parallel_last_rank():
    last_rank_local = get_pipeline_model_parallel_world_size() - 1
    return _PIPELINE_GLOBAL_RANKS[last_rank_local]

def get_pipeline_model_parallel_next_rank():
    rank_in_pipeline = get_pipeline_model_parallel_rank()
    world_size = get_pipeline_model_parallel_world_size()
    return _PIPELINE_GLOBAL_RANKS[(rank_in_pipeline + 1) % world_size]

def get_pipeline_model_parallel_prev_rank():
    rank_in_pipeline = get_pipeline_model_parallel_rank()
    world_size = get_pipeline_model_parallel_world_size()
    return _PIPELINE_GLOBAL_RANKS[(rank_in_pipeline - 1) % world_size]

3.5 穩定階段

穩定階段的總體邏輯如下：前向計算 -> 傳送啟用給前向計算下游 & 從下游接受梯度 -> 後向計算 -> 給上游傳送本worker計算的梯度 & 從上游接受啟用。

3.5.1 邏輯

穩定階段具體邏輯如下：

forward_step ：拿到一個微批次（上游啟用），進行本地前向計算。
send_forward：
1. 如果只是前向傳播，則呼叫send_forward把本地結算結果傳送給下游。
2. 否則呼叫 send_forward_recv_backward : 本地計算結果發給下游，再從下游接受其梯度。
每個 worker 在 input_tensor 之中儲存上游啟用，在output_tensor 之中儲存傳送給下游的啟用。
backward_step : 本地後向計算。
1. 從佇列中彈出第一個未處理的（就是最早未處理的）上游啟用。
2. 從佇列彈出對應的本地啟用。
3. 進行反向計算，利用(上游啟用，本地啟用，下游梯度)來對最早的未處理的微批次進行反向計算，得到本地梯度。
send_backward：
1. 如果是最後一個微批次，只需要把本地梯度 input_tensor_grad 傳遞給前向計算的上游。
2. 否則呼叫 send_backward_recv_forward 把本地梯度 input_tensor_grad 傳遞給前向計算的上游，還需要從上游再獲取一個啟用值。
跳回1繼續處理下一個微批次（上游啟用）。

# Before running 1F1B, need to receive first forward tensor.
# If all microbatches are run in warmup / cooldown phase, then no need to
# receive this tensor here.
if num_microbatches_remaining > 0:
    # 需要在穩定狀態下執行，所以得拿到前面層的啟用值
    input_tensor = recv_forward(recv_tensor_shapes, timers=timers)

# Run 1F1B in steady state.
for i in range(num_microbatches_remaining):
    last_iteration = (i == (num_microbatches_remaining - 1))

    # 前向計算
    output_tensor = forward_step(forward_step_func, data_iterator, model,
                                 input_tensor, losses_reduced)
    if forward_only:
        send_forward(output_tensor, send_tensor_shapes, timers=timers)

        if not last_iteration:
            input_tensor = recv_forward(recv_tensor_shapes, timers=timers)

    else:
        # 傳送中間啟用給下游，並且從下游獲取其反向梯度
        output_tensor_grad = \
            send_forward_recv_backward(output_tensor,
                                       send_tensor_shapes,
                                       timers=timers)
          
        # Add input_tensor and output_tensor to end of list.
        input_tensors.append(input_tensor) # 儲存上游啟用到佇列
        output_tensors.append(output_tensor) # 儲存本地計算的啟用，就是傳送給下游的啟用到佇列

        # Pop input_tensor and output_tensor from the start of the list for
        # the backward pass.        
        input_tensor = input_tensors.pop(0) # 從佇列中彈出第一個未處理的（就是最早未處理的）上游啟用
        output_tensor = output_tensors.pop(0) # 從佇列彈出對應的本地啟用

        # 反向計算，利用(上游啟用，本地啟用，下游梯度)來對最早的未處理的微批次進行反向計算，得到本地梯度
        input_tensor_grad = \
            backward_step(optimizer, input_tensor, output_tensor,
                          output_tensor_grad) # 下游傳來的梯度在這裡

        if last_iteration:
            input_tensor = None
            # 如果是最後一個微批次，把本地梯度 input_tensor_grad 傳遞給前向計算的上游
            send_backward(input_tensor_grad, recv_tensor_shapes, timers=timers)
        else:
            # 如果不是最後一個微批次，把本地梯度 input_tensor_grad 傳遞給前向計算的上游，還需要從上游再獲取一個啟用值
            input_tensor = \
                send_backward_recv_forward(
                    input_tensor_grad, recv_tensor_shapes, timers=timers)

3.5.2 序列

其中，send_forward_recv_backward 這個從名字就能看到邏輯，這個函式先傳送給下游，再從下游接受。

def send_forward_recv_backward(output_tensors, tensor_shapes, timers):
    if not isinstance(output_tensors, list):
        output_tensors = [output_tensors]
    output_tensor_grads = []
    for (output_tensor, tensor_shape) in zip(output_tensors, tensor_shapes):
        if tensor_shape is None:
            output_tensor_grads.append(None)
            continue
        # 傳送自己的啟用，然後得到下游傳上來的梯度
        output_tensor_grad = p2p_communication.send_forward_recv_backward(
                output_tensor, tensor_shape, timers=timers)
        output_tensor_grads.append(output_tensor_grad)
    return output_tensor_grads #返回梯度

可以發現，對於單個 worker，都是阻塞進行，因為 send 和 recv 都是阻塞，這樣通訊和計算必須序列，不能重疊。因為前面熱身階段已經把前向傳遞一直從 worker 0 傳送到 worker d，所以 worker d 可以直接拿到 input，就進行處理，然後直接進行反向計算，然後返回給上游。所以序列也無所謂。我們從論文之中的圖例也可以看出來：

圖：PipeDream-Flush在穩定狀態下交替進行向前和向後傳播，通過將啟用隱藏限制為僅執行中（in-flight）的微批次來保持較低的記憶體佔用。從圖上可以看到:

Worker 1的執行序列是：1 FW(warmup), 2 FW, 1 BW，3 FW，2 BW，4 FW，3 BW，4 BW(cooldown)
Worker 2的執行序列是：1 FW，1BW， 2 FW， 2 BW， 3 FW， 3 BW， 4 FW， 4 BW，worker 2直接就進入了穩定狀態。

3.6 冷卻階段

冷卻階段和熱身階段對稱，也執行num_warmup_microbatches個步驟，但是隻做反向傳播。這個階段因為是清理未完畢的反向傳播，所以只是從佇列中pop。具體就是彈出上游啟用和傳遞給下游的啟用，然後進行梯度計算。

# Run cooldown backward passes.
if not forward_only:
    for i in range(num_warmup_microbatches):
        input_tensor = input_tensors.pop(0)
        output_tensor = output_tensors.pop(0)

        output_tensor_grad = recv_backward(send_tensor_shapes, timers=timers)
        input_tensor_grad = \
            backward_step(optimizer, input_tensor, output_tensor,
                          output_tensor_grad)

        send_backward(input_tensor_grad, recv_tensor_shapes, timers=timers)

return losses_reduced

3.7 Flush 體現在哪裡？

我們需要看看 megatron/training.py。就是一次訓練step的流程。這裡在 update_successful, grad_norm, num_zeros_in_grad = optimizer.step() 時候會呼叫優化器進行引數更新，此時，內部兩個啟用值佇列也全部清空過了，所以在這個時間點上，flush也就完成了。

def train_step(forward_step_func, data_iterator,
               model, optimizer, lr_scheduler):
    """Single training step."""
    args = get_args()
    timers = get_timers()

    # 1. 把梯度歸零
    # Set grad to zero.
    if args.DDP_impl == 'local' and args.use_contiguous_buffers_in_local_ddp:
        for partition in model:
            partition.zero_grad_buffer()
    optimizer.zero_grad()

    # 2. 進行前向，後向傳播，對於本章來說，就是呼叫forward_backward_pipelining_without_interleaving
    forward_backward_func = get_forward_backward_func()
    losses_reduced = forward_backward_func(
        forward_step_func, data_iterator, model,
        optimizer, timers, forward_only=False)

    # 到了這裡，整個流水線處理完畢，loss 和 梯度都計算完畢
    # Empty unused memory
    if args.empty_unused_memory_level >= 1:
        torch.cuda.empty_cache()

    # 3. 資料並行的all-reduce
    # All-reduce if needed.
    if args.DDP_impl == 'local':
        for model_module in model:
            model_module.allreduce_gradients()

    # All-reduce word_embeddings' grad across first and last stages to ensure
    # that word_embeddings parameters stay in sync.
    # This should only run for models that support pipelined model parallelism
    # (BERT and GPT-2).
    # 4. 嵌入層 all-reduce，嵌入層也進行了權重分享，所以要進行all-reduce來確保引數統一
    if mpu.is_rank_in_embedding_group(ignore_virtual=True) and \
            mpu.get_pipeline_model_parallel_world_size() > 1:
        if mpu.is_pipeline_first_stage(ignore_virtual=True):
            unwrapped_model = model[0]
        elif mpu.is_pipeline_last_stage(ignore_virtual=True):
            unwrapped_model = model[-1]
        else:  # We do not support the interleaved schedule for T5 yet.
            unwrapped_model = model[0]
        unwrapped_model = unwrap_model(
            unwrapped_model, (torchDDP, LocalDDP, Float16Module))

        if unwrapped_model.share_word_embeddings:
            word_embeddings_weight = unwrapped_model.word_embeddings_weight()
            if args.DDP_impl == 'local':
                grad = word_embeddings_weight.main_grad
            else:
                grad = word_embeddings_weight.grad
            torch.distributed.all_reduce(grad, group=mpu.get_embedding_group())

    # Update parameters.
    # 5. 更新引數，這裡才進行Flush，
    update_successful, grad_norm, num_zeros_in_grad = optimizer.step()

    # Update learning rate.
    if update_successful:
        increment = get_num_microbatches() * \
                    args.micro_batch_size * \
                    args.data_parallel_size
        lr_scheduler.step(increment=increment)
        skipped_iter = 0
    else:
        skipped_iter = 1

    # Empty unused memory
    if args.empty_unused_memory_level >= 2:
        torch.cuda.empty_cache()

    if mpu.is_pipeline_last_stage(ignore_virtual=True):
        # Average loss across microbatches.
        loss_reduced = {}
        for key in losses_reduced[0]:
            losses_reduced_for_key = [x[key] for x in losses_reduced]
            loss_reduced[key] = sum(losses_reduced_for_key) / len(losses_reduced_for_key)
        return loss_reduced, skipped_iter, grad_norm, num_zeros_in_grad
    return {}, skipped_iter, grad_norm, num_zeros_in_grad

至此，NVIDIA Megetron 分析完畢，我們接下來使用 NVIDIA HugeCTR 看看如何處理大型稀疏嵌入。