[原始碼解析] 模型並行分散式訓練 Megatron (4) --- 如何設定各種並行

0x00 摘要

NVIDIA Megatron 是一個基於 PyTorch 的分散式訓練框架，用來訓練超大Transformer語言模型，其通過綜合應用了資料並行，Tensor並行和Pipeline並行來複現 GPT3，值得我們深入分析其背後機理。

本系列大概有 5 篇文章，通過論文和原始碼和大家一起學習研究。本文將看看 Megatron 如何處理設定並行。

本系列其他文章為：

[原始碼解析] 模型並行分散式訓練Megatron (1) --- 論文 & 基礎

[原始碼解析] 模型並行分散式訓練Megatron (2) --- 整體架構

[原始碼解析] 模型並行分散式訓練 Megatron (3) ---模型並行實現

0x01 前文回顧

前文我們對模型並行的原理和程式碼進行了分析，對於給定的模型，現在還需要解決幾個問題：

如何把模型切分給節點，比如哪個節點負責哪些層。
資料並行，模型並行，流水線並行這幾種並行之中，每個節點分別屬於哪個部分？
如何避免流水線帶來的問題。

我們接下來就仔細分析一下。

0x02 初始化

initialize_model_parallel 方法用來設定模型並行，所以我們接下來就具體分析。

2.1 全域性變數

因為前文_initialize_distributed之中呼叫了torch.distributed.init_process_group 初始化分散式環境，所以我們知道，每個程式都有自己的 gloabl rank 和 local rank，都有自己的全域性變數。

主要變數如下（具體例子可以結合 initialize_model_parallel 之中的註釋來看）：

_TENSOR_MODEL_PARALLEL_GROUP ：當前 rank 所屬於的Intra-layer model parallel group，就是tensor 並行程式組。
- 假如每一層分為兩個tensor，則 _TENSOR_MODEL_PARALLEL_GROUP 例子為：[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]。
_PIPELINE_MODEL_PARALLEL_GROUP ：當前 rank 所屬於的Intra-layer model parallel group，就是流水線程式組。
- 假如流水線深度為4，則例子為 [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]。
_MODEL_PARALLEL_GROUP ：當前 rank 所屬於的模型並行程式組，包括了以上兩組。
- 針對我們例子，就是完整模型被複制了兩份，其 GPU 節點具體是[0, 1, 4, 5, 8, 9, 12, 13]，[2, 3, 6, 7, 10, 11, 14, 15]
_EMBEDDING_GROUP ：嵌入對應的程式組。
_DATA_PARALLEL_GROUP ：當前 rank 所屬於的Data parallel group。
- 假如資料並行度數為2，則例子為[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]。

具體如下：

# Intra-layer model parallel group that the current rank belongs to.
_TENSOR_MODEL_PARALLEL_GROUP = None
# Inter-layer model parallel group that the current rank belongs to.
_PIPELINE_MODEL_PARALLEL_GROUP = None
# Model parallel group (both intra- and pipeline) that the current rank belongs to.
_MODEL_PARALLEL_GROUP = None
# Embedding group.
_EMBEDDING_GROUP = None
# Data parallel group that the current rank belongs to.
_DATA_PARALLEL_GROUP = None

_VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = None
_VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
_PIPELINE_MODEL_PARALLEL_SPLIT_RANK = None

# These values enable us to change the mpu sizes on the fly.
_MPU_TENSOR_MODEL_PARALLEL_WORLD_SIZE = None
_MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = None
_MPU_TENSOR_MODEL_PARALLEL_RANK = None
_MPU_PIPELINE_MODEL_PARALLEL_RANK = None

# A list of ranks that have a copy of the embedding.
_EMBEDDING_GLOBAL_RANKS = None

# A list of global ranks for each pipeline group to ease calculation of the source
# rank when broadcasting from the first or last pipeline stage.
_PIPELINE_GLOBAL_RANKS = None

2.2 初始化程式碼

我們首先把 initialize_model_parallel 程式碼摘錄出來。initialize_model_parallel 作用就是對模型進行分組，然後初始化程式組相關的各種全域性變數。

def initialize_model_parallel(tensor_model_parallel_size_=1,
                              pipeline_model_parallel_size_=1,
                              virtual_pipeline_model_parallel_size_=None,
                              pipeline_model_parallel_split_rank_=None):
    """
    Initialize model data parallel groups.

    Arguments:
        tensor_model_parallel_size: number of GPUs used for tensor model parallelism.
        pipeline_model_parallel_size: number of GPUs used for pipeline model parallelism.
        virtual_pipeline_model_parallel_size: number of virtual stages (interleaved
                                              pipeline).
        pipeline_model_parallel_split_rank: for models with both encoder and decoder,
                                            rank in pipeline with split point.


    Let's say we have a total of 16 GPUs denoted by g0 ... g15 and we
    use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
    the model pipeline. The present function will
    create 8 tensor model-parallel groups, 4 pipeline model-parallel groups
    and 8 data-parallel groups as:
        8 data_parallel groups:
            [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]
        8 tensor model-parallel groups:
            [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]
        4 pipeline model-parallel groups:
            [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]
    Note that for efficiency, the caller should make sure adjacent ranks
    are on the same DGX box. For example if we are using 2 DGX-1 boxes
    with a total of 16 GPUs, rank 0 to 7 belong to the first box and
    ranks 8 to 15 belong to the second box.
    """
    if torch.distributed.get_rank() == 0:
        print('> initializing tensor model parallel with size {}'.format(
            tensor_model_parallel_size_))
        print('> initializing pipeline model parallel with size {}'.format(
            pipeline_model_parallel_size_))
    # Get world size and rank. Ensure some consistencies.
    world_size = torch.distributed.get_world_size()
    tensor_model_parallel_size = min(tensor_model_parallel_size_, world_size)
    pipeline_model_parallel_size = min(pipeline_model_parallel_size_, world_size)
    ensure_divisibility(world_size,
                        tensor_model_parallel_size * pipeline_model_parallel_size)
    data_parallel_size = world_size // (tensor_model_parallel_size *
                                        pipeline_model_parallel_size)

    num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size
    num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size
    num_data_parallel_groups = world_size // data_parallel_size

    if virtual_pipeline_model_parallel_size_ is not None:
        global _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK
        global _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
        _VIRTUAL_PIPELINE_MODEL_PARALLEL_RANK = 0
        _VIRTUAL_PIPELINE_MODEL_PARALLEL_WORLD_SIZE = virtual_pipeline_model_parallel_size_

    if pipeline_model_parallel_split_rank_ is not None:
        global _PIPELINE_MODEL_PARALLEL_SPLIT_RANK
        _PIPELINE_MODEL_PARALLEL_SPLIT_RANK = pipeline_model_parallel_split_rank_

    rank = torch.distributed.get_rank()

    # Build the data-parallel groups.
    global _DATA_PARALLEL_GROUP
    all_data_parallel_group_ranks = []
    for i in range(pipeline_model_parallel_size):
        start_rank = i * num_pipeline_model_parallel_groups
        end_rank = (i + 1) * num_pipeline_model_parallel_groups
        for j in range(tensor_model_parallel_size):
            ranks = range(start_rank + j, end_rank,
                          tensor_model_parallel_size)
            all_data_parallel_group_ranks.append(list(ranks))
            group = torch.distributed.new_group(ranks)
            if rank in ranks:
                _DATA_PARALLEL_GROUP = group

    # Build the model-parallel groups.
    global _MODEL_PARALLEL_GROUP
    for i in range(data_parallel_size):
        ranks = [data_parallel_group_ranks[i]
                 for data_parallel_group_ranks in all_data_parallel_group_ranks]
        group = torch.distributed.new_group(ranks)
        if rank in ranks:
            _MODEL_PARALLEL_GROUP = group

    # Build the tensor model-parallel groups.
    global _TENSOR_MODEL_PARALLEL_GROUP
    for i in range(num_tensor_model_parallel_groups):
        ranks = range(i * tensor_model_parallel_size,
                      (i + 1) * tensor_model_parallel_size)
        group = torch.distributed.new_group(ranks)
        if rank in ranks:
            _TENSOR_MODEL_PARALLEL_GROUP = group

    # Build the pipeline model-parallel groups and embedding groups
    # (first and last rank in each pipeline model-parallel group).
    global _PIPELINE_MODEL_PARALLEL_GROUP
    global _PIPELINE_GLOBAL_RANKS
    global _EMBEDDING_GROUP
    global _EMBEDDING_GLOBAL_RANKS
    for i in range(num_pipeline_model_parallel_groups):
        ranks = range(i, world_size,
                      num_pipeline_model_parallel_groups)
        group = torch.distributed.new_group(ranks)
        if rank in ranks:
            _PIPELINE_MODEL_PARALLEL_GROUP = group
            _PIPELINE_GLOBAL_RANKS = ranks
        # Setup embedding group (to exchange gradients between
        # first and last stages).
        if len(ranks) > 1:
            embedding_ranks = [ranks[0], ranks[-1]]
            if pipeline_model_parallel_split_rank_ is not None and \
                    pipeline_model_parallel_split_rank_ not in embedding_ranks:
                embedding_ranks = [ranks[0],
                                   ranks[pipeline_model_parallel_split_rank_],
                                   ranks[-1]]
        else:
            embedding_ranks = ranks
        group = torch.distributed.new_group(embedding_ranks)
        if rank in embedding_ranks:
            _EMBEDDING_GROUP = group
        if rank in ranks:
            _EMBEDDING_GLOBAL_RANKS = embedding_ranks

0x03 切分樣例

我們使用註釋內容來進行學習如何切分模型，如何把多種並行模式組合在一起。

3.1 註釋

initialize_model_parallel 的註釋值得我們深入學習，具體如下：

Let's say we have a total of 16 GPUs denoted by g0 ... g15 and we
use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
the model pipeline. The present function will
create 8 tensor model-parallel groups, 4 pipeline model-parallel groups
and 8 data-parallel groups as:
    8 data_parallel groups:
        [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]
    8 tensor model-parallel groups:
        [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]
    4 pipeline model-parallel groups:
        [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]
Note that for efficiency, the caller should make sure adjacent ranks
are on the same DGX box. For example if we are using 2 DGX-1 boxes
with a total of 16 GPUs, rank 0 to 7 belong to the first box and
ranks 8 to 15 belong to the second box.

從註釋可以知道如下資訊：

假定目前有16個GPU，屬於兩個node，rank 0 ～7 屬於第一個節點，rank 8 ～ 15 屬於第二個節點。
create 8 tensor model-parallel groups, 4 pipeline model-parallel groups，這說明將一個完整模型切分如下：
- 沿著行橫向切了一刀：tensor_model_parallel_size = 16 / 8 = 2，就是2個 GPUs 來進行模型張量並行。
- 沿著列縱向切了三刀：pipeline_model_parallel_size = 16 /4 = 4，就是4個GPUs 進行流水線並行。
- 因此，一個模型分為8塊，每一塊放在一個GPU之上，就是8個GPU。而通過如下計算可以知 16 GPUs / 8 GPUs = 2 models。即，16張卡可以放置兩個完整模型。
因為張量模型並行組大小是2，即16個GPU被分成8組，則這8組內容是 [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]。
因為流水線並行組大小是4，即16個GPU被分成4組，則這4組內容是[g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]。
因為資料並行組大小是2，16個GPU被分成8組，則這8組內容是[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]。
以上這些程式組都是通過 torch.distributed.new_group 來完成，這樣組內程式之間就知道哪些程式是在同一個組內，是在一起訓練的，也知道怎麼通訊。

3.2 切分情況

模型原始圖如下

模型切分之後如下，一共被分成8塊。其中，第一層被切分為 A，B，所以 A，B 之間就是 Tensor Model parallel。後面 C，D 之間也是 Tensor Model parallel，把兩層都做了切分，依次類推。

我們的目標就是用程式碼來看看如何生成註釋裡面的各種模型組。

3.3 切分策略

我們接下來看看具體切分的策略，也就是GPU分配策略。切分需要綜合考慮多種情況，首先看看模型並行的通訊狀況。

張量並行：通訊發生在每層的前向傳播和後向傳播過程之中，通訊型別是all-reduce，不但單次通訊資料量大，並且通訊頻繁。
流水線並行：通訊在流水線階段相鄰的切分點之上，通訊型別是P2P通訊，單詞通訊資料量較少但是比較頻繁，而且因為流水線的特點，會產生GPU空閒時間，這裡稱為流水線氣泡（Bubble）。

我們接下來看看各種並行機制的對比。

Tensor versus Pipeline Parallelism. 張量模型的並行性在節點內是最好的，因為它會減少通訊量。另一方面，流水線模型並行使用更便宜的點對點通訊，可以跨節點執行，而不會限制整個計算。然而，流水線並行性會在流水線氣泡中花費大量時間，因此，應限制流水線級的總數，以便流水線中的microbatches數量是流水線深度的合理倍數。當張量並行大小等於單個節點中的GPU數量時會達到峰值效能。
Pipeline versus Data Parallelism. 對於每個batch size，吞吐量隨著流水線並行規模的增加而降低。流水線模型並行應該主要用於支援不適合單個 worker 的大型模型訓練。而資料並行應該用於擴大訓練規模。
Tensor versus Data Parallelism. 接下來看看資料和張量模型的並行性對效能的影響。在較大的批處理量和微批處理量為1的情況下，資料並行通訊並不頻繁；張量模型並行需要對批處理中的每個微批進行all-to-all通訊。這種all-to-all的通訊主導了端到端的訓練時間，特別是當通訊需要在多GPU節點上進行時。此外，隨著張量模型並行規模的增加，我們在每個GPU上執行較小的矩陣乘法（因為會把模型張量進行切分），這降低了每個GPU的利用率。

最後看看結論

Tensor模型並行被用於intra-node transformer 層，因為張量平行計算密集且是耗費大量頻寬，這樣會在HGX based系統上高效執行。
Pipeline 模型並行主要被用於inter-node transformer 層，因為Pipeline 並行的通訊頻寬佔用少，其可以有效利用叢集中多網路卡設計。
資料並行則在前兩者基礎之上進行加持，使得訓練可以擴充套件到更大規模和更快的速度。我們應該注意到，儘管資料並行可以帶來高效的擴充套件，但我們不能單獨使用資料並行來處理訓練超大模型，因為 a）記憶體容量不足，b）資料並行的擴充套件限制。

3.4 實驗

我們接下來做一個實驗看看。

import torch

world_size = 16
tensor_model_parallel_size = 2 # 2 GPUs to parallelize the model tensor
pipeline_model_parallel_size = 4 # 4 GPUs to parallelize the model pipeline
data_parallel_size = world_size // (tensor_model_parallel_size *
                                    pipeline_model_parallel_size) # 2
num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size # 8
num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size # 4
num_data_parallel_groups = world_size // data_parallel_size # 8

# Build the data-parallel groups.
print("------ Build the data-parallel groups -----")
all_data_parallel_group_ranks = []
for i in range(pipeline_model_parallel_size):
    start_rank = i * num_pipeline_model_parallel_groups
    end_rank = (i + 1) * num_pipeline_model_parallel_groups
    for j in range(tensor_model_parallel_size):
        ranks = range(start_rank + j, end_rank,
                      tensor_model_parallel_size)
        all_data_parallel_group_ranks.append(list(ranks))
print(all_data_parallel_group_ranks)

# Build the model-parallel groups.
print("------ Build the model-parallel groups -----")
for i in range(data_parallel_size):
    ranks = [data_parallel_group_ranks[i]
             for data_parallel_group_ranks in all_data_parallel_group_ranks]
    print(list(ranks))

# Build the tensor model-parallel groups.
print("------ Build the tensor model-parallel groups -----")
for i in range(num_tensor_model_parallel_groups):
    ranks = range(i * tensor_model_parallel_size,
                  (i + 1) * tensor_model_parallel_size)
    print(list(ranks))

# Build the pipeline model-parallel groups and embedding groups
# (first and last rank in each pipeline model-parallel group).
print("------ Build the pipeline model-parallel groups -----")
for i in range(num_pipeline_model_parallel_groups):
    ranks = range(i, world_size,
                  num_pipeline_model_parallel_groups)
    print(list(ranks))

輸出如下。需要注意，這裡都是 GPU 的序列號，[0,2] 就是 [g0, g2]：

------ Build the data-parallel groups -----
[[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]
------ Build the model-parallel groups -----
[0, 1, 4, 5, 8, 9, 12, 13]
[2, 3, 6, 7, 10, 11, 14, 15]
------ Build the tensor model-parallel groups -----
[0, 1]
[2, 3]
[4, 5]
[6, 7]
[8, 9]
[10, 11]
[12, 13]
[14, 15]
------ Build the pipeline model-parallel groups -----
[0, 4, 8, 12]
[1, 5, 9, 13]
[2, 6, 10, 14]
[3, 7, 11, 15]

我們對比一下注釋，發現程式碼列印結果可以和註釋對應上：
    Let's say we have a total of 16 GPUs denoted by g0 ... g15 and we
    use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
    the model pipeline. The present function will
    create 8 tensor model-parallel groups, 4 pipeline model-parallel groups
    and 8 data-parallel groups as:
        8 data_parallel groups:
            [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]
        8 tensor model-parallel groups:
            [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]
        4 pipeline model-parallel groups:
            [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]

我們接下來會進行具體分析。

0x04 起始狀態

4.1 GPU 狀況

從註釋中可以看到：

Note that for efficiency, the caller should make sure adjacent ranks are on the same DGX box. For example if we are using 2 DGX-1 boxes with a total of 16 GPUs, rank 0 to 7 belong to the first box and ranks 8 to 15 belong to the second box.

意思就是：呼叫者需要確保相鄰的rank在同一個節點上，我們例子有兩個Node，其中第一個Node擁有 GPU 0 ～ 7，就是 rank 0 ～ 7，第二個Node是 GPU 8～15，就是 rank 8 ～ 15。

具體如下，這裡每行4個GPU，是因為 4 GPUs to parallelize the model pipeline，所以流水線每個stage是4個GPU。

4.2 符號說明

下面是論文之中提到的一些符號，這裡有必要再取出來溫習一下：

(?, ?, ?): Parallelization dimensions.
? for the pipeline-modelparallel size,
? for the tensor-model-parallel size, and ? for the data-parallel size.
?: Number of GPUs. We require ? · ? · ? = ?.

4.3 初始分組

依據註釋，我們得出目前分組情況和一些全域性資訊。

一共16個GPU，所以 world_size 為 16。就是 Notation 之中的 n。
使用兩個GPU進行 model tensor 並行，所以 tensor_model_parallel_size = 2。就是 Notation 之中的 t。
使用四個GPU進行模型流水線並行，所以 pipeline_model_parallel_size = 4。就是 Notation 之中的 p。其實，就是流水線深度為 4，即，4 個 GPU 是序列的。
依據上面定義，d = n / ( t * p) = 2，就是 data_parallel_size = 2。因為 t * p 就是一個模型所需要的 GPU，d = (總 GPU / 一個模型需要的 GPU)，結果是這些GPU可以訓練 d 個模型，就是可以用 d 個 mini-batches 進行這個 d個模型一起訓練，所以資料並行度為 d。

接下來結合程式碼看看需要分成多少個process groups，他們在程式碼之中的變數是什麼。

num_tensor_model_parallel_groups 就是從 tensor model 並行角度看，分成8 個程式roup。
num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size 就是從 model 並行角度看，分成 4 個程式group。
num_data_parallel_groups = world_size // data_parallel_size 就是從data 並行角度看，分成8 個程式group。就是會有 8 個 DDP，每個 DDP 包括 2 個 rank。
還有一個 _MODEL_PARALLEL_GROUP，

具體如下：

world_size = 16
tensor_model_parallel_size = 2 # 2 GPUs to parallelize the model tensor
pipeline_model_parallel_size = 4 # 4 GPUs to parallelize the model pipeline
data_parallel_size = world_size // (tensor_model_parallel_size *
                                    pipeline_model_parallel_size) # 2
num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size # 8
num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size # 4
num_data_parallel_groups = world_size // data_parallel_size # 8

0x05 Tensor model-parallel

本節我們分析的是，如何將 Node 上的 GPU 分給 tensor model 並行組。

5.1 分組

對於註釋例子，16 / 2 = 8，分成 8 個程式組，每個組兩個 rank。這些分組分別是：[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]，我們得到了如下資訊：

[g0, g1] 就是某一層分切為2半，分別被 g0, g1 來執行，[g2, g3] 表示另一層被分為兩層，分別被 g2，g3 來執行。
我們可以看到，每一個 tensor-model-parallel group的 rank一定是相鄰的，比如 [g0, g1], [g2, g3]。
注意，0 ~ 7 不代表是同一個模型。0 ~ 7 是同一個 Node 上的 GPU，這點容易被混淆。

我們再看看程式碼：

    # Build the tensor model-parallel groups.
    global _TENSOR_MODEL_PARALLEL_GROUP
    for i in range(num_tensor_model_parallel_groups): # 8
        ranks = range(i * tensor_model_parallel_size,
                      (i + 1) * tensor_model_parallel_size)
        group = torch.distributed.new_group(ranks) # 就有生成 8 組
        if rank in ranks: 
            # 如果本rank在某一list之中，即1 在 [0,1] 之中，則本 rank 就屬於 new_group([0,1])
            _TENSOR_MODEL_PARALLEL_GROUP = group

我們實驗之中在這裡得到：

------ Build the tensor model-parallel groups -----
[0, 1]
[2, 3]
[4, 5]
[6, 7]
[8, 9]
[10, 11]
[12, 13]
[14, 15]

對應我們圖上如下，每個 tensor model group 用一個虛線小矩形框標示，一共8個：

_TENSOR_MODEL_PARALLEL_GROUP = group 就記錄了本rank的程式組資訊，比如 rank 2，它的 _TENSOR_MODEL_PARALLEL_GROUP 內容就是：group([g2, g3])。

5.2 使用

我們接下來看看如何使用。

get_tensor_model_parallel_group 返回了自己 rank 對應的 tensor model group。

def get_tensor_model_parallel_group():
    """Get the tensor model parallel group the caller rank belongs to."""
    return _TENSOR_MODEL_PARALLEL_GROUP

在 megatron/mpu/mappings.py 之中有對 tensor model group 的使用：

def _reduce(input_):
    """All-reduce the input tensor across model parallel group."""

    # Bypass the function if we are using only 1 GPU.
    if get_tensor_model_parallel_world_size()==1:
        return input_

    # All-reduce.
    torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())

    return input_

就是當流水線反向傳播時候，利用 _TENSOR_MODEL_PARALLEL_GROUP 進行在組內進行集合通訊。

0x06 Pipe-parallel

本節我們分析的是，如何將 Node 上的 GPU 分給 pipeline model 並行組。

6.1 分組

從註釋可以看到，流水線分組就是把這個16個GPU 分成 4 組，每組 4 個 GPU，得到 [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]，我們得到了如下資訊：

每組的四個GPU進行模型流水線並行，所以 pipeline_model_parallel_size = 4。就是 Notation 之中的 p。其實，就是流水線深度為 4，每組內 4 個 GPU 是序列的。即， [g0, g4, g8, g12] 這4個 GPU是序列的。
再看看流水線的每一層，含有 16 / 4 = 4 個 GPU，能看到第一層是 0 ~ 4，第二層是 5 ~ 8，......。
可以看到，流水線的 group是隔 n // p個取一個，比如[0, 4, 8, 12]。
對於流水線每個stage，則是stage i 的 rank 範圍是：[(i-1) * n//p, (i) * n//p]，即 rank 2 所在的stage 的rank是 [0,1,2,3]。
_PIPELINE_MODEL_PARALLEL_GROUP 得到了本rank對應的流水線程式組。
_PIPELINE_GLOBAL_RANKS 得到了程式組的ranks。
假如本程式是 rank 2，則流水線程式組 ranks 是 [g2, g6, g10, g14]。

具體程式碼如下：

    # Build the pipeline model-parallel groups and embedding groups
    # (first and last rank in each pipeline model-parallel group).
    global _PIPELINE_MODEL_PARALLEL_GROUP
    global _PIPELINE_GLOBAL_RANKS
    global _EMBEDDING_GROUP
    for i in range(num_pipeline_model_parallel_groups): # 4
        ranks = range(i, world_size, # 每隔 n // p個取一個
                      num_pipeline_model_parallel_groups)
        group = torch.distributed.new_group(ranks)
        if rank in ranks:
            _PIPELINE_MODEL_PARALLEL_GROUP = group
            _PIPELINE_GLOBAL_RANKS = ranks
        # Setup embedding group (to exchange gradients between
        # first and last stages).
        if len(ranks) > 1:
            embedding_ranks = [ranks[0], ranks[-1]]
        else:
            embedding_ranks = ranks
        group = torch.distributed.new_group(embedding_ranks)
        if rank in embedding_ranks:
            _EMBEDDING_GROUP = group

我們擴充之前圖如下，現在看到增加了 4 條從上到下的虛線箭頭，分別對應了 4 組流水線序列。橫向層是從 Stage 0 ~ Stage 3。

6.2 使用

接下來看看如何使用。

get_pipeline_model_parallel_group 返回了自己 rank 對應的 pipeline model group。

def get_pipeline_model_parallel_group():
    """Get the pipeline model parallel group the caller rank belongs to."""
    return _PIPELINE_MODEL_PARALLEL_GROUP

具體使用是在 megatron/p2p_communication.py，_communicate 之中會用流水線組資訊來進行通訊。這裡省略了大部分程式碼。

def _communicate(tensor_send_next, tensor_send_prev, recv_prev, recv_next,
                 use_ring_exchange=False, tensor_shape=None,
                 override_scatter_gather_tensors_in_pipeline=False,
                 dtype_=None):
    """Communicate tensors between stages. Used as helper method in other
    communication methods that are used in megatron/schedules.py.
    """

    # Send tensors in both the forward and backward directions as appropriate.
    if use_ring_exchange: # 這裡使用get_pipeline_model_parallel_group 進行通訊
        torch.distributed.ring_exchange(tensor_send_prev=tensor_send_prev,
                                        tensor_recv_prev=tensor_recv_prev,
                                        tensor_send_next=tensor_send_next,
                                        tensor_recv_next=tensor_recv_next,
                                        group=mpu.get_pipeline_model_parallel_group())
    else:
        ops = []
        if tensor_send_prev is not None:
            send_prev_op = torch.distributed.P2POp(
                torch.distributed.isend, tensor_send_prev,
                mpu.get_pipeline_model_parallel_prev_rank()) # 得到流水線前一個rank
            ops.append(send_prev_op)
        if tensor_recv_prev is not None:
            recv_prev_op = torch.distributed.P2POp(
                torch.distributed.irecv, tensor_recv_prev,
                mpu.get_pipeline_model_parallel_prev_rank())
            ops.append(recv_prev_op)
        if tensor_send_next is not None:
            send_next_op = torch.distributed.P2POp(
                torch.distributed.isend, tensor_send_next,
                mpu.get_pipeline_model_parallel_next_rank()) # 得到流水線下一個rank
            ops.append(send_next_op)
        if tensor_recv_next is not None:
            recv_next_op = torch.distributed.P2POp(
                torch.distributed.irecv, tensor_recv_next,
                mpu.get_pipeline_model_parallel_next_rank())
            ops.append(recv_next_op)

6.2.1 上下游rank

具體如何得到流水線上下游的rank？是通過 get_pipeline_model_parallel_next_rank 和 get_pipeline_model_parallel_prev_rank 來完成。其中_PIPELINE_GLOBAL_RANKS 得到了程式組的ranks，假如本程式是 rank 2，則流水線程式組 ranks 是 [g2, g6, g10, g14]。

def get_pipeline_model_parallel_next_rank():
    rank_in_pipeline = get_pipeline_model_parallel_rank()
    world_size = get_pipeline_model_parallel_world_size()
    return _PIPELINE_GLOBAL_RANKS[(rank_in_pipeline + 1) % world_size]

def get_pipeline_model_parallel_prev_rank():
    rank_in_pipeline = get_pipeline_model_parallel_rank()
    world_size = get_pipeline_model_parallel_world_size()
    return _PIPELINE_GLOBAL_RANKS[(rank_in_pipeline - 1) % world_size]

6.2.2 world size

get_pipeline_model_parallel_world_size 得到了程式組的 world size。

def get_pipeline_model_parallel_world_size():
    """Return world size for the pipeline model parallel group."""
    global _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
    if _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE is not None:
        return _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
    return torch.distributed.get_world_size(group=get_pipeline_model_parallel_group())

0x07 Data-parallel

我們接下來看看資料並行。

7.1 分組

對於註釋例子，16 / 2 = 8，分成 8 個程式組，每個組兩個 rank。這些分組分別是：[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]，我們得到了如下資訊：

依據上面分析， t * p 就是一個模型所需要的 GPU，因此，d = (總 GPU 數目 / 一個模型需要的 GPU 數目) = n / ( t * p)，就是說，目前提供的這 n 個GPU可以同時訓練 d 個模型，就是可以用 d 個 mini-batches 輸入到這 d 個模型一起訓練，所以資料並行度為 d。
對應註釋例子，就是data_parallel_size = 16 / (2 * 4) = 2。
rank 2 對應的資料並行程式組是[g0, g2]。

我們再看看用程式碼怎麼確定有哪些group，每個group裡面包含什麼。

首先，流水線被分成了 p 個 stage，對於流水線每個stage，其有 n // p 個GPU，stage i 的 rank 範圍是：[i * n//p, (i+1) * n//p]，即 rank 2所在的stage 的rank是 [0,1,2,3]。
其次，在每一個stage之中，ranks = range(start_rank + j, end_rank, tensor_model_parallel_size) ，意思是這stage的n//p個GPUs中，每隔 t 個取一個作為資料並行 group 之中的一份子，因此每個data-parallel group大小為 n // p // t = d。

具體程式碼如下：

    # Build the data-parallel groups.
    global _DATA_PARALLEL_GROUP
    assert _DATA_PARALLEL_GROUP is None, \
        'data parallel group is already initialized'
    all_data_parallel_group_ranks = []
    for i in range(pipeline_model_parallel_size): # 遍歷流水線深度
        start_rank = i * num_pipeline_model_parallel_groups # 找到每個stage的起始rank
        end_rank = (i + 1) * num_pipeline_model_parallel_groups # 找到每個stage的終止rank
        for j in range(tensor_model_parallel_size): # 遍歷tensor model分組size
            ranks = range(start_rank + j, end_rank, # 每隔 t 個取一個作為資料並行group中的一份子
                          tensor_model_parallel_size)
            all_data_parallel_group_ranks.append(list(ranks))
            group = torch.distributed.new_group(ranks)
            if rank in ranks:
                _DATA_PARALLEL_GROUP = group

列印輸出如下，和註釋一致。

------ Build the data-parallel groups -----
[[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]

對應圖片擴充如下：其中，每個新增的雙箭頭對應一個DDP（兩個rank），比如[2, 3]對應一個DDP。

7.2 如何使用

我們接下來看看如何使用。

get_data_parallel_group 會得到本rank對應的 _DATA_PARALLEL_GROUP。

def get_data_parallel_group():
    """Get the data parallel group the caller rank belongs to."""
    return _DATA_PARALLEL_GROUP

在 allreduce_gradients之中，會對本資料並行組進行all-reduce。

    def allreduce_gradients(self):
        """Reduce gradients across data parallel ranks."""
        # If we have buffers, simply reduce the data in the buffer.
        if self._grad_buffers is not None:
            for _, buffer_ in self._grad_buffers.items():
                buffer_.data /= mpu.get_data_parallel_world_size() # 資料並行 world size
                torch.distributed.all_reduce(
                    buffer_.data, group=mpu.get_data_parallel_group()) # 資料並行組
        else:
            # Otherwise, bucketize and all-reduce
            buckets = {}
            # Pack the buckets.
            for param in self.module.parameters():
                if param.requires_grad and param.grad is not None:
                    tp = param.data.type()
                    if tp not in buckets:
                        buckets[tp] = []
                    buckets[tp].append(param)
                    param.main_grad = param.grad

            # For each bucket, all-reduce and copy all-reduced grads.
            for tp in buckets:
                bucket = buckets[tp]
                grads = [param.grad.data for param in bucket]
                coalesced = _flatten_dense_tensors(grads)
                coalesced /= mpu.get_data_parallel_world_size()
                torch.distributed.all_reduce(
                    coalesced, group=mpu.get_data_parallel_group())
                for buf, synced in zip(grads, _unflatten_dense_tensors(
                        coalesced, grads)):
                    buf.copy_(synced)

0x08 模型組

前面實驗中，我們得到模型並行組如下：[0, 1, 4, 5, 8, 9, 12, 13] [2, 3, 6, 7, 10, 11, 14, 15]。生成程式碼如下：

    # Build the model-parallel groups.
    global _MODEL_PARALLEL_GROUP
    for i in range(data_parallel_size):
        ranks = [data_parallel_group_ranks[i]
                 for data_parallel_group_ranks in all_data_parallel_group_ranks]
        group = torch.distributed.new_group(ranks)
        if rank in ranks:
            _MODEL_PARALLEL_GROUP = group

_MODEL_PARALLEL_GROUP 會得到本rank對應的模型組。

def get_model_parallel_group():
    """Get the model parallel group the caller rank belongs to."""
    return _MODEL_PARALLEL_GROUP

這裡是裁剪梯度會用到，就是在本模型的全部rank之中進行梯度裁剪相關操作。

def clip_grad_norm_fp32(parameters, max_norm, norm_type=2):
    """Clips gradient norm of an iterable of parameters whose gradients
       are in fp32.

    This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
    added functionality to handle model parallel parameters. Note that
    the gradients are modified in place.

    Arguments:
        parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
            single Tensor that will have gradients normalized
        max_norm (float or int): max norm of the gradients
        norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
            infinity norm.

    Returns:
        Total norm of the parameters (viewed as a single vector).
    """

    if isinstance(parameters, torch.Tensor):
        parameters = [parameters]

    # Filter parameters based on:
    #   - grad should not be none
    #   - parameter should not be shared
    #   - should not be a replica due to tensor model parallelism
    grads = []
    grads_for_norm = []
    for param in parameters:
        grad_not_none = param.grad is not None
        is_not_shared = param_is_not_shared(param)
        is_not_tp_duplicate = param_is_not_tensor_parallel_duplicate(param)
        grad = param.grad.detach()
        if grad_not_none:
            # Make sure the grads are in fp32
            grads.append(grad)
        if grad_not_none and is_not_shared and is_not_tp_duplicate:
            grads_for_norm.append(grad)

    # Norm parameters.
    max_norm = float(max_norm)
    norm_type = float(norm_type)
    total_norm = 0.0

    # Calculate norm.
    if norm_type == inf:
        total_norm = max(grad.abs().max() for grad in grads_for_norm)
        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
        # Take max across all model-parallel GPUs.
        torch.distributed.all_reduce(total_norm_cuda,
                                     op=torch.distributed.ReduceOp.MAX,
                                     group=mpu.get_model_parallel_group()) # 模型組資訊
        total_norm = total_norm_cuda[0].item()

    else:
        if norm_type == 2.0:
            dummy_overflow_buf = torch.cuda.IntTensor([0])
            # Use apex's multi-tensor applier for efficiency reasons.
            # Multi-tensor applier takes a function and a list of list
            # and performs the operation on that list all in one kernel.
            grad_norm, _ = multi_tensor_applier(
                amp_C.multi_tensor_l2norm,
                dummy_overflow_buf,
                [grads_for_norm],
                False # no per-parameter norm
            )
            # Since we will be summing across data parallel groups,
            # we need the pow(norm-type).
            total_norm = grad_norm ** norm_type

        else:
            for grad in grads_for_norm:
                grad_norm = torch.norm(grad, norm_type)
                total_norm += grad_norm ** norm_type

        # Sum across all model-parallel GPUs.
        torch.distributed.all_reduce(total_norm,
                                     op=torch.distributed.ReduceOp.SUM,
                                     group=mpu.get_model_parallel_group()) # 模型組資訊
        total_norm = total_norm.item() ** (1.0 / norm_type)

    # Scale.
    clip_coeff = max_norm / (total_norm + 1.0e-6)
    if clip_coeff < 1.0:
        dummy_overflow_buf = torch.cuda.IntTensor([0])
        multi_tensor_applier(amp_C.multi_tensor_scale,
                             dummy_overflow_buf,
                             [grads, grads],
                             clip_coeff)

    return total_norm

之前的圖如下，利用看到分成兩組，左邊是Model 0 對應的全部ranks，右面是model 1 的ranks。

0x09 如何把模型分到GPU

我們最後還有一個問題沒有涉及，就是如何把模型分塊放到對應的GPU之上。就是如何與最初分成A，B，..., H 的那個圖對應起來。其實，不是根據模型來把模型部分拷貝到對應的rank或者GPU，而是rank或者GPU主動過來拷貝自己對應的層。

因為呼叫了 mpu.initialize_model_parallel 來設定模型並行，資料並行等各種程式組，所以每個 rank 對應的程式都有自己的全域性變數，具體其實就是程式自動就被對映到GPU上了。比如 rank 2 對應的程式在啟動之後才知道自己是 rank 2，然後從初始化的全域性變數之中知道自己的 data_parallel group 是 [g0, g2]，tensor model-parallel group 是[g2, g3]，pipeline model-parallel group 是 [g2, g6, g10, g14]。
ParallelTransformer 的初始化之中，offset 就是根據 rank 知道自己應該生成模型的那些層，然後通過 self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)]) 來生成對應的層。
get_model 方法也會根據自己的 pipeline rank 和 is_pipeline_first_stage 來知道是不是第一層或者最後一層，然後做相應處理。
最後把模型引數拷貝到了自己對應的 GPU 之上。

具體 ParallelTransformer 初始化程式碼如下：

class ParallelTransformer(MegatronModule):
    """Transformer class."""

    def __init__(self, init_method, output_layer_init_method,
                 layer_type=LayerType.encoder,
                 self_attn_mask_type=AttnMaskType.padding,
                 pre_process=True, post_process=True):
        super(ParallelTransformer, self).__init__()
        args = get_args()
        
        # 省略程式碼
        
        # Transformer layers.
        def build_layer(layer_number):
            return ParallelTransformerLayer(
                init_method,
                output_layer_init_method,
                layer_number,
                layer_type=layer_type,
                self_attn_mask_type=self_attn_mask_type)
      
        # 下面 offset 就是根據rank知道自己應該生成模型的那些層
        if args.virtual_pipeline_model_parallel_size is not None:
            # Number of layers in each model chunk is the number of layers in the stage,
            # divided by the number of model chunks in a stage.
            self.num_layers = self.num_layers // args.virtual_pipeline_model_parallel_size
            # With 8 layers, 2 stages, and 4 model chunks, we want an assignment of
            # layers to stages like (each list is a model chunk):
            # Stage 0: [0]  [2]  [4]  [6]
            # Stage 1: [1]  [3]  [5]  [7]
            # With 8 layers, 2 stages, and 2 virtual stages, we want an assignment of
            # layers to stages like (each list is a model chunk):
            # Stage 0: [0, 1]  [4, 5]
            # Stage 1: [2, 3]  [6, 7]
            offset = mpu.get_virtual_pipeline_model_parallel_rank() * (
                args.num_layers // args.virtual_pipeline_model_parallel_size) + \
                (mpu.get_pipeline_model_parallel_rank() * self.num_layers)
        else:
            # Each stage gets a contiguous set of layers.
            offset = mpu.get_pipeline_model_parallel_rank() * self.num_layers

        self.layers = torch.nn.ModuleList(
            [build_layer(i + 1 + offset) for i in range(self.num_layers)])

        if self.post_process:
            # Final layer norm before output.
            self.final_layernorm = LayerNorm(
                args.hidden_size,
                eps=args.layernorm_epsilon)

所以，最終效果如下，其中同名子模組具有同樣的引數，可以資料並行，即兩個A可以資料並行。一列上的層之間可以流水線序列，比如 A--> C --> E --> G 就是序列，而一個橫行4個是流水線的一個stage，其中從0開始，橫向相鄰兩個GPU是 tensor model 並行。