[原始碼解析] PyTorch 分散式(9) ----- DistributedDataParallel 之初始化

Distributed.py：
- 這是 DDP 的 Python 入口點。它實現了初始化步驟，對應了nn.parallel.DistributedDataParallel模組的forward函式，該模組會呼叫C++庫。
- 它的_sync_param功能是：當一個DDP程式在多個裝置上工作時，會執行程式內引數同步，並且它還從rank 0 程式向所有其他程式廣播模型緩衝區。
- 程式間引數同步在 Reducer.cpp之中實現。
comm.h：實現合併廣播助手函式（coalesced broadcast helper ），該函式在初始化期間被呼叫以廣播模型狀態，並在前向傳播之前同步模型緩衝區。
reducer.h：提供反向傳播中梯度同步的核心實現。它具有三個入口點函式：
- Reducer: 其建構函式在distributed.py被呼叫，Reducer將註冊 Reducer::autograd_hook()到梯度累加器。
- autograd_hook() 當梯度就緒時，autograd 引擎將呼叫該函式。
- prepare_for_backward()在 distributed.py之中，當 DDP 前向傳遞結束時，會呼叫prepare_for_backward()。如果在DDP建構函式中，把find_unused_parameters設定為True，DDP 會遍歷 autograd 計算圖以查詢未使用的引數。

1.2.2 程式

以下是兩個程式相關元件。

ProcessGroup.hpp ：包含所有程式組實現的抽象 API。c10d 庫提供了 3 個開箱即用的實現，即 ProcessGroupGloo，ProcessGroupNCCL和ProcessGroupMPI。 DistributedDataParallel用ProcessGroup::broadcast()在初始化期間將模型狀態從rank 0 的程式傳送到其他程式，並對ProcessGroup::allreduce()梯度求和。
Store.hpp ：協助程式組例項的集合服務找到彼此。

1.3 DDP 總體實現

我們把論文和 https://pytorch.org/docs/master/notes/ddp.html 結合起來，看看 DDP 總體實現。

我們總結一次DistributedDataParallel迭代中的步驟如下（與上圖不完全一致，有部分細化）：

Prerequisite：
- DDP 依賴 c10dProcessGroup進行通訊。因此，應用程式必須ProcessGroup在構建 DDP 之前建立例項。
Constuctor：
- rank 0 程式會引用本地模組，把模型state_dict()引數廣播到所有程式之中，這樣可以保證所有程式使用同樣初始化數值和模型副本進行訓練。
- 每個 DDP 程式建立一個 local Reducer，稍後將在向後傳遞期間處理梯度同步。
- 為了提高通訊效率，Reducer將引數梯度組織成桶，一次規約一個桶。
  - 初始化桶，按照逆序把 parameters 分配到桶之中，這樣可以提高通訊效率。
  - 可以通過設定DDP 建構函式中的引數bucket_cap_mb來配置桶的大小。
  - 從引數梯度到桶的對映是在構建時根據桶大小限制和引數大小確定的。模型引數以（大致）Model.parameters()與給定模型相反的順序分配到桶中。使用相反順序的原因是因為 DDP 期望梯度在反向傳遞期間以大約該順序準備就緒。
  - 下圖顯示了一個示例。請注意，grad0和grad1在 bucket1中，另外兩個梯度在 bucket0中。當然，這種假設可能並不總是正確的，當這種情況發生時，它可能會損害 DDP 後向速度，因為它無法 Reducer儘早開始通訊。
- 除了分桶，Reducer還在構造期間註冊 autograd 鉤子，每個引數一個鉤子。當梯度準備好時，將在向後傳遞期間觸發這些鉤子。具體就是遍歷引數，為每個引數加上 grad_accumulator 和 autograd_hook。
Forward Pass:
- 每個程式讀去自己的訓練資料，DistributedSampler確保每個程式讀到的資料不同。
- DDP 獲取輸入並將其傳遞給本地模型。
- 模型進行前向計算，結果設定為 out。現在計算都是在每個程式（CUDA裝置）上完成。
- 如果find_unused_parameters設定為True，DDP 會分析本地模型的輸出，從 out 開始遍歷計算圖，把未使用引數標示為 ready，因為每次計算圖都會改變，所以每次都要遍歷。
  - 此模式（Mode）允許在模型的子圖上向後執行，並且 DDP 通過從模型輸出out遍歷 autograd 圖並將所有未使用的引數標記為就緒，以減少反向傳遞中涉及的引數。
  - 在向後傳遞期間，Reducer只會等待未準備好的引數，但它仍然會規約所有桶。將引數梯度標記為就緒並不能幫助 DDP 跳過桶，但它會阻止 DDP 在向後傳遞期間永遠等待不存在的梯度。
  - 請注意，遍歷 autograd 圖會引入額外的開銷，因此應用程式僅應必要時才設定 find_unused_parameters為True 。
- 返回out。模型網路輸出不需要gather到 rank 0程式了，這與 DP不同。
Backward Pass:
- backward()在 loss 上直接呼叫該函式 Tensor，這是 DDP 無法控制的，DDP 使用構造時註冊的 autograd hooks 來觸發梯度同步。當一個梯度準備好時，它在該梯度累加器上的相應 DDP 鉤子將觸發。
- 在 autograd_hook 之中進行all-reduce。假設引數index是param_index，則利用param_index獲取到引數，標示為ready，如果某個桶裡面梯度都ready，則該桶是ready。
- 當一個桶中的梯度都準備好時，會在該桶上Reducer啟動非同步allreduce以計算所有程式的梯度平均值。
- 如果所有桶都ready，則等待所有 all-reduce 完成。當所有桶都準備好時，Reducer將阻塞等待所有allreduce操作完成。完成此操作後，將平均梯度寫入param.grad所有引數的欄位。
- 所有程式的梯度都會reduce，更新之後，大家的模型權重都相同。所以在向後傳播完成之後，跨不同DDP程式的對應的相同引數上的 grad 欄位應該是相等的。
- 不需要像 DP 那樣每次迭代之後還要廣播引數。但是 Buffers 還是需要在每次迭代由 rank 0 程式廣播到其他程式之上。
Optimizer Step:
- 從優化器的角度來看，它正在優化本地模型。
- 所有 DDP 程式上的模型副本都可以保持同步，因為它們都從相同的狀態開始，並且在每次迭代中都具有相同的平均梯度。

0x02 初始化

因為 Python 世界是可以在很多時刻給類設定成員變數，因此我們還是從 __init__ 看起。

2.1 `init`

其核心邏輯是：

設定裝置型別。
設定裝置IDs。
設定 self.process_group，預設就是 GroupMember.WORLD。
配置各種類成員變數。
檢查 parameters。
設定bucket大小。
構建引數。
將 rank 0 的state_dict() 廣播到其他worker，以保證所有worker的模型初始狀態相同。
建立reducer。

具體程式碼如下：

class DistributedDataParallel(Module):

    def __init__(
        self,
        module,
        device_ids=None,
        output_device=None,
        dim=0,
        broadcast_buffers=True,
        process_group=None,
        bucket_cap_mb=25,
        find_unused_parameters=False,
        check_reduction=False,
        gradient_as_bucket_view=False,
    ):

        super(DistributedDataParallel, self).__init__()

        # 設定裝置型別
        self.is_multi_device_module = len({p.device for p in module.parameters()}) > 1
        distinct_device_types = {p.device.type for p in module.parameters()}
        self.device_type = list(distinct_device_types)[0]

        # 設定裝置IDs
        if (
            device_ids is None
            or len(device_ids) == 0  # For backward compatibility.
            or self.device_type == "cpu"
            or self.is_multi_device_module
        ):

            self.device_ids = None
            self.output_device = None
        else:
            self.device_ids = [_get_device_index(x, True) for x in device_ids]
            if output_device is None:
                output_device = device_ids[0]
            self.output_device = _get_device_index(output_device, True)

        # 設定process group    
        if process_group is None:
            self.process_group = _get_default_group()
        else:
            self.process_group = process_group

        # 配置各種成員變數    
        self.static_graph = False
        self.dim = dim
        self.module = module
        self.device = list(self.module.parameters())[0].device
        self.broadcast_buffers = broadcast_buffers
        self.find_unused_parameters = find_unused_parameters
        self.require_backward_grad_sync = True
        self.require_forward_param_sync = True
        self.ddp_uneven_inputs_config = _DDPUnevenInputsConfig(
            ddp_join_enabled=False,
            ddp_join_divide_by_initial_world_size=False,
            ddp_join_throw_on_early_termination=False,
        )
        self.gradient_as_bucket_view = gradient_as_bucket_view
        if hasattr(module, "_ddp_params_and_buffers_to_ignore"):
            self.parameters_to_ignore = module._ddp_params_and_buffers_to_ignore
        else:
            self.parameters_to_ignore = []

        # 檢查 parameters
        # Check that a module does not have Uninitialized parameters
        for param in module.parameters():
            if isinstance(param, torch.nn.parameter.UninitializedParameter):
                raise RuntimeError(
                    "Modules with uninitialized parameters can't be used with `DistributedDataParallel`. "
                    "Run a dummy forward pass to correctly initialize the modules"
                )
        # used for intra-node param sync and inter-node sync as wel
        self.broadcast_bucket_size = int(250 * 1024 * 1024)

        # reduction bucket size
        self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024)
        # Whether to perform input tensor CPU to GPU copies on a side-stream
        self.use_side_stream_for_tensor_copies = (
            os.environ.get("PYTORCH_DDP_USE_SIDE_STREAM", "1") == "1"
        )

        # 構建引數
        # TODO(wayi@): Remove this field since SPMD is no longer supported,
        # and also remove all the relevant unnecessary loops.
        # Module replication within process (single-process multi device)
        # 這裡需要注意，就是以後不支援了
        self._module_copies = [self.module]
        # Build parameters for reducer.
        parameters, expect_sparse_gradient = self._build_params_for_reducer()
        # Verify model equivalence.
        dist._verify_model_across_ranks(self.process_group, parameters)
        
        
        # Sync params and buffers. Ensures all DDP models start off at the same value.
        # 將 rank 0 的state_dict() 廣播到其他worker，以保證所有worker的模型初始狀態相同；
        self._sync_params_and_buffers(authoritative_rank=0)
        
        # In debug mode, build a mapping of parameter index -> parameter.
        if dist._get_debug_mode() != dist._DistributedDebugLevel.OFF:
            param_to_name_mapping = self._build_param_to_name_mapping(parameters)
        else:
            param_to_name_mapping = {}
            
        # Builds reducer.
        self._ddp_init_helper(parameters, expect_sparse_gradient, param_to_name_mapping)

我們接下來選擇一些重要步驟進行分析。

2.2 構建引數

對於 DDP，第一個關鍵步就是構建引數，這裡要注意，如果目前情況是單機多GPU，也就是單程式多裝置（和DP一樣了）情況，那麼需要在程式之內進行模型複製。

但是未來不會支援了，會去掉。所以 parameters 就是 [ToyModel] 的引數集合，parameters[0] 就是 ToyModel 的引數。後面介紹 BucketReplica 會提到。

    # TODO(wayi@): Remove this field since SPMD is no longer supported,
    # and also remove all the relevant unnecessary loops.
    # Module replication within process (single-process multi device)
    
    self._module_copies = [self.module] # 構建一個比如 [ToyModel] 這樣的列表
    # Build parameters for reducer.
    parameters, expect_sparse_gradient = self._build_params_for_reducer()

我們看看模型中有哪些重要引數：

parameter ：在反向傳播之中需要被optimizer更新的引數。我們可以通過 model.parameters() 得到這些引數。
buffer : 在反向傳播過程之中不需要被optimizer更新的引數。我們可以通過 model.buffers() 得到這些引數。

2.2.1 _build_params_for_reducer

具體 _build_params_for_reducer 就為reducer建立引數，邏輯大致如下：

遍歷_module_copies，得到(module, parameter)列表 modules_and_parameters，這些引數是需要求導的，不能在忽略列表之中。
用集合去除可能在多個modules中共享的引數。
構建一個引數列表。
檢查是否一個module期盼一個sparse梯度，把結果放到 expect_sparse_gradient 之中。
得到module的引數，與下面的buffer一起，都是用來同步到其他worker的。
得到module的buffer，module_buffers 在後續同步時候會用到。
返回引數列表和expect_sparse_gradient。

# 之前在初始化過程中，設定了 self._module_copies = [self.module]

def _build_params_for_reducer(self):
        
        # Build tuple of (module, parameter) for all parameters that require grads.
        modules_and_parameters = [
            [
                (module, parameter)
                # 得到module列表
                for module_name, module in replica.named_modules()
                # 得到引數列表，並且引數是需要求導，不在忽略列表之中
                for parameter in [
                    param
                    # Note that we access module.named_parameters instead of
                    # parameters(module). parameters(module) is only needed in the
                    # single-process multi device case, where it accesses replicated
                    # parameters through _former_parameters.
                    for param_name, param in module.named_parameters(recurse=False)
                    if param.requires_grad
                    and f"{module_name}.{param_name}" not in self.parameters_to_ignore
                ]
            ]
            for replica in self._module_copies
        ]

        # Deduplicate any parameters that might be shared across child modules.
        # 用集合去除可能在多個modules中共享的引數
        memo = set()
        modules_and_parameters = [
            # "p not in memo" is the deduplication check.
            # "not memo.add(p)" is always True, and it's only there to cause "add(p)" if needed.
            [(m, p) for m, p in replica_mps if p not in memo and not memo.add(p)]
            for replica_mps in modules_and_parameters
        ]

        # Build list of parameters.
        # 構建一個引數列表
        parameters = [
            list(parameter for _, parameter in replica)
            for replica in modules_and_parameters
        ]

        # Checks if a module will produce a sparse gradient.
        def produces_sparse_gradient(module):
            if isinstance(module, torch.nn.Embedding) or isinstance(
                module, torch.nn.EmbeddingBag
            ):
                return module.sparse
            return False

        # Build list of booleans indicating whether or not to expect sparse
        # gradients for the corresponding parameters.
        # 引數是否期盼sparse gradients
        expect_sparse_gradient = [
            list(produces_sparse_gradient(module) for module, _ in replica)
            for replica in modules_and_parameters
        ]

        # The following modules_params and modules_buffers are used for
        # param/buffer sync in _sync_params.
        # 得到module的引數，與下面的buffer一起，都是用來同步到其他worker的
        self.modules_params = [
            list(self._get_parameters(m)) for m in self._module_copies
        ]
        # Collect buffers for modules, filtering out buffers that should be ignored.
        # 得到module的buffer，module_buffers 在後續同步時候會用到
        named_module_buffers = [
            [(buffer, buffer_name) for buffer_name, buffer in m.named_buffers()]
            for m in self._module_copies
        ]
        self.modules_buffers = [
            [
                buffer
                for (buffer, buffer_name) in module_buffers
                if buffer_name not in self.parameters_to_ignore
            ]
            for module_buffers in named_module_buffers
        ]

        return parameters, expect_sparse_gradient

此時 parameters 示例如下，可以看到其只有 [0] 元素有意義，這個 [0] 原始本身包括4個元素：

parameters = {list: 1} 
0 = {list: 4}           
 0 = {Parameter: 10} Parameter containing:\ntensor([[-4.0381e-02,  3.8828e-02, 1  )   
 1 = {Parameter: 10} Parameter containing:\ntensor([-0.0438, -0.2033,  0.2771,  0.0721,  ) 
 2 = {Parameter: 5} Parameter containing:\ntensor([[-0.0094, -0.1319,  0.0713,  0.3155,  )
 3 = {Parameter: 5} Parameter containing:\ntensor([-0.0008,  0.0582, -0.1245, -0.2538, )
 __len__ = {int} 4
__len__ = {int} 1

2.2.2 modules_buffers

這裡多說一句，何處用到 self.modules_buffers？後來在廣播引數時候就會用到，比如：

    # When running in join mode, checks and performs sync of module buffers if
    # the models have buffers that should be synchronized in the forward pass.
    def _check_and_sync_module_buffers(self):
        if self.will_sync_module_buffers():
            authoritative_rank = self._find_common_rank(self._distributed_rank, False)
            self._distributed_broadcast_coalesced(
                self.modules_buffers[0], self.broadcast_bucket_size, authoritative_rank
            )

這裡使用了 _find_common_rank 來得到目前 DDP 使用的所有有效 ranks。

def _find_common_rank(self, input_rank, rank_cond):
    # -1 indicates that this rank is not under consideration to be the
    # common_rank
    rank_to_use = torch.tensor(
        [input_rank if rank_cond else -1],
        device=self.device,
    )
    # 使用MAX操作得到最大數值
    dist.all_reduce(rank_to_use, op=ReduceOp.MAX, group=self.process_group)
    if rank_to_use.item() == -1:
        raise ValueError(
            "BUG! Expected rank_cond to be true for at least one process."
        )
    return rank_to_use.item() # 返回全部ranks

2.3 驗證模型

接下來是驗證模型階段。

2.3.1 背景知識

因為後續用到了如下程式碼，所以我們首先看看背景知識 broadcast。不熟悉這部分的朋友會有疑問是：為什麼 broadcast 可以從 rank 0 廣播到其他rank，明明所有rank都呼叫到了同樣的 broadcast 程式碼。

process_group->broadcast(vec)->wait(); // 把 rank 0 的 meta 廣播到對應的裝置

我們來到 torch/lib/c10d/ProcessGroupMPI.cpp。可以看到，其使用了 MPI 的 MPI_Bcast API 來進行廣播操作，其中 opts.rootRank是關鍵所在。

c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::broadcast(
    std::vector<at::Tensor>& tensors,
    const BroadcastOptions& opts) {
  checkSingleTensor(tensors);
  std::function<void(std::unique_ptr<WorkEntry>&)> runFunc =
      [opts, this](std::unique_ptr<WorkEntry>& entry) {
        auto data = (entry->src)[0];
        c10::DeviceGuard guard(data.device());
        std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
        MPI_CHECK(MPI_Bcast( // 呼叫MPI API
            data.data_ptr(),
            data.numel(),
            mpiDatatype.at(data.scalar_type()),
            opts.rootRank, // 這裡是關鍵，只是從root廣播其他rank
            pgComm_));
      };
  auto entry = std::make_unique<WorkEntry>(&tensors, &tensors, std::move(runFunc));
  return enqueue(
      std::move(entry),
      "mpi:broadcast",
      c10::optional<std::vector<at::Tensor>>(tensors));
}

opts 是 BroadcastOptions 的例項。

class BroadcastOptions:
    rootRank: int
    rootTensor: int
    timeout: timedelta

在 C++ 世界對應瞭如下：

struct BroadcastOptions {
  int rootRank = 0;
  int rootTensor = 0;
  std::chrono::milliseconds timeout = kUnsetTimeout;
};

在定義時候看到，BroadcastOptions 被C++自動初始化為0，所以所有 rank 的程式都是使用 rootRank = 0 進行呼叫 MPI_Bcast，結果就是從 rank = 0 來向其他 rank 進行廣播。

c10::intrusive_ptr<ProcessGroup::Work> broadcast(
    std::vector<at::Tensor>& data,
    const BroadcastOptions& opts = BroadcastOptions()) override;

2.3.2 具體程式碼

我們接下來看看如何驗證模型。

_verify_model_across_ranks 的作用是驗證模型（replica 0）的相關引數在廣播之後，跨程式時候擁有同樣的size/strides。

    # Verify model equivalence.
    dist._verify_model_across_ranks(self.process_group, parameters)

通過下面程式碼我們可知，_verify_model_across_ranks 實際呼叫到verify_replica0_across_processes。

module.def(
    "_verify_model_across_ranks",
    &::c10d::verify_replica0_across_processes,
    py::arg("process_group"),
    py::arg("replicas"),
    py::call_guard<py::gil_scoped_release>());

verify_replica0_across_processes 之中，引數model_replicas 就是前面的 parameters，其邏輯如下：

首先，從 model_replicas 得到 metadata。
然後把metadata克隆到metadata_dev。
然後，把 process 0 的 metadata_dev 廣播到對應的裝置。
- 每個程式都會執行同樣的程式碼，但是 process_group->broadcast 之中，只有 rank 0 會設定為 root_rank，這樣就只廣播 rank 0 的資料。
- 廣播之後，所有程式的 metadata_dev 都一樣，就是 process 0 內的資料。
然後把 metadata_dev 拷貝回 control，把 control 和 model_replicas[0]比較，看看是否和原來相等。
- 檢查 control 是否和 model_replicas 的尺寸一樣。
- 這裡使用了 accessor，LibTorch 使用 accessor 快速訪問 Tensor，如果 tensor 在CPU上，使用 accessor，如果在 GPU上，使用 packed_accessor 訪問，這部分在 "核心開發者全面解讀PyTorch 內部機制" 有相關提及。

具體程式碼如下：

// Verifies corresponding params in replica 0 have the same sizes/strides
// across processes.
void verify_replica0_across_processes(
    c10::intrusive_ptr<c10d::ProcessGroup> process_group,
    std::vector<std::vector<at::Tensor>> model_replicas) {
  size_t i = 0;
  for (const auto& t : model_replicas[0]) {
    i += 2 * t.dim();
  }
  at::TensorOptions options;
  options = options.dtype(at::kLong);
  auto metadata = at::empty({static_cast<long>(i)}, options);

  // Technically, process 0 is the broadcast source, so only process 0 needs
  // to populate metadata.  But no harm keeping work aligned across processes.
  auto metadata_accessor = metadata.accessor<int64_t, 1>();
  i = 0;
  // 把model_replicas[0]拷貝到metadata_accessor，其實就是metadata
  for (const auto& t : model_replicas[0]) {
    for (const auto& sz : t.sizes()) {
      metadata_accessor[i++] = sz;
    }
    for (const auto& str : t.strides()) {
      metadata_accessor[i++] = str;
    }
  }

  // 然後把metadata克隆到metadata_dev
  auto metadata_dev = metadata.clone().to(model_replicas[0][0].device());
  std::vector<at::Tensor> vec{metadata_dev};
  //  廣播metadata_dev
  process_group->broadcast(vec)->wait(); // 把process 0 的 meta 廣播到對應的裝置

  // 這之後，metadata_dev 就是所有程式的結果大家都一樣了
  // Technically, process 0 doesn't need to double-check metadata, because it
  // was the source.  But no harm keeping work aligned.
  auto control = at::empty({static_cast<long>(i)}, options);
  // 把 metadata_dev 拷貝回 control
  control.copy_(metadata_dev, /*non_blocking=*/false);
  
  // 然後把 control 和 model_replicas[0]比較，看看是否和原來相等
  auto control_accessor = control.accessor<int64_t, 1>();
  i = 0;
  for (size_t p = 0; p < model_replicas[0].size(); p++) {
    const auto& t = model_replicas[0][p];
    // I'd like to include which process we are in the message,
    // but ProcessGroup::getRank is not public!
    for (const auto& sz : t.sizes()) {
      TORCH_CHECK(
          sz == control_accessor[i++],
          "replicas[0][",
          p,
          "] in this process"
          " with sizes ",
          t.sizes(),
          " appears not to match sizes of the same param in process 0.");
    }
    for (const auto& str : t.strides()) {
      TORCH_CHECK(
          str == control_accessor[i++],
          "replicas[0][",
          p,
          "] in this process"
          " with strides ",
          t.strides(),
          " appears not to match strides of the same param in process 0.");
    }
  }
}

2.4 廣播狀態

下一步是廣播狀態，把模型初始引數和變數從 rank 0 廣播到其他 ranks。

    # Sync params and buffers. Ensures all DDP models start off at the same value.
    # 將 rank 0 的state_dict() 廣播到其他worker，以保證所有worker的模型初始狀態相同；
    self._sync_params_and_buffers(authoritative_rank=0)

2.4.1 state_dict

我們先來看看需要廣播什麼。

pytorch 的 state_dict 是一個字典物件，其將模型的每一層與它的對應引數建立對映關係，比如 model 每一層的weights及偏置等等。只有那些引數可以訓練的層（比如卷積層，線性層等）才會被儲存到模型的state_dict中，池化層、BN層這些本身沒有引數的層就不會儲存在 state_dict 之中，比如針對下面模型。

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

state_dict 如下：

self.module.state_dict() = {OrderedDict: 4} 
 'net1.weight' = {Tensor: 10} tensor([[ 0.2687,  0.0840, -0.1032,  0.3079,  0.0385, -0.0495, -0.3068, -0.1271,\n         -0.1067, -0.1966],\n        [-0.1203,  0.1789,  0.0666,  0.1882,  0.1335,  0.1921, -0.1145, -0.1781,\n          0.0661, -0.2339],\n        [ 0.1865, -0.2076,  0.2071,  0
 'net1.bias' = {Tensor: 10} tensor([ 0.2146, -0.1599,  0.2350, -0.2843, -0.0773, -0.2151,  0.1864, -0.3068,\n        -0.2093,  0.1365])
 'net2.weight' = {Tensor: 5} tensor([[ 0.1922, -0.0148, -0.1884,  0.2124, -0.1361,  0.0172, -0.2371,  0.1946,\n          0.2047, -0.2697],\n        [-0.2690,  0.1372,  0.2269,  0.0436, -0.1353, -0.2054, -0.2418, -0.2300,\n          0.1987,  0.0007],\n        [ 0.0995, -0.2659, -0.2374, -0
 'net2.bias' = {Tensor: 5} tensor([0.1488, 0.0791, 0.1667, 0.1449, 0.0545])

2.4.2 _sync_params_and_buffers

_sync_params_and_buffers 是依據 module的state_dict 來收集可以訓練的引數，然後把這些引數廣播出去。

具體程式碼是：

    def _sync_params_and_buffers(self, authoritative_rank=0):
        module_states = []
        for name, param in self.module.state_dict().items():
            if name not in self.parameters_to_ignore:
                module_states.append(param)

# module_states = {list: 4} [tensor([[ 0.2687,  0.0840, -0.1032,  0.3079,  0.0385, -0.0495, -0.3068, -0.1271,\n         -0.1067, -0.1966],\n        [-0.1203,  0.1789,  0.0666,  0.1882,  0.1335,  0.1921, -0.1145, -0.1781,\n          0.0661, -0.2339],\n        [ 0.1865, -0.2076,  0.2071,  
                
        if len(module_states) > 0:
            self._distributed_broadcast_coalesced(
                module_states, self.broadcast_bucket_size, authoritative_rank
            )

我們看看，_distributed_broadcast_coalesced 呼叫了 dist._broadcast_coalesced

import torch.distributed as dist

def _distributed_broadcast_coalesced(
        self, tensors, buffer_size, authoritative_rank=0
    ):
        dist._broadcast_coalesced(
            self.process_group, tensors, buffer_size, authoritative_rank
        )

2.4.3 dist._broadcast_coalesced

我們沿著程式碼來尋找，首先來到 torch\distributed_init_.py，這裡會匯入 _broadcast_coalesced。

if is_available():
    from torch._C._distributed_c10d import (
        Store,
        FileStore,
        TCPStore,
        ProcessGroup,
        PrefixStore,
        Reducer,
        Logger,
        BuiltinCommHookType,
        GradBucket,
        _DEFAULT_FIRST_BUCKET_BYTES,
        _register_comm_hook,
        _register_builtin_comm_hook,
        _broadcast_coalesced, # 在這裡匯入
        _compute_bucket_assignment_by_size,
        _verify_model_across_ranks,
        _test_python_store,
        _DistributedDebugLevel,
        _get_debug_mode
    )
    if sys.platform != 'win32':
        from torch._C._distributed_c10d import (
            HashStore,
            _round_robin_process_groups,
        )

    from .distributed_c10d import *  # noqa: F403
    # Variables prefixed with underscore are not auto imported
    # See the comment in `distributed_c10d.py` above `_backend` on why we expose
    # this.

    from .distributed_c10d import _backend, _all_gather_base

我們繼續找到 torch\csrc\distributed\c10d\init.cpp

  module.def(
      "_broadcast_coalesced",
      // Define a lambda such that the pybind11 prototype can take a std::vector
      // for the tensor list argument, but still pass it to the underlying
      // function as a c10::ArrayRef.
      [](c10::intrusive_ptr<::c10d::ProcessGroup> process_group,
         std::vector<at::Tensor> tensors, // NOLINT
         size_t buffer_size,
         int rank) {
        broadcast_coalesced( // 在這裡
            std::move(process_group), tensors, buffer_size, rank);
      },
      py::arg("process_group"),
      py::arg("tensors"),
      py::arg("buffer_size"),
      // The source of truth rank to broadcast the tensors from.
      py::arg("src") = 0,
      py::call_guard<py::gil_scoped_release>());

最後來到了 torch/lib/c10d/comm.cpp，這裡利用 ProcessGroup 對張量進行廣播。

// Broadcast many tensors to all processes in the process group.
void broadcast_coalesced(
    c10::intrusive_ptr<c10d::ProcessGroup> process_group,
    at::TensorList tensors,
    size_t buffer_size,
    int rank) {
  // Coalesce tensors into buckets taking into account the maximum buffer size.
  // This routine is multi-device aware, so the tensors can be split across
  // multiple devices and can contain a mix of CPU and CUDA tensors.
  // 首先計算出桶
  const auto buckets =
      compute_bucket_assignment_by_size(tensors.vec(), {buffer_size});

  // Returns tensor at specified index in input tensor list.
  const auto lookup = [&tensors](size_t index) { return tensors[index]; };

  // We maintain a maximum of 2 in flight broadcast operations to avoid
  // allocating too much memory (in case the specified tensors are very large).
  std::deque<BroadcastWork> in_flight; // 建立一個廣播work列表
  constexpr auto max_in_flight = 2;
  for (const auto& bucket : buckets) { // 遍歷桶
    if (in_flight.size() >= max_in_flight) { // 由註釋可以知道，廣播維度是2，這樣避免記憶體佔用過大
      in_flight.front().finish(); // 廣播變數
      in_flight.pop_front();
    }

    in_flight.emplace_back(process_group, c10::fmap(bucket, lookup), rank);
  }

  while (!in_flight.empty()) {
    in_flight.front().finish();
    in_flight.pop_front();
  }
}

對於BroadcastWork，我們補充說明一下，就是利用 ProcessGroup 來把張量廣播出去，ProcessGroup 具體可以參見前面文章。

class BroadcastWork {
 public:
  BroadcastWork(
      const c10::intrusive_ptr<c10d::ProcessGroup>& process_group,
      std::vector<at::Tensor> bucket_tensors,
      int root_rank = 0)
      : bucket_tensors_(std::move(bucket_tensors)),
        flat_tensor_({torch::utils::flatten_dense_tensors(bucket_tensors_)}) {
    BroadcastOptions broadcastOptions;
    broadcastOptions.rootRank = root_rank;
    work_ = process_group->broadcast(flat_tensor_, broadcastOptions);
  }

  void finish() {
    work_->wait();

    // Copy the output of the broadcast operation back.
    auto output_tensors = torch::utils::unflatten_dense_tensors(
        flat_tensor_.front(), bucket_tensors_);
    TORCH_INTERNAL_ASSERT(output_tensors.size() == bucket_tensors_.size());
    for (size_t i = 0; i < output_tensors.size(); i++) {
      bucket_tensors_[i].copy_(output_tensors[i], /*non_blocking=*/true);
    }
  }

 protected:
  // The list of tensors to broadcast. They are guaranteed to be
  // placed on the same device and have the same dtype.
  std::vector<at::Tensor> bucket_tensors_;

  // The vector with a single flattened tensor containing the contents
  // of the tensors in bucket_tensors_. It must be stored in a vector
  // because c10d::ProcessGroup::broadcast takes a vector argument.
  std::vector<at::Tensor> flat_tensor_;

 private:
  // The broadcast work that is kicked off upon construction.
  c10::intrusive_ptr<c10d::ProcessGroup::Work> work_;
};

2.5 初始化功能函式

接下來會呼叫 _ddp_init_helper 進行初始化業務函式。

2.5.1 _ddp_init_helper

_ddp_init_helper 是用來初始化業務的函式，其主要邏輯如下：

對引數進行分桶，儘可能按照前向傳播的逆序（前向傳播中先計算出來的梯度，會先反向傳播）把引數分配平均分配入桶，這樣可以提高通訊速度和歸併速度；
重置分桶狀態；
生成一個Reducer，其內部會註冊 autograd_hook，其用來在反向傳播時候進行梯度同步；
進行logging配置；
給SyncBatchNorm Layer傳遞 DDP handle；

具體程式碼如下：

    def _ddp_init_helper(self, parameters, expect_sparse_gradient, param_to_name_mapping):
        """
        Initialization helper function that does the following:
        (1) bucketing the parameters for reductions
        (2) resetting the bucketing states
        (3) registering the grad hooks
        (4) Logging constructin-time DDP logging data
        (5) passing a handle of DDP to SyncBatchNorm Layer
        """
        self.num_iterations = 0
        # The bucket size limit is specified in the constructor.
        # Additionally, we allow for a single small bucket for parameters
        # that are defined first, such that their gradients don't spill into
        # a much larger bucket, adding unnecessary latency after gradient
        # computation finishes. Experiments showed 1MB is a reasonable value.
        bucket_indices = dist._compute_bucket_assignment_by_size(
            parameters[0],
            [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap],
            expect_sparse_gradient[0],
        )

        # Note: reverse list of buckets because we want to approximate the
        # order in which their gradients are produced, and assume they
        # are used in the forward pass in the order they are defined.
        self.reducer = dist.Reducer(
            parameters,
            list(reversed(bucket_indices)), # 利用桶index
            self.process_group,
            expect_sparse_gradient,
            self.bucket_bytes_cap,
            self.find_unused_parameters,
            self.gradient_as_bucket_view,
            param_to_name_mapping,
        )

        self.logger = dist.Logger(self.reducer)

        # Set logging data that can be got during construction time.
        self.logger.set_construction_data_and_log(
            self.module.__class__.__name__,
            [] if self.device_ids is None else self.device_ids,
            -1 if self.output_device is None else self.output_device,
            self.broadcast_buffers,
        )

        # passing a handle to torch.nn.SyncBatchNorm layer
        self._passing_sync_batchnorm_handle(self._module_copies)

2.5.2 計算分桶

首先，_compute_bucket_assignment_by_size 完成了分桶功能。這裡parameters[0] 就是對應的張量列表。

_DEFAULT_FIRST_BUCKET_BYTES = 1048576
# reduction bucket size
self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024)
        
bucket_indices = dist._compute_bucket_assignment_by_size(
            parameters[0],
            # 桶的大小限制是一個陣列
            [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap],
            expect_sparse_gradient[0],
        )

2.5.2.1 論文內容

我們接下來就要結合論文內容來分析。

梯度bucketing的思想是基於這樣一個觀察，即集合通訊在大張量上更有效。

實驗表明，如果DDP在短時間內等待並將多個梯度儲存到一個AllReduce操作中，它可以實現更高的吞吐量和更低的延遲，而不是在每個梯度儲存可用時立即啟動專用的AllReduce。這對於具有許多小引數的模型尤其有用。但是，DDP不應在一個AllReduce中傳輸所有資料，否則，在計算結束之前無法啟動任何通訊。

引數到桶對映（Parameter-to-Bucket Mapping）對DDP速度有相當大的影響。在每次向後傳播中，將所有引數梯度中的張量複製到桶中，並在AllReduce之後將平均梯度複製回桶中。為了加速複製操作，儲存桶始終與引數在同一裝置上建立。如果模型跨越多個裝置，DDP會考慮裝置關聯性，以確保同一儲存桶中的所有引數都位於同一裝置上。AllReduce的順序也會對結果產生影響，因為它決定了多少通訊可以與計算重疊。DDP按model.parameters()的相反順序啟動AllReduce。

所以，為了提高通訊效率，DDP 將Reducer引數梯度組織成為桶，一次規約一個桶。從引數梯度到桶的對映是在構建時根據桶大小限制和引數大小確定的，。使用者可以通過設定bucket_cap_mb來配置桶的大小。

模型引數以（大致）Model.parameters()與給定模型相反的順序分配到桶中。使用相反順序的原因是：

反向傳播的次序是前向傳播計算的反序。
DDP 期望梯度在反向傳遞期間以前向傳播的大致順序來就緒。

2.5.2.2 分組依據

DDP 按照型別和裝置作為key來分組，因為不同裝置上的tensor不應該分在一組上，同型別張量應該分在一桶。用型別和裝置作為key 就可以保證同裝置上同型別張量分配在同一個桶裡。

// Tensors may be coalesced into buckets. Buckets must contain tensors of
// the same type, on the same device, so a bucket can identified by a
// composite key of a tensor's type identifier and its device.
struct BucketKey {
  BucketKey(c10::ScalarType type, c10::Device device)
      : type(std::move(type)), device(std::move(device)) {}

  const c10::ScalarType type;
  const c10::Device device;

  // See torch/csrc/utils/hash.h for dispatch code.
  static size_t hash(const BucketKey& key) {
    return c10::get_hash(key.type, key.device); // 用型別和裝置作為key
  }
};

2.5.2.3 compute_bucket_assignment_by_size

其關鍵結構如下，BucketAccumulator 可以認為是實際的桶。

struct BucketAccumulator {
    std::vector<size_t> indices; // 桶內容，是張量列表
    size_t size = 0; // 桶大小，比如若干mb
  }; // 桶的邏輯內容

  // Keep vector of indices and size accumulator by tensor type and device.
std::unordered_map<BucketKey, BucketAccumulator, c10::hash<BucketKey>>
      buckets; // 所有桶的列表，每一個實際桶可以認為是 BucketAccumulator

我們來看看 compute_bucket_assignment_by_size的具體邏輯：

定義了桶大小限制列表。bucket_size_limit_iterators。
定義了所有桶的列表 buckets，每一個實際桶可以認為是 BucketAccumulator。
遍歷傳入的所有張量：
- 給所有的tensor一個index，從0開始遞增，一直到 tensors.size()，如果已經傳入了 indices，就拿到張量的index。
- 如果配置了期待sparse gradient，則把這個張量自己放入一個桶，因為沒法和其他張量放在一起。
- 使用張量資訊構建桶的key，找到對應的桶。
  - 拿到BucketAccumulator，往該桶的張量列表裡面插入新張量的index，indices 是 tensor index list。
  - 增加對應桶大小。
- 如果需要，就設定成大小限制的初始值。
- 拿到當前最小值限制。
- 如果桶的尺寸大於最小值限制，就是說目前桶的尺寸已經達到了桶的最大限制，按說需要轉移到新桶了。
  - 實際上確實轉移到了邏輯上的新桶，但是實際還是在現有桶內執行，因為 type, device 還是同樣的，還是應該在原有桶內繼續累積，不過原有桶的indice已經轉移到了result之中，就相當於清空了。
  - 把桶內容插入到返回result，就是說，當桶尺寸過大的時候，就先插入到result之中。
  - 重新生成桶，bucket是個引用，所以直接賦值，就相當於清空原有的桶，就是原來桶繼續用，但是桶內原有的indices已經轉移到了result之中。
  - 前進到下一個尺寸限制。
- 把剩餘的桶內indices插入到返回值，因為之前已經有些直接插入到了result之中。
- 對result 進行排序：
  - 如果 tensor_indices 非空，說明張量的順序已經是梯度準備好的順序，不需要再排序了。
  - 如果 tensor_indices 是空的，依據最小張量index來排序，這裡假定張量的順序是他們使用的順序（或者說是他們梯度產生次序的反序）。這種排序可保證桶是按照連續不斷的順序準備好。
  - 注意，這裡就是正序排列，等到建立Reducer的時候，才反序傳入：list(reversed(bucket_indices))。
- 最後返回 result，result 最終如下，裡面每個vector 都對應了一個bucket，裡面是都是 tensor 的 index，這裡都是從小到大順序排序。

std::vector<std::vector<size_t>> compute_bucket_assignment_by_size(
    const std::vector<at::Tensor>& tensors,
    const std::vector<size_t>& bucket_size_limits, // 桶大小限制
    const std::vector<bool>& expect_sparse_gradient,
    const std::vector<int64_t>& tensor_indices) { //實際上，初始化時候沒有傳入 tensor_indices
  // Either expect_sparse_gradient is not specified or it has as many elements
  // as the vector with tensors.
  TORCH_INTERNAL_ASSERT(
      expect_sparse_gradient.empty() ||
      (tensors.size() == expect_sparse_gradient.size()));
  TORCH_INTERNAL_ASSERT(tensors.size() > 0);

  std::vector<std::vector<size_t>> result;
  result.reserve(tensors.size()); // 預留大小

  // Keep iterator into the size_limit vector by tensor type and device.
  // This is done so that we can use the consecutive bucket limits per type.
  std::unordered_map<
      BucketKey,
      std::vector<size_t>::const_iterator,
      c10::hash<BucketKey>>
      bucket_size_limit_iterators;

  // Local accumulator type for a single bucket.
  struct BucketAccumulator {
    std::vector<size_t> indices; // 桶內容，是張量列表
    size_t size = 0; // 桶大小，比如若干mb
  }; // 桶的邏輯內容

  // Keep vector of indices and size accumulator by tensor type and device.
  std::unordered_map<BucketKey, BucketAccumulator, c10::hash<BucketKey>>
      buckets; // 所有桶的列表，每一個實際桶可以認為是 BucketAccumulator

  for (size_t i = 0; i < tensors.size(); i++) { // 遍歷傳入的所有張量
    const auto& tensor = tensors[i]; //拿到張量
    TORCH_CHECK(!tensor.is_sparse(), "No support for sparse tensors.");

    // when tensor_indices is empty, the index of tensors[i] assigned to
    // bucket is i, otherwise the tensor index is tensor_indices[i].
    auto tensor_index = i; // 就是給所有的tensor一個index，從0開始遞增，一直到 tensors.size()
    if (!tensor_indices.empty()) {
      tensor_index = tensor_indices[i]; // 如果有index，就拿到張量的index
    }
    // If we expect a sparse gradient to be produced for this tensor, it cannot
    // be grouped together with other gradients and gets its own bucket.
    // 如果配置了期待sparse gradient，則把這個張量自己放入一個桶，因為沒法和其他張量放在一起
    if (!expect_sparse_gradient.empty() &&
        expect_sparse_gradient[tensor_index]) {
      result.push_back({tensor_index});
      continue;
    }

    auto key = BucketKey(tensor.scalar_type(), tensor.device()); //使用張量資訊構建桶的key
    auto& bucket = buckets[key]; // 找到對應的桶, 拿到BucketAccumulator
    bucket.indices.push_back(tensor_index); // 往該桶的張量列表裡面插入新張量的index，indices 是 tensor index list
    bucket.size += tensor.numel() * tensor.element_size();// 增加對應桶大小

    // Initialize bucket size limit iterator if necessary.
    // 如果需要，就設定成大小限制的初始值
    if (bucket_size_limit_iterators.count(key) == 0) {
      bucket_size_limit_iterators[key] = bucket_size_limits.begin();
    }

    // bucket_size_limit_iterator 就是桶大小的範圍, 即 [_DEFAULT_FIRST_BUCKET_BYTES, int(bucket_cap_mb * 1024 * 1024)]
    auto& bucket_size_limit_iterator = bucket_size_limit_iterators[key];
    const auto bucket_size_limit = *bucket_size_limit_iterator; // 當前最小值限制
    if (bucket.size >= bucket_size_limit) { 
      // 如果桶的尺寸大於最小值限制，就是說目前桶的尺寸已經達到了桶的最大限制，按說需要轉移到新桶了（實際上確實轉移到了邏輯上的新桶，但是實際還是在現有桶內執行，因為 type, device 還是同樣的，還是應該在原有桶內繼續累積，不過原有桶的indice已經轉移到了result之中，就相當於清空了）
      result.emplace_back(std::move(bucket.indices)); // 把桶內容插入到返回result，就是說，當桶尺寸過大的時候，就先插入到result之中。
      bucket = BucketAccumulator(); // 重新生成桶，bucket是個引用，所以直接賦值，就相當於清空原有的桶，就是原來桶繼續用，但是桶內原有的indices已經轉移到了result之中。

      // Advance to the next bucket size limit for this type/device.
      // 前進到下一個尺寸限制
      auto next = bucket_size_limit_iterator + 1;
      if (next != bucket_size_limits.end()) {
        bucket_size_limit_iterator = next;
      }
    }
  }

  // Add remaining buckets. 把剩餘的桶內indices插入到返回值，因為之前已經有些直接插入到了result之中
  for (auto& it : buckets) {
    auto& bucket = it.second;
    if (!bucket.indices.empty()) {
      result.emplace_back(std::move(bucket.indices));
    }
  }

  // If tensor_indices is not empty, the order of the tensors is in the gradient
  // ready order, so no need to sort.
  // If tensor_indices is empty, sort resulting buckets by the minimum tensor
  // index they include. We assume that the order of the tensors is the order in
  // which they are used (or the reverse order in which their gradients are
  // produced). This sorting step ensures that the buckets are ready in
  // consecutive order.
  // 如果 tensor_indices 非空，說明張量的順序已經是梯度準備好的順序，不需要再排序了
  // 如果 tensor_indices 是空的，依據最小張量index來排序，這裡假定張量的順序是他們使用的順序（或者說是他們梯度產生次序的反序）。這種排序可保證桶是按照連續不斷的順序準備好。
  // 注意，這裡就是正序排列，等到建立Reducer的時候，才反序傳入：list(reversed(bucket_indices))
  if (tensor_indices.empty()) {
    std::sort(
        result.begin(),
        result.end(),
        [](const std::vector<size_t>& a, const std::vector<size_t>& b) {
          // 對於任意兩個vector，排序的依據是：用這兩個vector之中最小index來排序
          const auto amin = std::min_element(a.begin(), a.end()); // a中的最小index
          const auto bmin = std::min_element(b.begin(), b.end()); // b中的最小index
          return *amin < *bmin;
        });
  }

  return result; // result 最終如下，裡面每個vector 都對應了一個bucket，裡面是都是 tensor 的 index，這裡都是從小到大順序排序。
}

result 最終如下，裡面每個vector 都對應了一個bucket，裡面是都是 tensor 的 index，這裡都是從小到大順序排序。

這裡注意的是：因為傳入引數 tensors就是 parameters[0]，而 parameters[0] 是按照 parametes() 的返回結果來的，即，模型引數以（大致）Model.parameters()與給定模型相反的順序分配到桶中。使用相反順序的原因是因為 DDP 期望梯度在反向傳遞期間以大約該順序準備就緒。最終 DDP 是按model.parameters()的相反順序啟動AllReduce。

+-----------------------------------------------------------------------+
|                                                                       |
|  <tensor index 1, tensor index 2, tensor index 3, tensor index 4>     |
|                                                                       |
|                                                                       |
|  <tensor index 5, tensor index 6, tensor 7>                           |
|                                                                       |
|                                                                       |
|  ......                                                               |
|                                                                       |
|                                                                       |
|  <tensor index 8, tensor index 9, tensor index 10, tensor index 11>   |
|                                                                       |
+-----------------------------------------------------------------------+

2.5.3 Reducer

接下來的程式碼就是生成了一個Reducer。

    self.reducer = dist.Reducer(
        parameters,
        list(reversed(bucket_indices)), # 利用桶index
        self.process_group,
        expect_sparse_gradient,
        self.bucket_bytes_cap,
        self.find_unused_parameters,
        self.gradient_as_bucket_view,
        param_to_name_mapping,
    )

我們在後續文章中會詳細介紹 Reducer。