[原始碼解析] PyTorch 分散式(13) ----- DistributedDataParallel 之反向傳播

backward() 是在 loss 上直接呼叫，這是autograd engine 的工作，是 DDP 無法控制的，所以DDP採用了Hook來達到目的。
- DDP 在構造時註冊了 autograd hooks。
- Autograd 引擎進行梯度計算。
- 當一個梯度準備好時，它在該梯度累加器上的相應 DDP 鉤子將被觸發。
在 autograd_hook 之中進行all-reduce。假設引數index是param_index，則利用param_index獲取到引數，標示為ready，如果某個桶裡面梯度都ready，則該桶是ready。
當一個桶中的梯度都準備好時，會在該桶上Reducer啟動非同步allreduce以計算所有程式的梯度平均值。
如果所有桶都ready，則等待所有 all-reduce 完成。當所有桶都準備好時，Reducer將阻塞等待所有allreduce操作完成。完成此操作後，將平均梯度寫入param.grad所有引數的欄位。
所有程式的梯度都會reduce，更新之後，大家的模型權重都相同。所以在向後傳播完成之後，跨不同DDP程式的對應的相同引數上的 grad 欄位應該是相等的。
梯度被歸併之後，會再傳輸回autograd引擎。
不需要像 DP 那樣每次迭代之後還要廣播引數。但是 Buffers 還是需要在每次迭代由 rank 0 程式廣播到其他程式之上。

接下來我們就看看如何進行後向傳播。

0x02 從Hook開始

下圖來自快手的一篇論文（請參見參考1，後續應該也會對該論文專案進行分析）。圖上半部分是原生autograd引擎處理方式，下面是 Horovod 和 Torch-DDP 的處理方式。從中可以看到，對於梯度歸併是在後向傳播過程中就會開始。

具體來說就是，除了分桶，Reducer還在構造期間註冊 autograd 鉤子，每個引數一個鉤子。當梯度準備好時，將在向後傳遞期間觸發這些鉤子，進行梯度規約。如果某個桶裡面梯度都ready，則該桶是ready。當一個桶中的梯度都準備好時，會在該桶上Reducer啟動非同步allreduce以計算所有程式的梯度平均值。所以，我們就從反向傳播的入口點 Hook 開始分析。

2.1 如何註冊hook

我們首先看看如何註冊hook，這涉及到 AutogradMeta 和 Node。

2.1.1 AutogradMeta

AutoGradMeta : 記錄 Variable 的autograd歷史資訊，主要成員變數是。

grad_ ：儲存當前Variable例項的梯度，本身也是一個Variable。
grad_fn ：是個Node例項，非葉子節點才有。通過 grad_fn() 方法來訪問，實際上，PyTorch中就是通過 grad_fn是否為空來判斷一個Variable是否是leaf variable。
grad_accumulator_ ：也是Node的例項，只有葉子節點才有。
- 通過Variable的grad_accumulator()來訪問。
- 葉子節點負責對梯度進行累加，grad_accumulator_ 就是梯度累加處理函式。
- 其對應梯度就被儲存在 grad_ 變數之中。
output_nr_：是個數字。output_nr_表明是 Node 的第幾個輸出，比如為 0 就表明這個Variable是Node 的第 1 個輸出。
我們總結一下：
- 對於非葉子節點，grad_fn是計算梯度操作，梯度不會累積在 grad_ 之上，而是傳遞給計算圖反向傳播下一站。grad_fn 就是一個 Node。
- 對於葉子節點，PyTorch 虛擬出了一個特殊計算操作，輸出這個葉子節點，同時此虛擬計算操作也作為葉子節點的grad_accumulator_來累加其梯度，梯度會累積在 grad_ 之上，因此葉子節點的 output_nr_ 必定為 0。grad_accumulator_ 也是一個 Node，就是 AccumulateGrad。

其定義如下：

struct TORCH_API AutogradMeta : public c10::AutogradMetaInterface {
  std::string name_;

  Variable grad_;
  std::shared_ptr<Node> grad_fn_;
  std::weak_ptr<Node> grad_accumulator_;

  // This field is used to store all the forward AD gradients
  // associated with this AutogradMeta (and the Tensor it corresponds to)
  std::shared_ptr<ForwardGrad> fw_grad_;

  std::vector<std::shared_ptr<FunctionPreHook>> hooks_;
  std::shared_ptr<hooks_list> cpp_hooks_list_;

  // Only meaningful on leaf variables (must be false otherwise)
  bool requires_grad_;
  // Only meaningful on non-leaf variables (must be false otherwise)
  bool retains_grad_;
  bool is_view_;

  // The "output number" of this variable; e.g., if this variable
  // was the second output of a function, then output_nr == 1.
  // We use this to make sure we can setup the backwards trace
  // correctly when this variable is passed to another function.
  uint32_t output_nr_;
  mutable std::mutex mutex_;
};

2.1.2 Node

在計算圖中，一個計算操作用一個節點（Node）表示，不同的 Node子類實現了不同操作。

AutogradMeta 的 grad_fn_ 和 grad_accumulator_ 都是 Node。

這裡針對的主要成員變數是 post_hooks_，就是在執行梯度計算之後，會執行的 hook。

add_post_hook 會往 post_hooks_ 之中新增一個 hook。

struct TORCH_API Node : std::enable_shared_from_this<Node> {
  public:
  std::vector<std::unique_ptr<FunctionPreHook>> pre_hooks_;
  std::vector<std::unique_ptr<FunctionPostHook>> post_hooks_;  
  
  uintptr_t add_post_hook(std::unique_ptr<FunctionPostHook>&& post_hook) {
    post_hooks_.push_back(std::move(post_hook));
    // Use the raw pointer as the unique key to identify this hook. This key
    // can then be used in del_post_hook(key) to remove this hook.
    return reinterpret_cast<std::uintptr_t>(post_hooks_.back().get());
  }
}

2.1.3 AccumulateGrad

AccumulateGrad 是 Node 的派生類。

2.2 建構函式

我們回顧一下 Reducer 建構函式，其中會：

每個張量都得到其 Variable::AutogradMeta的 grad_accumulator_，即用於累加葉子 Variable 的梯度累加器。
針對每個梯度累加器都配置一個autograd_hook，這個 hook 掛在 autograd graph 之上，在 backward 時負責梯度同步。
設定 gradAccToVariableMap_ 存了grad_accumulator & index 的對應關係（函式指標和引數張量的對應關係），這樣以後在 autograd graph 遍歷尋找 unused parameters 就方便了。
這些梯度累加器都儲存於 grad_accumulators_ 之中。

具體程式碼如下：

Reducer::Reducer(
    std::vector<std::vector<at::Tensor>> replicas, // 張量
    std::vector<std::vector<size_t>> bucket_indices, // 桶資訊
    ......) {

    for (size_t replica_index = 0; replica_index < replica_count; // 遍歷replica
         replica_index++) {
      
      for (size_t variable_index = 0; variable_index < variable_count; // 遍歷張量
           variable_index++) { 
        auto& variable = replicas_[replica_index][variable_index]; //得到具體的張量
        const auto index = VariableIndex(replica_index, variable_index); //每個張量一個index
				// 得到Variable::AutogradMeta的grad_accumulator_，即用於累加葉子 Variable 的梯度累加器
        auto grad_accumulator = torch::autograd::impl::grad_accumulator(variable); 

        hooks_.emplace_back(
            // 累加器新增hook,這個 hook 掛在 autograd graph 之上，在 backward 時負責梯度同步。
            // grad_accumulator 執行完後，autograd_hook 就會執行
            grad_accumulator->add_post_hook(
                torch::make_unique<torch::autograd::utils::LambdaPostHook>(
                    [=](const torch::autograd::variable_list& outputs,
                        const torch::autograd::variable_list& ) {
#ifndef _WIN32
                      this->rpc_context_.set(
                          ThreadLocalDistAutogradContext::getContextPtr());
#endif
                      this->autograd_hook(index); // 把reducer的autograd_hook函式新增進去
                      return outputs;
                    })),
            grad_accumulator);
          
        // gradAccToVariableMap_ 存了grad_accumulator & index 的對應關係（函式指標和引數張量的對應關係），這樣以後在 autograd graph 遍歷尋找 unused parameters 就方便了
        if (find_unused_parameters_) {
          gradAccToVariableMap_[grad_accumulator.get()] = index;
        }

        grad_accumulators_[replica_index][variable_index] =
            std::move(grad_accumulator);
      }
    }
  }
}

2.2.1 grad_accumulator

這裡 grad_accumulator 程式碼如下，可以看到，就是獲取張量的 autograd_meta->grad_accumulator_，然後返回，對於葉子節點，grad_accumulator_ 就是 AccumulateGrad。

std::shared_ptr<Node> grad_accumulator(const Variable& self) {
  auto autograd_meta = get_autograd_meta(self); // 獲取 autograd_meta
  if (!autograd_meta) {
    return nullptr;
  }
  if (autograd_meta->grad_fn_) {
    throw std::logic_error(
        "grad_accumulator() should be only called on leaf Variables");
  }
  if (!autograd_meta->requires_grad_) {
    return nullptr;
  }

  std::lock_guard<std::mutex> lock(autograd_meta->mutex_);

  // 獲取autograd_meta->grad_accumulator_
  auto result = autograd_meta->grad_accumulator_.lock(); 
  if (result) 
    return result;

  c10::raw::intrusive_ptr::incref(self.unsafeGetTensorImpl());
  auto intrusive_from_this = c10::intrusive_ptr<at::TensorImpl>::reclaim(self.unsafeGetTensorImpl());
  result = std::make_shared<AccumulateGrad>(Variable(std::move(intrusive_from_this)));
  autograd_meta->grad_accumulator_ = result; // 獲取 autograd_meta->grad_accumulator_
  return result;
}

2.2.2 圖示

一個張量為 variable1，張量對應的 VariableIndex 是 index1，具體配置如下，AccumulateGrad 在使用 apply 計算完梯度之後，會呼叫 post_hooks 之中的 hook。

+-----------------------------------------+
| Reducer                                 |
|                                         |
|                                         |
|  +------------------------------------+ |   +------------------+    +----------------+
|  | grad_accumulators_                 | |   |  variable1       |    | AccumulateGrad |
|  |                                    | |   |                  |    |                |
|  |                                    | |   |                  |    |                |
|  |  [replica_index][variable_index]+------> |   autograd_meta_+---> |    post_hooks  |
|  |                                    | |   |                  |    |        +       |
|  |                                    | |   |                  |    |        |       |
|  +------------------------------------+ |   +------------------+    +----------------+
|                                         |                                    |
|  +-------------------------------+      |                                    |
|  | gradAccToVariableMap_         |      |                                    v
|  |                               |      |
|  |                               |      |                    +-----------------------+
|  |        [variable1 : index1]   |      |                    |  autograd_hook(index1)|
|  |                               |      |                    +-----------------------+
|  +-------------------------------+      |
|                                         |
+-----------------------------------------+


                                               +---------------------------------------+
                                  index1 +-->  |VariableIndex                          |
                                               |                                       |
                                               |          replica_index of Variable1   |
                                               |                                       |
                                               |          variable_index of Variable1  |
                                               |                                       |
                                               +---------------------------------------+

2.3 Hook 函式

當梯度準備好時，引擎會回撥 Hook 函式，Hook 就是如下的 autograd_hook 方法，其就是依據相關條件來設定本變數是否就緒。邏輯如下：

如果是動態圖&找到未用張量或者靜態圖第一次迭代，則把 local_used_maps_ 之中變數對應位置置為1。
- local_used_maps_ 記錄本地使用過的CPU張量。
- 動態圖每次迭代都可能不一致，桶和變數可能每次都不一樣，所以local_used_maps_需要每次迭代都更新。
- 靜態圖每次迭代都一樣，只要第一次迭代時候，在回撥之中設定即可。
如果是靜態圖第一次迭代，則把 numGradHooksTriggeredMap_ 之中該變數對應之處變成1
如果沒有標示未使用變數，則遍歷沒有用到的variable，未用到的標示為ready，呼叫 mark_variable_ready。
如果是靜態圖&第二次迭代之後，則如果numGradHooksTriggeredMapPerIteration_對應遞減後為0，則設定變數為就緒，呼叫 mark_variable_ready。
否則就是動態圖，動態圖每次都要設定variable為就緒，呼叫 mark_variable_ready。

// The function `autograd_hook` is called after the gradient for a
// model parameter has been accumulated into its gradient tensor.
// This function is only to be called from the autograd thread.
void Reducer::autograd_hook(VariableIndex index) {
  std::lock_guard<std::mutex> lock(this->mutex_);

  // Carry over thread local state from main thread. This allows for
  // thread-local flags such as profiler enabled to be configure correctly.
  at::ThreadLocalStateGuard g(thread_local_state_);

  // Ignore if we don't expect to be called.
  // This may be the case if the user wants to accumulate gradients
  // for number of iterations before reducing them.
  if (!expect_autograd_hooks_) {
    return;
  }

// Note [Skip allreducing local_used_maps_dev]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~
// If find_unused_parameters_ is set to false, there is no need to allreduce
// local_used_maps_dev_, because all parameters will be reduced anyway.
// Therefore, we can avoid allocating memory for local_used_maps and
// local_used_maps_dev_ if find_unused_parameters_ is false. 
        
  // See Note [Skip allreducing local_used_maps_dev]
  // 動態圖&找到未用張量 或者 靜態圖第一次迭代
  if (dynamic_graph_find_unused() || static_graph_first_iteration()) {
    // Since it gets here, this param has been used for this iteration. We want
    // to mark it in local_used_maps_. During no_sync session, the same var can
    // be set multiple times, which is OK as does not affect correctness. As
    // long as it is used once during no_sync session, it is marked as used.
    // 在 no_sync 的session之中，只要引數被用過一次，就會被標記為用過
    // local_used_maps_ 記錄本地使用過的CPU張量
    // 動態圖每次迭代都可能不一致，桶和變數可能每次都不一樣，所以local_used_maps_需要每次迭代都更新
    // 靜態圖每次迭代都一樣，只要第一次迭代時候，在回撥之中設定即可
    local_used_maps_[index.replica_index][index.variable_index] = 1;
  }

  if (static_graph_first_iteration()) { // 靜態圖第一次迭代
    numGradHooksTriggeredMap_[index] += 1;// 只是靜態圖第一次迭代時候，會增加1
    return;
  }

  // If `find_unused_parameters_` is true there may be model parameters that
  // went unused when computing the model output, they won't be part of the
  // autograd graph, and won't receive gradients. These parameters are
  // discovered in the `prepare_for_backward` function and their indexes stored
  // in the `unused_parameters_` vector.
  if (!has_marked_unused_parameters_) {
    has_marked_unused_parameters_ = true;
    for (const auto& unused_index : unused_parameters_) { // 遍歷沒有用到的variable
      mark_variable_ready(unused_index); //未用到的當然就標示為ready了
    }
  }

  // If it is static graph, after 1st iteration, check a avariable
  // is ready for communication based on numGradHooksTriggeredMap_.
  if (static_graph_after_first_iteration()) {// 第二次迭代之後確實用到了
    // 為何從第二次迭代開始處理？因為第一次迭代，當進入到這裡時候，梯度還沒有準備好（就是沒有經過Reducer處理過，只有經過Reducer處理過之後，才算處理好）
    // 靜態圖時，numGradHooksTriggeredMapPerIteration_ = numGradHooksTriggeredMap_;
    if (--numGradHooksTriggeredMapPerIteration_[index] == 0) {
      // Finally mark variable for which this function was originally called.
      mark_variable_ready(index); // 從1變成0，就是就緒了，所以設定variable為就緒
    }
  } else {
    // Finally mark variable for which this function was originally called.
    mark_variable_ready(index);// 動態圖每次都要設定variable為就緒
  }
}

0x03 就緒

如果在反向傳播過程之中，某一個引數的 hook之中發現該變數是就緒的，則會開始呼叫mark_variable_ready(index)，我們繼續看如何處理。

大致順序就是：處理就緒的變數，處理就緒的桶，處理使用情況，從DDP拷貝回autograd之中對應的梯度。

3.1 變數ready

3.1.1 設定就緒

mark_variable_ready 是把一個變數標示為就緒，邏輯如下。

如果需要重建桶，則把index插入到需重建列表之中。
- 重建桶會發生在如下情況：1）第一次重建儲存桶。2）靜態圖為真或查詢未使用的引數為假時。3）此反向過程需要執行allreduce。
- 在這裡，我們只需將張量及其引數索引轉儲到基於梯度到達順序的重建引數和重建引數索引中，然後在finalize_backward()結束時，將基於重建引數和重建引數索引重建儲存桶，然後廣播和初始化儲存桶。此外，我們只需要轉儲一個副本的張量和引數索引。
找到本變數對應的副本index，找到本變數在副本中哪個位置。
這個variable是被使用過的，記錄下來，插入到perIterationReadyParams_。
每當某個變數被標記成 ready，都要設定呼叫一下finalize。
檢查桶裡的梯度是不是都ready，如果有沒有pending，就是桶也ready了
本模型副本pending數目減1，因為又一個張量ready了。
如果本副本pending數目為0，則本桶pending數目減1。
- 因為如果本模型副本的pending為0，則說明桶對應的模型副本pending數目應該減一。
- 如果本桶pending為0，則使用 mark_bucket_ready 設定桶就緒。
如果所有桶都ready，則會：
- 呼叫all_reduce_local_used_map。
- 呼叫Engine::get_default_engine().queue_callback 註冊一個callback，這個callback將在engine完成全部 backward 之後呼叫，後續將對使用過的variable進行規約，裡面呼叫了finalize_backward。

void Reducer::mark_variable_ready(VariableIndex index) {
  // Rebuild bucket only if 1) it is the first time to rebuild bucket 2)
  // static_graph_ is true or find_unused_parameters_ is false,
  // 3) this backward pass needs to run allreduce.
  // Here, we just dump tensors and their parameter indices into
  // rebuilt_params_ and rebuilt_param_indices_ based on gradient arriving
  // order, and then at the end of finalize_backward(), buckets will be
  // rebuilt based on rebuilt_params_ and rebuilt_param_indices_, and then
  // will be broadcasted and initialized. Also we only need to dump tensors
  // and parameter indices of one replica.
 
  if (should_rebuild_buckets()) {
    push_rebuilt_params(index); // 如果需要重建，就把index插入到需重建列表之中
  }

  const auto replica_index = index.replica_index; // 找到副本index
  const auto variable_index = index.variable_index; // 找到在副本中哪個位置

  if (replica_index == 0) {
    checkAndRaiseMarkedTwiceError(variable_index);
    perIterationReadyParams_.insert(variable_index); // 這個variable是被使用過的，記錄下來
  }
  backward_stats_[replica_index][variable_index] =
      current_time_in_nanos() - cpu_timer_.backward_compute_start_time;

  // Any time we mark a variable ready (be it in line due to unused parameters,
  // or via an autograd hook), we require a call to the finalize function. If
  // this doesn't happen before the next iteration (or call to
  // `prepare_for_backwards`), we know something is wrong.
  require_finalize_ = true;  // 每當某個變數被標記成 ready，都要呼叫一下 finalize

  const auto& bucket_index = variable_locators_[variable_index]; // 找到variable的index資訊
  auto& bucket = buckets_[bucket_index.bucket_index]; // 找到variable位於哪個桶
  auto& replica = bucket.replicas[replica_index]; // 找到副本


  set_divide_factor();

  if (bucket.expect_sparse_gradient) {
    mark_variable_ready_sparse(index); // sparse variable
  } else {
    mark_variable_ready_dense(index); // dense variable
  }

  // TODO(@pietern): Make this work for both CPU/CUDA tensors.
  // When using CPU tensors we don't need to do this.
  // // Record event so that we can wait for all of them.
  // auto& event = replica.events[bucket_index.intra_bucket_index];
  // event.record();

  // Check if this was the final gradient for this bucket.
  // 檢查桶裡的梯度是不是都ready，如果有沒有pending，就是桶也ready了
  if (--replica.pending == 0) { // 減去本模型副本pending數目，因為又一個張量ready了
    // Kick off reduction if all replicas for this bucket are ready.
    if (--bucket.pending == 0) {// 如果本模型副本的pending為0，則說明桶對應的模型副本pending數目應該減一
      mark_bucket_ready(bucket_index.bucket_index); // 那麼就設定桶就緒
    }
  }

  // Run finalizer function and kick off reduction for local_used_maps once the
  // final bucket was marked ready.
  if (next_bucket_ == buckets_.size()) { // 如果所有桶都ready

    if (dynamic_graph_find_unused()) {
      all_reduce_local_used_map(); // 對使用過的variable進行規約
    }

    // The autograd engine uses the default stream when running callbacks, so we
    // pass in the current CUDA stream in case it is not the default.
    const c10::Stream currentStream = get_current_stream();
    // 這裡會註冊 finalize_backward 到 engine
    torch::autograd::Engine::get_default_engine().queue_callback([=] {
      
      std::lock_guard<std::mutex> lock(this->mutex_);
      // Run callback with the current stream
      c10::OptionalStreamGuard currentStreamGuard{currentStream};
      if (should_collect_runtime_stats()) {
        record_backward_compute_end_time();
      }
      // Check that all buckets were completed and had their work kicked off.
      TORCH_INTERNAL_ASSERT(next_bucket_ == buckets_.size());
      this->finalize_backward(); 
    });
  }
}

邏輯如下：

Reduer 會註冊autograd_hook到AccumulateGrad的post_hooks之上。
Autograd Engine 在反向傳播過程中，如果發現某個引數ready，就呼叫autograd_hook。
autograd_hook 之中繼續處理。
會註冊一個 finalize_backward到 engine。

Engine        AccumulateGrad                Reducer

  +                  +                         +
  |                  |                         |
  |                  |           1             |
  |                  | <-----------------------v
  |                  |
  |                  |
  |                  |
  |                  v           2
  |             post_hooks  +-------->  autograd_hook
  |                                            +
  |                                            |
  |                                            | 3
  |                                            v
  |                         +------------------+---------------------------+
  |                         |    mark_variable_ready                       |
  |                         |                                              |
  |                         |                                              |
  |                         |     All variable in replica are ready?       |
  |                         |                   +                          |
  |                         |                   | YES                      |
  |                         |                   v                          |
  |                         |     All replica in bucket are ready?         |
  |                         |                   +                          |
  |                         |                   | YES                      |
  |                         |                   v                          |
  |                         |            mark_bucket_ready                 |
  |                         |                                              |
  |                         |                                              |
  |                         |                                              |
  |                         |                   +                          |
  |                         |                   |                          |
  |                         |                   |                          |
  |                         |                   v                          |
  |                         |          All buckets are ready?              |
  |                         |                   +                          |
  |                         |                   | YES                      |
  |                         |                   v                          |
  |   queue_back   4        |          all_reduce_local_used_map           |
  | <----------------------------+  queue_callback(finalize_backward)      |
  |                         |                                              |
  |                         |                                              |
  v                         +----------------------------------------------+

3.1.2 註冊callback

上面程式碼之中，使用了 torch::autograd::Engine::get_default_engine().queue_callback 來註冊了一個回撥函式。我們就來分析一下。

在engine之中有定義，就是往 final_callbacks_ 插入callback：

void Engine::queue_callback(std::function<void()> callback) {
  std::lock_guard<std::mutex> lock(current_graph_task->final_callbacks_lock_);
  current_graph_task->final_callbacks_.emplace_back(std::move(callback));
}

對於 final_callbacks_ 處理，在 exec_post_processing 之中，就是當 engine 全部完成 backward 的時候會呼叫 callback。

void GraphTask::exec_post_processing() {
  if (!not_ready_.empty()) {
    throw std::runtime_error("could not compute gradients for some functions");
  }

  // set the thread_local current_graph_task_ as more callbacks can be installed
  // by existing final callbacks.
  GraphTaskGuard guard(shared_from_this());
  // Lock mutex during each iteration for accessing final_callbacks.size()
  // Unlocking is necessary, because the callback can register
  // more callbacks (or they can be registered from other threads
  // while it's waiting.
  std::unique_lock<std::mutex> cb_lock(final_callbacks_lock_);
  // WARNING: Don't use a range-for loop here because more callbacks may be
  // added in between callback calls, so iterators may become invalidated.
  // NOLINTNEXTLINE(modernize-loop-convert)
  for (size_t i = 0; i < final_callbacks_.size(); ++i) {
    cb_lock.unlock();
    final_callbacks_[i](); // 呼叫了callback
    cb_lock.lock();
  }

  // Syncs leaf streams with default streams (if necessary)
  // See note "Streaming backwards"
  for (const auto& leaf_stream : leaf_streams) {
    const auto guard = c10::impl::VirtualGuardImpl{c10::DeviceType::CUDA};
    const auto default_stream = guard.getDefaultStream(leaf_stream.device());
    if (leaf_stream != default_stream) {
      auto event = c10::Event{c10::DeviceType::CUDA};
      event.record(leaf_stream);
      default_stream.wait(event);
    }
  }
}

於是邏輯擴充如下：

Reduer 會註冊autograd_hook到AccumulateGrad的post_hooks之上。
Autograd Engine 在反向傳播過程中，如果發現某個引數ready，就呼叫autograd_hook。
autograd_hook 之中繼續處理。
會註冊一個 finalize_backward到 engine。
在 GraphTask::exec_post_processing 之中會呼叫 finalize_backward。

          Engine        AccumulateGrad                Reducer

            +                  +                         +
            |                  |                         |
            |                  |           1             |
            |                  | <-----------------------+
            |                  |
            |                  |
            |                  |
            |                  v
            |                              2
            |             post_hooks  +-------->  autograd_hook
            |                                            +
            |                                            |
            |                                            |  3
            |                                            v
            |                         +------------------+---------------------------+
            |                         | mark_variable_ready                          |
            |                         |                                              |
            |                         |                                              |
            |                         |     All variable in replica are ready?       |
            |                         |                   +                          |
            |                         |                   | YES                      |
            |                         |                   v                          |
            |                         |     All replica in bucket are ready?         |
            |                         |                   +                          |
            |                         |                   | YES                      |
            |                         |                   v                          |
            |                         |            mark_bucket_ready                 |
            |                         |                                              |
            |                         |                                              |
            |                         |                                              |
            |                         |                   +                          |
            |                         |                   |                          |
            |                         |                   |                          |
            |                         |                   v                          |
            |                         |          All buckets are ready?              |
            |                         |                   +                          |
            |                         |                   | YES                      |
            |                         |                   v                          |
            |   queue_back    4       |          all_reduce_local_used_map           |
            | <----------------------------+  queue_callback(finalize_backward)      |
            |                         |                                              |
            |                         |                                              |
            |                         +-------------------+--------------------------+
            v                                             |
                                                          |
GraphTask::exec_post_processing                           |
            +                                             |
            |                                             |
            |                 5                           v
            +--------------------------------->   finalize_backward
            |                                             +
            |                                             |
            |                                             |
            v                                             v

3.1.3 mark_variable_ready_sparse

mark_variable_ready_sparse 函式用來處理sparse型別的variable，其實就是拷貝梯度到Reducer。

void Reducer::mark_variable_ready_sparse(VariableIndex index) {
  const auto replica_index = index.replica_index;
  const auto variable_index = index.variable_index;
  const auto& bucket_index = variable_locators_[variable_index];
  auto& bucket = buckets_[bucket_index.bucket_index]; // 哪個桶
  auto& replica = bucket.replicas[replica_index]; // 桶的哪個副本
  auto& variable = replica.variables[bucket_index.intra_bucket_index]; // 副本之中哪個variable

  runGradCallbackForVariable(variable, [&](auto& grad) {
    TORCH_CHECK(grad.defined(), "Expected sparse gradient to be defined.");
    TORCH_CHECK(
        grad.options().layout() == c10::kSparse,
        "Expected variable to have sparse gradient.");

    // Sparse tensors cannot be grouped together with other sparse tensors
    // in a single reduction operation like we can for dense tensors.
    // Therefore, the `offsets` and `lengths` vectors in the bucket replica
    // struct are empty, and there is no pre-existing accumulation tensor.
    // Directly assign the sparse tensor to the `contents` field.
    replica.contents = grad; //直接拷貝
    // See Note [DDP Communication Hook]
    if (comm_hook_ == nullptr) {
      replica.contents.div_(divFactor_);
    }
    // The grad is modified in place and needs to be written back.
    return true;
  });
}

3.1.4 mark_variable_ready_dense

mark_variable_ready_dense 會處理 dense tensors，其實就是拷貝梯度到Reducer。

我們首先看一個成員變數：gradient_as_bucket_view_，其：

如果為false，在 allreduce 桶之後，需要把桶拷貝回grads。
當設定為“True”時，梯度將是指向“allreduce”的不同偏移的檢視。這可以減少峰值記憶體使用，其中儲存的記憶體大小將等於梯度總大小。此外，它還避免了在梯度和“allreduce”通訊桶之間進行復制的開銷。當梯度為檢視時，不能對梯度呼叫detach_()。

mark_variable_ready_dense 邏輯為：

依據index找到本變數屬於哪個桶，哪個副本，然後得到副本中的張量variable，進而得到variable的offset和size。最終得到張量對應的 bucket_view。
使用 runGradCallbackForVariable 對張量進行處理。runGradCallbackForVariable 其實是使用 DistAutogradContext 處理callback，最後傳回 DistAutogradContext。
callback 內部執行邏輯是：
- 當 gradient_as_bucket_view_ 為false時，或者即使gradient_as_bucket_view_為true時，在極少數情況下，使用者可以在每次迭代後將grad設定為None。
- 在這些情況下，grad和bucket_view指向不同的儲存，因此需要將grad複製到bucket_view。
- 如果 gradient_as_bucket_view_ 設定為true，則讓 grad 指向 bucket_view。
- 如果 grad 在以前的迭代中已經被設定為bucket_view，則不需要複製。

void Reducer::mark_variable_ready_dense(VariableIndex index) {
  const auto replica_index = index.replica_index;
  const auto variable_index = index.variable_index;
  const auto& bucket_index = variable_locators_[variable_index];
  auto& bucket = buckets_[bucket_index.bucket_index]; // 哪個桶
  auto& replica = bucket.replicas[replica_index]; // 桶的哪個副本
  auto& variable = replica.variables[bucket_index.intra_bucket_index]; // 得到副本中的variable
  const auto offset = replica.offsets[bucket_index.intra_bucket_index]; // variable的offset
  const auto length = replica.lengths[bucket_index.intra_bucket_index]; // variable的size
  auto& bucket_view = replica.bucket_views_in[bucket_index.intra_bucket_index]; //插入view

  // Copy contents of gradient tensor to bucket tensor.
  // If the gradient is not set, we assume it wasn't computed
  // as part of the current backwards pass, and zero the part
  // of the bucket it would otherwise hold.
  runGradCallbackForVariable(variable, [&](auto& grad) {
    // 拿到張量對應的梯度 grad
    if (grad.defined()) {
      this->check_grad_layout(grad, bucket_view);
      // When gradient_as_bucket_view_ is false, or even when
      // gradient_as_bucket_view_ is true, in rare cases users may set grad to
      // be None after every iteration. In these cases, grad and bucket_view are
      // pointing to different storages and thus need to copy grads to
      // bucket_view. If gradient_as_bucket_view_ is set as true, let grad point
      // to bucket_view. If grad has already been set as views of buckets in
      // previous iterations, no copy is needed.
      if (!grad.is_alias_of(bucket_view)) {
        this->copy_grad_to_bucket(grad, bucket_view); // 把梯度拷貝進入contents
        if (gradient_as_bucket_view_) {
          // Let grad point to bucket_view buffer.
          grad = bucket_view; // 為了省記憶體，grad指向了bucket_view
          // The grad is modified and need to be written back.
          return true;
        }
      } else {
        // If grad and bucket view point to the same storage, no need to copy
        if (comm_hook_ == nullptr) {
          bucket_view.div_(divFactor_);
        }
      }
    } else {
      bucket_view.zero_(); // 設定為0
    }
    // The grad is not modified and doesn't need to be written back.
    return false;
  });
}

copy_grad_to_bucket的作用是把梯度拷貝到 contents

void Reducer::copy_grad_to_bucket(
    const at::Tensor& grad,
    at::Tensor& bucket_view) {
  // See Note [DDP Communication Hook]
  if (comm_hook_ == nullptr) {
    auto wrapped = at::native::wrapped_scalar_tensor(double(1.) / divFactor_);
    // Divides while copying into the bucket view.
    at::mul_out(bucket_view, grad, wrapped);
  } else {
    bucket_view.copy_(grad); // 通過bucket_view把梯度拷貝到 桶副本的contents
  }
}

3.2 桶ready

前面程式碼中有，檢查桶裡的梯度是不是都ready，如果有沒有pending，就是桶也ready了，這時候就呼叫 mark_bucket_ready。

mark_bucket_ready 之中會遍歷桶，對於就緒的桶進行規約。

// Called when the bucket at the specified index is ready to be reduced.
void Reducer::mark_bucket_ready(size_t bucket_index) {
  TORCH_INTERNAL_ASSERT(bucket_index >= next_bucket_);

  // Buckets are reduced in sequence. Ignore this bucket if
  // it's not its turn to be reduced.
  if (bucket_index > next_bucket_) {
    return;
  }

  // Keep going, until we either:
  // - have kicked off reduction for all buckets, or
  // - found a bucket that's not yet ready for reduction.
  //   
    
  // 遍歷桶，直到遇到下面兩種情況：
	// - 已經發起了對所有桶的規約
	// - 發現一個桶其實沒有就緒
  for (; next_bucket_ < buckets_.size() && buckets_[next_bucket_].pending == 0;
       next_bucket_++) {
    num_buckets_ready_++; // 增加
    if (num_buckets_ready_ == 1 && should_collect_runtime_stats()) {
      record_backward_comm_start_time();
    }
    auto& bucket = buckets_[next_bucket_];
    all_reduce_bucket(bucket); // 對於就緒的桶，進行規約
  }
}

3.2.1 all_reduce_bucket

all_reduce_bucket 是對於 contents 進行同步。

遍歷桶的副本，把副本張量插入到 tensors。
如果沒註冊 comm_hook，直接 allreduce 這些tensors。
註冊了 comm_hook 那就使用 hook 進行allreduce，需要注意的是，這個comm_hook 只是處理通訊的底層hook，如果想在 reduce 前分別進行梯度裁剪，還是需要在 autograph 掛 hook。

void Reducer::all_reduce_bucket(Bucket& bucket) {
  std::vector<at::Tensor> tensors;
  tensors.reserve(bucket.replicas.size());
  for (const auto& replica : bucket.replicas) {
    // TODO(@pietern): Ensure proper synchronization with the CUDA events
    // that recorded copies into this contents tensor. If these copies are
    // executed on non-default streams, the current stream for the device
    // that holds the contents tensor must wait on these events.
    //
    // As long as autograd uses the default stream for every device,
    // these operations are implicitly sequenced, and we don't need to
    // do any extra synchronization here.
    //
    // CUDA default stream 都按時序排好了
    tensors.push_back(replica.contents);
  }
  // See Note [DDP Communication Hook]
  if (comm_hook_ == nullptr) {
    // 如果沒註冊 comm_hook，直接 allreduce
    bucket.work = process_group_->allreduce(tensors);
  } else {
    // 註冊了 comm_hook 那就使用 hook 進行allreduce
    // 需要注意的是，這個comm_hook 只是處理通訊的底層hook，如果想在 reduce 前分別進行梯度裁剪，還是需要在 autograph 掛 hook
      
    GradBucket grad_bucket(
        next_bucket_,
        tensors[0], // 從下面註解可以知道，一個桶只有一個replica
        // Since currently we do not support single-process multiple-device
        // mode, we can assume only one replica in the bucket.
        bucket.replicas[0].offsets,
        bucket.replicas[0].lengths,
        bucket.replicas[0].sizes_vec);
    bucket.future_work = comm_hook_->runHook(grad_bucket);
  }
}

邏輯擴充如下：

Reduer 會註冊autograd_hook到AccumulateGrad的post_hooks之上。
Autograd Engine 在反向傳播過程中，如果發現某個引數ready，就呼叫autograd_hook。
autograd_hook 之中繼續處理。
呼叫all_reduce_bucket進行同步梯度。
會註冊一個 finalize_backward到 engine。
在 GraphTask::exec_post_processing 之中會呼叫 finalize_backward。

                                                                             +
                                                                  Worker 1   |   Worker 2
                                                                             |
  Engine    AccumulateGrad                Reducer                            |    Reducer
                                                                             |
    +              +                         +                               |        +
    |              |                         |                               |        |
    |              |          1              |                               |        |
    |              | <-----------------------+                               |        |
    |              |                                                         |        |
    |              |                                                         |        |
    |              v                                                         |        |
    |                         2                                              |        |
    |         post_hooks  +-------->  autograd_hook                          |        |
    |                                        +                               |        |
    |                                        |                               |        |
    |                                        |  3                            |        |
    |                                        v                               |        |
    |                     +------------------+---------------------------+   |        |
    |                     | mark_variable_ready                          |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |     All variable in replica are ready?       |   |        |
    |                     |                   +                          |   |        |
    |                     |                   | YES                      |   |        |
    |                     |                   v                          |   |        |
    |                     |     All replica in bucket are ready?         |   |        |
    |                     |                   +                          +   +        |
    |                     |                   | YES                                   |
    |                     |                   v               4   all_reduce_bucket   |
    |                     |            mark_bucket_ready  <--------------+---+----->  |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |                   +                          |   |        |
    |                     |                   |                          |   |        |
    |                     |                   |                          |   |        |
    |                     |                   v                          |   |        |
    |                     |          All buckets are ready?              |   |        |
    |                     |                   +                          |   |        |
    |                     |                   | YES                      |   |        |
    |                     |                   v                          |   |        |
    |      queue_back 5   |          all_reduce_local_used_map           |   |        |
    | <------------------------+  queue_callback(finalize_backward)      |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     +-------------------+--------------------------+   |        |
    v                                         |                              |        |
                                              |                              |        |
GraphTask::exec_post_processing               |                              |        |
    +                                         |                              |        |
    |                                         |                              |        |
    |                                         v                              |        |
    +----------------------------->   finalize_backward                      |        |
    |                 6                       +                              |        |
    |                                         |                              |        |
    |                                         |                              |        |
    v                                         v                              +        v

3.2.2 PythonCommHook

PythonCommHook 用來實現使用者的特殊需求，我們前文提到過，這裡再給出兩個例子。

PythonCommHook 舉例

c10::intrusive_ptr<c10::ivalue::Future> PythonCommHook::runHook(
    GradBucket& bucket) {
  py::gil_scoped_acquire acquire;

  py::object py_fut = hook_(state_, bucket);

  try {
    return py_fut.cast<std::shared_ptr<torch::jit::PythonFutureWrapper>>()->fut;
  } catch (const py::cast_error& e) {
    auto type = py_fut.get_type();
    auto errMsg = c10::str(
        e.what(),
        ". DDP communication hook's callback must return a "
        "torch.futures.Future or torch._C.Future object, but got ",
        type.attr("__module__").cast<std::string>(),
        ".",
        type.attr("__qualname__").cast<std::string>());
    throw std::runtime_error(errMsg);
  }
}

或者

c10::intrusive_ptr<c10::ivalue::Future> AllReduceCommHook::runHook(
    GradBucket& bucket) {
  std::vector<at::Tensor> tensors = {bucket.getTensorRef()};
  auto allreduce_work = state_->allreduce(tensors);

  // FIXME Access the result through the Future passed as argument, instead of
  // capturing the Work.
  auto div_by_process_group_size = [allreduce_work,
                                    this](c10::ivalue::Future& /* unused */) {
    auto tensor = allreduce_work->result()[0] / state_->getSize();
    return c10::IValue(tensor);
  };

  auto fut = allreduce_work->getFuture();
  return fut->then(div_by_process_group_size, fut->elementType());
}

3.2.3 GradBucket

GradBucket 是用來拷貝資訊的類。

// This class passes bucket contents tensor to DDP communication hook.
class GradBucket {
 public:
  explicit GradBucket(
      size_t index,
      const at::Tensor& tensor,
      const std::vector<size_t>& offsets,
      const std::vector<size_t>& lengths,
      const std::vector<c10::IntArrayRef>& sizes_vec)
      : index_(index),
        tensor_(tensor),
        offsets_(offsets),
        lengths_(lengths),
        sizes_vec_(sizes_vec) {}

  // Returns the index of the bucket, which is unique across all the buckets.
  size_t getIndex() const {
    return index_;
  }

  const at::Tensor& getTensor() const {
    return tensor_;
  }

  // Returns a mutable tensor compared with the above method.
  at::Tensor& getTensorRef() {
    return tensor_;
  }

  // Overwrites tensors at a specific index.
  void setTensor(at::Tensor& tensor) {
    tensor_ = tensor;
  }

  // Each tensor in the list that getPerParameterTensors corresponds to a
  // parameter.
  std::vector<at::Tensor> getPerParameterTensors() const;

  // Returns whther this bucket is the last bucket to allreduce in an iteration.
  bool isTheLastBucketToAllreduce() const {
    return index_ == 0;
  }

 private:
  size_t index_;
  at::Tensor tensor_;

  // Per-variable info in tensors_[0].
  std::vector<size_t> offsets_;
  std::vector<size_t> lengths_;
  std::vector<c10::IntArrayRef> sizes_vec_;
};

3.3 all_reduce_local_used_map

注意，這裡是對張量使用情況這個local_used_maps_變數進行規約，不是張量的梯度進行規約。

3.3.1 定義

我們回憶下定義。

以下兩個變數用來記錄本地使用過的引數，其標示在未啟用同步的情況下（no_sync is on），在當前迭代或者 no_sync session 之中，這些引數是否在本地被使用過。

每個模型副本對應map中的一個張量，每個張量是引數數量的一維int32（one-dim int32）張量。

這些張量在autograd_hook中標記，以指示已使用了相應的引數。這些張量會在當前迭代或無同步會話（no_sync session）的後向傳播結束時進行allreduce，以計算出全域性未使用的引數。

// Locally used parameter maps indicating if parameters are used locally
// during the current iteration or no_sync session if no_sync is on. One
// tensor for each model replica and each tensor is one-dim int32 tensor of
// number of parameters. These tensors are marked in autograd_hook to indicate
// the corresponding param has been used, and get allreduced in the end of
// backward of current iteration or no_sync session for figuring out the
// globally unused parameters.
//
// local_used_maps_:     CPU tensors for bookkeeping locally used params
// local_used_maps_dev_: dev tensors for reducing globally unused params
std::vector<at::Tensor> local_used_maps_; // autograd_hook中會設定，對應論文中的
std::vector<at::Tensor> local_used_maps_dev_; // GPU

3.3.2 同步

all_reduce_local_used_map 這裡使用了非同步 H2D 來避免阻塞開銷。即把 local_used_maps_ 拷貝到 local_used_maps_dev_，然後對 local_used_maps_dev_ 進行規約。

void Reducer::all_reduce_local_used_map() {
  // See Note [Skip allreducing local_used_maps_dev]
    // H2D from local_used_maps_ to local_used_maps_dev_
    for (size_t i = 0; i < local_used_maps_.size(); i++) {
      if (local_used_maps_dev_[i].is_cuda()) {
        // Note [local_used_maps_ -> local_used_maps_dev copying]
        // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        // We do async H2D to avoid the blocking overhead. The async copy and
        // allreduce respect the current stream, so will be sequenced
        // correctly.
        //
        // Correct sequencing with respect to host operations is also
        // essential. The H2D copy_ is stream ordered, while the host's
        // changes to local_used_maps_ are host ordered. If a large backlog of
        // cuda-stream work pushes the copy_ far into the future, and if no
        // blocking calls occur between now and finalize_backward()** such
        // that finalize_backward() re-zeroes local_used_maps_ on the host
        // before the stream executes the copy_, copy_ will read those zeros
        // instead of the values we thought we told it to read here. Copying
        // local_used_maps_[i] to a pinned temporary (which the pinned caching
        // allocator should supply asynchronously) avoids this nasty, rare
        // race condition.
        //
        // ** In the hoped-for case where all params are used, DDP itself
        // won't do any blocking work between now and the re-zeroing, so the
        // danger is real.
        //
        // Defensively ensures local_used_maps_tmp is distinct from
        // local_used_maps_[i]
        auto local_used_maps_tmp = at::native::empty_like(
            local_used_maps_[i],
            optTypeMetaToScalarType(local_used_maps_[i].options().dtype_opt()),
            local_used_maps_[i].options().layout_opt(),
            local_used_maps_[i].options().device_opt(),
            true /* pinned_memory */);
        // Paranoid asserts here because in some workloads, the pinned
        // allocator behaves in a way we don't understand, and may be bugged.
        // See https://github.com/pytorch/pytorch/pull/54474
        TORCH_INTERNAL_ASSERT(local_used_maps_tmp.is_pinned());
        TORCH_INTERNAL_ASSERT(
            local_used_maps_tmp.data_ptr() != local_used_maps_[i].data_ptr());
        local_used_maps_tmp.copy_(local_used_maps_[i]);
        local_used_maps_dev_[i].copy_(local_used_maps_tmp, true);
      } else {
        local_used_maps_dev_[i].copy_(local_used_maps_[i], true);
      }
    }
    local_used_work_ = process_group_->allreduce(local_used_maps_dev_);
}

擴充如下：

Reduer 會註冊autograd_hook到AccumulateGrad的post_hooks之上。
Autograd Engine 在反向傳播過程中，如果發現某個引數ready，就呼叫autograd_hook。
autograd_hook 之中繼續處理。
呼叫all_reduce_bucket進行同步梯度。
呼叫 allreduce 對 local_used_maps_變數進行規約。
會註冊一個 finalize_backward到 engine。
在 GraphTask::exec_post_processing 之中會呼叫 finalize_backward。

                                                                             +
                                                                  Worker 1   |   Worker 2
                                                                             |
  Engine    AccumulateGrad                Reducer                            |    Reducer
                                                                             |
    +              +                         +                               |        +
    |              |                         |                               |        |
    |              |          1              |                               |        |
    |              | <-----------------------+                               |        |
    |              |                                                         |        |
    |              |                                                         |        |
    |              |                                                         |        |
    |              |                                                         |        |
    |              v                                                         |        |
    |                         2                                              |        |
    |         post_hooks  +-------->  autograd_hook                          |        |
    |                                        +                               |        |
    |                                        |                               |        |
    |                                        |  3                            |        |
    |                                        v                               |        |
    |                     +------------------+---------------------------+   |        |
    |                     | mark_variable_ready                          |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |     All variable in replica are ready?       |   |        |
    |                     |                   +                          |   |        |
    |                     |                   | YES                      |   |        |
    |                     |                   v                          |   |        |
    |                     |     All replica in bucket are ready?         |   |        |
    |                     |                   +                          +   +        |
    |                     |                   | YES            4  all_reduce_bucket   |
    |                     |                   v                                       |
    |                     |            mark_bucket_ready  <--------------+---+----->  |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |                   +                          |   |        |
    |                     |                   |                          |   |        |
    |                     |                   |                          |   |        |
    |                     |                   v                          |   |        |
    |                     |          All buckets are ready?              |   |        |
    |                     |                   +                          |   |        |
    |                     |                   | YES                      +   +        |
    |                     |                   v                     5  allreduce      |
    |   6  queue_back     |          all_reduce_local_used_map  <--------+---+----->  |
    | <------------------------+  queue_callback(finalize_backward)      |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     +-------------------+--------------------------+   |        |
    v                                         |                              |        |
                                              |                              |        |
GraphTask::exec_post_processing               |                              |        |
    +                                         |                              |        |
    |                                         |                              |        |
    |                                         v                              |        |
    +----------------------------->   finalize_backward                      |        |
    |             7                           +                              |        |
    |                                         |                              |        |
    |                                         |                              |        |
    v                                         v                              +        v

3.4 finalize_backward

finalize_backward 完成了收尾工作，邏輯為：

遍歷桶，對於每個桶：
- 等待同步張量完成。
- 從future結果拷貝回contents。
等待 local_used_maps_dev 同步完成。

void Reducer::finalize_backward() {
  // No longer expect autograd hooks to fire after this function returns.
  expect_autograd_hooks_ = false;
  // No longer require call to finalize after this function returns.
  require_finalize_ = false;

  // Unset allreduce division factor, as it may change in next backwards pass
  // when running with DDP join mode.
  divFactor_ = kUnsetDivFactor;

  // Wait for asynchronous reduction to complete and unflatten contents.
  for (auto& bucket : buckets_) { // 遍歷桶
    // See Note [DDP Communication Hook]
    if (comm_hook_ == nullptr) {
      bucket.work->wait(); // 等待同步完成
    } else {
      bucket.future_work->wait(); // 等待同步完成

      auto future_result =
          comm_hook_->parseHookResult(bucket.future_work->value());

      for (size_t i = 0; i < future_result.size(); i++) { // 
        auto& replica = bucket.replicas[i];
        if (bucket.expect_sparse_gradient) {
          replica.contents.copy_(future_result[i]); // 從future結果拷貝回contents
        } else {
          // Reinitialize only `bucket_views_out` with the future_result by
          // following the same logic in `initialize_buckets`.
          // 把 future_result[i] 解析到 bucket_views_out 之中
          populate_bucket_views_out(replica, future_result[i]);
        }
      }
    }
    if (!bucket.expect_sparse_gradient) {
      // We don't need to finalize the sparse bucket since the sparse grad and
      // the bucket essentially point to the same storage. As a result, once
      // the allreduce is done, the sparse grads are automatically updated.
      finalize_bucket_dense(bucket); // 
    }
  }

  // See Note [Skip allreducing local_used_maps_dev]
  if (dynamic_graph_find_unused() || static_graph_first_iteration()) {
    // Due to the lazy wait, it is possible that reduction of the current
    // iteration is still going when the one for next iteration gets kicked off.
    // For such case, we want to wait explicitly to make sure the reduction does
    // complete before kicking off next one. Otherwise the previous one may
    // interfere, write to the device-side memory and clobber the content of
    // local_unused_maps_dev_.
    if (!local_used_maps_reduced_) {
      local_used_work_->wait(); // 等待 local_used_maps_dev 同步完成
    }
  }

  if (dynamic_graph_find_unused()) {
    // Reset unused parameter accounting.
    // See Note [local_used_maps_ -> local_used_maps_dev copying]
    for (auto& local_used : local_used_maps_) {
      local_used.fill_(0);
    }
    local_used_maps_reduced_ = false;
  }

  if (should_collect_runtime_stats()) {
    record_backward_comm_end_time();
  }
}

這個過程會用到如下函式。

4.6.1 populate_bucket_views_out

populate_bucket_views_out 從contents構建輸出view

// (see Note:  "Gradient Layout Contract" in initialize_buckets).
void Reducer::populate_bucket_views_out(
    Reducer::BucketReplica& replica,
    at::Tensor& tensor) { // 把tensor解析到 bucket_views_out 之中
  replica.bucket_views_out.clear(); // 清空
  for (size_t i = 0; i < replica.variables.size(); i++) { // 重新初始化 bucket_views_out
    const auto& v = replica.variables[i]; // 遍歷副本的張量
    const auto offset = replica.offsets[i];
    const auto length = replica.lengths[i];
    if (v.is_non_overlapping_and_dense()) {
      // If the param's memory is dense, match its layout, anticipating
      // the autograd engine (AccumulateGrad) will also create gradients
      // matching its layout.
      replica.bucket_views_out.push_back( // 把tensor解析到 bucket_views_out 之中
          tensor.as_strided(v.sizes(), v.strides(), offset));
    } else {
      // Fall back to a C-style contiguous view, again anticipating
      // AccumulateGrad will do the same when stashing grads for non-dense
      // params.
      replica.bucket_views_out.push_back( // 把tensor解析到 bucket_views_out 之中
          tensor.narrow(0, offset, length).view(v.sizes()));
    }
  }
}

4.6.1 finalize_bucket_dense

finalize_bucket_dense 作用是呼叫 runGradCallbackForVariable 或者 copy_bucket_to_grad 把規約好的梯度拷貝會引擎。

// A bucket with one or more dense tensors needs to be unflattened.
void Reducer::finalize_bucket_dense(Bucket& bucket) {
  for (size_t replica_index = 0; replica_index < bucket.replicas.size();
       replica_index++) {
    auto& replica = bucket.replicas[replica_index];
    for (size_t intra_bucket_index = 0;
         intra_bucket_index < replica.variables.size();
         intra_bucket_index++) {
      auto& variable = replica.variables[intra_bucket_index];
      const auto offset = replica.offsets[intra_bucket_index];
      const auto length = replica.lengths[intra_bucket_index];

      bool global_unused = false;
      // See Note [Skip allreducing local_used_maps_dev]
      if (static_graph_ || find_unused_parameters_) {
        // Determine if this param has been used globally or not.
        //
        // If the variable was used locally, it is also used globally and then
        // we don't need to wait for the reduction. Otherwise we lazily wait for
        // the reduction to complete, only when we see a variable that was
        // unused locally. Then we end up delaying the synchronization point
        // that local_used_work_->wait() implies. If we don't have any unused
        // parameters at all, we can skip waiting for the work to complete
        // altogether, and cause negligible performance overhead for models
        // where all parameters are used. Such lazily waiting means minimizing
        // performance impact for the big majority of models where all
        // parameters are always used. Then we only pay the overhead cost if
        // there is indeed a parameter that is locally unused, because we need
        // to check if it's also globally unused.
        size_t variable_index = bucket.variable_indices[intra_bucket_index];
        // Note: global_unused might not be global yet. As we lazily wait for
        // the reduction to complete, it becomes really global only if we get to
        // the point as below where we wait for the reduction work, make D2H
        // copy, and update global_unused with the real global consensus, i.e.
        // local_used_maps_reduced_ is true.
        global_unused =
            local_used_maps_[replica_index][variable_index].item<int>() == 0;
        if (global_unused && !local_used_maps_reduced_) {
          // Wait for local_used_maps reduction to complete.
          local_used_work_->wait();
          // D2H from local_used_maps_dev_ to local_used_maps_
          for (size_t i = 0; i < local_used_maps_.size(); i++) {
            // Blocking copy, if local_used_maps_dev_ is cuda
            local_used_maps_[i].copy_(local_used_maps_dev_[i]);
          }
          global_unused =
              local_used_maps_[replica_index][variable_index].item<int>() == 0;
          local_used_maps_reduced_ = true;
        }
      }

      if (!gradient_as_bucket_view_) {
        copy_bucket_to_grad( // 拷貝回 dist.context 去
            variable, replica, intra_bucket_index, global_unused);
      } else {
        const auto& bucket_view_out =
            replica.bucket_views_out[intra_bucket_index];
        auto& bucket_view_in = replica.bucket_views_in[intra_bucket_index];
        // If communication_hook is registered, bucket_view_out stores
        // allreduced results in a newly allocated tensor, copy bucket_view_out
        // back to bucket_view_in that referring to replica.content tensor and
        // grad.
        if (!bucket_view_in.is_alias_of(bucket_view_out)) {
          bucket_view_in.copy_(bucket_view_out); // 從out拷貝回in view
        }
        runGradCallbackForVariable(variable, [&](auto& grad) {
          // If a parameter is globally unused, we keep its grad untouched.
          if (!global_unused) {
            // If grad is globally used but locally unused, let grad point to
            // bucket_view_in
            if (!grad.defined()) {
              grad = bucket_view_in;
            } else {
              if (!grad.is_alias_of(bucket_view_in)) {
                TORCH_CHECK(
                    false,
                    "Detected at least one parameter gradient is not the "
                    "expected DDP bucket view with gradient_as_bucket_view=True. "
                    "This may happen (for example) if multiple allreduce hooks "
                    "were registered onto the same parameter. If you hit this error, "
                    "please file an issue with a minimal repro.");
              }
            }
            // The grad is modified and needs to be written back.
            return true;
          }
          // The grad is not modified.
          return false;
        });
      }
    }
  }
}

4.6.3 copy_bucket_to_grad

這裡是從桶拷貝回autograd engine之中對應的梯度。

void Reducer::copy_bucket_to_grad(
    at::Tensor& variable,
    Reducer::BucketReplica& replica,
    size_t intra_bucket_index,
    bool global_unused) {
  const auto& bucket_view = replica.bucket_views_out[intra_bucket_index]; // 拿到輸出view
  runGradCallbackForVariable(variable, [&](auto& grad) {
    // If a parameter is globally unused, we keep its grad untouched.
    if (!global_unused) {
      if (!grad.defined()) {
        // Creates grad according to the "Gradient Layout Contract"
        // (see torch/csrc/grad/AccumulateGrad.h)
        grad =
            torch::autograd::utils::clone_obey_contract(bucket_view, variable);
      } else {
        grad.copy_(bucket_view); // 從桶拷貝回梯度
      }
      // The grad is modified and needs to be written back.
      return true;
    }
    // The grad is not modified.
    return false;
  });
}

至此，我們擴充如下：

Reduer 會註冊autograd_hook到AccumulateGrad的post_hooks之上。
Autograd Engine 在反向傳播過程中，如果發現某個引數ready，就呼叫autograd_hook。
autograd_hook 之中繼續處理。
呼叫all_reduce_bucket進行同步梯度。
呼叫 allreduce 對 local_used_maps_變數進行規約。
會註冊一個 finalize_backward到 engine。
在 GraphTask::exec_post_processing 之中會呼叫 finalize_backward。
呼叫 wait 於其他 worker 同步。
呼叫 copy_bucket_to_grad 從桶拷貝回autograd引擎對應的梯度。

因此，我們就知道了一個在反向傳播過程之中，autograd 引擎如何與DDP互動，如何一邊做反向計算，一邊利用DDP歸併梯度的完整過程。

                                                                             +
                                                                  Worker 1   |   Worker 2
                                                                             |
  Engine    AccumulateGrad                Reducer                            |    Reducer
                                                                             |
    +              +                         +                               |        +
    |              |                         |                               |        |
    |              |          1              |                               |        |
    |              |  <----------------------+                               |        |
    |              |                                                         |        |
    |              |                                                         |        |
    |              v                                                         |        |
    |                         2                                              |        |
    |         post_hooks  +-------->  autograd_hook                          |        |
    |                                        +                               |        |
    |                                        |                               |        |
    |                                        |  3                            |        |
    |                                        v                               |        |
    |                     +------------------+---------------------------+   |        |
    |                     | mark_variable_ready                          |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |     All variable in replica are ready?       |   |        |
    |                     |                   +                          |   |        |
    |                     |                   | YES                      |   |        |
    |                     |                   v                          |   |        |
    |                     |     All replica in bucket are ready?         |   |        |
    |                     |                   +                          +   +        |
    |                     |                   | YES           4   all_reduce_bucket   |
    |                     |                   v                                       |
    |                     |            mark_bucket_ready  <--------------+---+----->  |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     |                   +                          |   |        |
    |                     |                   |                          |   |        |
    |                     |                   |                          |   |        |
    |                     |                   v                          |   |        |
    |                     |          All buckets are ready?              |   |        |
    |                     |                   +                          |   |        |
    |                     |                   | YES                      +   +        |
    |                     |                   v                    5   allreduce      |
    |   6  queue_back     |          all_reduce_local_used_map  <--------+---+----->  |
    | <------------------------+  queue_callback(finalize_backward)      |   |        |
    |                     |                                              |   |        |
    |                     |                                              |   |        |
    |                     +-------------------+--------------------------+   |        |
    v                                         |                              |        |
                                              |                              |        |
GraphTask::exec_post_processing               |                              |        |
    +                                         |                              |        |
    |                                         |                              |        |
    |              7                          v                              |        |
    +----------------------------->   finalize_backward                      |        |
    |                                         +                 8       wait |        |
    |                                         |  <--------------------------------->  |
    | <-------------------------------------+ |                              |        |
    v         copy_bucket_to_grad     9       v                              +        v

至此，反向傳播分析完畢，DDP 的全部分析也結束，我們接下來對分散式autograd進行分析。