[原始碼解析] 深度學習分散式訓練框架 horovod (6) --- 後臺執行緒架構

羅西的思考發表於2021-06-24

原始碼深度學習分散式框架執行緒架構

[原始碼解析] 深度學習分散式訓練框架 horovod (6) --- 後臺執行緒架構

[原始碼解析] 深度學習分散式訓練框架 horovod (6) --- 後臺執行緒架構

0x00 摘要

Horovod 是Uber於2017年釋出的一個易於使用的高效能的分散式訓練框架，在業界得到了廣泛應用。

本系列將通過原始碼分析來帶領大家瞭解 Horovod。本文是系列第六篇，看看 Horovod 後臺執行緒架構。

前面幾篇連結如下：

[原始碼解析] 深度學習分散式訓練框架 Horovod (1) --- 基礎知識

[原始碼解析] 深度學習分散式訓練框架 horovod (2) --- 從使用者角度切入

[原始碼解析] 深度學習分散式訓練框架 horovod (3) --- Horovodrun背後做了什麼

[原始碼解析] 深度學習分散式訓練框架 horovod (4) --- 網路基礎 & Driver

[原始碼解析] 深度學習分散式訓練框架 horovod (5) --- 融合框架

0x01 引子

在前文我們看到，當訓練時，Execution Thread 會通過一系列操作，把 Tensor & Operation 傳遞給後臺執行緒，其流程大致如下：

   IndexedSlices
        +
        |
        |
        v
     allreduce
        +
        |
        |
        v
     allgather
        +
        |
        |
        v
  HorovodAllgather
        +
        |
        v
  HorovodAllgatherOp
        +
        |
        |
        v
 EnqueueTensorAllgather
        +
        |
        |
        v
+-------+-------------+
| HorovodGlobalState  |
|                     |
|      message_queue  |
|                     |
|       tensor_table  |
|                     |
+---------------------+

或者如下圖，左面是執行執行緒，就是訓練執行緒，右面是後臺執行緒，用來做 ring-allreduce：

我們下面繼續看看後臺是如何運作的。

0x02 設計要點

2.1 問題

因為計算框架往往採用多執行緒執行訓練的計算圖，所以在多節點情況下，拿allreduce操作來舉例，我們不能保證每個節點上的 allreduce 請求是有序的。因此MPI_Allreduce並不能直接用。

2.2 方案

為了解決這個問題，hvd 設計了一個主從模式，rank 0 為 master 節點，rank 1 ~ rank n 為 worker 節點。

master 節點進行同步協調，保證對於某些 tensor 的 allreduce 請求最終有序 & 完備，可以繼續處理。
在決定了哪些 tensor 以後，master又會將可以進行通訊的tensor 名字和順序發還給各個節點。
當所有的節點都得到了即將進行的MPI的tensor和順序，MPI通訊得以進行。

首先回顧下同步梯度更新這個概念，其表示的是等待所有Rank的梯度都計算完畢後，再統一做全域性梯度累加，這就涉及到在叢集中做訊息通訊，為此HVD做了兩個方面的工作。

在Horovod中，每張卡都對應一個訓練程式，稱之為rank。如4張卡，對應的各個程式的rank則為 [0,1,2,3]。
協調工作：HVD裡面將 Rank0 作為coordinator（master），其餘的程式為worker。由Rank0來協調所有Rank的進度。
後臺執行緒：為了不block正常OP的計算，HVD裡面建立 background communication 執行緒，專門用來Rank間的訊息同步和AllReduce操作。

在 Horovod 中，訓練程式是平等的參與者，每個程式既負責梯度的分發，也負責具體的梯度計算。如下圖所示，三個 Worker 中的梯度被均衡地劃分為三份，通過 4 次通訊，能夠完成叢集梯度的計算和同步。

2.3 協調

2.3.1 設計

對於協調的過程，文件中也有非常詳細的講述，我也一起翻譯。

coordinator 目前採用master-worker paradigm。Rank 0 作為master（即 "coordinator"），其他的rank是 worker。每個 rank 在自己的後臺執行緒中執行，時間片迴圈排程處理。在每個時間片中會進行如下操作：

Workers 會傳送 MPIRequests 給 coordinator。MPIRequests 顯式註明 worker 希望做什麼（比如在哪個 tensor 上做什麼操作，是 gather 還是 reduce，以及 tensor 的形狀和型別）。在 tensor 的 collective op 已經執行完 ComputeAsync 之後，worker 就會對於每個 tensor 傳送MPIRequest。

當沒有更多處理的 tensors 之後，workers 會向 coordinator 傳送一個空的 "DONE" 訊息；

coordinator 從 worker 收到 MPIRequests 以及 coordinator本身的 TensorFlow ops 之後，將它們儲存在請求表中（request table）。協調器繼續接收MPIRequest，直到收到了MPI_SIZE 個 "DONE" 訊息；

Coordinator 收集所有準備縮減，gather 的張量，或所有導致錯誤的操作。對於每一個向量或者操作。Coordinator 向所有工作人員傳送MPIResponse。當沒有更多的MPIResponse時，Coordinator將向工人傳送“完成”響應。如果程式正在關閉，它將傳送一個“shutdown”響應。

Workers 監聽MPIResponse訊息，逐個做所要求的reduce或gather操作，直到他們收到"DONE" resposne。此時，時間片結束。如果接收到的不是“DONE”，而是“SHUTDOWN”，則退出background loop

簡單來講就是：

Coordinator 收集所有 worker（包括Coordinator自己，因為自己也在進行訓練）的MPIRequests，把他們放入request table。
當收集到 MPI_SIZE 個 "DONE" 訊息之後，Coordinator 會找出就緒的 tensor （在 message_table 裡面查詢）構造出一個 read_to_reduce 的列表，然後發出 size 個 MPIResponse 告知程式進行計算。
worker 接受到 response 開始真正的計算過程(通過 op_manager 具體執行)。
這是整體同步的過程，如果開啟 horovod 的 trace log(HOROVOD_LOG_LEVEL=trace) 就能看到同步的過程。

2.3.2 實現

我們再具體看看實現。

在Horovod中，每張卡都對應一個訓練程式，稱之為rank。如4張卡，對應的各個程式的rank則為[0,1,2,3]。

hvd 設計了一個主從模式，將 Rank0 作為coordinator（master），其餘的程式為worker，由Rank0來協調所有Rank的進度。每個worker節點上都有一個訊息佇列，而在master節點上除了一個訊息佇列，還有一個訊息map。

每當計算框架發來通訊請求時，hvd並不直接執行MPI，而是封裝了這個訊息並推入自己的訊息佇列。

整體採用訊息的 Request 和 Response 機制；
當某個 OP 的 gradient 計算完成並且等待全域性的 AllReduce，該 Rank 就會包裝一個 Request 請求，呼叫 ComputeResponseList 將 Request （就是說，這是個 ready tensor）放入這個 rank 的 message_queue 中，每個 Rank 的後臺執行緒定期輪訓自己的 message_queue，然後把 queue 裡面的 request 傳送到 Rank 0。因為是同步MPI，所以每個節點會阻塞等待MPI完成。
Rank 0 擁有 message_table，用來儲存其他 rank 的 request 資訊，rank 0 會處理 message_table 裡面所有的 request。
當 rank 0 收到所有 rank 對於某個 op allreduce 的 request 之後，就說明這個 tensor 在所有的rank中都已經ready。說明所有的節點都已經發出了對該tensor的通訊請求，那這個tensor就需要且能夠進行通訊。
決定了tensor以後，master又會將可以進行通訊的tensor 名字和順序發還給各個節點。
- Rank 0 節點會挑選出所有符合要求的tensor進行MPI通訊：
- 不符合要求的tensor繼續留在訊息map中，等待條件符合。
- 當有符合要求的 tensor，Rank 0 然後就會傳送 Response 給其他 rank，表明當前 op & tensor 的所有區域性梯度已經 Ready，可以對這個tensor執行collective操作，比如可以執行 allReduce 操作。
至此，所有的節點都得到了即將進行的MPI的tensor和順序，MPI通訊得以進行。

大致邏輯如下：

           Rank 0                          Rank 1         Rank 2
             +                               +              +
             |                               |              |
             |                               |              |
             |                               |              |
             +            Tensor 1 request   |              |
        message_table  <---------------------+              |
             +                               |              |
             |                               |              |
             |                               |              |
             v                               |              |
                                             |              |
 message_table[tensor 1]                     |              |
             +                               |              |
             |                               |              |
             |           Tensor 1 request    |              |
             | <--------------------------------------------+
             +                               |              |
 message_table[tensor 1, tensor 1]           |              |
             +                               |              |
             |                               |              |
             |           Tensor 1 request    |              |
             +-------------------------+     |              |
             |                         |     |              |
             |                         |     |              |
             | <-----------------------+     |              |
             |                               |              |
             v                               |              |
message_table[tensor 1, tensor 1, tensor 1]  |              |
             +                               |              |
             |                               |              |
             |                               |              |
             |          Tensor 1 response    |              |
             +-----------------------------> |              |
             |                               |              |
             |          Tensor 1 response    |              |
             +--------------------------------------------> |
             |                               |              |
             |  Tensor 1 response            |              |
             +-------------------------v     |              |
             |                         |     |              |
             |                         |     |              |
             | <-----------------------+     |              |
             |                               |              |
             |                               |              |
             v                               v              v

2.4 Background Thread

每個rank有兩個thread，我們通常在python檔案中使用hvd.init()來初始化hvd，實際上是開了一個後臺執行緒和一個MPI執行緒。

Execution thread（MPI執行緒）是用來做機器學習計算的。
background thread 是 rank 之間同步通訊和做allreduce操作的。百度在設計時候，就有了一個MPI background thread，Horovod沿用了這個設計，名字就是BackgroundThreadLoop。

2.4.1 設計

關於設計的思考，百度在原始碼註釋（tensorflow-allreduce-master/tensorflow/contrib/mpi_collectives/mpi_ops.cc）裡面寫的非常清楚，我大致翻譯出來。

MPI background thread 是為了協調所有的 MPI 程式和tensor reduction。這個設計是處於幾個考慮：

一些MPI實現要求所有的MPI呼叫必須在一個單獨執行緒中。因為 Tensorflow 在處理圖的時候可能會用到幾個執行緒，所以我們必須使用自己的特定的執行緒來處理MPI；

對於某些錯誤（比如不匹配的types），MPI 有時候會沒有一個確定的處理方式，但是我們還想優雅的處理這些錯誤。為了做到優雅處理，就要求 MPI 程式需要知道其他程式上tensor的形狀和型別；

MPI reductions and gathers 也許會和其他操作一起並行處理。因為 MPI 使用一個與TF GPUDevice streams分離的內部（inaccessible）的GPU stream，我們不能顯式進行同步memcpys或者kernels。因此，MPIAllreduce and MPIAllgather 必須是 AsyncOpKernels 型別以便確保memcpys或者kernels的合理順序；

注意：我們無法確保所有的MPI程式以同樣的順序reduce他們的tensors。因此，必須有一個辦法來確保可以同時跨越所有的ranks來做reduction memcpys and kernels。我們使用 rank ID 0 作為 coordinator 來協調那些已經準備好的，可以執行的操作（gather and trigger the reduction operations）；

精簡下：

一些MPI的實現機制要求所有的MPI呼叫必須在一個單獨執行緒中。
為了處理錯誤，MPI 程式需要知道其他程式上tensor的形狀和型別。
MPIAllreduce and MPIAllgather 必須是 AsyncOpKernels 型別以便確保memcpys或者kernels的合理順序。

因此，一個後臺執行緒是有必要的。horovod_global.message_queue 以及 horovod_global.tensor_table 都是在Horovod的後臺執行緒BackgroundThreadLoop 中被處理的。

2.4.2 實現

在底層，AllReduce 被註冊為 Op，在 ComputeAsync 中，計算請求被入隊到一個佇列中。這一佇列會被一個統一的後臺執行緒處理。

在這個後臺執行緒的初始化過程中，它會利用程式內共享的全域性狀態在自己的記憶體裡建立一些物件，以及一些邏輯判斷。比如要不要進行 Hierarchical AllReduce，要不要 AutoTune等。這裡是初始化階段的日誌。

在初始化的過程中，有一些比較重要的物件會被構造出來，比如各種 Controller。

我們接下來就具體分析後臺執行緒。

0x03 輔助功能

我們首先介紹一些輔助功能。

3.1 如何判斷是 coordinator

因為後臺執行緒程式碼是所有worker公用，所以需要區分 rank0 還是其他 worker，從而執行不同的程式碼流程。

這裡採用 is_coordinator 用來判斷是否是 Rank0。

is_coordinator_ 的賦值如下：

void MPIController::DoInitialization() {
  ......

  // Get MPI rank to determine if we are rank zero.
  MPI_Comm_rank(mpi_ctx_.mpi_comm, &rank_);
  is_coordinator_ = rank_ == 0;

is_coordinator_ 的使用方式示例如下，可以看出來，在同步引數的時候，是從 rank 0 獲取引數，然後廣播給其他 rank，即 workers：

void Controller::SynchronizeParameters() {
  ParameterManager::Params param;
  if (is_coordinator_) { // rank 0 執行操作
    param = parameter_manager_.GetParams();
  }

  void* buffer = (void*)(&param);
  size_t param_size = sizeof(param);
  Bcast(buffer, param_size, 0, Communicator::GLOBAL);

  if (!is_coordinator_) { // worker 執行操作
    parameter_manager_.SetParams(param);
  }
}

3.2 協調快取&資訊

在 ComputeResponseList 函式中，會使用以下程式碼來協調快取，作用就是整理出來所有 rank 共有的 tensor。

CoordinateCacheAndState(cache_coordinator);

主要還是用到了cache_coordinator 操作。

void Controller::CoordinateCacheAndState(CacheCoordinator& cache_coordinator) {
  // Sync cache and state information across workers.
  cache_coordinator.sync(shared_from_this(), timeline_enabled_);
}

3.2.1 計算共有 tensor

CoordinateCacheAndState 函式如下：

每個worker都整理自己的bitvector;
使用 CrossRankBitwiseAnd 整理出來共有的 tensor；
使用 CrossRankBitwiseOr 整理出來共有的無效 tensor；

void CacheCoordinator::sync(std::shared_ptr<Controller> controller,
                            bool timeline_enabled) {

  // Resize and initialize bit vector.
  int nbits = num_active_bits_ + NUM_STATUS_BITS;
  int count = (nbits + sizeof(long long) * CHAR_BIT - 1) /
              (sizeof(long long) * CHAR_BIT);

  ......

  // 每個worker都整理自己的bitvector
  // For each cache hit on this worker, flip associated bit in bit vector.
  for (auto bit : cache_hits_) {
    int shifted_bit = bit + NUM_STATUS_BITS;
    int shift = shifted_bit / (sizeof(long long) * CHAR_BIT);
    bitvector_[shift] |=
        (1ull << (shifted_bit % (sizeof(long long) * CHAR_BIT)));
    if (timeline_enabled) {
      // Set corresponding bit in extended section for timeline if needed.
      bitvector_[count + shift] ^=
          (1ull << (shifted_bit % (sizeof(long long) * CHAR_BIT)));
    }
  }

  // 整理出來共有的 tensor
  // Global AND operation to get intersected bit array.
  controller->CrossRankBitwiseAnd(bitvector_, fullcount);

  // Search for flipped bits to populate common cache hit set. There will never
  // be invalid bits in this set.
  cache_hits_.clear();
  for (int i = 0; i < count; ++i) {
    int shift = i * sizeof(long long) * CHAR_BIT;
    long long ll = bitvector_[i];
    while (ll) {
      int idx = __builtin_ffsll(ll);
      int shifted_bit = shift + idx - 1;
      cache_hits_.insert(shifted_bit - NUM_STATUS_BITS);
      ll &= ~(1ull << (idx - 1));
    }
  }

  ......

  // If any worker has invalid cache entries, communicate invalid bits across
  // workers using a second bit-wise allreduce operation.
  if (invalid_in_queue_) {
    std::memset(&bitvector_[0], 0, count * sizeof(long long));
    for (auto bit : invalid_bits_) {
      int shift = bit / (sizeof(long long) * CHAR_BIT);
      bitvector_[shift] |= (1ull << (bit % (sizeof(long long) * CHAR_BIT)));
    }

    // Global OR operation to get common invalid bits.
    controller->CrossRankBitwiseOr(bitvector_, count);
    // Search for flipped bits to populate common invalid bit set.
    invalid_bits_.clear();
    for (int i = 0; i < count; ++i) {
      int shift = i * sizeof(long long) * CHAR_BIT;
      long long ll = bitvector_[i];
      while (ll) {
        int idx = __builtin_ffsll(ll);
        int bit = shift + idx - 1;
        invalid_bits_.insert(bit);
        ll &= ~(1ull << (idx - 1));
      }
    }
  }

  synced_ = true;
}

3.2.2 MPI操作

CrossRankBitwiseAnd 作用是呼叫 MPI 歸併共有的 bitvector。

void MPIController::CrossRankBitwiseAnd(std::vector<long long>& bitvector,
                                        int count) {
  int ret_code = MPI_Allreduce(MPI_IN_PLACE, bitvector.data(), count,
                               MPI_LONG_LONG_INT, MPI_BAND, mpi_ctx_.mpi_comm);
}

3.3 MPIContext

mpi_context 是在載入 C++ 的程式碼時候就已經建立了，同時建立的還有其他 context（ nccl_context, gpu_context），主要是維護一些節點上 mpi 通訊的必要環境資訊和設定，如：

3 個 MPI communicator，mpi_comm，local_comm，cross_comm 分別負責 horovod mpi 傳輸，節點內傳輸，和節點間分層傳輸（主要用於 hierarchical allreduce）。
mpi_float16_t ：horovod 主要以 float16 傳輸。
mpi_float16_sum： float16 對應的sum 操作。

在 horovod 使用 mpi 的時候，都會使用上面的 communicator 進行資料傳輸。

3.4 Parameter_manager

Parameter_manager 主要是 GlobalState 的一個用於管理一些調節 horovod 效能的引數的管理器，在 BackgroundThreadLoop 中跟其他的 GlobalState 的元素一同初始化，然後會讀取下面這些對應的環境變數，然後進行設定。

HOROVOD_FUSION_THRESHOLD ：指傳輸資料切片的大小，預設是64M，如果切片太大，傳輸的時候就不能很好地 pipeline 傳輸，如果太小，一個 tensor 需要傳輸多次，增加 IO 的 overhead。
HOROVOD_CYCLE_TIME ：指 RunLoopOnce 的睡眠時長，預設是 5ms，比較理想的睡眠時間應該是 RunLoopOnce 其餘邏輯處理的時間 + HOROVOD_CYCLE_TIME 剛好等於一次前向傳播和後向傳播所用的時間，因為睡太久前端會在等 RunLoopOnce 睡醒；如果睡太短，不斷地跑一次 RunLoopOnce，tensor_queue 也不會有新的元素，只是白跑。
HOROVOD_CACHE_CAPACITY：指 cache 的大小，這個可能跟 model 層數引數量相關了。
HOROVOD_HIERARCHICAL_ALLGATHER：是否使用分層的 allgather 的方式等

Parameter_manager 也提供了對這些引數自動調節的功能。通過 Parameter_manager.SetAutoTuning 進行設定，設定後會在初始的幾個 batch 嘗試不同的引數組合進行通訊，後面會收斂到一組最優的引數值。

0x04 總體程式碼

4.1 後臺執行緒

BackgroundThreadLoop 是訓練過程中的後臺執行緒，主要負責跟其他節點的通訊，和處理前端過來的通訊需求（request），會輪詢呼叫 RunLoopOnce，不斷檢視 tensor_queue 中有沒有需要通訊的tensor，如果有跟其他節點同步更新，然後執行通訊操作。

在 BackgroundThreadLoop 函式可以看到基本邏輯：

依據編譯配置，決定如何初始化，比如 mpi_context.Initialize 只有在 MPI 編譯時候才初始化。
初始化 controller，會根據載入的集合通訊庫（mpi 或者 gloo）為 globalstate 建立對應的 controller；
得到各種配置，比如 local_rank；
設定 background thread affinity；
設定 GPU stream；
設定 timeline 配置；
設定 Tensor Fusion threshold，cycle time，response cache capacity，flag for hierarchical allreduce.....；
設定 auto-tuning， chunk size；
重置 operation manager；
進入關鍵程式碼 RunLoopOnce；

縮減版程式碼如下：

BackgroundThreadLoop(HorovodGlobalState& state) {
 ......

#if HAVE_MPI
  // Initialize mpi context
#if HAVE_DDL
  // If DDL is enabled, let DDL ops manage MPI environment.
  auto mpi_ctx_manager = DDL_MPIContextManager(ddl_context, gpu_context);
#else
  // Otherwise, let MPI ops be in charge.
  auto mpi_ctx_manager = MPIContextManager();
#endif
  // mpi_context 會根據前端和環境變數傳過來的資訊，建立 mpi 執行緒，和一些 mpiOps 
  mpi_context.Initialize(state.controller->GetRanks(), mpi_ctx_manager);
#endif

  ......
    
  // 會同步不同 node 的 global_size, local_size, rank, is_coordinator 等資訊  
  // Initialize controller
  state.controller->Initialize();

  int local_size = state.controller->GetLocalSize();
  int local_rank = state.controller->GetLocalRank();

  ......
    
  // 設定op_manager，這裡主要是註冊不同的集合通訊庫的 ops  
  op_manager.reset(CreateOperationManager(state));

  // Signal that initialization is completed.
  state.initialization_done = true;

  // Iterate until shutdown.
  try {
    while (RunLoopOnce(state));
  } catch (const std::exception& ex) {
    LOG(ERROR) << "Horovod background loop uncaught exception: " << ex.what();
  }
}

4.2 哪裡建立環

也許大家會有疑問，既然 Horovod 是 ring Allreduce，但是究竟是在哪裡建立了環？我們選幾種實現來大致看看。因為如果細緻研究就需要深入MPI，gloo等，這已經超出了本文範疇，所以我們只是大致瞭解。

4.2.1 NCCL 呼叫

我們首先看看 NCCL。

4.2.1.1 NCCL

NCCL是Nvidia Collective multi-GPU Communication Library的簡稱，它是一個實現多GPU的collective communication通訊（all-gather, reduce, broadcast）庫，Nvidia做了很多優化，以在PCIe、Nvlink、InfiniBand上實現較高的通訊速度。

4.2.1.2 Horovod

在 NCCLAllreduce::Execute 我們可以看到，呼叫了ncclAllReduce，這是 nccl 的 API，因此我們可以推斷，其引數 *nccl_op_context_.nccl_comm_應該是關鍵。

Status NCCLAllreduce::Execute(std::vector<TensorTableEntry>& entries,
                              const Response& response) {

  // Do allreduce.
  auto nccl_result = ncclAllReduce(fused_input_data, buffer_data,
                                   (size_t) num_elements,
                                   GetNCCLDataType(first_entry.tensor), ncclSum,
                                   *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
}

nccl_op_context_ 是 NCCLOpContext 型別，NCCLOpContext 簡化版定義如下：

class NCCLOpContext {
public:
  void InitNCCLComm(const std::vector<TensorTableEntry>& entries,
                    const std::vector<int32_t>& nccl_device_map);

  ncclComm_t* nccl_comm_;
};

所以我們來看其引數 nccl_comm_是如何初始化的，可以看到其呼叫了 ncclCommInitRank 進行初始化。

void NCCLOpContext::InitNCCLComm(const std::vector<TensorTableEntry>& entries,
                                 const std::vector<int32_t>& nccl_device_map) {
  // Ensure NCCL communicator is in the map before executing operation.
  ncclComm_t& nccl_comm = nccl_context_->nccl_comms[global_state_->current_nccl_stream][nccl_device_map];
  if (nccl_comm == nullptr) {
    auto& timeline = global_state_->timeline;
    timeline.ActivityStartAll(entries, INIT_NCCL);

    int nccl_rank, nccl_size;
    Communicator nccl_id_bcast_comm;
    // 獲取rank相關資訊
    PopulateNCCLCommStrategy(nccl_rank, nccl_size, nccl_id_bcast_comm);

    ncclUniqueId nccl_id;
    global_state_->controller->Bcast((void*)&nccl_id, sizeof(nccl_id), 0,
                                         nccl_id_bcast_comm);

    ncclComm_t new_nccl_comm;
    // 這裡呼叫了nccl，傳遞了rank資訊
    auto nccl_result = ncclCommInitRank(&new_nccl_comm, nccl_size, nccl_id, nccl_rank);
    nccl_context_->ErrorCheck("ncclCommInitRank", nccl_result, nccl_comm);
    nccl_comm = new_nccl_comm;

    // Barrier helps NCCL to synchronize after initialization and avoid
    // deadlock that we've been seeing without it.
    global_state_->controller->Barrier(Communicator::GLOBAL);
    timeline.ActivityEndAll(entries);
  }

  nccl_comm_ = &nccl_comm;
}

PopulateNCCLCommStrategy就是從全域性狀態中獲取rank資訊。

void NCCLOpContext::PopulateNCCLCommStrategy(int& nccl_rank, int& nccl_size,
                                             Communicator& nccl_id_bcast_comm) {
  if (communicator_type_ == Communicator::GLOBAL) {
    nccl_rank = global_state_->controller->GetRank();
    nccl_size = global_state_->controller->GetSize();
  } else if (communicator_type_ == Communicator::LOCAL) {
    nccl_rank = global_state_->controller->GetLocalRank();
    nccl_size = global_state_->controller->GetLocalSize();
  } else {
    throw std::logic_error("Communicator type " + std::to_string(communicator_type_) +
                            " is not supported in NCCL mode.");
  }
  nccl_id_bcast_comm = communicator_type_;
}

於是我們得去 NCCL 原始碼中看看。

4.2.1.3 In NCCL

在 init.cc 中可以看到

NCCL_API(ncclResult_t, ncclCommInitRank, ncclComm_t* newcomm, int nranks, ncclUniqueId commId, int myrank);
ncclResult_t ncclCommInitRank(ncclComm_t* newcomm, int nranks, ncclUniqueId commId, int myrank) {
  NVTX3_FUNC_RANGE_IN(nccl_domain);
  int cudaDev;
  CUDACHECK(cudaGetDevice(&cudaDev));
  // 這裡初始化
  NCCLCHECK(ncclCommInitRankDev(newcomm, nranks, commId, myrank, cudaDev));
  return ncclSuccess;
}

繼續看，呼叫了 ncclAsyncInit 來完成最後初始化，傳入了總體rank數目，程式自身的myrank。

static ncclResult_t ncclCommInitRankDev(ncclComm_t* newcomm, int nranks, ncclUniqueId commId, int myrank, int cudaDev) {
  ncclResult_t res;
  char* env = getenv("NCCL_COMM_ID");

  NCCLCHECKGOTO(ncclInit(), res, end);
  // Make sure the CUDA runtime is initialized.
  CUDACHECKGOTO(cudaFree(NULL), res, end);
  NCCLCHECKGOTO(PtrCheck(newcomm, "CommInitRank", "newcomm"), res, end);

  if (ncclAsyncMode()) {
    // 呼叫了 ncclAsyncInit 來完成最後初始化，傳入了總體rank數目，程式自身的myrank
    NCCLCHECKGOTO(ncclAsyncInit(ncclCommInitRankSync, newcomm, nranks, commId, myrank, cudaDev), res, end);
  } else {
    NCCLCHECKGOTO(ncclCommInitRankSync(newcomm, nranks, commId, myrank, cudaDev), res, end);
  }

end:
  if (ncclAsyncMode()) return ncclAsyncErrCheck(res);
  else return res;
}

ncclComm_t 實際是 ncclComm 的typedef，因此我們看看ncclComm定義，其中就包括了總體rank數目，程式自身的myrank。

struct ncclComm {
  struct ncclChannel channels[MAXCHANNELS];
  ... 
  // Bitmasks for ncclTransportP2pSetup
  int connect;
  uint32_t* connectSend;
  uint32_t* connectRecv;

  int rank;    // my rank in the communicator
  int nRanks;  // number of GPUs in communicator
  int cudaDev; // my cuda device index
  int64_t busId;   // my PCI bus ID in int format

  int node;
  int nNodes;
  int localRanks;

  // Intra-process sync
  int intraRank;
  int intraRanks;
  int* intraBarrier;
  int intraPhase;
  ....
};

因此，我們大致可以瞭解，horovod 把 rank 資訊傳進來，NCCL 會據此組環。

4.2.2 GLOO

在 GlooContext::Initialize 之中可以看到，Horovod 通過 Rendezvous 把 rank 資訊發給了 Rendezvous Server。

Gloo 內部會進行組環。

其中，cross_rank 是hierarchical allreduce所需要的。

void GlooContext::Initialize(const std::string& gloo_iface) {

  attr device_attr;
  device_attr.iface = gloo_iface;

  device_attr.ai_family = AF_UNSPEC;
  auto dev = CreateDevice(device_attr);
  auto timeout = GetTimeoutFromEnv();

  auto host_env = std::getenv(HOROVOD_HOSTNAME);
  std::string hostname = host_env != nullptr ? std::string(host_env) : std::string("localhost");

  int rank = GetIntEnvOrDefault(HOROVOD_RANK, 0);
  int size = GetIntEnvOrDefault(HOROVOD_SIZE, 1);
  int local_rank = GetIntEnvOrDefault(HOROVOD_LOCAL_RANK, 0);
  int local_size = GetIntEnvOrDefault(HOROVOD_LOCAL_SIZE, 1);
  int cross_rank = GetIntEnvOrDefault(HOROVOD_CROSS_RANK, 0);
  int cross_size = GetIntEnvOrDefault(HOROVOD_CROSS_SIZE, 1);

  auto rendezvous_addr_env = std::getenv(HOROVOD_GLOO_RENDEZVOUS_ADDR);
  auto rendezvous_port = GetIntEnvOrDefault(HOROVOD_GLOO_RENDEZVOUS_PORT, -1);

  bool elastic = GetBoolEnvOrDefault(HOROVOD_ELASTIC, false);
  if (elastic && reset_) {
    std::string server_addr = rendezvous_addr_env;
    std::string scope = HOROVOD_GLOO_GET_RANK_AND_SIZE;
    HTTPStore init_store(server_addr, rendezvous_port, scope, rank);

    auto key = hostname + ":" + std::to_string(local_rank);
    std::vector<char> result = init_store.get(key);
    std::string s(result.begin(), result.end());
    std::stringstream ss(s);

    int last_rank = rank;
    int last_size = size;
    int last_local_rank = local_rank;
    int last_local_size = local_size;
    int last_cross_rank = cross_rank;
    int last_cross_size = cross_size;

    rank = ParseNextInt(ss);
    size = ParseNextInt(ss);
    local_rank = ParseNextInt(ss);
    local_size = ParseNextInt(ss);
    cross_rank = ParseNextInt(ss);
    cross_size = ParseNextInt(ss);

    SetEnv(HOROVOD_RANK, std::to_string(rank).c_str());
    SetEnv(HOROVOD_SIZE, std::to_string(size).c_str());
    SetEnv(HOROVOD_LOCAL_RANK, std::to_string(local_rank).c_str());
    SetEnv(HOROVOD_LOCAL_SIZE, std::to_string(local_size).c_str());
    SetEnv(HOROVOD_CROSS_RANK, std::to_string(cross_rank).c_str());
    SetEnv(HOROVOD_CROSS_SIZE, std::to_string(cross_size).c_str());
  }

  // 設定了不同的 Rendezvous server
  ctx = Rendezvous(HOROVOD_GLOO_GLOBAL_PREFIX,
                   rendezvous_addr_env, rendezvous_port,
                   rank, size, dev, timeout);

  local_ctx = Rendezvous(HOROVOD_GLOO_LOCAL_PREFIX + hostname,
                         rendezvous_addr_env, rendezvous_port,
                         local_rank, local_size, dev, timeout);

  cross_ctx = Rendezvous(HOROVOD_GLOO_CROSS_PREFIX + std::to_string(local_rank),
                         rendezvous_addr_env, rendezvous_port,
                         cross_rank, cross_size, dev, timeout);
}

4.2.3 MPI

MPIContext::Initialize 中可以看到，在這會設定各種 rank。

void MPIContext::Initialize(const std::vector<int>& ranks,
                            MPIContextManager& ctx_manager) {

  auto mpi_threads_disable = std::getenv(HOROVOD_MPI_THREADS_DISABLE);
  int required = MPI_THREAD_MULTIPLE;
  if (mpi_threads_disable != nullptr &&
      std::strtol(mpi_threads_disable, nullptr, 10) > 0) {
    required = MPI_THREAD_SINGLE;
  }
  int is_mpi_initialized = 0;
  MPI_Initialized(&is_mpi_initialized);
  if (is_mpi_initialized) {
    int provided;
    MPI_Query_thread(&provided);
  } else {
    // MPI environment has not been created, using manager to initialize.
    ctx_manager.EnvInitialize(required);
    should_finalize = true;
  }

  if (!ranks.empty()) {
    MPI_Group world_group;
    MPI_Comm_group(MPI_COMM_WORLD, &world_group);
    MPI_Group work_group;
    MPI_Group_incl(world_group, ranks.size(), ranks.data(), &work_group);
    MPI_Comm_create_group(MPI_COMM_WORLD, work_group, 0, &(mpi_comm));
    if (mpi_comm == MPI_COMM_NULL) {
      mpi_comm = MPI_COMM_WORLD;
    }
    MPI_Group_free(&world_group);
    MPI_Group_free(&work_group);
  } else if (!mpi_comm) {
    // No ranks were given and no communicator provided to horovod_init() so use
    // MPI_COMM_WORLD
    MPI_Comm_dup(MPI_COMM_WORLD, &mpi_comm);
  }

  // Create local comm, Determine local rank by querying the local communicator.
  MPI_Comm_split_type(mpi_comm, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL,
                      &local_comm);

  // Get local rank and world rank for cross comm establishment.
  int local_rank, world_rank;
  MPI_Comm_rank(mpi_comm, &world_rank);
  MPI_Comm_rank(local_comm, &local_rank);

  // Create cross node communicator.
  MPI_Comm_split(mpi_comm, local_rank, world_rank, &cross_comm);

  // Create custom MPI float16 data type.
  MPI_Type_contiguous(2, MPI_BYTE, &mpi_float16_t);
  MPI_Type_commit(&mpi_float16_t);

  // Create custom MPI float16 summation op.
  MPI_Op_create(&float16_sum, 1, &mpi_float16_sum);
}

0x05 業務邏輯

我們具體看看業務邏輯。

5.1 RunLoopOnce 總體業務

RunLoopOnce 負責總體業務邏輯，其功能如下：

計算是否還需要sleep，即檢查從上一個cycle開始到現在，是否已經超過一個cycle時間；
利用 ComputeResponseList 來讓 rank 0 與 worker 協調，獲取 Request，計算 response；

rank 0 會遍歷 response_list，對於 response 逐一執行操作。

response_list 是 rank 0 處理，response cache 是其他 rank 處理。
利用 PerformOperation 對於每個response，做collective的操作
如果需要 auto tune，就同步引數；

我們可以看到Horovod的工作流程大致如之前所說的，是一個生產者和消費者的模式。controller在這裡是做協調的工作：會互通各個 rank 有哪些 request 已經就緒，對於就緒的 request，執行collective的操作。

縮減版程式碼如下：

bool RunLoopOnce(HorovodGlobalState& state) {
  // This delay determines thread frequency and communication message latency
  .....
    
  // 讓 rank 0 與 worker 協調，獲取 Request，計算 response  
  auto response_list =
      state.controller->ComputeResponseList(horovod_global.shut_down, state);

  // Get tensor name and size data for autotuning.
  .....

  // Perform the collective operation. All nodes should end up performing
  // the same operation.
  // 對於每個response，做collective的操作
  int rank = state.controller->GetRank();
  for (auto& response : response_list.responses()) {
    PerformOperation(response, horovod_global);
  }

  // 如果需要 auto tune，就同步引數
  if (state.parameter_manager.IsAutoTuning()) {
    bool should_sync =
        state.parameter_manager.Update(tensor_names, total_tensor_size);

    if (should_sync) {
      state.controller->SynchronizeParameters();
    }
  }

  return !response_list.shutdown();
}

流程如下：

+---------------------------------+
|                                 |             +-----------------------------+
|  BackgroundThreadLoop           |             |                             |
|                                 |             | OperationManager            |
|   +--------------------------+  |             |                             |
|   |  RunLoopOnce             |  |             |                             |
|   |                          |  |             |                             |
|   |                          |  |             |                             |
|   |     ComputeResponseList  |  |    +----------> ExecuteOperation          |
|   |             +            |  |    |        |                             |
|   |             |            |  |    |        |                             |
|   |             |            |  |    |        |                             |
|   |             |            |  |    | 1      |                             |
|   |             v            |  |    |        |                             |
|   |                          |  |    |        |                             |
|   |      PerformOperation +----------+        |                             |
|   |                          |  |             |                             |
|   +--------------------------+  |             |                             |
|                                 |             |                             |
+---------------------------------+             +-----------------------------+

5.2 ComputeResponseList 計算 response

在後臺執行緒裡，最重要的一個函式呼叫是 ComputeResponseList。ComputeResponseList 實現了協調過程，即來讓 rank 0 與 worker 協調，獲取 Request，計算 response。

Horovod 也遵循著 Coordinator 的設計，與百度類似。無論是百度還是 Horovod 中的 Coordinator 都類似是 Actor 模式，主要起來協調多個程式工作的作用。在真正執行計算的時候，Horovod 同樣引入了一個新的抽象 op_manager。從某種程度來說，我們可以把 controller 看做是對通訊和協調管理能力的抽象，而 op_manager 是對實際計算的抽象。

5.2.1 總體思路

Controller::ComputeResponseList 的功能就是：worker 傳送請求給 rank 0，然後coordinator 處理所有 worker 的請求，找到 ready 的，進行融合，最後結果傳送給其他 rank：

利用 PopMessagesFromQueue 從從自己程式的 GlobalState 的 Tensor Quene 中把目前的 Request 都取出來，進行處理，具體處理時使用了快取，然後經過一系列處理快取到 message_queue_tmp 中；
彼此同步cache資訊，目的是得到每個worker 共同儲存的 response列表；
判斷是否需要進一步同步，比如是否response全都在cache之中；
如果不需要同步，則
- 說明佇列中所有訊息都在快取之中，不需要其他的協調。於是直接把快取的response進行融合，放入response_list，下一輪時間片會繼續處理；
如果需要同步，則
- 如果是rank 0，
  - 因為rank 0 也會參與機器學習的訓練，所以需要把rank 0的request也加入到message table之中。接受其他 rank 的 Request，把其他 rank 的 Request 加入到 message_table_ 之中。此處就同步阻塞了。
  - Rank 0 利用 RecvReadyTensors 接受其他 rank 的 Request，把其他 rank 的 Request 加入到 ready_to_reduce。此處就同步阻塞了。coordinator 會持續接收這些資訊，直到獲取的 Done 的數目等於 global_size。
  - 然後遍歷 rank 0+1 ~ rank n，逐一處理每個 rank 的 response；
  - 最後，message table 之中已經有了所有的可以reduce的列表，responses 的來源是以下三部分：
    - 來源1，response_cache_ in rank 0；
    - 來源2，逐一處理 ready_to_reduce；
    - 來源3，join_response
  - 利用 FuseResponses 對tensor做fusion：即將一些tensor合併成一個大的tensor，再做collective的操作。
  - coordinator 會找到所有準備好 reduce 的 tensors，通過 SendFinalTensors(response_list) 返回一個 response 給所有的 worker，如果資訊有誤會返回一個 error，傳送完成也會傳送一個 Done。
- 如果是其他 rank，則：
  - 當 worker 到達了前端 all_reduce 這句的時候，會用 message_queue_tmp 整理成一個 message_list通過 SendReadyTensors 函式往主節點( coordinator，Rank 0 ) 傳送一個請求表明我打算reduce 的 Request，然後會把準備 reduce 的 tensor 資訊通過 message_list 迭代地送過去，最後有一個 Done 的請求，然後同步阻塞。
  - Worker 利用 RecvFinalTensors(response_list) 監聽 response 的資訊，從 Rank 0 接受 ready response list，同步阻塞。當收到 Done，會嘗試呼叫 performation 去進行 reduce 。
- coordinator 和 worker 都會把同步的資訊整理成一個 responses 的陣列給到後面的 PerformOperation 操作。

這裡說一下mpi是怎麼實現的，就是 coordinator 和對應的 worker 會阻塞地到同一條指令：

SendReadyTensors 和 RecvReadyTensors 阻塞到 MPI_Gather；
SendFinalTensors 和 RecvFinalTensors 到 MPI_Bcast ；

可以這樣分辨：如果是 coordinator 傳送的就是 MPI_Bcast，如果是worker 傳送的是 MPI_Gather。通訊都是先同步需要通訊message的大小 length，再同步message。

具體如下圖：

                                                      +
                                                      |
                    ComputeResponseList in rank 0     |     ComputeResponseList in worker(rank n)
                                                      |
                                                      |
                       message_queue_tmp              |          message_queue_tmp
                                                      |
                               +                      |                +
                               |                      |                |
                               |PopMessagesFromQueue  |                | PopMessagesFromQueue
                               |                      |                |
                               |                      |                |
                               |           CoordinateCacheAndState     |
                               |                      |                |
                               |    <--------------------------------> |
                               |                      |                |
                               v                      |                v
                                                      |
RecvReadyTensors(ready_to_reduce, ready_list)  <------------->  SendReadyTensors(message_list)
                               +                      |                +
                               |                      |                |
                               |                      |                |
                               |                      |                |
                               |                      |                |
                               v                      |                |
                        message_table_                |                |
                               +                      |                |
                               |                      |                |
                               |                      |                |
                               v                      |                |
                         FuseResponses                |                |
                               +                      |                |
                               |                      |                |
                               |                      |                |
                               v                      |                v
           SendFinalTensors(response_list)  <---------------->  RecvFinalTensors(response_list)
                               +                      |                +
                               |                      |                |
                               |                      |                |
                               |                      |                |
                               v                      |                v
                        PerformOperation              |           PerformOperation
                                                      |
                                                      +

手機如圖：

5.2.2 詳細分析

下面是比較詳細的分析，參考了網上的資料，自己也做了解讀。

ResponseList Controller::ComputeResponseList(std::atomic_bool& shut_down,
                                             HorovodGlobalState& state) {
  // Update cache capacity if autotuning is active.
  if (parameter_manager_.IsAutoTuning()) {
    response_cache_.set_capacity((int)parameter_manager_.CacheEnabled() *
                                 cache_capacity_);
  }

  // Copy the data structures out from parameters.
  // However, don't keep the lock for the rest of the loop, so that
  // enqueued stream callbacks can continue.

  CacheCoordinator cache_coordinator(response_cache_.num_active_bits());

  // 從 Tensor Quene 中把目前的 Request 都取出來，進行處理
  // message queue used only in this cycle
  std::deque<Request> message_queue_tmp;
  tensor_queue_.PopMessagesFromQueue(message_queue_tmp);
  for (auto& message : message_queue_tmp) {
    if (message.request_type() == Request::JOIN) {
      state.joined = true;
      // set_uncached_in_queue 記錄沒有cache的
      cache_coordinator.set_uncached_in_queue(true);
      continue;
    }

    // 這裡使用了快取，就是為了快取本rank已經得到了多少response。
    // Keep track of cache hits
    if (response_cache_.capacity() > 0) {
      // 需要看看這個tensor是否已經得到了對應的response。為啥要快取呢？不是都 ready 之後，就立刻進行 all reduce 了嘛。
      // cached 函式比較複雜，不但要看是否已經快取，還要看新 tensor 是否和已經快取的同名 tensor 的各種引數一致，比如device，dtype，shape等等。如果不一致，則標識快取的是 INVALID。難道深度學習訓練中，這些會變更？
      auto cache_ = response_cache_.cached(message);
      if (cache_ == ResponseCache::CacheState::HIT) {
        uint32_t cache_bit = response_cache_.peek_cache_bit(message);
        cache_coordinator.record_hit(cache_bit);

        // Record initial time cached tensor is encountered in queue.
        stall_inspector_.RecordCachedTensorStart(message.tensor_name());

      } else {
        // 如果沒有快取
        if (cache_ == ResponseCache::CacheState::INVALID) {
          // 處理無效快取記錄
          uint32_t cache_bit = response_cache_.peek_cache_bit(message);
          cache_coordinator.record_invalid_bit(cache_bit);
        }
        // 如果沒有快取，則新增到 set_uncached_in_queue
        cache_coordinator.set_uncached_in_queue(true);

        // 從stall 移除
        // Remove timing entry if uncached or marked invalid.
        stall_inspector_.RemoveCachedTensor(message.tensor_name());
      }
    }
  }

  if (state.joined && response_cache_.capacity() > 0) {
    for (uint32_t bit : response_cache_.list_all_bits()) {
      cache_coordinator.record_hit(bit);
    }
  }

  // Flag indicating that the background thread should shut down.
  bool should_shut_down = shut_down;

  // 處理 stalled
  // Check for stalled tensors.
  if (stall_inspector_.ShouldPerformCheck()) {
    if (is_coordinator_) {
      should_shut_down |= stall_inspector_.CheckForStalledTensors(size_);
    }

    if (response_cache_.capacity() > 0) {
      stall_inspector_.InvalidateStalledCachedTensors(cache_coordinator);
    }
    stall_inspector_.UpdateCheckTime();
  }

  cache_coordinator.set_should_shut_down(should_shut_down);

  if (response_cache_.capacity() > 0) {
    // 為什麼要彼此同步cache資訊？
    // Obtain common cache hits and cache invalidations across workers. Also,
    // determine if any worker has uncached messages in queue or requests
    // a shutdown. This function removes any invalid cache entries, if they
    // exist.
    // 這裡會同步，也會從 response_cache_ 之中移除 invalid 的。
    // 目的是得到每個worker 共同儲存的 response列表
    CoordinateCacheAndState(cache_coordinator);
      
    // Remove uncommon cached tensors from queue and replace to state
    // queue for next cycle. Skip adding common cached tensors to
    // queue as they are handled separately.
      
    // 此時 cache_coordinator 已經是所有worker 共有的response 列表了。需要移除那些 不在共有response 列表中的 response。
    // 為什麼有的worker會沒有某種response?
    // 會從 tensor request messages 之中看看是否已經有cache的了，然後相應更新 tensor_queue_。
    std::deque<Request> messages_to_replace;
    size_t num_messages = message_queue_tmp.size();
    for (size_t i = 0; i < num_messages; ++i) {
      auto& message = message_queue_tmp.front();
      if (response_cache_.cached(message) == ResponseCache::CacheState::HIT) {
        uint32_t cache_bit = response_cache_.peek_cache_bit(message);
        if (cache_coordinator.cache_hits().find(cache_bit) ==
            cache_coordinator.cache_hits().end()) {
          // Try to process again in next cycle.
          messages_to_replace.push_back(std::move(message));
        } else {
          // Remove timing entry for messages being handled this cycle.
          stall_inspector_.RemoveCachedTensor(message.tensor_name());
        }
      } else {
        // Remove timing entry for messages being handled this cycle.
        stall_inspector_.RemoveCachedTensor(message.tensor_name());
        message_queue_tmp.push_back(std::move(message));
      }
      message_queue_tmp.pop_front();
    }
    tensor_queue_.PushMessagesToQueue(messages_to_replace);
  } 
  // End of response_cache_.capacity()

  
  ResponseList response_list;
  response_list.set_shutdown(cache_coordinator.should_shut_down());

  bool need_communication = true;
  // 判斷是否需要進一步同步，比如response全都在cache之中。
  if (response_cache_.capacity() > 0 &&
      !cache_coordinator.uncached_in_queue()) {
    // if cache is enabled and no uncached new message coming in, no need for
    // additional communications
    need_communication = false;

    // If no messages to send, we can simply return an empty response list;
    if (cache_coordinator.cache_hits().empty()) {
      return response_list;
    }
    // otherwise we need to add cached messages to response list.
  }

  if (!need_communication) {
    // 佇列中所有訊息都在快取之中，不需要其他的協調。於是直接把快取的response進行融合，放入response_list
    // If all messages in queue have responses in cache, use fast path with
    // no additional coordination.

    // If group fusion is disabled, fuse tensors in groups separately
    if (state.disable_group_fusion && !group_table_.empty()) {
      // Note: need group order to be based on position in cache for global consistency
      std::vector<int> common_ready_groups;
      std::unordered_set<int> processed;
      for (auto bit : cache_coordinator.cache_hits()) {
        const auto& tensor_name = response_cache_.peek_response(bit).tensor_names()[0];
        int group_id = group_table_.GetGroupIDFromTensorName(tensor_name);
        if (group_id != NULL_GROUP_ID && processed.find(group_id) == processed.end()) {
          common_ready_groups.push_back(group_id);
          processed.insert(group_id);
        }
      }

      for (auto id : common_ready_groups) {
        std::deque<Response> responses;
        for (const auto &tensor_name : group_table_.GetGroupTensorNames(id)) {
          auto bit = response_cache_.peek_cache_bit(tensor_name);
          responses.push_back(response_cache_.get_response(bit));
          // Erase cache hit to avoid processing a second time.
          cache_coordinator.erase_hit(bit);
        }

        FuseResponses(responses, state, response_list);
      }
    }

    std::deque<Response> responses;
    // Convert cache hits to responses. Populate so that least
    // recently used responses get priority. All workers call the code
    // here so we use the get method here to consistently update the cache
    // order.
    for (auto bit : cache_coordinator.cache_hits()) {
      responses.push_back(response_cache_.get_response(bit));
    }

    // Fuse responses as normal.
    FuseResponses(responses, state, response_list);
    response_list.set_shutdown(cache_coordinator.should_shut_down());
  } else {
    // 有沒有快取的訊息進入，需要找出來這些是不是可以reduce的。
    // There are uncached messages coming in, need communication to figure out
    // whether those are ready to be reduced.

    // Collect all tensors that are ready to be reduced. Record them in the
    // tensor count table (rank zero) or send them to rank zero to be
    // recorded (everyone else).
    std::vector<std::string> ready_to_reduce;

    if (is_coordinator_) {
      // 我是 rank 0，對於master程式，記錄已經ready的tensor。
      // rank 0 也會參與機器學習的訓練，所以需要把rank 0的request也加入到message table之中。
      while (!message_queue_tmp.empty()) { // 注意此時message_queue_tmp中的request是來自master程式
        // Pop the first available message
        Request message = message_queue_tmp.front();
        message_queue_tmp.pop_front();

        if (message.request_type() == Request::JOIN) {
          state.joined_size++;
          continue;
        }

        bool reduce = IncrementTensorCount(message, state.joined_size);
        stall_inspector_.RecordUncachedTensorStart(
            message.tensor_name(), message.request_rank(), size_);
        if (reduce) {
          ready_to_reduce.push_back(message.tensor_name());
        }
      }

      // 接受其他 rank 的 Request，把其他 rank 的 ready Request 加入到 message_table_ 之中。
      // 此處就同步阻塞了
      // Receive ready tensors from other ranks
      std::vector<RequestList> ready_list;
      RecvReadyTensors(ready_to_reduce, ready_list);

      // 處理所有 rank 的 Request。
      // Process messages.
      // 遍歷 rank 0+1 ~ rank n，逐一處理每個 rank 的 response
      for (int i = 1; i < size_; ++i) { // size_是指有多少個rank
        
        // 每一個 rank 的 response list。
        auto received_message_list = ready_list[i];
        for (auto& received_message : received_message_list.requests()) {
          auto& received_name = received_message.tensor_name();

          // Join型別訊息是指有新的rank加入，Horovod支援彈性
          if (received_message.request_type() == Request::JOIN) {
            state.joined_size++;  // 增加該tensor已經ready的rank的個數，如果所有rank都ready，則發給其他rank
            continue;
          }

          bool reduce = IncrementTensorCount(received_message, state.joined_size);
          stall_inspector_.RecordUncachedTensorStart(
              received_message.tensor_name(), received_message.request_rank(),
              size_);
            
          // 如果已經達到了最大數值，則可以 reduce 了，加入到 ready_to_reduce。
          if (reduce) {
            ready_to_reduce.push_back(received_name);
          }
        }
        if (received_message_list.shutdown()) {
          // Received SHUTDOWN request from one of the workers.
          should_shut_down = true;
        }
      }

      // Check if tensors from previous ticks are ready to reduce after Joins.
      // 遍歷 message_table_，目的是看看上一輪處理的 response 在本輪是否可以 reduce
      if (state.joined_size > 0) {
        for (auto& table_iter : message_table_) {
          int count = (int)table_iter.second.size();
          if (count == (size_ - state.joined_size) &&
              std::find(ready_to_reduce.begin(), ready_to_reduce.end(),
                        table_iter.first) == ready_to_reduce.end()) {
            state.timeline.NegotiateEnd(table_iter.first);
            ready_to_reduce.push_back(table_iter.first);
          }
        }
      }

      // Fuse tensors in groups before processing others.
      if (state.disable_group_fusion && !group_table_.empty()) {

        // Extract set of common groups from coordinator tensor list and cache hits.
        std::vector<int> common_ready_groups;
        std::unordered_set<int> processed;

        for (const auto& tensor_name : ready_to_reduce) {
          int group_id = group_table_.GetGroupIDFromTensorName(tensor_name);
          if (group_id != NULL_GROUP_ID && processed.find(group_id) == processed.end()) {
            common_ready_groups.push_back(group_id);
            processed.insert(group_id);
            // Leaving name in list, to be skipped later.
          }
        }

        if (response_cache_.capacity() > 0) {
          for (auto bit : cache_coordinator.cache_hits()) {
            const auto& tensor_name = response_cache_.peek_response(bit).tensor_names()[0];
            int group_id = group_table_.GetGroupIDFromTensorName(tensor_name);
            if (group_id != NULL_GROUP_ID && processed.find(group_id) == processed.end()) {
              common_ready_groups.push_back(group_id);
              processed.insert(group_id);
            }
          }
        }

        // For each ready group, form and fuse response lists independently
        for (auto id : common_ready_groups) {
          std::deque<Response> responses;
          for (const auto &tensor_name : group_table_.GetGroupTensorNames(id)) {
            if (message_table_.find(tensor_name) != message_table_.end()) {
              // Uncached message
              Response response = ConstructResponse(tensor_name, state.joined_size);
              responses.push_back(std::move(response));

            } else {
              // Cached message
              auto bit = response_cache_.peek_cache_bit(tensor_name);
              responses.push_back(response_cache_.get_response(bit));
              // Erase cache hit to avoid processing a second time.
              cache_coordinator.erase_hit(bit);
            }
          }

          FuseResponses(responses, state, response_list);
        }
      }

	  // 此時，message table 之中已經有了所有的可以reduce的列表
        
      // At this point, rank zero should have a fully updated tensor count
      // table and should know all the tensors that need to be reduced or
      // gathered, and everyone else should have sent all their information
      // to rank zero. We can now do reductions and gathers; rank zero will
      // choose which ones and in what order, and will notify the other ranks
      // before doing each reduction.
      std::deque<Response> responses;

      // responses 的來源是以下三部分
        
      // 來源1，response_cache_ in rank 0
      if (response_cache_.capacity() > 0) {
        // Prepopulate response list with cached responses. Populate so that
        // least recently used responses get priority. Since only the
        // coordinator rank calls this code, use peek instead of get here to
        // preserve cache order across workers.
        // No need to do this when all ranks did Join.
        if (state.joined_size < size_) {
          for (auto bit : cache_coordinator.cache_hits()) {
            responses.push_back(response_cache_.peek_response(bit));
          }
        }
      }

      // 來源2，逐一處理 ready_to_reduce
      for (auto& tensor_name : ready_to_reduce) {
        // Skip tensors in group that were handled earlier.
        if (state.disable_group_fusion &&
            !group_table_.empty() &&
            group_table_.GetGroupIDFromTensorName(tensor_name) != NULL_GROUP_ID) {
          continue;
        }

        Response response = ConstructResponse(tensor_name, state.joined_size);
        responses.push_back(std::move(response));
      }
        
      // 來源3，join_response  
      if (state.joined_size == size_) {
        // All ranks did Join(). Send the response, reset joined size.
        Response join_response;
        join_response.set_response_type(Response::JOIN);
        join_response.add_tensor_name(JOIN_TENSOR_NAME);
        responses.push_back(std::move(join_response));
        state.joined_size = 0;
      }
        
      // 進行融合
      FuseResponses(responses, state, response_list);
      response_list.set_shutdown(should_shut_down);

      // Broadcast final results to other ranks.
      SendFinalTensors(response_list);

    } else {
      // 我是其他的 rank，非master，則傳送自己已經ready的tensor給master，再接收已經ready的tensor列表
      RequestList message_list;
      message_list.set_shutdown(should_shut_down);
      while (!message_queue_tmp.empty()) {
        message_list.add_request(message_queue_tmp.front());
        message_queue_tmp.pop_front();
      }

      // 給 Rank 0 傳送 Request，同步阻塞
      // Send ready tensors to rank zero
      SendReadyTensors(message_list);

      // 從 Rank 0 接受 ready response list，同步阻塞
      // Receive final tensors to be processed from rank zero
      RecvFinalTensors(response_list);
    }
  }
  
  if (!response_list.responses().empty()) {
    std::string tensors_ready;
    for (const auto& r : response_list.responses()) {
      tensors_ready += r.tensor_names_string() + "; ";
    }
  }

  // If need_communication is false, meaning no uncached message coming in,
  // thus no need to update cache.
  if (need_communication && response_cache_.capacity() > 0) {
    // All workers add supported responses to cache. This updates the cache
    // order consistently across workers.
    for (auto& response : response_list.responses()) {
      if ((response.response_type() == Response::ResponseType::ALLREDUCE ||
           response.response_type() == Response::ResponseType::ADASUM ||
           response.response_type() == Response::ResponseType::ALLTOALL) &&
          (int)response.devices().size() == size_) {
        response_cache_.put(response, tensor_queue_, state.joined);
      }
    }
  }

  // Reassign cache bits based on current cache order.
  response_cache_.update_cache_bits();

  return response_list;
}

我們接下來重點看幾個函式。

5.2.3 IncrementTensorCount

IncrementTensorCount 的作用是計算是否所有的 tensor 都已經準備好。

如果 bool ready_to_reduce = count == (size_ - joined_size) ， 就會知道這個是可以 allreduce 的。

bool Controller::IncrementTensorCount(const Request& msg, int joined_size) {
  auto& name = msg.tensor_name();
  auto table_iter = message_table_.find(name);
  if (table_iter == message_table_.end()) {
    std::vector<Request> messages = {msg};
    messages.reserve(static_cast<unsigned long>(size_));
    message_table_.emplace(name, std::move(messages));
    table_iter = message_table_.find(name);
  } else {
    std::vector<Request>& messages = table_iter->second;
    messages.push_back(msg);
  }

  std::vector<Request>& messages = table_iter->second;
  int count = (int)messages.size();
  bool ready_to_reduce = count == (size_ - joined_size); // 判斷是否可以 allreduce

  return ready_to_reduce;
}

具體呼叫就是 rank 0 來負責，看看是不是 allreduce了。

即如果 IncrementTensorCount 了，就說明齊全了，可以把 Request 加入到 message_table_ 之中。

    if (is_coordinator_) {

      while (!message_queue_tmp.empty()) {
        // Pop the first available message
        Request message = message_queue_tmp.front();
        message_queue_tmp.pop_front();

        if (message.request_type() == Request::JOIN) {
          state.joined_size++;
          continue;
        }

        // 這裡呼叫
        bool reduce = IncrementTensorCount(message, state.joined_size);
        stall_inspector_.RecordUncachedTensorStart(
            message.tensor_name(), message.request_rank(), size_);
        if (reduce) {
          ready_to_reduce.push_back(message.tensor_name());
        }
      }

5.2.4 RecvReadyTensors

該函式的作用是收集其他 rank 的 Request。

使用 MPI_Gather 確定訊息長度；
使用 MPI_Gatherv 收集訊息；
因為 rank 0 已經被處理了，所以這裡不處理 rank 0；

void MPIController::RecvReadyTensors(std::vector<std::string>& ready_to_reduce,
                                     std::vector<RequestList>& ready_list) {
  // Rank zero has put all its own tensors in the tensor count table.
  // Now, it should count all the tensors that are coming from other
  // ranks at this tick.

  // 1. Get message lengths from every rank.
  auto recvcounts = new int[size_];
  recvcounts[0] = 0;
  MPI_Gather(MPI_IN_PLACE, 1, MPI_INT, recvcounts, 1, MPI_INT, RANK_ZERO,
             mpi_ctx_.mpi_comm);

  // 2. Compute displacements.
  auto displcmnts = new int[size_];
  size_t total_size = 0;
  for (int i = 0; i < size_; ++i) {
    if (i == 0) {
      displcmnts[i] = 0;
    } else {
      displcmnts[i] = recvcounts[i - 1] + displcmnts[i - 1];
    }
    total_size += recvcounts[i];
  }

  // 3. Collect messages from every rank.
  auto buffer = new uint8_t[total_size];
  MPI_Gatherv(nullptr, 0, MPI_BYTE, buffer, recvcounts, displcmnts, MPI_BYTE,
              RANK_ZERO, mpi_ctx_.mpi_comm);

  // 4. Process messages.
  // create a dummy list for rank 0
  ready_list.emplace_back();
  for (int i = 1; i < size_; ++i) {
    auto rank_buffer_ptr = buffer + displcmnts[i];
    RequestList received_message_list;
    RequestList::ParseFromBytes(received_message_list, rank_buffer_ptr);
    ready_list.push_back(std::move(received_message_list));
  }

  // 5. Free buffers.
  delete[] recvcounts;
  delete[] displcmnts;
  delete[] buffer;
}

5.2.5 SendReadyTensors

該函式是其他 rank 同步 Request 給 rank 0。

使用 MPI_Gather 確定訊息長度；
使用 MPI_Gatherv 收集訊息；

void MPIController::SendReadyTensors(RequestList& message_list) {
  std::string encoded_message;
  RequestList::SerializeToString(message_list, encoded_message);
  int encoded_message_length = (int)encoded_message.length() + 1;
  int ret_code = MPI_Gather(&encoded_message_length, 1, MPI_INT, nullptr, 1,
                            MPI_INT, RANK_ZERO, mpi_ctx_.mpi_comm);

  ret_code = MPI_Gatherv((void*)encoded_message.c_str(), encoded_message_length,
                         MPI_BYTE, nullptr, nullptr, nullptr, MPI_BYTE,
                         RANK_ZERO, mpi_ctx_.mpi_comm);
}

5.2.6 SendFinalTensors

該函式作用是 rank 0 把最後結果傳送給其他 rank；

void MPIController::SendFinalTensors(ResponseList& response_list) {
  // Notify all nodes which tensors we'd like to reduce at this step.
  std::string encoded_response;
  ResponseList::SerializeToString(response_list, encoded_response);
  int encoded_response_length = (int)encoded_response.length() + 1;
  MPI_Bcast(&encoded_response_length, 1, MPI_INT, RANK_ZERO, mpi_ctx_.mpi_comm);

  MPI_Bcast((void*)encoded_response.c_str(), encoded_response_length, MPI_BYTE,
            RANK_ZERO, mpi_ctx_.mpi_comm);
}

5.2.7 RecvFinalTensors

該函式作用是 worker 從 Rank 0 接受 ready response list，同步阻塞

void MPIController::RecvFinalTensors(ResponseList& response_list) {
  int msg_length;
  int ret_code =
      MPI_Bcast(&msg_length, 1, MPI_INT, RANK_ZERO, mpi_ctx_.mpi_comm);

  auto buffer = new uint8_t[msg_length];
  ret_code =
      MPI_Bcast(buffer, msg_length, MPI_BYTE, RANK_ZERO, mpi_ctx_.mpi_comm);
    
  ResponseList::ParseFromBytes(response_list, buffer);
  delete[] buffer;
}

5.3 根據 response 執行操作

我們接下來要看看另一個重要操作 PerformOperation，就是根據 response 執行操作。

其呼叫順序是：

BackgroundThreadLoop 呼叫 RunLoopOnce；
RunLoopOnce 如果是 rank 0，則處理 response_list，然後呼叫 PerformOperation；
PerformOperation 進而呼叫 op_manager -> ExecuteOperation------ ExecuteAllreduce；

我們可以看到，ComputeResponseList 返回了 response_list，就是這些 response 對應的 tensor 可以做 allreduce了。然後會遍歷每一個 response，進行 PerformOperation。

 auto response_list =
   state.controller->ComputeResponseList(horovod_global.shut_down, state);
   
  int rank = state.controller->GetRank();
  for (auto& response : response_list.responses()) {
    PerformOperation(response, horovod_global);
  }

5.3.1 PerformOperation

從 ComputeResponseList 繼續跑 RunLoopOnce， worker node 會根據前面 ComputeResponseList 返回的 response_list 對每個 response 輪詢呼叫 PerformOperation 完成對應的 reduce 工作。

主要呼叫 status = op_manager->ExecuteOperation(entries, response); 具體如下：

PerformOperation 會從 horovod_global.tensor_queue 通過函式 GetTensorEntriesFromResponse 取出對應的 TensorEntry；
如果還沒初始化buffer，呼叫 horovod_global.fusion_buffer.InitializeBuffer 初始化；
然後 status = op_manager->ExecuteOperation(entries, response) 會呼叫不同的 op->Execute(entries, response) 執行reduce 運算；
然後呼叫不同 entries 的 callback，這裡 callback 一般是前端作相應的操作；

// Process a Response by doing a reduction, a gather, a broadcast, or
// raising an error.
void PerformOperation(Response response, HorovodGlobalState& state) {
  std::vector<TensorTableEntry> entries;
  auto& timeline = horovod_global.timeline;
  if (response.response_type() != Response::JOIN) {
    horovod_global.tensor_queue.GetTensorEntriesFromResponse(response, entries,
                                                             state.joined);

    if (entries.size() > 1) { // 如果多於1個，則可以進行fuse，以提高throughput
      auto first_entry = entries[0];
      Status status = horovod_global.fusion_buffer.InitializeBuffer(
          horovod_global.controller->TensorFusionThresholdBytes(),
          first_entry.device, first_entry.context,
          horovod_global.current_nccl_stream,
          [&]() { timeline.ActivityStartAll(entries, INIT_FUSION_BUFFER); },
          [&]() { timeline.ActivityEndAll(entries); });
      if (!status.ok()) {
        for (auto& e : entries) {
          timeline.End(e.tensor_name, nullptr);
          // Callback can be null if the rank sent Join request.
          if (e.callback != nullptr) {
            e.callback(status);
          }
        }
        return;
      }
    }

    // On GPU data readiness is signalled by ready_event.
    // 即使tensor可以進行操作了，但需要等待資料同步到視訊記憶體
    std::vector<TensorTableEntry> waiting_tensors;
    for (auto& e : entries) {
      if (e.ready_event != nullptr) {
        timeline.ActivityStart(e.tensor_name, WAIT_FOR_DATA);
        waiting_tensors.push_back(e);
      }
    }
    while (!waiting_tensors.empty()) {
      for (auto it = waiting_tensors.begin(); it != waiting_tensors.end();) {
        if (it->ready_event->Ready()) {
          timeline.ActivityEnd(it->tensor_name);
          timeline.ActivityStart(it->tensor_name, WAIT_FOR_OTHER_TENSOR_DATA);
          it = waiting_tensors.erase(it);
        } else {
          ++it;
        }
      }
      std::this_thread::sleep_for(std::chrono::nanoseconds(100));
    }
  }

  Status status;
  try {
    // 進行collective的操作
    status = op_manager->ExecuteOperation(entries, response);
  } catch (const std::exception& ex) {
    status = Status::UnknownError(ex.what());
  }
  
  ... // 呼叫 callback 函式
}

5.3.2 ExecuteOperation

然後 status = op_manager->ExecuteOperation(entries, response) 會呼叫不同的 op->Execute(entries, response) 執行reduce 運算。

這裡來到了 OperationManager。

Status OperationManager::ExecuteOperation(std::vector<TensorTableEntry>& entries,
                                          const Response& response) const {
  if (response.response_type() == Response::ALLREDUCE) {
    return ExecuteAllreduce(entries, response);
  } else if (response.response_type() == Response::ALLGATHER) {
    return ExecuteAllgather(entries, response);
  } else if (response.response_type() == Response::BROADCAST) {
    return ExecuteBroadcast(entries, response);
  } else if (response.response_type() == Response::ALLTOALL) {
    return ExecuteAlltoall(entries, response);
  } else if (response.response_type() == Response::JOIN) {
    return ExecuteJoin(entries, response);
  } else if (response.response_type() == Response::ADASUM) {
    return ExecuteAdasum(entries, response);
  } else if (response.response_type() == Response::ERROR) {
    return ExecuteError(entries, response);
  } else {
    throw std::logic_error("No operation found for response type provided");
  }
}

5.3.3 ExecuteAllreduce

op->Execute(entries, response); 就是呼叫了類似 MPIAllreduce . Execute。

Status OperationManager::ExecuteAllreduce(std::vector<TensorTableEntry>& entries,
                                          const Response& response) const {
  for (auto& op : allreduce_ops_) {
    if (op->Enabled(*param_manager_, entries, response)) {
      return op->Execute(entries, response);
    }
  }
}

allreduce_ops_ 是從哪裡來的？在 OperationManager 構建函式中有。

allreduce_ops_(std::move(allreduce_ops)),

所以我們看看allreduce_ops。

5.3.4 allreduce_ops

在 CreateOperationManager 之中對 allreduce_ops 進行新增。

可以看到，新增的型別大致如下：

MPI_GPUAllreduce
NCCLHierarchicalAllreduce
NCCLAllreduce
DDLAllreduce
GlooAllreduce
CCLAllreduce
MPIAllreduce
......

OperationManager* CreateOperationManager(HorovodGlobalState& state) {
  // Order of these operations is very important. Operations will be checked
  // sequentially from the first to the last. The first 'Enabled' operation will
  // be executed.
  std::vector<std::shared_ptr<AllreduceOp>> allreduce_ops;
  std::vector<std::shared_ptr<AllgatherOp>> allgather_ops;
  std::vector<std::shared_ptr<BroadcastOp>> broadcast_ops;
  std::vector<std::shared_ptr<AllreduceOp>> adasum_ops;
  std::vector<std::shared_ptr<AlltoallOp>> alltoall_ops;

#if HAVE_MPI && HAVE_GPU // 如果構建了 MPI，就新增對應MPI_GPUAllreduce
  if (mpi_context.IsEnabled()) {
#if HOROVOD_GPU_ALLREDUCE == 'M'
    allreduce_ops.push_back(std::shared_ptr<AllreduceOp>(
        new MPI_GPUAllreduce(&mpi_context, &gpu_context, &state)));

#elif HAVE_NCCL && HOROVOD_GPU_ALLREDUCE == 'N' // 如果編譯了NCCL，就新增 AdasumGpuAllreduceOp
    adasum_ops.push_back(std::shared_ptr<AllreduceOp>(new AdasumGpuAllreduceOp(&mpi_context, &nccl_context, &gpu_context, &state)));

    allreduce_ops.push_back(
        std::shared_ptr<AllreduceOp>(new NCCLHierarchicalAllreduce(
            &nccl_context, &mpi_context, &gpu_context, &state)));

#elif HAVE_DDL && HOROVOD_GPU_ALLREDUCE == 'D'// 如果編譯了DDL，就新增DDLAllreduce
    allreduce_ops.push_back(std::shared_ptr<AllreduceOp>(
        new DDLAllreduce(&ddl_context, &gpu_context, &state)));
#endif

#if HAVE_NCCL && HOROVOD_GPU_ALLREDUCE == 'N'// 如果編譯了NCCL，就新增NCCLAllreduce
  allreduce_ops.push_back(std::shared_ptr<AllreduceOp>(
      new NCCLAllreduce(&nccl_context, &gpu_context, &state)));
#endif

5.3.5 MPIAllreduce

因為 allreduce_ops 型別很多，所以我們以 MPIAllreduce 舉例如下：

class MPIAllreduce : public AllreduceOp {
public:
  MPIAllreduce(MPIContext* mpi_context, HorovodGlobalState* global_state);

  virtual ~MPIAllreduce() = default;

  Status Execute(std::vector<TensorTableEntry>& entries, const Response& response) override;

  bool Enabled(const ParameterManager& param_manager,
               const std::vector<TensorTableEntry>& entries,
               const Response& response) const override;

protected:
  MPIContext* mpi_context_;
};

MPIAllreduce::Execute 這裡使用到了 MPI_Allreduce，也處理了 fusion，比如 MemcpyOutFusionBuffer。

#include "mpi_operations.h"

Status MPIAllreduce::Execute(std::vector<TensorTableEntry>& entries, const Response& response) {
  auto& first_entry = entries[0];

  const void* fused_input_data;
  void* buffer_data;
  size_t buffer_len;
  int64_t num_elements = NumElements(entries);

  // Copy memory into the fusion buffer.
  auto& timeline = global_state_->timeline;
  if (entries.size() > 1) {
    timeline.ActivityStartAll(entries, MEMCPY_IN_FUSION_BUFFER);
    MemcpyInFusionBuffer(entries, fused_input_data, buffer_data, buffer_len);
    timeline.ActivityEndAll(entries);
  } else {
    fused_input_data = first_entry.tensor->data();
    buffer_data = (void*) first_entry.output->data();
    buffer_len = (size_t) first_entry.output->size();
  }

  if (response.prescale_factor() != 1.0) {
    // Execute prescaling op
    ScaleBuffer(response.prescale_factor(), entries, fused_input_data, buffer_data, num_elements);
    fused_input_data = buffer_data; // for unfused, scale is done out of place
  }

  // Do allreduce.
  timeline.ActivityStartAll(entries, MPI_ALLREDUCE);
  const void* sendbuf = entries.size() > 1 || fused_input_data == buffer_data
                        ? MPI_IN_PLACE : fused_input_data;
  int op = MPI_Allreduce(sendbuf, buffer_data,
                         (int) num_elements,
                         mpi_context_->GetMPIDataType(first_entry.tensor),
                         mpi_context_->GetMPISumOp(first_entry.tensor->dtype()),
                         mpi_context_->GetMPICommunicator(Communicator::GLOBAL));
  timeline.ActivityEndAll(entries);

  if (response.postscale_factor() != 1.0) {
    // Execute postscaling op
    ScaleBuffer(response.postscale_factor(), entries, buffer_data, buffer_data, num_elements);
  }

  // Copy memory out of the fusion buffer.
  if (entries.size() > 1) {
    timeline.ActivityStartAll(entries, MEMCPY_OUT_FUSION_BUFFER);
    MemcpyOutFusionBuffer(buffer_data, entries);
    timeline.ActivityEndAll(entries);
  }

  return Status::OK();
}

此時具體邏輯如下：

+---------------------------------+
|                                 |             +-----------------------+
|  BackgroundThreadLoop           |             |                       |
|                                 |             | OperationManager      |
|   +--------------------------+  |             |                       |
|   |  RunLoopOnce             |  |             |                       |
|   |                          |  |             |                       |
|   |                          |  |             |                       |      +-->  GPUAllreduce
|   |     ComputeResponseList  |  |    +----------> ExecuteOperation    |      |
|   |             +            |  |    |        |           +           |      |
|   |             |            |  |    |        |           |           |      +-->  NCCLHierarchicalAllreduce
|   |             |            |  |    |        |           |           |      |
|   |             |            |  |    | 1      |           |  2        |      |
|   |             v            |  |    |        |           |           |      +-->  NCCLAllreduce
|   |                          |  |    |        |           |           |      |
|   |      PerformOperation +----------+        |           v           |      |
|   |                          |  |             |   ExecuteAllreduce    |      +-->  DDLAllreduce
|   +--------------------------+  |             |    +                  |      |
|                                 |             |    |                  |      |
+---------------------------------+             |    |                  |      +-->  GlooAllreduce
                                                |    |  allreduce_ops----------+
                                                |    |                  |      |    +----------------+
                                                |    |                  |      +--> | MPIAllreduce   |
                                                +-----------------------+           |                |
                                                     |                              |                |
                                                     +---------------------------------->  Execute   |
                                                                       3            |                |
                                                                                    +----------------+

手機如下：