[原始碼解析] Pytorch 如何實現後向傳播 (2)---- 引擎靜態結構

羅西的思考發表於2021-10-27

原文網址 : https://www.cnblogs.com/rossiXYZ/p/15472692.html

[原始碼解析] Pytorch 如何實現後向傳播 (2)---- 引擎靜態結構

0x00 摘要

前文最終我們提到了如下程式碼就是呼叫引擎來進行反向傳播，其中：

roots是包含有前向傳播輸出節點的 gradient_edge()（即輸出節點的(grad_fn_, 0)）的 vector，也就是edge_list。
inputs 是前向傳播產生的梯度，如果沒有配置，則初始化為(tensor(1.),)。
outputs 是依據前向傳播輸入節點構建的後向傳播輸出邊，這些邊是(Function, input number) pair。

 Engine::execute(roots, inputs, keep_graph, create_graph, accumulate_grad, outputs);

結合Engine定義，我們可以一一把這些輸入與 execute 的引數對應起來。

auto Engine::execute(const edge_list& roots, // 反向傳播的根節點
                     const variable_list& inputs, // 根節點的梯度
                     bool keep_graph, // 計算圖是否需要保留
                     bool create_graph, // 是否需要構建微分圖以進行高階求導
                     bool accumulate_grad,
                     const edge_list& outputs // 需要輸出梯度的節點
                    )

所以本文我們首先從靜態角度來看引擎，就是看看其資料結構和靜態性質。

系列前幾篇連結如下：

深度學習利器之自動微分(1)

深度學習利器之自動微分(2)

[原始碼解析]深度學習利器之自動微分(3) --- 示例解讀

[原始碼解析]PyTorch如何實現前向傳播(1) --- 基礎類(上)

[原始碼解析]PyTorch如何實現前向傳播(2) --- 基礎類(下)

[原始碼解析] PyTorch如何實現前向傳播(3) --- 具體實現

[原始碼解析] Pytorch 如何實現後向傳播 (1)---- 呼叫引擎

0x01 Engine

Engine 是autograd的核心，其實現了後向傳播。後向傳播方向是從根節點（就是正向傳播的輸出）到輸出（就是正向傳播的輸入），在後向傳播過程之中依據前向傳播過程中設定的依賴關係生成了動態計算圖。

Engine 入口是execute函式，其邏輯如下：

根據根節點 roots 構建GraphRoot。
根據 roots 之中的Node例項 metadata 以及各層之間的關係來構建計算圖。
- 通過next_edge不斷的找到指向的下一個Edge，最終完成整個計算圖的計算。
- 利用 Queue 來多執行緒完成反向計算的工作。

引擎定義在：torch/csrc/autograd/engine.cpp，這裡只給出成員變數，最主要的變數是：

device_ready_queues_ ：ReadyQueue 列表 device_ready_queues_ 之中的每一個ReadyQueue都啟動了一個工作執行緒。各個執行緒之間通過 device_ready_queues_ 來進行互動。注意，因為CPU執行緒會處理其呼叫的反向傳播的CPU相關工作，所以每個 GraphTask 擁有自己的 cpu_ready_queue_，使用者可以向這些 cpu_ready_queue_ 傳送待處理的工作。
thread_pool_shared_ ：執行緒池，用來多執行緒處理後向傳播。

具體程式碼是：

// A single instance of this struct should be created through the whole process lifetime.
// The worker thread creation logic and Engine's destructor rely on this.
struct TORCH_API Engine {

  // Ensures device_ready_queues_ are initialized only once
  std::once_flag start_device_threads_flag_;
  // Safe to read device_ready_queues_ without synchronization after initialization
  std::vector<std::shared_ptr<ReadyQueue>> device_ready_queues_;

  std::vector<std::function<void()>> final_callbacks_;
  // To protect reads and writes to final_callbacks_
  std::mutex post_callbacks_lock_;

  // How many nested reentrant calls are allowed until a new thread is used
  int max_recursion_depth_;

  struct ThreadPoolShared {
    // Data structures used by the threads for executing reentrant backwards
    // tasks. See Note [Reentrant backwards]
    // Number of available threads for processing new GraphTasks.
    unsigned int num_workers_;
    // The threads will wait on work_ to be notified of GraphTasks
    std::condition_variable work_;
    // To protect reads and writes to graphtask_queue_ and num_workers_
    // and for synchronizing creating new threads when needed
    std::mutex mutex_;
    // Workers will process the GraphTasks added to this queue. A GraphTask is
    // allocated inside Engine::execute and lives for the duration of execute
    std::queue<std::weak_ptr<GraphTask>> graphtasks_queue_;

    ThreadPoolShared() : num_workers_(0) {}
 };

 // Temporary workaround until shutting down threads is done
 // We need shared ownership of all these objects because the threads are leaked
 // when Engine shuts down, so there may be threads waiting on work_
 // for the graphtasks_queue_ to be nonempty.
 std::shared_ptr<ThreadPoolShared> thread_pool_shared_;

private:
  // Number of non-reentrant threads
  std::atomic<uint32_t> non_reentrant_device_thread_count_;
  // Destructor will wait for non-reentrant threads to finish
  std::condition_variable non_reentrant_device_thread_condvar_;
  std::mutex non_reentrant_device_thread_mutex_;
  // stop() must be called before the destruction path goes down to the base
  // class, in order to avoid a data-race-on-vptr. Use this boolean to guard
  // whether stop() has already been called, so we can call this in every
  // destructor of the class hierarchy.
  bool stopped_{false};
};

我們接下來就先介紹各種基礎類，每個類我們力爭結合其使用程式碼來分析。

0x02 GraphRoot

GraphRoot 是一個Node型別，Node其實就是原來的Function類。

struct TORCH_API GraphRoot : public Node {
    
  GraphRoot(edge_list functions, variable_list inputs)
      : Node(std::move(functions)),
      outputs(std::move(inputs)) { // 把輸入的 input 配置給 outputs 成員變數。
    // Ensures calls to stream() on a GraphRoot instance reflect current stream(s)
    // on devices of root grad tensors at the time the instance is constructed.
    for (const auto& t : outputs) {
      add_input_metadata(t);
    }
  }

  variable_list apply(variable_list&& inputs) override {
    return outputs; // apply 方法僅僅返回它的輸入，就是梯度。Node 的其他派生類會有自己不同的實現。
  }

  variable_list outputs; // 梯度。其只是通過 apply() 來進行使用，就是 apply 方法返回這個outputs。
};

struct TORCH_API Identity : public Node {
  variable_list apply(variable_list&& inputs) override;
};

2.1 構建

在 engine 之中，是用如下程式碼構建 GraphRoot。結合 execute 的呼叫方式，我們知道是使用反向傳播的根節點（起始點）和根節點的梯度 inputs 來構建 GraphRoot。

  // If we receive a single root, skip creating extra root node
  bool skip_dummy_node = roots.size() == 1;
  auto graph_root = skip_dummy_node ?
    roots.at(0).function :
    std::make_shared<GraphRoot>(roots, inputs);

我們再回憶一下 GraphRoot 之中的 Node這個基類被如何構建。可以看到 GraphRoot 就是使用邊列表構建了基類 Node，反向傳播的根節點 roots 就是 GraphRoot（Node）相關聯的邊，然後 GraphRoot 本身新增了成員變數 variable_list outputs（就是輸入 input 引數）。

  explicit Node(edge_list&& next_edges = edge_list())
    : Node(/*sequence_nr=*/at::sequence_number::get_and_increment(),
    std::move(next_edges)) {}

具體如下：

+------------------------------------+
| GraphRoot                          |
|                                    |
|   variable_list outputs +--------------->  inputs  梯度，被透傳給下游
|                                    |
|                                    |
|   +----------------------------+   |
|   | Node                       |   |
|   |                            |   |
|   |                            |   |
|   |   edge_list next_edges_ +----------->  roots   起始點
|   |                            |   |
|   +----------------------------+   |
|                                    |
+------------------------------------+

2.2 作用

GraphRoot 的作用是:

GraphRoot 就是後向傳播的輸入，就是根節點。
在構造 graph_root 時候：
- 如果只有一個root節點，則就直接使用root作為 GraphRoot 。
- 如果多個root，就構造一個GraphRoot（可以認為是虛擬根節點），把這些 root 作為引數構建一個GraphRoot，這個 GraphRoot 作為真正的根節點。root 就是 Node 的邊。
從初始化函式可以看出來，引擎的輸入inputs（反向傳播的輸入梯度）就是GraphRoot的輸出 outputs。
Function 的靈魂是 apply 方法，對於 GraphRoot 來說，其apply函式僅僅返回它的輸入，這樣，原始輸入 input 就直接被 GraphRoot 透傳給反向傳播的下一階段。
後續計算 compute_dependencies 會用這個 GraphRoot 來得到計算圖的依賴關係，就是利用 GraphRoot 的 next_edges_ 來得到計算圖的依賴關係。

  // If we receive a single root, skip creating extra root node
  bool skip_dummy_node = roots.size() == 1;
  auto graph_root = skip_dummy_node ?
    roots.at(0).function : // 如果只有一個root，就直接使用root作為 GraphRoot 
    std::make_shared<GraphRoot>(roots, inputs); // 如果多個root，就構造一個GraphRoot

  auto min_topo_nr = compute_min_topological_nr(outputs);
  // Now compute the dependencies for all executable functions
  compute_dependencies(graph_root.get(), *graph_task, min_topo_nr);

0x03 GraphTask

我們先給出一個基本概念。GraphTask 例項代表一個動態圖級別的資源管理物件，其擁有一次反向傳播執行所需要的全部後設資料，比如計算圖中所有Node的依賴關係，還沒有準備好Node的等待佇列等等。如果允許重入反向傳播，則會有多個GraphTask一起工作。

3.1 定義

GraphTask 其主要成員變數如下：

outstanding_tasks_ ：用來記錄當前任務數目，如果數目為0，則說明任務結束了。如果這個數量不為0，則此GraphTask依然需要執行。
dependencies_ ：用來判斷後續節點是否已經可以被執行。
not_ready_ ：儲存沒有完成的function和其輸入。
grad_mode_ ：是否需要進行梯度計算。反向計算期間執行的程式碼邏輯依靠AutoGradMode::is_enabled() 來判斷當前是否是要計算grad。
owner : GraphTask 所屬執行緒的Device 數值。GraphTask是在哪個執行緒中建立的，該值就是那個執行緒中的worker_device的值。
cpu_ready_queue_ ：
- CPU執行緒專用於處理反向傳播之中的CPU相關工作。因此所有Graph task都會維護自己的cpu_ready_queue_，CPU相關任務應該將傳送到該佇列。
- 對於每個GraphTask，我們維護cpu_ready_queue_，這樣在裝置執行緒（即GPU）上執行時，如果是下一個NodeTask 應該在CPU上執行，我們就知道應該推送 NodeTask 到哪個就緒佇列。
mutex_ ：保護如下變數：not_ready_, dependencies_, captured_vars，has_error_, future_result_, cpu_ready_queue_, and leaf_streams。
keep_graph ：用來指定一次反向計算後是否釋放資源。

具體定義如下，這裡只給出成員變數：

// GraphTask holds metadata needed for a single execution of backward()
struct GraphTask: std::enable_shared_from_this<GraphTask> {
  std::atomic<uint64_t> outstanding_tasks_{0};
  // Indicates if an error occurred while executing any task.  When this is
  // true, it signals all threads to stop executing.
  std::atomic_bool has_error_{false};
  std::atomic_bool future_completed_{false};
  // It is safe to read grad_mode_ and keep_graph_ without synchronization
  bool keep_graph_;
  bool grad_mode_;

  // To protect reads/writes to not_ready_, dependencies_, captured_vars_,
  // has_error_, future_result_, cpu_ready_queue_, and leaf_streams.
  std::mutex mutex_;
  std::unordered_map<Node*, InputBuffer> not_ready_;
  std::unordered_map<Node*, int> dependencies_;

  struct ExecInfo {
    struct Capture {
      Capture(const Capture&) = delete;
      Capture(Capture&&) = default;

      Capture(int input_idx, int output_idx)
          : input_idx_(input_idx), output_idx_(output_idx) {}
      int input_idx_; // within Node inputs
      int output_idx_; // within the output vector of a GraphTask

      // This hook will be executed after a grad is captured. The captured
      // grad will be replaced by the return value of the hook.
      struct GradCaptureHook {
        virtual ~GradCaptureHook() = default;
        virtual at::Tensor operator()(const at::Tensor& grad) = 0;
      };
      // The hooks will be called one by one in the order as they were added.
      // The input grad of a hook will be the output of its preceding hook. The
      // first hook will take the captured grad as the input. The output of the
      // last hook will replace the captured grad.
      std::vector<std::unique_ptr<GradCaptureHook>> hooks_;
    };

    bool should_execute() const {
      return needed_ || captures_;
    }

    bool needed_ = false;
    std::unique_ptr<std::vector<Capture>> captures_;
  };
  // Exec info has a bit complicated semantics. If it's empty, it means the task
  // is run in a "default" mode, which means that all next_edges we encounter
  // should get executed. If it's not empty, only functions that have an entry
  // and this entry has needed == True should be executed. exec_info is only empty
  // when the graph is executed via .backward() and the inputs parameter is not passed.
  // Otherwise, when executed through .grad(), or when inputs arg is specified for
  // .backward(), exec_info will be non-empty.
  //
  // exec_info_ is safe to read without synchronization
  std::unordered_map<Node*, ExecInfo> exec_info_;
  // Captures variables are grads captured that we return to the user. After
  // execution of the GraphTask is completed, the captured_vars_ are moved
  // out of the GraphTask and are no longer valid.
  std::vector<Variable> captured_vars_;

  at::ThreadLocalState thread_locals_ =
      at::ThreadLocalState(/* keep_grad_mode */ false);

  std::unordered_set<c10::Stream> leaf_streams;

  // The value of worker_device in the thread that created this task.
  // See Note [Reentrant backwards]
  // Safe to read owner_ and reentrant_depth_ without synchronizaton
  int owner_;
  // The number of parent graph tasks for this graph task
  const int reentrant_depth_;

  // Whether or not to stop execution for this GraphTask when an error is
  // encountered. When set to true, this would cause Engine::execute() to throw
  // an exception as soon as the autograd engine receives an exception.
  bool exit_on_error_;

  // CPU threads are dedicated to processing CPU work for the backward they invoked.
  // So any given graph task maintains its own cpu_ready_queue_ where you should send
  // work for it to be done. We memoize the cpu_ready_queue_ per GraphTask so that
  // we know which ready queue we should push to if we are on device thread (i.e. GPU)
  // and but next NodeTask should be run on CPU.
  std::shared_ptr<ReadyQueue> cpu_ready_queue_;

  // Future representing the completion of the graph task. Notified when all
  // tasks are done.
  std::shared_ptr<at::ivalue::Future> future_result_;

  // Final callbacks installed during execution of this GraphTask
  std::vector<std::function<void()>> final_callbacks_;
  // To protect reads and writes to final_callbacks_. Intentionally no reusing
  // mutex_ as the two are protecting different data structures.
  std::mutex final_callbacks_lock_;
};

我們接下來看看一些重要成員變數。

3.2 outstanding_tasks_

是待處理 NodeTask的數量，用來判斷該GrapTask是否還需要執行，其數值總是先加再減，如果數目為0，則說明任務結束了。

當 GraphTask 被建立出來時候，此數值為0。
如果有一個NodeTask被送入到 ReadyQueue，則outstanding_tasks_ 增加 1。
如果在工作執行緒作執行一次 evaluate_function(task)後，outstanding_tasks的值減1。
如果這個數量不為0，則此GraphTask依然需要執行。

3.2.1 任務結束

以下程式碼用來判斷GraphTask是否結束。

bool GraphTask::completed() {
  return outstanding_tasks_.load() == 0 ||
      (exit_on_error_ && has_error_.load());
}

3.2.2 增加

NodeTask任務增加時 outstanding_tasks_ 就加一。即，往某一個 ReadyQueue 之中插入一個 NodeTask 時候， NodeTask 對應的GraphTask 就會把其 outstanding_tasks_ 增加一。

auto ReadyQueue::push(NodeTask item, bool incrementOutstandingTasks) -> void {
  {
    // Lock mutex for writing to heap_
    std::lock_guard<std::mutex> lock(mutex_);
    if (incrementOutstandingTasks) {
      std::shared_ptr<GraphTask> graph_task = item.base_.lock();
      ++graph_task->outstanding_tasks_; // 增加
    }
    heap_.push(std::move(item));
  }
  not_empty_.notify_one();
}

3.2.3 遞減

NodeTask 任務結束時候就減一，我們用簡化程式碼看看。

auto Engine::thread_main(const std::shared_ptr<GraphTask>& graph_task) -> void {

  while (graph_task == nullptr || !graph_task->future_result_->completed()) { //執行 GraphTask

    std::shared_ptr<GraphTask> local_graph_task;
    {
      NodeTask task = local_ready_queue->pop();

      if (task.fn_ && !local_graph_task->has_error_.load()) {
        // 執行 NodeTask
        evaluate_function(local_graph_task, task.fn_.get(), task.inputs_, local_graph_task->cpu_ready_queue_);
      }
    }

    // Decrement the outstanding tasks.
    --local_graph_task->outstanding_tasks_; // 執行 NodeTask完畢，這裡減一

    // Check if we've completed execution.
    if (local_graph_task->completed()) { // 判斷 GraphTask是否結束。
      // 做相關處理工作
    }
  }
}

3.3 keep_graph

keep_graph 用來指定一次反向計算後是否釋放資源。資源就是在前向過程中建立起來的資源。keep_graph如果是False的話，則會在 fn 執行完畢後呼叫 fn 的 will_release_variables 方法來釋放該資源。

當執行反向傳播時候，在 void Engine::evaluate_function 會呼叫

auto outputs = call_function(graph_task, func, inputs);

在 call_function 之中，如果發現不需要保持圖，就釋放資源。

static variable_list call_function(
    std::shared_ptr<GraphTask>& graph_task,
    Node* func,
    InputBuffer& inputBuffer) {
  CheckpointValidGuard cpvguard(graph_task);
  auto& fn = *func;
  auto inputs =
      call_pre_hooks(fn, InputBuffer::variables(std::move(inputBuffer)));

  if (!graph_task->keep_graph_) {
    fn.will_release_variables(); // 如果不需要保持圖，就呼叫釋放。
  }
  // 省略其他
}

3.4 dependencies_

dependencies 用來判斷後續節點是否已經可以被執行，其型別如下：

std::unordered_map<Node*, int> dependencies_;

dependencies成員在compute_dependencies呼叫中被初始化，只要一個grad_fn函式在別人的next_edges()中出現過一次，那麼dependencies[this_grad_fn] 就自增1。如果dependencies[this_grad_fn]大於0，說明this_grad_fn有一個後向傳播的依賴，即this_grad_fn需要等被依賴者完成，才能進行反向傳播。

比如如下計算圖：

# MulBackward0 被 SubBackward0 的next_edges引用 1 次，即 MulBackward0 需要等 SubBackward0 反向計算完成之後，才能進行自己的反向傳播
dependencies[MulBackward0] = 1

#PowBackward0-1 被 MulBackward0 的next_edges用1次
dependencies[PowBackward0-1] = 1

#PowBackward0-2 被 MulBackward0 的next_edges用1次
dependencies[PowBackward0-2] = 1

我們結合具體程式碼（刪除無關程式碼）看看。

void Engine::evaluate_function(
    std::shared_ptr<GraphTask>& graph_task,
    Node* func,
    InputBuffer& inputs,
    const std::shared_ptr<ReadyQueue>& cpu_ready_queue) {

  // 執行後向計算
  auto outputs = call_function(graph_task, func, inputs);

  std::lock_guard<std::mutex> lock(graph_task->mutex_);
  for (int i = 0; i < num_outputs; ++i) { // 遍歷自己的輸出
    auto& output = outputs[i];
    const auto& next = fn.next_edge(i); // 找到第i個輸出

    // Check if the next function is ready to be computed
    bool is_ready = false;
      
    // 得到依賴關係  
    auto& dependencies = graph_task->dependencies_;
    auto it = dependencies.find(next.function.get()); // 找到第i個輸出的依賴關係

    if (it == dependencies.end()) {
      auto name = next.function->name();
      throw std::runtime_error(std::string("dependency not found for ") + name);
    } else if (--it->second == 0) { // 因為本節點的後向計算已經完成，所以第i個輸出的依賴數目減一
      dependencies.erase(it); // 如果為0，說明沒有依賴了，就從依賴關係之中刪除
      is_ready = true; // true 代表沒有依賴關係，可以構建一個 NodeTask 進行下一步反向計算了
    }
  }
}

3.5 not_ready_

用來暫存未就緒的function及其輸入，型別如下：

std::unordered_map<Node*, InputBuffer> not_ready_;

not_ready_ 是針對未就緒節點和其輸入的map，假設某節點 A 在反向傳播路徑上有兩個輸入，當第一個輸入完成時候，因為第二個輸入沒有完成反向計算，所以需要有一個地方暫存這個 A 和其第一個輸入以備後續處理。not_ready_ 就是用來做這個的。

not_ready_ 的 key 是未就緒節點，value 是這個節點目前就緒的輸入列表。

第一次遇到某節點的一個輸入之後，就把 (節點 A, A 的輸入資訊 ) 放入 not_ready_ 這裡，得到 (節點 A, [A 的輸入資訊 1 ] )
後續遇到 A 的其他輸入，就繼續調整這裡，把 A 的其他輸入加入到 "A 的輸入資訊" 之中，比如得到 (節點 A, [A 的輸入資訊 1，A的輸入資訊 2 ] )
如果此時 A 已經 ready，就把 A 和其輸入資訊放入一個 Ready Queue，然後從 not_ready_ 移除節點 A。
如果 A 還沒有 ready（A還需要其他輸出），就繼續維持 not_ready_ 的狀態，把目前 A 輸入都加入到 not_ready_ 裡面。

我們結合程式碼看看。

    auto& not_ready = graph_task->not_ready_;
    auto not_ready_it = not_ready.find(next.function.get());
    if (not_ready_it == not_ready.end()) { // 如果未就緒佇列之中沒有next節點
      // Skip functions that aren't supposed to be executed
      if (!exec_info_.empty()) {
        auto it = exec_info_.find(next.function.get());
        if (it == exec_info_.end() || !it->second.should_execute()) {
          continue;
        }
      }
      // No buffers have been allocated for the function
      InputBuffer input_buffer(next.function->num_inputs()); // 整理 next 節點的輸入引數資訊

      // Accumulates into buffer
      const auto opt_next_stream = next.function->stream(c10::DeviceType::CUDA);
      input_buffer.add(next.input_nr, // 插入 next 節點的輸入引數資訊
                       std::move(output),
                       opt_parent_stream,
                       opt_next_stream);

      if (is_ready) { // is_ready 是前面小節之中，通過依賴關係計算出來的，true表示可以進行反向計算了
        auto queue = ready_queue(cpu_ready_queue, input_buffer.device());
        queue->push(
            NodeTask(graph_task, next.function, std::move(input_buffer)));
      } else {
        // 還有依賴關係，不能進行反向計算，只能放入未就緒佇列 not_ready_ 
        not_ready.emplace(next.function.get(), std::move(input_buffer));
      }
    } else { // 如果未就緒佇列之中已經有next節點
        
      // The function already has a buffer
      auto &input_buffer = not_ready_it->second;

      // Accumulates into buffer
      const auto opt_next_stream = next.function->stream(c10::DeviceType::CUDA);
      input_buffer.add(next.input_nr, // 把最新完成反向計算的輸入插入輸入buffer input_buffer
                       std::move(output),
                       opt_parent_stream,
                       opt_next_stream);
      if (is_ready) { // 如果可以計算，就放入ready 佇列
        auto queue = ready_queue(cpu_ready_queue, input_buffer.device());
        queue->push(
            NodeTask(graph_task, next.function, std::move(input_buffer)));
        not_ready.erase(not_ready_it); // 同時從未就緒佇列之中移除
      }
    }

3.6 exec_info_

ExecInfo 主要作用就是判斷是否需要執行，並且註冊了一個hook，用來在計算梯度時候做呼叫。

3.6.1 定義

定義如下：

struct ExecInfo {
  struct Capture {
    Capture(const Capture&) = delete;
    Capture(Capture&&) = default;

    Capture(int input_idx, int output_idx)
        : input_idx_(input_idx), output_idx_(output_idx) {}
    int input_idx_; // within Node inputs
    int output_idx_; // within the output vector of a GraphTask

    // This hook will be executed after a grad is captured. The captured
    // grad will be replaced by the return value of the hook.
    struct GradCaptureHook {
      virtual ~GradCaptureHook() = default;
      virtual at::Tensor operator()(const at::Tensor& grad) = 0;
    };
    // The hooks will be called one by one in the order as they were added.
    // The input grad of a hook will be the output of its preceding hook. The
    // first hook will take the captured grad as the input. The output of the
    // last hook will replace the captured grad.
    std::vector<std::unique_ptr<GradCaptureHook>> hooks_;
  };

  bool should_execute() const {
    return needed_ || captures_;
  }

  bool needed_ = false;
  std::unique_ptr<std::vector<Capture>> captures_;
};

在引擎之中生成如下成員變數。

// Exec info has a bit complicated semantics. If it's empty, it means the task
// is run in a "default" mode, which means that all next_edges we encounter
// should get executed. If it's not empty, only functions that have an entry
// and this entry has needed == True should be executed. exec_info is only empty
// when the graph is executed via .backward() and the inputs parameter is not passed.
// Otherwise, when executed through .grad(), or when inputs arg is specified for
// .backward(), exec_info will be non-empty.
//
// exec_info_ is safe to read without synchronization
std::unordered_map<Node*, ExecInfo> exec_info_;

3.6.2 作用

exec_info_ 的作用就是給 GraphTask 的每一個Node配置一個ExecInfo，就是執行資訊。

如果exec_info_為空，說明該task執行在預設模式，即，所有遇到的 next_edges 都需要執行。
如果 exec_info_ 非空，說明只有特定 functions 才會被執行，這些 Functions 的特點是：擁有 entry，並且這個 entry 的 “has needed == True”。

exec_info_ 何時為空？何時非空？

當圖被用 .backward() 執行，並且沒有傳遞輸入引數，則 exec_info 為空，就是全部執行。
如果只是使用用 .grad() 執行，或者使用.backward() 執行時候並且給定輸入引數，那麼 exec_info_ 非空。

所以，exec 和 captured_vars_ 就是針對 grad() 和指定引數的 backward()，就是標註在這種情況下需要計算哪些梯度。在這種情況下，只有某些節點需要執行，從這些節點開始，有一條路徑通向 outpus。

3.6.3 生成

在 Engine::execute 之中會呼叫 init_to_execute 生成ExecInfo。

if (!outputs.empty()) {
  graph_task->init_to_execute(*graph_root, outputs, accumulate_grad, min_topo_nr);
}

邏輯是：

Populates exec_info so nodes that should be executed have `exec_info[node].needed_ = true` Only nodes that have a path to any edge in `outputs` should be executed.The code below populates exec_info using recursion, but the actual code does this iteratively. Refer to the numbering to see how the actual code corresponds.A difference to note is that in the iterative version, when you are working with the current Node, you are reponsible to update your parent's is_needed after all your children have been updated.

從其註釋可知，其作用是：填充exec_info，以便應執行的節點具有exec_info[node].needed_ = true 。

只具特定節點才應該執行，這些節點的性質是：節點擁有一條路徑，這路徑可以通往outputs的任何一條邊。

下面的程式碼使用遞迴填充exec_info，但實際程式碼以迭代方式執行此操作。關鍵程式碼如下，就是插入ExecInfo資訊 exec_info_.emplace(stack.back().fn_, ExecInfo());。具體刪減版程式碼如下：

void GraphTask::init_to_execute(Node& graph_root, const edge_list& outputs, bool accumulate_grad, uint64_t min_topo_nr) {
  // Populates exec_info so nodes that should be executed have `exec_info[node].needed_ = true`
  // Only nodes that have a path to any edge in `outputs` should be executed.
  // The code below populates exec_info using recursion, but the actual code does this
  // iteratively. Refer to the numbering to see how the actual code corresponds.
  // A difference to note is that in the iterative version, when you are working with
  // the current Node, you are reponsible to update your parent's is_needed after all your
  // children have been updated.
  //
  // is_needed = {fn: True for fn in outputs}             # (0)
  // seen = {}
  // def compute_is_needed(fn):
  //   for next_edge in fn.next_edges:
  //     child_fn = next_edge.fn
  //     if child_fn in seen and is_needed[child_fn]:     # (1)
  //       is_needed[fn] = true
  //     else:
  //       seen.add(child_fn)
  //       if compute_is_needed(child_fn):
  //         is_needed[fn] = true                         # (2)
  //                                                      # (3) exit for-loop
  //   return is_needed[fn]
  // compute_is_needed(graph_root)
  //
  // NB: you might be wondering why we don't populate `seen` with outputs. We cannot
  // because in the case where two outputs lie on the same path, we still need to explore past
  // the first output or we would miss the nodes that are required to compute the second output.
  
  // 這一段就是針對 grad() API 進行處理，只有在所求梯度的張量路徑上的其他張量才會被計算梯度
  int output_idx = 0;
  for (auto & output_edge : outputs) { // 遍歷輸出邊
    // (0) `is_needed` above corresponds to `exec_info_[fn].needed_`
    Node *output = output_edge.function.get();
    auto & info = exec_info_[output];
    if (accumulate_grad) {
      // if called through `.backward()` we directly set `needed_` for all the outputs to true
      info.needed_ = true;
    } else {
      if (!info.captures_) {
        info.captures_ = make_unique<std::vector<ExecInfo::Capture>>();
      }
      // 第 i 個輸入對應的輸出
      info.captures_->emplace_back(output_edge.input_nr, output_idx++);
    }
  }
  captured_vars_.resize(output_idx);

  auto nodeShouldExecute = [this](Node *fn) {
    auto it = exec_info_.find(fn);
    return it != exec_info_.end() && it->second.should_execute();
  };

  std::vector<Frame> stack;
  std::unordered_set<Node*> seen;
  stack.emplace_back(&graph_root);
  exec_info_.emplace(stack.back().fn_, ExecInfo()); // 這裡會初始化 exec_info_，有多個 exec_info

  while (!stack.empty()) {
    auto &frame = stack.back();
    const auto fn = frame.fn_;

    Node *child_fn = nullptr;
    while((child_fn = frame.get_next_fn()) && !seen.emplace(child_fn).second) {
      // (1) next child exists AND has already been seen
      if (nodeShouldExecute(child_fn)) {
        exec_info_[fn].needed_ = true;
      }
    }

    if (child_fn) {
      // (2) next child exists but has not been seen
      if (child_fn->topological_nr() < min_topo_nr) { 
        // child created before the first output means this child cannot have
        // an edge to output
        continue;
      }
      stack.emplace_back(child_fn);
    } else {
      // (3) no next child exists for `fn` means its `needed` has already been
      // finalized. pop stack and update parent
      stack.pop_back();
      if (nodeShouldExecute(fn) && !stack.empty()) {
        exec_info_[stack.back().fn_].needed_ = true;
      }
    }
  }
}

3.6.4 GradCaptureHook

其中，ExecInfo.Capture.GradCaptureHook 是要對梯度再做後續處理。

但是這個使用卻是主要在分散式狀態下，是因為分散式引擎有一個累積梯度的需要，這個必須在正常梯度操作之後的後置處理中完成。

在 DistEngine::computeDependencies 之中有新增操作：

    // Create a dummy GraphRoot and run init_to_execute with it.
    GraphRoot dummyRoot(edges, {});
    graphTask->init_to_execute(dummyRoot, outputEdges, /*accumulate_grad=*/false, /*min_topo_nr=*/0);
    for (auto& mapEntry : graphTask->exec_info_) {
      auto& execInfo = mapEntry.second;
      if (!execInfo.captures_) {
        continue;
      }
      auto fn = mapEntry.first;
      // There may be nodes other than 'AccumulateGrad', e.g. RecvRPCBackward,
      // to be captured.
      if (auto accumulateGradFn = dynamic_cast<AccumulateGrad*>(fn)) {
        for (auto& capture : *execInfo.captures_) {
          capture.hooks_.push_back( // 在這裡新增 hook
              std::make_unique<DistAccumulateGradCaptureHook>(
                  std::dynamic_pointer_cast<AccumulateGrad>(
                      accumulateGradFn->shared_from_this()),
                  autogradContext));
        }
      }
    }

在 Engine::evaluate_function 之中有使用操作。

  auto& exec_info_ = graph_task->exec_info_;
  if (!exec_info_.empty()) {
    auto& fn_info = exec_info_.at(func);
    if (auto* capture_vec = fn_info.captures_.get()) {
      // Lock mutex for writing to graph_task->captured_vars_.
      std::lock_guard<std::mutex> lock(graph_task->mutex_);
      for (const auto& capture : *capture_vec) {
        // 獲取到 captured_vars_，然後對其進行後置操作
        auto& captured_grad = graph_task->captured_vars_[capture.output_idx_];
        // 這裡是引用操作，所以 captured_grad 的賦值實際就是往 graph_task->captured_vars_ 賦值
        captured_grad = inputs[capture.input_idx_];
        for (auto& hook : capture.hooks_) {
          captured_grad = (*hook)(captured_grad); // 這裡使用了 hook 進行後置操作
        }
      }
    }
    if (!fn_info.needed_) {
      // Skip execution if we don't need to execute the function.
      return;
    }
  }

3.7 captured_vars_

上面提到了 captured_vars_，我們因此就一併分析。

Captures variables是我們返回給使用者的捕獲梯度。GraphTask執行完成後，Captures variables 將移出GraphTask，不再有效。

// Captures variables are grads captured that we return to the user. After
// execution of the GraphTask is completed, the captured_vars_ are moved
// out of the GraphTask and are no longer valid.
std::vector<Variable> captured_vars_;

這個 captured_vars_ 是可以進行後續處理，就是使用上面提到的GradCaptureHook 在 evaluate_function 進行處理，具體賦值也是在 evaluate_function 其中，參見前面程式碼之中的註釋，我們後文詳細對函式也會有分析。

// This hook will be executed after a grad is captured. The captured
// grad will be replaced by the return value of the hook.

引擎進行後向傳播操作，最後返回給呼叫者（比如Python程式碼）的output結果就是 captured_vars_。

void GraphTask::mark_as_completed_and_run_post_processing() {
  // Allow only one thread one attempt to process this logic.
  if (future_completed_.exchange(true)) {
    // Future is already marked complete, or being marked as such.
    // In case the marking complete is only in progress, we add a
    // wait() to guarantee the future is marked complete on exit.
    future_result_->wait();
    return;
  }

  try {
    // Run post processing, before marking the future as complete.
    // Drop lock prior to completing, to avoid holding across callbacks.
    std::unique_lock<std::mutex> lock(mutex_);

    exec_post_processing();
    std::vector<Variable> vars = std::move(captured_vars_); //最後返回的輸出

    // Need to unlock before we call markCompleted to avoid holding locks
    // when the callbacks are called.
    lock.unlock();
    // NOLINTNEXTLINE(performance-move-const-arg)
    future_result_->markCompleted(std::move(vars)); // 反向傳播最後的返回輸出
  } catch (std::exception& e) {
    future_result_->setErrorIfNeeded(std::current_exception());
  }
}

0x04 NodeTask

4.1 緣由

對於NodeTask，我們有一個疑問：為什麼要再增加一個新型別？而不是繼續使用 GraphTask。

因為 GraphTask 只是包括本計算圖的總體資訊，但是具體某一個節點如何計算梯度，GraphTask 是不知道的，所以引入了一個新型別 NodeTask 來處理。NodeTask 這個類的物件正是在queue中傳輸的東西，就是一個可以被執行的求導函式。從下面的定義可以看到，我們使用GraphTask、Node、InputBuffer來構建一個NodeTask例項，可以認為，生產者不停的向 ReadyQueue 插入 NodeTask，消費者則從 ReadyQueue 之中提取 NodeTask 進行處理。

4.2 定義

NodeTask 定義如下：

struct NodeTask {
  std::weak_ptr<GraphTask> base_; // 所屬的GraphTask
  std::shared_ptr<Node> fn_; // 需要執行的Node，比如 PowBackward0
  // This buffer serves as an implicit "addition" node for all of the
  // gradients flowing here.  Once all the dependencies are finished, we
  // use the contents of this buffer to run the function.
  InputBuffer inputs_; // fn_的輸入
  // When worker receives a task with isShutdownTask = true, it will immediately
  // exit. The engine sends a shutdown task to every queue upon its destruction.
  bool isShutdownTask_;

  int getReentrantDepth() const;

  NodeTask(
      std::weak_ptr<GraphTask> base,
      std::shared_ptr<Node> fn,
      InputBuffer inputs,
      bool isShutdownTask = false)
      : base_(base),
        fn_(std::move(fn)),
        inputs_(std::move(inputs)),
        isShutdownTask_(isShutdownTask) {}
};

在主執行緒和工作執行緒之中都可以插入 NodeTask，我們逐一分析。

4.3 主執行緒生產

主執行緒有兩種情況會產生 NodeTask。

剛啟動時候，在 execute_with_graph_task 之中，主執行緒就是往 index = -1 的 CPU 工作執行緒的queue 傳送一個 NodeTask。

// Now that all the non-thread safe fields of the graph_task have been populated,
// we can enqueue it.
// 主執行緒之中
queue->push(NodeTask(graph_task, std::move(graph_root), std::move(input_buffer)));

在 execute_with_graph_task 之中，當有重入的反向傳播時候，也會插入 NodeTask：

    // We set the worker_device to CPU_DEVICE only if worker_device was previously
    // NO_DEVICE. Setting it to CPU afterwards allow us to detect whether this is
    // a re-entrant call or not.
    set_device(CPU_DEVICE);

    // set the graph_task owner to the current device
    graph_task->owner_ = worker_device;

    // Now that all the non-thread safe fields of the graph_task have been populated,
    // we can enqueue it.
    queue->push(NodeTask(graph_task, std::move(graph_root), std::move(input_buffer)));

graph_root 的初始化我們可以回憶一下：

  auto graph_root = skip_dummy_node ?
    roots.at(0).function : // 如果只有一個root，就直接使用root作為 GraphRoot 
    std::make_shared<GraphRoot>(roots, inputs); // 如果多個root，就構造一個GraphRoot

graph_root 由roots和inputs構建，roots就是最終輸出節點的gradient_edge()，比如 [ (MulBackward0例項，0），(PowerBackward0, 0) ]。inputs 如果使用者沒有指定，就是預設的 tensor(1.)，如果指定了，就是起始梯度。

4.4 工作執行緒生產

在工作執行緒 thread_main 中，可以用如下方式構建新NodeTask例項，新增到queue中。

4.4.1 下一可計算節點

在 evaluate_function 之中，當完成一個節點的反向計算之後，會查詢下一個可以計算的節點，如果找到了，就取出當前節點的下一條邊，然後依據這個邊構建一個NodeTask，放入對應的工作執行緒（依據下一條邊的device等等資訊）的 ReadyQueue。

for (int i = 0; i < num_outputs; ++i) { // 遍歷輸入節點
    
      const auto& next = fn.next_edge(i); // 查詢下一個可以計算的節點

      if (not_ready_it == not_ready.end()) {
      // Skip functions that aren't supposed to be executed

      // No buffers have been allocated for the function
      InputBuffer input_buffer(next.function->num_inputs());

      // Accumulates into buffer
      const auto opt_next_stream = next.function->stream(c10::DeviceType::CUDA);
      input_buffer.add(next.input_nr,
                       std::move(output),
                       opt_parent_stream,
                       opt_next_stream);

      if (is_ready) {
        auto queue = ready_queue(cpu_ready_queue, input_buffer.device());
        // 插入下一個需要計算的NodeTask
        queue->push(
            NodeTask(graph_task, next.function, std::move(input_buffer)));
      } else {
        not_ready.emplace(next.function.get(), std::move(input_buffer));
      }
    } else {
      // The function already has a buffer
      auto &input_buffer = not_ready_it->second;

      // Accumulates into buffer
      const auto opt_next_stream = next.function->stream(c10::DeviceType::CUDA);
      input_buffer.add(next.input_nr,
                       std::move(output),
                       opt_parent_stream,
                       opt_next_stream);
      if (is_ready) {
        auto queue = ready_queue(cpu_ready_queue, input_buffer.device());
        // 插入下一個需要計算的NodeTask  
        queue->push(
            NodeTask(graph_task, next.function, std::move(input_buffer)));
        not_ready.erase(not_ready_it);
      }
    }
}

其中，const auto& next = fn.next_edge(i); 就是用來查詢下一個節點。

next_edge 程式碼如下：

const Edge& next_edge(size_t index) const noexcept {
  return next_edges_[index];
}

next_edges_ 指向的是前向圖中該Node節點的輸入節點，所以在反向傳播中，就是該節點的輸出節點。

4.4.2 喚醒

在 thread_main 之中，有一個 work around。就是：當前工作執行緒完成 graph_task，但此時，擁有graph_task的執行緒可能正在pop()上等待休眠。因此，我們需要向所屬執行緒傳送一個仿造的函式任務，以喚醒它，這樣我們可以退出thread_main。

    // Check if we've completed execution.
    if (local_graph_task->completed()) {
      local_graph_task->mark_as_completed_and_run_post_processing();

      auto base_owner = local_graph_task->owner_;
      // The current worker thread finish the graph_task, but the owning thread
      // of the graph_task might be sleeping on pop() if it does not have work.
      // So we need to send a dummy function task to the owning thread just to
      // ensure that it's not sleeping, so that we can exit the thread_main.
      // If it has work, it might see that graph_task->outstanding_tasks_ == 0
      // before it gets to the task, but it's a no-op anyway.
      //
      // NB: This is not necessary if the current thread is the owning thread.
      if (worker_device != base_owner) {
        // Synchronize outstanding_tasks_ with queue mutex
        std::atomic_thread_fence(std::memory_order_release);
        ready_queue_by_index(local_graph_task->cpu_ready_queue_, base_owner)
            ->push(NodeTask(local_graph_task, nullptr, InputBuffer(0)));
      }
    }

4.5 工作執行緒消費

首先，我們可以回憶一下graph_root 的初始化，graph_root 由roots和inputs構建，roots就是最終輸出節點的gradient_edge()，比如 [ (MulBackward0例項，0），(PowerBackward0, 0) ]。inputs 如果使用者沒有指定，就是預設的 tensor(1.)。

  auto graph_root = skip_dummy_node ?
    roots.at(0).function : // 如果只有一個root，就直接使用root作為 GraphRoot 
    std::make_shared<GraphRoot>(roots, inputs); // 如果多個root，就構造一個GraphRoot

其次，我們看看如何消費。

當worker執行緒剛被建立出來的時候，該執行緒被阻塞在queue->pop()，就是等待生產者往這個queue裡插入一個task。當主執行緒往 ReadyQueue 傳送了 NodeTask 例項之後，消費端的工作執行緒在 thread_main 的 pop 結束阻塞被喚醒。

於是worker執行緒獲取到了NodeTask。worker執行緒然後：

通過task.base來訪問到這個GraphTask例項。
通過 task.fn_ 訪問到這個roots例項，也就是該NodeTask需要執行的後向計算方法，比如 MulBackward0。
通過task.inputs_ 來訪問這個InputBuffer例項，就是 MulBackward0 的輸入。
後把NodeTask 的 fn_, inputs 傳給evaluate_function。進行反向計算。

具體程式碼如下：

  // 工作執行緒之中如何消費 NodeTask
  NodeTask task = local_ready_queue->pop();
  if (task.fn_ && !local_graph_task->has_error_.load()) {
    AutoGradMode grad_mode(local_graph_task->grad_mode_);
    try {
      GraphTaskGuard guard(local_graph_task);
      NodeGuard ndguard(task.fn_);
      // 後向計算
      evaluate_function(local_graph_task, task.fn_.get(), task.inputs_, local_graph_task->cpu_ready_queue_);
    } catch (std::exception& e) {
      thread_on_exception(local_graph_task, task.fn_, e);
    }
  }
}

下面是生產者和消費者的圖例。

1）主執行緒往CPU ReadyQueue放入一個 NodeTask。
2）工作執行緒 1 從 CPU ReadyQueue 取出 NodeTask，開始執行。
3）工作執行緒 1 結束之後，往 device_ready_queues_ 的某一個 ReadyQueue 插入一個 NodeTask。
4）ReadyQueue 對應的工作執行緒 2 取出 NodeTask，開始執行。

+--------------+                                                     +-----------------+
| Main Thread  |                                                     | Worker Thread 1 |
|              |       1         +-----------------+       2         |                 |
|              | push(NodeTask)  |                 |  pop(NodeTask)  |                 |
|          +-------------------> | CPU ReadyQueue  +----------------------->           |
|              |                 |                 |                 |                 |
|              |                 +-----------------+                 |                 |
|              |              +----------------------+               |                 |
|              |              | device_ready_queues_ |               |                 |
|              |              |                      |               |                 |
|              |              |                      |    3          |                 |
|              |              |    +-------------+   | push(NodeTask)|                 |
|              |              |    | ReadyQueue  | <------------------------           |
|              |              |    +------+------+   |               |                 |
|              |              |           |          |               |                 |
+--------------+              |           |          |               +-----------------+
                              |           +------------------+
                              |                      |       |       +-----------------+
                              |                      |       |       | Worker Thread 2 |
                              |                      |       |       |                 |
                              |                      |       |       |                 |
                              |                      |       |       |                 |
                              |    +-------------+   |       +------------->           |
                              |    | ReadyQueue  |   | pop(NodeTask) |                 |
                              |    +-------------+   |     4         |                 |
                              |                      |               |                 |
                              |                      |               |                 |
                              |    +-------------+   |               |                 |
                              |    | ReadyQueue  |   |               |                 |
                              |    +-------------+   |               |                 |
                              |                      |               |                 |
                              +----------------------+               +-----------------+

0x05 InputBuffer

因為有的節點在反向計算時候，有多個輸入，所以在計算梯度的時候， grad_fn 的輸入可能從很多條路徑上累積過來，InputBuffer 就是用來累積 grad_fn 的輸入。

struct InputBuffer {
    // size 表示有幾個輸入
  explicit InputBuffer(size_t size)
    : buffer(size) {}
  InputBuffer(const InputBuffer& other) = delete;
  InputBuffer(InputBuffer&& other) = default;
  explicit InputBuffer(variable_list&& inputs): buffer(std::move(inputs)) {};
  InputBuffer& operator=(InputBuffer&& other) = default;

  // Accumulates the variable at a specified index.
  // The optional CUDA streams determine which stream the accumulation
  // is run on and how the addition is synchronized.
  void add(size_t pos,
           Variable&& var,
           const c10::optional<c10::Stream>& opt_producer_stream,
           const c10::optional<c10::Stream>& opt_consumer_stream);

  at::Device device() const;

  Variable operator[](size_t pos) { return buffer[pos]; }

  // Returns the inputs as a list of variables. Destroys given InputBuffer.
  static std::vector<Variable> variables(InputBuffer&& g);

private:
  // Variables, pair 中的 int 代表 version  
  std::vector<Variable> buffer;
};

如何通過 input_buffer.device() 來得到對應的 device？就是遍歷 input_buffer 中的 variables，其中第一個裝置非cpu的variable的device將成為input_buffer的device，否則裝置就是CPU。

auto InputBuffer::device() const -> at::Device {
  // Since we pick the first non-CPU tensor, this won't work with
  // mixed device-type operations (e.g., an op that is both CUDA
  // and XLA).  This is *incredibly* unlikely, so we don't worry
  // about it.
  // 遍歷buffer，獲取第一個非CPU張量，然後得到他的device
  for (auto& var : buffer) {
    if (var.defined()) {
      auto device = var.device();
      if (device.type() != at::kCPU) {
        return device;
      }
    }
  }
  // Only report to the CPU thread if there really were no tensors
  // from other devices.
  return at::kCPU;
}

InputBuffer 對應的部分方法如下，有新增引數，也有累積引數。

  static void accumulate(std::vector<Variable>& buffer,
                         const size_t pos,
                         Variable&& var) {
    TORCH_INTERNAL_ASSERT(pos < buffer.size());
    auto& old_var = buffer[pos];
    // ATen doesn't route sparse additions correctly...
    // do dense + sparse in-place if possible
    if (old_var.is_sparse()) {
      //storage use_count is a big hammer, but for anything lighter there's an adversarial example with unexpected inplace modification
      if (!var.is_sparse() && var.is_contiguous() && var.storage().use_count() == 1) {
          buffer[pos] = var.add_(old_var);
      } else {
          buffer[pos] = var + old_var;
      }
    } else {
      if (var.is_sparse() && !old_var.is_sparse() && old_var.is_contiguous() && old_var.storage().use_count() == 1) {
          buffer[pos] = old_var.add_(var);
      } else {
          buffer[pos] = old_var + var;
      }
    }
  }

  void InputBuffer::add(size_t pos,
                        Variable&& var,
                        const c10::optional<c10::Stream>& opt_producer_stream,
                        const c10::optional<c10::Stream>& opt_consumer_stream) {
  TORCH_INTERNAL_ASSERT(pos < buffer.size());
  if (!var.defined()) {
    return;
  }

  // Switches to accumulate device
  // The device (and stream) chosen for accumulation is:
  //  (1) var is not a CUDA variable. Accumulation happens on var's device.
  //  (2) var is a CUDA variable and it, the consumer, and the producer share the same device:
  //       (2a) Uses the consumer's stream as the accumulation stream
  //       (2b) Syncs the accumulation stream with the producer's stream (if different)
  //       (2c) Accumulates.
  //  (3) var is a CUDA variable and it shares a device with the consumer but not the producer:
  //       (3a) Uses the consumer's stream as the accumulation stream
  //       (3b) Syncs the accumulation stream with the consumer device's default stream
  //       (3c) Accumulates.
  //  (4) var is a CUDA variable and it shares a device with the producer but not the consumer:
  //       (4a) Uses the producer device's default stream as the accumulation stream
  //       (4b) Syncs the accumulation stream with the the producer's stream
  //       (4c) Accumulates.
  //  (5) var is a CUDA variable and it does not share a device with the consumer or producer.
  //      Accumulation happens on the var device's default stream.

  c10::optional<c10::Stream> opt_accumulate_stream = c10::nullopt;
  if (device_of(var)->is_cuda()) {
    const auto on_producer = opt_producer_stream
                        && device_of(var) == opt_producer_stream->device();
    const auto on_consumer = opt_consumer_stream
                        && device_of(var) == opt_consumer_stream->device();
    if (on_producer && on_consumer) {
      // (2a)
      opt_accumulate_stream = opt_consumer_stream;
      if (opt_accumulate_stream != opt_producer_stream) {
        // (2b)
        auto event = c10::Event{c10::DeviceType::CUDA};
        event.record(*opt_producer_stream);
        opt_accumulate_stream->wait(event);
      }
    } else {
      c10::optional<c10::Stream> opt_sync_stream = c10::nullopt;
      const auto guard = c10::impl::VirtualGuardImpl{c10::DeviceType::CUDA};
      if (on_consumer && !on_producer) {
        // (3a)
        opt_accumulate_stream = opt_consumer_stream;
        opt_sync_stream = guard.getDefaultStream(opt_consumer_stream->device());
      } else if (on_producer && !on_consumer) {
        // (4a)
        opt_accumulate_stream = guard.getDefaultStream(opt_producer_stream->device());
        opt_sync_stream = opt_producer_stream;
      } else {
        // (5)
        opt_accumulate_stream = guard.getDefaultStream(*device_of(var));
      }
      if (opt_sync_stream && (opt_accumulate_stream != opt_sync_stream)) {
        // (3b), (4b)
        c10::OptionalDeviceGuard device_guard{opt_sync_stream->device()};
        auto event = c10::Event{c10::DeviceType::CUDA};
        event.record(*opt_sync_stream);
        opt_accumulate_stream->wait(event);
      }
    }
  }

  auto& old_var = buffer[pos];
  if (!old_var.defined()) {
    buffer[pos] = std::move(var);
  } else {
    if (opt_accumulate_stream) {
      c10::OptionalStreamGuard stream_guard{opt_accumulate_stream};
      accumulate(buffer, pos, std::move(var));
    } else {
      // (1) non-CUDA variable
      //     Accumulation happens on variable's device
      c10::OptionalDeviceGuard device_guard{device_of(var)};
      accumulate(buffer, pos, std::move(var));
    }
  }
}

auto InputBuffer::variables(InputBuffer&& g) -> std::vector<Variable> {
  std::vector<Variable> result = std::move(g.buffer);
  return result;
}

0x06 ReadyQueue

6.1 定義

ReadyQueue 用來在主執行緒和worker執行緒之間、以及worker執行緒和worker執行緒之間傳輸任務（NodeTask物件）。為什麼要傳遞 NodeTask？是因為NodeTask 包含了求導函式，逐一執行NodeTask 就是在反向計算圖路徑上逐一執行求導函式，最後往輸出節點輸出最終梯度。ReadyQueue就指定了worker執行緒要執行的工作流。

其定義如下：

struct ReadyQueue {
 private:
  // Returns true when t2 should be (weakly) BEFORE t1 in the queue.
  // Shutdown tasks are first and then empty NodeTask are next.
  struct CompareNodeTaskTime {
    bool operator()(NodeTask const & t1, NodeTask const & t2) {
      // NOLINTNEXTLINE(bugprone-branch-clone)
      if (t2.isShutdownTask_) {
        return true;
      } else if (!t1.fn_ || t1.isShutdownTask_) {
        return false;
      } else if (!t2.fn_) {
        return true;
      } else if (t1.getReentrantDepth() == t2.getReentrantDepth()) {
        return t1.fn_->sequence_nr() < t2.fn_->sequence_nr();
      } else {
        return t1.getReentrantDepth() < t2.getReentrantDepth();
      }
    }
  };

  // To notify threads waiting on the ReadyQueue of available tasks on the heap_
  std::condition_variable not_empty_;
  // To protect read and writes to heap_
  mutable std::mutex mutex_;

  std::priority_queue<NodeTask, std::vector<NodeTask>, CompareNodeTaskTime> heap_;

 public:
  // incrementOutstandingTasks indicates whether or not we should increment
  // 'outstanding_tasks_' for the associated GraphTask. This should mostly
  // always be true and is only set false in certain cases (see docs for
  // DistEngine.execute_graph_task_until_ready_queue_empty)
  void push(NodeTask item, bool incrementOutstandingTasks = true);
  void pushShutdownTask();
  NodeTask pop();
  bool empty() const;
  size_t size() const;
};

ReadyQueue 主要成員函式/成員變數如下:

std::condition_variable not_empty_ 其作用是線上程之間同步。
Push 是生成者行為，使用 not_empty_.notify_one() 來通知消費者，這樣就可以解鎖一個消費者。
Pop 是消費者行為，使用 not_empty_.wait(lock, [this]{ return !heap_.empty(); }) 來阻塞等待生產。
std::priority_queue heap_，使用 CompareNodeTaskTime 來做比較。
- 每次 pop 時會取出 CompareNodeTaskTime 最小的 NodeTask。
- CompareNodeTaskTime 依據 ReentrantDepth 和 sequence_nr 做比較，哪一個小就消費哪一個。因此消費的順序不等同於生產的順序，這裡生產的意思是往queue之中插入NodeTask。

auto ReadyQueue::push(NodeTask item, bool incrementOutstandingTasks) -> void {
  {
    // Lock mutex for writing to heap_
    std::lock_guard<std::mutex> lock(mutex_);
    if (incrementOutstandingTasks) {
      std::shared_ptr<GraphTask> graph_task = item.base_.lock();
      ++graph_task->outstanding_tasks_;
    }
    heap_.push(std::move(item));
  }
  not_empty_.notify_one();
}

auto ReadyQueue::pushShutdownTask() -> void {
  {
    std::lock_guard<std::mutex> lock(mutex_);
    heap_.push(NodeTask({}, nullptr, InputBuffer(0), true));
  }
  not_empty_.notify_one();
}

size_t ReadyQueue::size() const {
  // Lock mutex for accesses to heap_
  std::unique_lock<std::mutex> lock(mutex_);
  return heap_.size();
}

auto ReadyQueue::pop() -> NodeTask {
  // Lock mutex for accesses to heap_
  std::unique_lock<std::mutex> lock(mutex_);
  not_empty_.wait(lock, [this]{ return !heap_.empty(); });
  auto task = std::move(const_cast<NodeTask&>(heap_.top())); heap_.pop();
  return task;
}

bool ReadyQueue::empty() const {
  // Lock mutex for accesses to heap_
  std::unique_lock<std::mutex> lock(mutex_);
  return heap_.empty();
}

6.2 裝置Queue 數量

在引擎之中，執行緒數量和ReadyQueue 的數量是由據裝置的數量來決定的。有多少個裝置，就啟動多少個工作執行緒，也生成與執行緒一一對應的ReadyQueue。

所以，引擎有如下成員變數，使用 vector 來統一管理 queue。

// Safe to read device_ready_queues_ without synchronization after initialization
std::vector<std::shared_ptr<ReadyQueue>> device_ready_queues_;

生成queue具體如下面的程式碼：

auto Engine::start_device_threads() -> void {
  // See Note [Allocating GPUs to autograd threads]
  c10::DeviceIndex num_devices = 0;
  // 得到裝置數量
  for (const auto& impl_atomic : c10::impl::device_guard_impl_registry) {
    auto* impl = impl_atomic.load();
    if (impl) {
      num_devices = std::max(num_devices, impl->deviceCount());
    }
  }

  // 確定queue數量，並且生成queue
  // allocate one thread for every GPU device (but colocate GPUs of different
  // types), and pre-allocate the device_ready_queues_ to ensure safe reading on it.
  device_ready_queues_ = std::vector<std::shared_ptr<ReadyQueue>>(num_devices);
  for (auto& queue : device_ready_queues_)    {
    // NOLINTNEXTLINE(modernize-make-shared)
    queue.reset(new ReadyQueue());
  }
    
  // 生成執行緒 
  thread_pool_shared_ = std::make_shared<ThreadPoolShared>();
  for (int i = 0; i < num_devices; ++i) {
    std::thread t(&Engine::thread_init, this, i, device_ready_queues_[i], true);
    t.detach();
  }
  // Wait for the threads to start
  {
    std::unique_lock<std::mutex> lk(non_reentrant_device_thread_mutex_);
    while(non_reentrant_device_thread_count_.load() != static_cast<uint32_t>(num_devices)) {
      non_reentrant_device_thread_condvar_.wait(lk);
    }
  }
}

因為是使用 vector 來管理queue，所以可以使用裝置號（device index）去vector裡得到每個device專屬的ReadyQueue。

auto Engine::ready_queue_by_index(std::shared_ptr<ReadyQueue> cpu_ready_queue, int device_index) -> std::shared_ptr<ReadyQueue> {
  if (device_index == CPU_DEVICE) {
    // return the cpu ready queue passed in
    TORCH_INTERNAL_ASSERT(cpu_ready_queue);
    return cpu_ready_queue;
  } else {
    // Static cast is ok here as the number of device should never overflow an int.
    TORCH_INTERNAL_ASSERT(0 <= device_index && device_index < static_cast<int>(device_ready_queues_.size()));
    // See Note [Allocating GPUs to autograd threads]
    // NB: This function would become obsolete if we truly allocated a CPU thread
    // per device, rather than colocate.
    return device_ready_queues_.at(device_index);
  }
}

6.3 執行緒角度看ReadyQueue

現在，讓我們從執行緒角度來看看ReadyQueue。

6.3.1 工作執行緒

每個autogard 工作執行緒都與一個就緒佇列相關聯，該佇列指定該執行緒要執行的工作流，這個佇列定義如下。

// Every autograd worker thread is associated with a ready queue, which specifies
// the stream of work of this thread to do. This shared_ptr is a thread_local
// pointer to each thread's ready_queue, and it should be initialized via the
// Engine::init_local_ready_queue() call in each corresponding thread before execution.
//
// The CUDA, XLA threads are shared among all invocations of backwards via
// device_ready_queues_, while CPU threads are dedicated to processing CPU work for
// the backward they invoked. So any given graph task maintains its own cpu_ready_queue_
// where you should send work for it to be done
//
// For reentrant backward calls, if we spawn new thread from the current thread
// because we reached the maximum depth, the new thread will just reuse the same
// ReadyQueue with the parent thread for performance improvement.
// see Note [Reentrant backwards] for more details.

static thread_local std::shared_ptr<ReadyQueue> local_ready_queue = nullptr;

這個shared_ptr是一個thread_local指標，其指向每個執行緒的ready_queue，在執行之前，應該通過每個對應執行緒中的 Engine::init_local_ready_queue() 呼叫對其進行初始化。

void Engine::init_local_ready_queue(std::shared_ptr<ReadyQueue> ready_queue) {
  if (ready_queue) {
    // if ready_queue provided in the caller, use the caller's ready_queue to initialize local_ready_queue
    local_ready_queue = std::move(ready_queue);
  } else if (!local_ready_queue){
    // otherwise if local_ready_queue not allocated, allocate a new ready_queue
    local_ready_queue = std::make_shared<ReadyQueue>();
  }
}

對於可重入的向後呼叫，如果由於達到最大深度而從當前執行緒生成新執行緒，則新執行緒將與父執行緒重用相同的ReadyQueue以提高效能。

對於工作執行緒，其對應的 ReadyQueue 是 device_ready_queues_ 之中對應的 queue，比如下面是用 std::thread t(&Engine::thread_init, this, i, device_ready_queues_[i], true) 來初始化。

auto Engine::start_device_threads() -> void {
  // See Note [Allocating GPUs to autograd threads]
  c10::DeviceIndex num_devices = 0;
  for (const auto& impl_atomic : c10::impl::device_guard_impl_registry) {
    auto* impl = impl_atomic.load();
    if (impl) {
      num_devices = std::max(num_devices, impl->deviceCount());
    }
  }

  // allocate one thread for every GPU device (but colocate GPUs of different
  // types), and pre-allocate the device_ready_queues_ to ensure safe reading on it.
  device_ready_queues_ = std::vector<std::shared_ptr<ReadyQueue>>(num_devices);
  for (auto& queue : device_ready_queues_)    {
    // NOLINTNEXTLINE(modernize-make-shared)
    queue.reset(new ReadyQueue());
  }

  thread_pool_shared_ = std::make_shared<ThreadPoolShared>();

  for (int i = 0; i < num_devices; ++i) {
    std::thread t(&Engine::thread_init, this, i, device_ready_queues_[i], true);
    t.detach();
  }
  // Wait for the threads to start
  {
    std::unique_lock<std::mutex> lk(non_reentrant_device_thread_mutex_);
    while(non_reentrant_device_thread_count_.load() != static_cast<uint32_t>(num_devices)) {
      non_reentrant_device_thread_condvar_.wait(lk);
    }
  }
}

6.3.2 主執行緒

對於主執行緒，則呼叫 init_local_ready_queue() 來初始化local ready_queue。

因為 init_local_ready_queue 沒有傳入引數，所以新生成一個queue。

void Engine::init_local_ready_queue(std::shared_ptr<ReadyQueue> ready_queue) {
  if (ready_queue) {
    // if ready_queue provided in the caller, use the caller's ready_queue to initialize local_ready_queue
    local_ready_queue = std::move(ready_queue);
  } else if (!local_ready_queue){
    // otherwise if local_ready_queue not allocated, allocate a new ready_queue
    local_ready_queue = std::make_shared<ReadyQueue>();
  }
}

這就是 CPU queue。我們把 CPU queue 和工作執行緒的queue做比較。

裝置 ReadyQueue 的數目與 worker執行緒數目相同，每個worker有一個對應的 ReadyQueue。CUDA、XLA執行緒在所有反向傳播呼叫之間通過 device_ready_queues_進行資訊共享。
而CPU執行緒專用於處理它們呼叫的反向傳播相關CPU工作。因此，任何給定的graph任務都會維護自己的cpu_ready_queue_，使用者應該向其傳送要完成的工作。

CPU queue 就是GraphTask 的成員變數 cpu_ready_queue_。

  // CPU threads are dedicated to processing CPU work for the backward they invoked.
  // So any given graph task maintains its own cpu_ready_queue_ where you should send
  // work for it to be done. We memoize the cpu_ready_queue_ per GraphTask so that
  // we know which ready queue we should push to if we are on device thread (i.e. GPU)
  // and but next NodeTask should be run on CPU.
  std::shared_ptr<ReadyQueue> cpu_ready_queue_;

注意，CPU就緒佇列為每個GraphTask獨有，但CUDA裝置就緒佇列在所有GraphTask中共享。

所以，引擎之中就緒佇列數目是：裝置數目 + GraphTask 數目。

我們完善一下之前的圖例，加入了GraphTask 和 Engine 資訊，具體如下圖：

1）主執行緒往CPU ReadyQueue放入一個 NodeTask。
2）工作執行緒 1 從 CPU ReadyQueue 取出 NodeTask，開始執行。
3）工作執行緒 1 結束之後，往 device_ready_queues_ 的某一個 ReadyQueue 插入一個 NodeTask。
4）ReadyQueue 對應的工作執行緒 2 取出 NodeTask，開始執行。

                            +-------------------------+
                            | GraphTask               |
                            |                         |
                            |        cpu_ready_queue_ |
                            |            +            |
                            |            |            |
                            +-------------------------+
                                         |
+--------------+                         |                           +-----------------+
| Main Thread  |                         v                           | Worker Thread 1 |
|              |       1         +-------+---------+       2         |                 |
|              | push(NodeTask)  |                 |  pop(NodeTask)  |                 |
|          +-------------------> | CPU ReadyQueue  +----------------------->           |
|              |                 |                 |                 |                 |
|              |                 +-----------------+                 |                 |
|              |              +----------------------+               |                 |
|              |              | Device ReadyQueues   |               |                 |
|              |              |                      |               |                 |
|              |              |                      |    3          |                 |
|              |              |    +-------------+   | push(NodeTask)|                 |
|              |              |    | ReadyQueue 1| <-----------------------+           |
|              |              |    +------+------+   |               |                 |
|              |              |           |          |               |                 |
+--------------+              |           |          |               +-----------------+
                              |           +------------------+
                              |                      |       |       +-----------------+
+------------------------+    |          .           |       |       | Worker Thread 2 |
| Engine                 |    |          .           |       |       |                 |
|                        |    |          .           |       |       |                 |
|                        |    |                      |       |       |                 |
|  device_ready_queues_ +---> |    +-------------+   |       +------------->           |
|                        |    |    | ReadyQueue 2|   | pop(NodeTask) |                 |
|                        |    |    +-------------+   |     4         |                 |
+------------------------+    |                      |               |                 |
                              |                      |               |                 |
                              |    +-------------+   |               |                 |
                              |    | ReadyQueue 3|   |               |                 |
                              |    +-------------+   |               |                 |
                              |                      |               |                 |
                              +----------------------+               +-----------------+

至此，靜態結構和基礎類介紹完畢，下一篇我們介紹動態邏輯。