[原始碼解析]PyTorch如何實現前向傳播(2) --- 基礎類(下)

羅西的思考發表於2021-10-20

[原始碼解析]PyTorch如何實現前向傳播(2) --- 基礎類(下)

0x00 摘要

本系列將通過大概十篇左右文章來分析 PyTorch 的自動微分功能如何實現。本文是前向傳播的第二篇,介紹自動微分(梯度計算)所涉及的部分 PyTorch 基礎類。因為字數太多(1萬兩千字),所以拆分成上下兩篇。

系列前幾篇連線如下:

深度學習利器之自動微分(1)

深度學習利器之自動微分(2)

深度學習利器之自動微分(3) --- 示例解讀

[原始碼解析]PyTorch如何實現前向傳播(1) --- 基礎類(上)

0x01 前文回顧

前文介紹了部分基礎類,比如 Variable, Function, Tensor,本文我們繼續分析其他基礎類。為了行文完整,我們從前文摘取了總體邏輯關係如下,SubBackward0,PowBackward0 和 都是Node 的派生類,在本文我們會細化這個圖。

+---------------------+              +----------------------+
| SubBackward0        |              | PowBackward0         |
|                     |      Edge    |                      |  Edge
|   next_functions  +-----+--------> |     next_functions +----------> ...
|                     |   |          |                      |
+---------------------+   |          +----------------------+
                          |
                          |
                          |          +----------------------+
                          |  Edge    | MulBackward0         |
                          +--------> |                      |  Edge
                                     |     next_functions +----------> ...
                                     |                      |
                                     +----------------------+

0x02 TensorImpl

2.1 轉嫁

PyTorch 之中大量使用了bridge設計模式,at::Tensor就是利用bridge模式把具體實現轉交給TensorImpl完成。

class TORCH_API Tensor {
 private:
  struct unsafe_borrow_t { explicit unsafe_borrow_t() = default; };

  explicit Tensor(unsafe_borrow_t, const Tensor& rhs)
      : impl_(c10::intrusive_ptr<at::TensorImpl, UndefinedTensorImpl>::reclaim(rhs.impl_.get())) {}
    
  friend MaybeOwnedTraits<Tensor>;
  protected:
  friend class ::caffe2::Tensor;

  void enforce_invariants();
  c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl> impl_; // 轉嫁出去
};

具體如下:

+------------------------------------------------+          +---------------------------+
|Tensor                                          |          |TensorImpl                 |
|                                                |          |                           |
|                                                |  bridge  |                           |
|      <TensorImpl, UndefinedTensorImpl> impl_+-----------> |       autograd_meta_      |
|                                                |          |                           |
|                                                |          |       named_tensor_meta_  |
+------------------------------------------------+          |                           |
                                                            |       pyobj_              |
                                                            |                           |
                                                            |       sizes_and_strides_  |
                                                            |                           |
                                                            |       storage_offset_     |
                                                            |                           |
                                                            |       data_type_          |
                                                            |                           |
                                                            |       device_opt_         |
                                                            |                           |
                                                            |                           |
                                                            +---------------------------+

2.2 定義

TensorImpl 定義如下,因為本文是自動微分和前向傳播相關,因此我們專注這部分功能的相關變數,就是autograd_meta_ 。除了 autograd_meta_ 之外,主要是一些描述Tensor大小的後設資料,包含元素的型別(dtype),Tensor所依賴的裝置,Strides(步幅)等等。

struct C10_API TensorImpl : public c10::intrusive_ptr_target {
  Storage storage_;

 private:
  // This pointer points to an AutogradMeta struct that stores autograd-specific
  // fields (such as grad_ / grad_fn_ / grad_accumulator_). This pointer always
  // has unique ownership (meaning only one TensorImpl can own it at a time).
  //
  // autograd_meta_ can be nullptr, as an optimization.  When this occurs, it is
  // equivalent to having an autograd_meta_ pointing to a default constructed
  // AutogradMeta; intuitively, tensors which don't require grad will have this
  // field set to null.
  //
  // This means accessors on autograd_meta_ have to be careful to test if they
  // got a nullptr, and handle default behavior appropriately in that case.
  //
  // Note that we don't enforce the invariant that if the AutogradMeta is
  // default constructed, it is nullptr (to do this, we'd have to continuously
  // check if an AutogradMeta became, by mutation, equal to the default
  // constructed form.  (This might be useful, but it seems rare enough that
  // a requires_grad=True variable will turn back into the requires_grad=False
  // version.)  So there are three representable states:
  //
  //    1. autograd_meta_ == nullptr
  //    2. autograd_meta_ is default constructed (semantically, same as (1))
  //    3. autograd_meta_ has nontrivial information content
  //
  std::unique_ptr<c10::AutogradMetaInterface> autograd_meta_ = nullptr; // 主要關注這裡

 protected:
  std::unique_ptr<c10::NamedTensorMetaInterface> named_tensor_meta_ = nullptr;
  c10::VariableVersion version_counter_;
  PyObject* pyobj_ = nullptr;
  c10::impl::SizesAndStrides sizes_and_strides_;
  int64_t storage_offset_ = 0;
  int64_t numel_ = 1;
  caffe2::TypeMeta data_type_;
  c10::optional<c10::Device> device_opt_;
  bool is_contiguous_ : 1;
  /* HasContiguityPolicy */ uint8_t has_contiguity_ : 2;
  bool storage_access_should_throw_ = false;
  bool is_channels_last_ : 1;
  bool is_channels_last_contiguous_ : 1;
  bool is_channels_last_3d_ : 1;
  bool is_channels_last_3d_contiguous_ : 1;
  bool is_non_overlapping_and_dense_ : 1;
  bool is_wrapped_number_ : 1;
  bool allow_tensor_metadata_change_ : 1;
  bool reserved_ : 1;
  DispatchKeySet key_set_;
};

對於自動微分,std::unique_ptr<c10::AutogradMetaInterface> autograd_meta_ = nullptr; 是關鍵。

此成員變數用來儲存自動微分相關的特殊變數,比如grad_ / grad_fn_ / grad_accumulator_,每一個TensorImpl在同一時刻只有唯一一個AutogradMeta。

autograd_meta_ 是區分一個 Variable 是普通張量還是帶 autograd 功能張量的唯一標識:

  • 對於不需要梯度的張量,autograd_meta_ 這個變數為null。
  • 但是出於優化的目的,即使需要梯度,autograd_meta_ 也可以是null,這種情況等同於被賦值成一個預設的AutogradMeta。所以在使用時候需要仔細校驗是否為null。
  • 在需要梯度情況下,一般來說,autograd_meta_會被初始化為 AutogradMeta 或者DifferentiableViewMeta。

AutogradMetaInterface 定義如下,這是一個抽象介面,需要派生類來實現具體功能。

struct C10_API AutogradMetaInterface {
  virtual void set_requires_grad(
      bool requires_grad,
      at::TensorImpl* self_impl) = 0;
  virtual bool requires_grad() const = 0;
  virtual at::Tensor& mutable_grad() = 0;
  virtual const at::Tensor& grad() const = 0;
  virtual const at::Tensor& fw_grad(uint64_t level, const at::Tensor& self)
      const = 0;
  virtual void set_fw_grad(
      const at::Tensor& new_grad,
      const at::Tensor& self,
      uint64_t level,
      bool is_inplace_op) = 0;
  virtual ~AutogradMetaInterface();
};

0x03 自動求導相關類

以下類是與自動求導相關。

3.1 AutogradMeta

AutogradMeta 繼承了 AutogradMetaInterface,儲存於自動微分相關的東西,比如節點的梯度值和梯度計算函式,其具體定義如下:

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                            AutogradMeta
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/// Each `Variable` has one unique `AutogradMeta` struct, which stores autograd
/// metadata fields that are necessary for tracking the Variable's autograd history.
/// As an optimization, a Variable may store a nullptr, in lieu of a default
/// constructed AutogradMeta.

/// 1. A `grad_fn`, if the variable is in the interior of the graph. This is the
///    gradient of the function that produced the variable.
/// 2. A `grad_accumulator`, if the variable is a leaf, which accumulates a
///    scalar gradient value into its `grad` variable.

struct TORCH_API AutogradMeta : public c10::AutogradMetaInterface {
  std::string name_;

  Variable grad_; // 儲存當前Variable的梯度,本身也是一個Variable
  std::shared_ptr<Node> grad_fn_; // 非葉子節點才有意義,中間節點負責梯度計算。Pytorch就是判斷grad_fn_是否為空來判斷一個Variable是否是葉子節點,可以通過grad_fn()方法來訪問。
  std::weak_ptr<Node> grad_accumulator_; // Node例項,只有葉子節點才有,葉子節點負責對梯度進行累加,grad_accumulator_就是梯度累加處理函式,梯度就被儲存在grad_變數之中

  // This field is used to store all the forward AD gradients
  // associated with this AutogradMeta (and the Tensor it corresponds to)
  // There is a semantic 1:1 correspondence between AutogradMeta and
  // ForwardGrad but:
  //   - This field is lazily populated.
  //   - This field is a shared_ptr but it must never be
  //     shared by multiple Tensors. See Note [ Using ForwardGrad ]
  // Any transition from not_initialized to initialized
  // must be protected by mutex_
  std::shared_ptr<ForwardGrad> fw_grad_; // forward AD gradients

  std::vector<std::shared_ptr<FunctionPreHook>> hooks_;
  std::shared_ptr<hooks_list> cpp_hooks_list_;

  // Only meaningful on leaf variables (must be false otherwise)
  bool requires_grad_; // 此Variable是否需要grad

  // Only meaningful on non-leaf variables (must be false otherwise)
  bool retains_grad_; // 只有非葉子節點才有意義,是否需要保持圖

  bool is_view_; // 此Variable是否是一個View(沒有實際儲存,這是基於base的Variable)

  // The "output number" of this variable; e.g., if this variable
  // was the second output of a function, then output_nr == 1.
  // We use this to make sure we can setup the backwards trace
  // correctly when this variable is passed to another function.
  uint32_t output_nr_; // Variable是某一個函式的輸出資料,output_nr_ 就記錄了它是第幾個輸出,比如 = 0,就表示是函式的第1個輸出

  // Mutex to ensure that concurrent read operations that modify internal
  // state are still thread-safe. Used by grad_fn(), grad_accumulator(),
  // fw_grad() and set_fw_grad()
  // This is mutable because we need to be able to acquire this from const
  // version of this class for the functions above
  mutable std::mutex mutex_;
};

AutogradMeta 的主要成員變數如下:

  • grad_ :儲存當前Variable例項的梯度,本身也是一個Variable。
  • grad_fn :是個Node例項,非葉子節點才有。通過 grad_fn() 方法來訪問,實際上,PyTorch中就是通過 grad_fn是否為空 來判斷一個Variable是否是leaf variable。
  • grad_accumulator_ :也是Node的例項,只有葉子節點才有
    • 通過Variable的grad_accumulator()來訪問。
    • 葉子節點負責對梯度進行累加,grad_accumulator_ 就是梯度累加處理函式。
    • 其對應梯度就被儲存在 grad_ 變數之中。
  • requires_grad_ :表明此Variable例項是否需要grad。
  • retains_grad_ : 只有非葉子節點才有意義,意義為是否需要保持圖。
  • is_view_ :是個flag,表明此Variable例項是否是個view(沒有實際儲存,基於base的variable)。
  • version_counter_ :version number。
  • output_nr_:是個數字。output_nr_表明是 Node 的第幾個輸出,比如為 0 就 表明這個Variable是Node 的第 1 個輸出。
  • base_ :是view的base variable。

具體如下:

+----------------------------------------------+          +-------------------------+
|Tensor                                        |          |TensorImpl               |
|                                              |          |                         |
|                                              |  bridge  |                         |
|   <TensorImpl, UndefinedTensorImpl> impl_ +-----------> |    autograd_meta_ +---------+
|                                              |          |                         |   |
|                                              |          |    named_tensor_meta_   |   |
+----------------------------------------------+          |                         |   |
                                                          |    pyobj_               |   |
                                                          |                         |   |
                                                          |    sizes_and_strides_   |   |
                                                          |                         |   |
                                                          |    storage_offset_      |   |
                                                          |                         |   |
                                                          |    data_type_           |   |
                                                          |                         |   |
                                                          |    device_opt_          |   |
                                                          |                         |   |
                                                          |                         |   |
                                                          +-------------------------+   |
                                                                                        |
                                                          +-------------------------+   |
                                                          | AutogradMeta            |   |
                                                          |                         +<--+
                                                          |                         |
                                                          |      grad_fn_           |
                                                          |                         |
                                                          |      grad_accumulator_  |
                                                          |                         |
                                                          |      hooks_             |
                                                          |                         |
                                                          |      retains_grad_      |
                                                          |                         |
                                                          |      output_nr_         |
                                                          |                         |
                                                          |      fw_grad_           |
                                                          |                         |
                                                          +-------------------------+

3.2 DifferentiableViewMeta

對於輸入變數,許多操作返回與輸入變數共享儲存的新變數,返回的變數被稱為在基變數之上的檢視(view)變數。在PyTorch中,我們有兩種型別的檢視:可微檢視和不可微的檢視。為了支援合適的版本校驗,無論是哪種型別,基變數和檢視變數必須分享同樣的版本計數器(version_counter)。

DifferentiableViewMeta 就是用來處理可微檢視。

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                     DifferentiableViewMeta
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/// DifferentiableViewMeta is created to support gradient tracking of
/// such **in-place** operations. In particular,
///   + if an in-place op is done on base, the grad_fn field of the view may
///     become stale. So accesses should always go through grad_fn(), which
///     reconstructs an updated grad_fn if the version_counter has incremented.
///     All other fields are always valid.
///   + if an in-place op is done on view, in rebase_history() of view, which is
///     called after every in-place op in VariableType.cpp, the grad_fn of base
///     is updated.
///   + if a single autograd Node returns multiple differentiable views, if any
///     output is modified by an inplace operation, the autograd engine will make
///     an equivalent graph (corresponding to the view operations) without using
///     equivalent graph, where each output is treated as if it were produced by a
///     distinct view operation. This discards the original (e.g., user provided)
///     grad_fn. If the provided grad_fn does more than the backward of the view,
///     then the DifferentiableViewMeta must be created with creation_meta=
///     CreationMeta::MULTI_OUTPUT_NODE to prevent the engine from ignoring the
///     provided grad_fn.

enum class CreationMeta: uint8_t { DEFAULT, IN_CUSTOM_FUNCTION, MULTI_OUTPUT_NODE,
                                   NO_GRAD_MODE, MULTI_OUTPUT_SAFE, INFERENCE_MODE};

struct TORCH_API DifferentiableViewMeta : public AutogradMeta {
private:
  /// Informations about the views
  c10::optional<ViewInfo> backward_info_;
  c10::optional<ViewInfo> forward_info_;

  // Optimization to reduce the number of ViewInfo we create.
  // In the (very common) case where backward_info_ == forward_info_, we only
  // populate backward_info_ (that should be used as both the forward and backward
  // view information) and set shared_view_info_ = true.
  // Invariants:
  //   - If shared_view_info_ is false, there is no special constraints on
  //     backward_info_ and forward_info_
  //   - If shared_view_info_ is true, we must have:
  //      - backward_info_.has_value() == true
  //      - forward_info_.has_value() == false
  bool shared_view_info_;

  /// The two following fields are extra information that we track to ensure that
  /// any operation on this backward view is valid.

  /// The value of the version_counter at the time grad_fn was created. The
  /// grad_fn field is stale if attr_version_ != version_counter.current_version().
  uint32_t attr_version_;
  CreationMeta creation_meta_;
};

3.3 AutogradContext

AutogradContext 是操作 autograd 的上下文,用來儲存在前向過程中產生的資訊,這樣在後向傳播中就可以訪問。

/// Context to save information during `forward` that can be accessed in `backward`
/// in custom autograd operations (see `torch::autograd::Function` for details).
struct TORCH_API AutogradContext {
  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
  AutogradContext() : materialize_grads_(true) {}
  AutogradContext(const AutogradContext &other) = delete;
  AutogradContext& operator=(const AutogradContext& other) = delete;

  /// Can be used to save non-variable data for `backward`.
  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
  ska::flat_hash_map<std::string, at::IValue> saved_data;

  /// Saves the list of variables for a future call to `backward`. This
  /// should be called at most once from inside of `forward`.
  void save_for_backward(variable_list to_save);
  /// Marks variables in the list as modified in an in-place operation. This
  /// should be called at most once from inside of `forward` and all arguments
  /// should be inputs.
  void mark_dirty(const variable_list &inputs);
  /// Marks outputs in the list as not requiring gradients. This should be called
  /// at most once from inside of `forward` and all arguments should be outputs.
  void mark_non_differentiable(const variable_list &outputs);
  // Sets whether undefined output grad tensors should be expanded to tensors
  // full of zeros before calling backward function. Default value is true.
  void set_materialize_grads(bool value);

  /// Get the list of variables that were saved in `forward` using
  /// `save_for_backward()`. Before returning them to the user, a check is made to
  /// ensure that they were not modified by any in-place operations.
  variable_list get_saved_variables() const;
  const std::unordered_set<at::TensorImpl*>& get_and_bump_dirty() const;
  const std::unordered_set<at::TensorImpl*>& get_non_differentiable() const;

private:
  std::unordered_set<at::TensorImpl*> non_differentiable_;
  std::unordered_set<at::TensorImpl*> dirty_inputs_;
  std::vector<torch::autograd::SavedVariable> saved_variables_;
  variable_list to_save_;
  bool materialize_grads_;

  // The CppNode in the autograd graph that owns this AutogradContext. We need a
  // weak_ptr to avoid a refcycle. Since grad_fn_ owns this AutogradContext, it
  // will always be alive when we want to use it.
  std::weak_ptr<Node> grad_fn_;
  bool has_freed_buffers_;

  void save_variables();

  template <class T> friend struct CppNode;
};

對使用者來說,AutogradContext 主要是在 自定義 Auto Function 方面。以下是註釋之中的例子。

/// ```
/// class MyFunction : public Function<MyFunction> {
///   public:
///   static variable_list forward(AutogradContext *ctx, int n, Variable var) {
///      // Save data for backward in context
///      ctx->saved_data["n"] = n;
///      var.mul_(2);
///      // Mark var as modified by inplace operation
///      ctx->mark_dirty({var});
///      return {var};
///   }
///
///   static variable_list backward(AutogradContext *ctx, variable_list
///   grad_output) {
///      // Use data saved in forward
///      auto n = ctx->saved_data["n"].toInt();
///      return {grad_output[0]*n};
///   }
/// };
/// ```
///
/// To use `MyFunction`:
/// ```
/// Variable x;
/// auto y = MyFunction::apply(6, x);
/// // Example backward call
/// y[0].sum().backward();

我們籍此進入到 Auto Function。

3.4 Auto Function

Autograd使用Function來計算結果和梯度,並對操作歷史進行編碼。在Tensor 上執行的每個操作都會建立一個新的 Function 物件,該物件執行計算並記錄發生了什麼。操作歷史以函式 DAG 的形式保留,邊表示資料依賴關係 ( input <- output )。

通常,使用者與 Function 互動的唯一方式是建立子類和定義新操作(擴充套件新的功能),這是擴充套件 torch.autograd 的推薦方式。有關如何使用此類的更多詳細資訊,請參閱有關擴充套件 autograd 引擎的說明: https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd

使用者如果要使用自定義autograd操作,請使用靜態正向和反向函式實現一個Function子類。

  • forward可以接受任意多個引數,並應返回變數列表或變數。
    • 任何Variable引數的使用都將在計算圖中註冊,但是vectors/sets 或者其他資料結構不會遍歷註冊。
    • 您可以使用c10::optional作為引數之一,如果引數有值,它將在圖形中註冊為變數。
    • forward應該將指向“torch::autograd::AutogradContext”的指標作為第一個引數。變數可以使用“ctx->save_for_backward”,儲存在“ctx->saved_data” map中,其他資料將以<std::string, at::IValue>”對的形式儲存在“ctx->saved_data” map中。
  • backward應該使用指向torch::autograd::AutogradContext的指標 以及一個變數列表作為引數。
    • 該變數列表包含的變數數量與forward輸出的變數數量相同。
    • backward應該返回與輸入一樣多的變數,其中每個變數都包含與輸入相應的梯度。
    • “forward”中儲存的變數可以通過“ctx->get_saved_Variables”訪問,其他儲存的資料可以通過“ctx->saved_data”訪問。
    • 當 backward被呼叫時,通過呼叫每個Function物件的方法,並將返回的梯度傳遞給下一個Function ,我們就可以按照拓撲順序來處理這個計算圖 。

Function 具體派生子類例子如下:

class Exp(Function):

     @staticmethod
     def forward(ctx, i):
         result = i.exp()
         ctx.save_for_backward(result)
         return result

     @staticmethod
     def backward(ctx, grad_output):
         result, = ctx.saved_tensors
         return grad_output * result

#Use it by calling the apply method:
output = Exp.apply(input)

如前所示,Function 已經被 Node 替換,所以我們再來到了 Node。

0x04 Node

早期版本中,Node的名字是Function,後來修改為Node,應該是想與節點概念更好的對應。

Node 是一個代表操作的抽象類,其輸入是0個或者多個Variable,輸出是0個或多個Variable。前向圖中該Node節點的輸入節點,就是後向傳播圖中該Node節點的輸出節點。PyTorch的autograd機制中,所有函式都派生自此類,並重寫其“apply”方法。這樣子類的例項就可以通過call操作符呼叫。

將autograd系統視為計算圖時,Node是通過(有向)Edge相互連線的頂點或節點,其本身通過(Node,input_nr)對來表示。Variable 是Node 的輸入和輸出,並在圖形執行期間在這些邊之間移動。當兩個或多個“邊”(來自不同來源)指向一個“節點”的同一輸入時,沿所有這些邊生成的值在轉發到目標“節點”之前將被隱式求和。

其子類通常用來表示可微函式及其梯度運算元。然而,請注意,由於“節點”的定義非常籠統,“節點”接受或更多的輸入併產生或更多的輸出。“節點”的使用非常靈活,超出了純數學運算的範圍。例如,AccumageGrad函式是一個sink,它接受一個輸入,但不產生輸出,而是將輸入作為副作用進行累積。在另一端,“GraphRoot”函式不接收來自其他函式的輸入,而是產生多個輸出。具體可以參見 torch/csrc/autograd/function.h 的註釋。

4.1 定義

我們看看 Node 類的定義,為了更好的說明,這裡只保留成員變數,刪除成員函式。

using edge_list = std::vector<Edge>;

struct TORCH_API Node : std::enable_shared_from_this<Node> {

 protected:
  /// Performs the `Node`'s actual operation.
  virtual variable_list apply(variable_list&& inputs) = 0;

  /// Calls `apply()`, but instruments it with tracing machinery.
  variable_list traced_apply(variable_list inputs);

  /// NOTE [ Sequence Number]
  ///
  /// The sequence_nr has two main usages in autograd:
  ///
  /// 1) Helps determine the node's execution priority in the engine.
  ///    All else being equal, nodes with higher priority numbers are executed first.
  ///    Thus, nodes corresponding to ops executed later are the first to be executed in
  ///    the backward pass. One caveat is that we prioritize AccumulateGrad nodes by
  ///    explicitly setting its sequence_nr to be UINT64_MAX.
  /// 2) The sequence number of this `Node` is paired with with thread_id it was created in
  ///    as a unique identifier by the profiler to annotate recorded events.
  ///    The purpose of this is to help users (and possibly programs) interpreting the profiler's
  ///    output to correlate backward nodes with its forward ops.
  ///    We need both sequence_nr and thread_id to identify a node because sequence_nr is
  ///    thread_local, i.e., starts counting up from zero in a new thread    
    
  // Sequence number used to correlate backward nodes with forward ops in the
  // profiler and provide determinisim in the engine.
  const uint64_t sequence_nr_;


  // NOTE [ Topological Number ]
  //
  // topological_nr is used to prune branches in the DAG during autograd discovery as
  // maintaining topological_nr helps us check in O(1) if there does NOT exist
  // a directed path between two nodes.
  //
  // The topological order number of this `Node` representing the length of the
  // longest possible path from this Node to any leaf node. If you are leaf node,
  // aka AccumulateGrad, this will be zero. This value has the property that
  // For every pair of nodes X, Y in G, existence of a directed path from X to Y
  // implies topo_nr(X) > topo_nr(Y). The converse is not true, however, so we
  // cannot prove existence of a path from X to Y, only non-existence.
  //
  // One assumption we make when using topo_nr is that once a node
  // has been used, i.e., has a parent node, its own topo_nr does not change
  // we have added some checks with the `has_parent_` field to enforce this.
  //
  // What NOT to do:
  //
  //   1) 2 -> 1 -> 0               In this diagram we label nodes with their topo_nr.
  //      2 -> 1 -> 0               We have two simple graphs that can each arise from
  //                                `t.exp().exp()`, for example.
  //   2)        2 -> 1 -> 0
  //            /
  //      2 -> 1 -> 0               We add 2 as a next edge to 1 even though 1 already
  //                                has a parent.
  //   3)        2 -> 1 -> 0
  //            /
  //      2 -> 3 -> 0               2 < 3, yet there exists a path from 2 to 3!
  //
  uint64_t topological_nr_ = 0;

  // Tracks whether this node has been added as the next_edge of another node
  // via set_next_edge(s), which always calls topological_nr() of all its children
  // See NOTE [ Topological Number ] for why we need this.
  mutable bool has_parent_ = false;

  // Id of the thread that created the instance
  uint64_t thread_id_ = 0;

  std::mutex mutex_;

  // 前向過程中的輸入variable,在前向過程中與該運算元相關聯的邊
  edge_list next_edges_;
  PyObject* pyobj_ = nullptr; // weak reference
  std::unique_ptr<AnomalyMetadata> anomaly_metadata_ = nullptr;
  std::vector<std::unique_ptr<FunctionPreHook>> pre_hooks_;
  std::vector<std::unique_ptr<FunctionPostHook>> post_hooks_;
  at::SmallVector<InputMetadata, 2> input_metadata_;
    
  // 這裡對運算子()進行過載,核心其實就是呼叫apply()
  variable_list operator()(variable_list&& inputs) {
    // In the first iteration of named tensors, autograd ignores names and
    // operates on unnamed tensors. In the long term, autograd should
    // probably operate with names.
    at::NoNamesGuard no_names_guard;

    bool pre_sampled = false;
    if (at::shouldRunRecordFunction(&pre_sampled)) {
      // Using RecordFunction to trigger observers in the backward pass
      at::RecordFunction guard(at::RecordScope::BACKWARD_FUNCTION, pre_sampled);
      if (guard.isActive()) {
        // Using sequence number and thread id to correlate with
        // the forward pass function
        guard.setForwardThreadId(thread_id_);
        if (guard.needsInputs()) {
          guard.before(
            name(),
            std::vector<c10::IValue>(inputs.begin(), inputs.end()),
            sequence_nr());
        } else {
          guard.before(name(), sequence_nr());
        }
      }
      // keeping stack guard object alive during the call
      return apply(std::move(inputs));
    } else {
      return apply(std::move(inputs));
    }
  }    
};

其建構函式是:

  explicit Node(
      uint64_t sequence_nr,
      edge_list&& next_edges = edge_list())
      : sequence_nr_(sequence_nr),
      next_edges_(std::move(next_edges)) {

    for (const Edge& edge: next_edges_) {
      update_topological_nr(edge);
    }

    if (AnomalyMode::is_enabled()) {
      metadata()->store_stack();

      // If anomaly mode is enabled and graph is constructed, then assign the
      // currently evaluating node as the parent of this node.
      // A parent is a Node where this Node is created.
      // We are tracking the parents to track multiple backward operations.
      assign_parent();
    }

    // Store the thread_id of the forward operator.
    // See NOTE [ Sequence Numbers ]
    thread_id_ = at::RecordFunction::currentThreadId();
  }

4.2 重要成員變數

我們具體解釋一些重要成員變數。

4.2.1 input_metadata_

input_metadata_ 代表了 input data 的元資訊,界定了一個Function的輸入引數。

4.2.2 next_edges_

這是在前向過程中與該運算元相關聯的邊。

我們將 PyTorch的autograd系統看作是一個圖,每個 Node 例項就是圖節點,各個 Node 例項之間則是通過Edge連線的。Edge是個結構體,通過 (Function, input_nr) 的配對來代表graph中的邊。Node 的成員 next_edges_ 正是一組這樣的Edge例項,其代表此 Node 例項的返回值要輸出到的(另外)Node,即 next_edges_是 Node 和Node 之間的紐帶。

Node 的輸入輸出都是Variable例項,因此當一個graph被執行的時候,Variable例項就在這些edges之間來傳輸流動。當兩個或者多個Edge指向同一個Node的時候(這個節點的入度大於1),這些edges的輸出將被隱含相加起來再送給指向的目標 Node。

使用者可以使用add_next_edge()來向 Node 新增一個edge, 通過next_edge(index)獲取對應的edge,通過next_edges()方法獲得迭代edge的迭代器。

4.2.3 sequence_nr_

該變數用於將網路中的後向節點與前向操作關聯起來,並且在引擎中提供確定資訊。sequence_nr_ 隨著Function例項的不斷構建而單調增長,具體有兩個用處:

  • 幫助確定節點在引擎中的執行優先順序。在所有其他條件相同的情況下,優先順序較高的節點將首先執行。因此,前向傳播時後執行的操作就是後向傳播之中先執行的操作。需要注意的一點是,對於 AccumulateGrad 節點,我們將sequence_nr顯式地設定為UINT64_MAX。在PyTorch的反向圖計算中,AccumulateGrad型別代表的就是葉子節點型別,也就是計算圖終止節點。AccumulateGrad類中有一個.variable屬性指向葉子節點。

  • 此“節點”的 sequence_nr_ 與 thread_id 一起搭配,作為一個節點的唯一標示,在 profiler 之中記錄事件。這樣做的目的是幫助使用者(可能還有程式)解釋 profiler 的輸出,以便將向後的節點與其向前的操作關聯起來。因為 sequence_nr 是 thread_local 型別變數,即在新執行緒中從零開始計數。

4.2.4 topological_nr_

此變數是 “節點”的拓撲順序號,表示從該節點到任何葉節點的最長可能路徑的長度。如果有一個葉節點,即AccumulateGrad,topological_nr_ 將是零。

topological_nr_ 用於在autograd發現期間對DAG中的分支進行修剪,維護拓撲 topological_nr_有助於我們在兩個節點之間不存在有向路徑時,在O(1) 時間完成檢查。

topological_nr_ 具有以下屬性:

  • 對於G中的每一對節點X,Y,如果存在從X到Y的有向路徑,則意味著 topo_nr(X) > topo_nr(Y)。然而,事實並非如此,因此我們無法證明從X到Y的路徑的存在性,只能證明不存在。
  • 我們在使用 topological_nr_ 時所做的一個假設是:一旦使用了一個節點,即它有一個父節點,那麼它自己的topological_nr_ 就不會改變。我們在“has_parent_”欄位中新增了一些檢查來強制執行這一點。

4.2.5 operator()

variable_list operator()(variable_list&& inputs)是Node的主要方法。該方法接收vector封裝的多個Variable例項,並輸出vector封裝的多個Variable例項,然後呼叫apply 具體業務函式。該方法依靠C++的多型,將對operator 的呼叫轉化為對自身(子類)的apply方法呼叫。

PyTorch中所有用於反向傳播計算的函式都繼承自Function類,並重寫Function類中的apply純虛擬函式。

0x05 Edge

從名字可知,Edge 就是計算圖的邊。主要變數是:

  • std::shared_ptr function :本邊指向的目標Node。
  • uint32_t input_nr : 指定本Edge是 function 的第幾個輸入 。
using tensor_list = std::vector<at::Tensor>;
using variable_list = std::vector<Variable>;
using edge_list = std::vector<Edge>;
using saved_variable_list = std::vector<SavedVariable>;
using IndexRange = std::pair<size_t, size_t>;

/// Represents a particular input of a function.
struct Edge {
  Edge() noexcept : function(nullptr), input_nr(0) {}

  Edge(std::shared_ptr<Node> function_, uint32_t input_nr_) noexcept
      : function(std::move(function_)), input_nr(input_nr_) {}

  /// Convenience method to test if an edge is valid.
  bool is_valid() const noexcept {
    return function != nullptr;
  }

  // Required for use in associative containers.
  bool operator==(const Edge& other) const noexcept {
    return this->function == other.function && this->input_nr == other.input_nr;
  }

  bool operator!=(const Edge& other) const noexcept {
    return !(*this == other);
  }

  /// The function this `Edge` points to.
  std::shared_ptr<Node> function; // 指向的Node

  /// The identifier of a particular input to the function.
  uint32_t input_nr; //指定本Edge是function的第幾個輸入 
};
}} // namespace torch::autograd

0x06 邏輯圖

我們把文初的邏輯圖細化如下,上半部分是 Python 世界,下半部分是 C++世界:

+--------------------------------------------+         +------------------------------+
| SubBackward0                               |         | PowBackward0                 |
|                                            |         |                              |  Edge
|                                            |         |            next_functions  +----------> ...
|   next_functions[0] = (PowBackward0, 0) +----------> |                              |
|                                            |         +------------------------------+
|                                            |
|                                            |         +-------------------------------+
|   next_functions[1] = (MulBackward0, 0) +----------> | MulBackward0                  |
|                                            |         |                               |  Edge
|                                            |         |             next_functions  +----------> ...
+--------------------------------------------+         |                               |
                                                       +-------------------------------+
                      ^
                      |
                      |
                      |                                                                            Python
+--------------------------------------------------------------------------------------------------------+
                      |                                                                            C++
                      |
                      v

+---------------------------------------------+       +----------------------+        +------------------+
| SubBackward0                                |       | Edge 1               |        | PowBackward0     |
|                         +-------------------------> |                      |        |                  |
|                         |                   |       |         function +----------> |                  |
|                         +                   |       |                      |        |                  |
|        next_edges_ = [Edge 1, Edge 2]       |       |         input_nr = 0 |        |                  |
|                                  +          |       +----------------------+        +------------------+
|                                  |          |
|                                  |          |
+---------------------------------------------+       +----------------------+        +------------------+
                                   |                  | Edge 2               |        | MulBackward0     |
                                   |                  |                      |        |                  |
                                   +----------------> |         function +----------> |                  |
                                                      |                      |        |                  |
                                                      |         input_nr = 0 |        |                  |
                                                      |                      |        |                  |
                                                      +----------------------+        +------------------+

手機如下:

至此,傳播過程中的基礎類已經分析完畢,下一篇我們介紹如何使用這些類來完成前向傳播。

0xFF 參考

https://github.com/KeithYin/read-pytorch-source-code/

pytorch學習筆記(十三):backward過程的底層實現解析

PyTorch的初始化

pytorch的自動求導機制 - 計算圖的建立

How autograd encodes the history

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

pytorch筆記(計算圖+autograd)-Node(1)

詳解Pytorch中的網路構造

PyTorch的優化器

PyTorch的分散式

PyTorch的Tensor(下)

PyTorch的Tensor(中)

PyTorch的Tensor(上)

PyTorch的動態圖(下)

PyTorch的動態圖(上)

計算圖——用Pytorch解釋李宏毅老師PPT中的例項

如何使用pytorch自動求梯度

PyTorch自動求導(Autograd)原理解析

pytorch自動求導Autograd系列教程(一)

PyTorch核心開發者親自揭祕其內部機制

PyTorch自動微分基本原理

https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95

相關文章