[原始碼解析] PyTorch分散式優化器(1)----基石篇

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

net = ToyModel()
optimizer = optim.SGD(params=net.parameters(), lr = 1)
optimizer.zero_grad()
input = torch.randn(10,10)
outputs = net(input)
outputs.backward(outputs)
optimizer.step()

給出一個粗略的反向計算圖如下。

1.2 問題點

因為已經有了之前分析引擎等其他經歷，所以我們結合之前得到的知識先整理出幾個問題點，用來引導我們分析，我們按照：根據模型引數構建優化器 ---> 引擎計算梯度 ---> 優化器優化引數 ---> 優化器更新模型這個順序來分析。我們知道是autograd引擎計算了梯度，這樣問題就來了：

根據模型引數構建優化器
- 採用 optimizer = optim.SGD(params=net.parameters(), lr = 1) 進行構造，這樣看起來 params 被賦值到優化器的內部成員變數之上（我們假定是叫parameters）。
- 1. 模型包括兩個 Linear，這些層如何更新引數？
引擎計算梯度
- 如何保證 Linear 可以計算梯度？
- 1. 對於模型來說，計算出來的梯度怎麼和 Linear 引數對應起來？引擎計算出來的這些梯度累積在哪裡？
優化器優化引數：
- 1. 呼叫 step 進行優化，優化目標是優化器內部成員變數 self.parameters。
優化器更新模型：
- 1. 如何把優化目標（self.parameters）的更新反應到模型引數（比如 Linear）的更新上？

下面圖之中的數字和問號就對應了上面4個問題。

      +-------------------------------------------+                    +------------------+
      |ToyModel                                   |                    | Engine           |
      |                                           | forward / backward |                  |
      | Linear(10, 10)+--> ReLU +--> Linear(10, 5)| +----------------> | Compute gradient |
      |                                           |                    |        +         |
      +-------------------+-----------------------+                    |        |         |
                          |                                            |        |         |
                    1 ??? | parameters()                               +------------------+
                          |                                                     |
                          |                                                     | gradient
                          |   ^                                                 |
                          |   |                                                 v
                          |   | 4 ???                                        2 ???
                          |   |
      +------------------------------------------+
      |SGD                |   |                  |
      |                   |   |                  |
      |                   v   +                  |
      |                                          |
^ +---------------> self.parameters  +---------------->
|     |                                          |    |
|     |                                          |    |
|     +------------------------------------------+    |
|                                                     |
<---------------------------------------------------+ v
                     3 step()

我們需要一步一步來分析。

0x01 模型構造

因為優化器是優化更新模型的引數，所以我們首先介紹下模型相關資訊。

1.1 Module

在PyTorch如果定義一個模型，一般需要繼承 nn.Module。

import torch
import torch.nn as nn
import torch.nn.functional as F

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

Module 定義如下：

class Module:
    r"""Base class for all neural network modules.

    Your models should also subclass this class.

    Modules can also contain other Modules, allowing to nest them in
    a tree structure. You can assign the submodules as regular attributes::

        import torch.nn as nn
        import torch.nn.functional as F

        class Model(nn.Module):
            def __init__(self):
                super(Model, self).__init__()
                self.conv1 = nn.Conv2d(1, 20, 5)
                self.conv2 = nn.Conv2d(20, 20, 5)

            def forward(self, x):
                x = F.relu(self.conv1(x))
                return F.relu(self.conv2(x))

    Submodules assigned in this way will be registered, and will have their
    parameters converted too when you call :meth:`to`, etc.

    :ivar training: Boolean represents whether this module is in training or
                    evaluation mode.
    :vartype training: bool
    """

    dump_patches: bool = False
    _version: int = 1
    training: bool
    _is_full_backward_hook: Optional[bool]

    def __init__(self):
        """
        Initializes internal Module state, shared by both nn.Module and ScriptModule.
        """
        torch._C._log_api_usage_once("python.nn_module")

        self.training = True
        self._parameters = OrderedDict()
        self._buffers = OrderedDict()
        self._non_persistent_buffers_set = set()
        self._backward_hooks = OrderedDict()
        self._is_full_backward_hook = None
        self._forward_hooks = OrderedDict()
        self._forward_pre_hooks = OrderedDict()
        self._state_dict_hooks = OrderedDict()
        self._load_state_dict_pre_hooks = OrderedDict()
        self._modules = OrderedDict()

1.2 成員變數

Module 內部有如下重要變數，大致可以分為如下三類。

基礎型別：

_parameters ：型別為張量的權重引數，用於前向和後向傳播，儲存模型就是儲存這些引數。使用 parameters() 函式可以遞迴獲取到模型所有引數，但是需要注意，parameters() 函式返回的是 iterator。
_buffers : 儲存一些需要持久化的非網路引數的變數，比如BN 的 running_mean。
_modules : 儲存型別為 Module 的變數，當後去一個模型的parameters 時候，PyTorch 通過遞迴遍歷所有_modules來實現。

計算相關型別：

在模型計算時候，是按照如下順序完成：

 _backward_hooks  ----> forward ----> _forward_hooks ----> _backward_hooks

具體如下：

_forward_pre_hooks ：在 forward 之前執行，不會更改 forward 輸入引數。
_forward_hooks ：在 forward 之後執行，不會改變 forward 的輸入和輸出。
_backward_hooks ：在 backward 之後執行，不會改變 backward 的輸入和輸出。

儲存/載入相關：

以下是儲存相關的，PyTorch 使用如下來儲存 torch.save(cn.state_dict()...) ，使用 load_state_dict(state_dict) 來載入。

_load_state_dict_pre_hooks : 在呼叫 _load_from_state_dict 載入模型時希望執行的操作。
_state_dict_hooks ：在呼叫state_dict方法時希望執行的操作。

具體執行時候如下：

net = {ToyModel} 
 T_destination = {TypeVar} ~T_destination
 dump_patches = {bool} False
 net1 = {Linear} Linear(in_features=10, out_features=10, bias=True)
 net2 = {Linear} Linear(in_features=10, out_features=5, bias=True)
 relu = {ReLU} ReLU()
 training = {bool} True
  _backward_hooks = {OrderedDict: 0} OrderedDict()
  _buffers = {OrderedDict: 0} OrderedDict()
  _forward_hooks = {OrderedDict: 0} OrderedDict()
  _forward_pre_hooks = {OrderedDict: 0} OrderedDict()
  _is_full_backward_hook = {NoneType} None
  _load_state_dict_pre_hooks = {OrderedDict: 0} OrderedDict()
  _modules = {OrderedDict: 3} OrderedDict([('net1', Linear(in_features=10, out_features=10, bias=True)), ('relu', ReLU()), ('net2', Linear(in_features=10, out_features=5, bias=True))])
  _non_persistent_buffers_set = {set: 0} set()
  _parameters = {OrderedDict: 0} OrderedDict()
  _state_dict_hooks = {OrderedDict: 0} OrderedDict()
  _version = {int} 1

1.3 _parameters

優化器是優化 _parameters，所以我們需要特殊瞭解一下。

1.3.1 構建

我們首先看看生成時候的特點：requires_grad=True。引數這麼設定，就說明 Parameter 就是需要計算梯度的。

因為張量預設是不需要求導的，requires_grad屬性預設為False，如果某個節點 requires_grad 屬性被設定為True，就說明其需要求導，並且所有依賴於它的節點 requires_grad 都為True。

class Parameter(torch.Tensor):
    r"""A kind of Tensor that is to be considered a module parameter.

    Parameters are :class:`~torch.Tensor` subclasses, that have a
    very special property when used with :class:`Module` s - when they're
    assigned as Module attributes they are automatically added to the list of
    its parameters, and will appear e.g. in :meth:`~Module.parameters` iterator.
    Assigning a Tensor doesn't have such effect. This is because one might
    want to cache some temporary state, like last hidden state of the RNN, in
    the model. If there was no such class as :class:`Parameter`, these
    temporaries would get registered too.

    Args:
        data (Tensor): parameter tensor.
        requires_grad (bool, optional): if the parameter requires gradient. See
            :ref:`locally-disable-grad-doc` for more details. Default: `True`
    """
    def __new__(cls, data=None, requires_grad=True): # 需要計算梯度
        if data is None:
            data = torch.tensor([])
        return torch.Tensor._make_subclass(cls, data, requires_grad)

1.3.2 歸類

如果類的成員是從Parameter類派生，那麼nn.Module使用__setattr__機制把他們歸屬到_parameters 之中。比如Linear的weight和bias。

def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:
    
    # 省略 .....
    
    params = self.__dict__.get('_parameters')
    if isinstance(value, Parameter):
        remove_from(self.__dict__, self._buffers, self._modules, self._non_persistent_buffers_set)
        self.register_parameter(name, value) # 
        

    def register_parameter(self, name: str, param: Optional[Parameter]) -> None:
        r"""Adds a parameter to the module.

        The parameter can be accessed as an attribute using given name.

        Args:
            name (string): name of the parameter. The parameter can be accessed
                from this module using the given name
            param (Parameter): parameter to be added to the module.
        """
        
        # 省略各種校驗

        if param is None:
            self._parameters[name] = None
        elif not isinstance(param, Parameter):
            raise TypeError("cannot assign '{}' object to parameter '{}' "
                            "(torch.nn.Parameter or None required)"
                            .format(torch.typename(param), name))
        elif param.grad_fn:
            raise ValueError(
                "Cannot assign non-leaf Tensor to parameter '{0}'. Model "
                "parameters must be created explicitly. To express '{0}' "
                "as a function of another Tensor, compute the value in "
                "the forward() method.".format(name))
        else:
            self._parameters[name] = param # 這裡新增了

1.3.3 獲取

我們無法直接獲取到 _parameters 這個變數，只能通過 parameters 方法來獲取，其返回的是一個Iterator。

比如：

for param in net.parameters():
    print(type(param), param.size())

輸出：

<class 'torch.nn.parameter.Parameter'> torch.Size([10, 10])
<class 'torch.nn.parameter.Parameter'> torch.Size([10])
<class 'torch.nn.parameter.Parameter'> torch.Size([5, 10])
<class 'torch.nn.parameter.Parameter'> torch.Size([5])

parameters 程式碼如下。

def parameters(self, recurse: bool = True) -> Iterator[Parameter]:
    r"""Returns an iterator over module parameters.

    This is typically passed to an optimizer.

    Args:
        recurse (bool): if True, then yields parameters of this module
            and all submodules. Otherwise, yields only parameters that
            are direct members of this module.

    Yields:
        Parameter: module parameter

    Example::

        >>> for param in model.parameters():
        >>>     print(type(param), param.size())
        <class 'torch.Tensor'> (20L,)
        <class 'torch.Tensor'> (20L, 1L, 5L, 5L)

    """
    for name, param in self.named_parameters(recurse=recurse):
        yield param

再來看看 named_parameters，其核心是 module._parameters.items()，以列表返回可遍歷的元組陣列。

def named_parameters(self, prefix: str = '', recurse: bool = True) -> Iterator[Tuple[str, Parameter]]:
    r"""Returns an iterator over module parameters, yielding both the
    name of the parameter as well as the parameter itself.

    Args:
        prefix (str): prefix to prepend to all parameter names.
        recurse (bool): if True, then yields parameters of this module
            and all submodules. Otherwise, yields only parameters that
            are direct members of this module.

    Yields:
        (string, Parameter): Tuple containing the name and parameter

    Example::

        >>> for name, param in self.named_parameters():
        >>>    if name in ['bias']:
        >>>        print(param.size())

    """
    gen = self._named_members(
        lambda module: module._parameters.items(),
        prefix=prefix, recurse=recurse)
    for elem in gen:
        yield elem

需要注意，我們目前已經有了兩個關鍵知識：

Parameter 建構函式中引數 requires_grad=True。這麼設定就說明 Parameter 預設就是需要計算梯度的。
通過 parameters 方法來獲取，其返回的是一個Iterator。

所以之前圖可以擴充一下，現在 SGD 的 parameters 是一個指向 ToyModel._parameters 的 iterator，這說明優化器實際上是直接優化 ToyModel 的 _parameters。所以我們可以去掉原來圖之中 4) 對應的問號。

      +-------------------------------------------+                    +------------------+
      |ToyModel                                   |                    | Engine           |
      |                                           | forward / backward |                  |
      | Linear(10, 10)+--> ReLU +--> Linear(10, 5)| +----------------> | Compute gradient |
      |                                           |                    |        +         |
      |         para_iterator = parameters()      |                    |        |         |
      |                   +          ^            |                    |        |         |
      |                   |          |            |                    +------------------+
      +-------------------------------------------+                             |
                          |          |                                          | gradient
                          |          |                                          |
                  1 ???   |          | 4 update                                 v
                          |          |                                       2 ???
                          |          |
      +----------------------------------------------------------------+
      |SGD                |          |                                 |
      |                   |          |                                 |
      |                   v          |                                 |
      |                              +                                 |
^ +--------> self.parameters = para_iterator(ToyModel._parameters) --------->
|     |                                                                |    |
|     |                                                                |    |
|     +----------------------------------------------------------------+    |
|                                                                           |
<-------------------------------------------------------------------------+ v
                     3 step()

1.4 Linear

Torch.nn.Linear 可以對輸入資料實現線形變換，一般用來設定全連線層。

1.4.1 使用

在 PyTorch 之中使用 torch.nn.Linear 例子如下。

input = torch.randn(2,3)
linear = nn.Linear(3,4)
out = linear(input)
print(out)

# 輸出結果如下
tensor([[-0.6938,  0.0543, -1.4393, -0.3554],
        [-0.4653, -0.2421, -0.8236, -0.1872]], grad_fn=<AddmmBackward>)

1.4.2 定義

Linear 具體定義如下，可以看到，其引數主要是

self.weight = Parameter()。
self.bias = Parameter()。

由前面我們可以知道，Parameter 的生成時候引數是 requires_grad=True，說明 weight，bias 是需要計算梯度的。

class Linear(Module):
    r"""Applies a linear transformation to the incoming data: :math:`y = xA^T + b`

    This module supports :ref:`TensorFloat32<tf32_on_ampere>`.

    Args:
        in_features: size of each input sample
        out_features: size of each output sample
        bias: If set to ``False``, the layer will not learn an additive bias.
            Default: ``True``

    Shape:
        - Input: :math:`(N, *, H_{in})` where :math:`*` means any number of
          additional dimensions and :math:`H_{in} = \text{in\_features}`
        - Output: :math:`(N, *, H_{out})` where all but the last dimension
          are the same shape as the input and :math:`H_{out} = \text{out\_features}`.

    Attributes:
        weight: the learnable weights of the module of shape
            :math:`(\text{out\_features}, \text{in\_features})`. The values are
            initialized from :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})`, where
            :math:`k = \frac{1}{\text{in\_features}}`
        bias:   the learnable bias of the module of shape :math:`(\text{out\_features})`.
                If :attr:`bias` is ``True``, the values are initialized from
                :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})` where
                :math:`k = \frac{1}{\text{in\_features}}`

    Examples::

        >>> m = nn.Linear(20, 30)
        >>> input = torch.randn(128, 20)
        >>> output = m(input)
        >>> print(output.size())
        torch.Size([128, 30])
    """
    __constants__ = ['in_features', 'out_features']
    in_features: int
    out_features: int
    weight: Tensor

    def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
        if bias:
            self.bias = Parameter(torch.empty(out_features, **factory_kwargs))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self) -> None:
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
            init.uniform_(self.bias, -bound, bound)

    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias) 

    def extra_repr(self) -> str:
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
        )

1.4.3 解釋

從前面簡略計算圖我們可以知道，torch.nn.Linear 的反向計算是 AddmmBackward。

struct TORCH_API AddmmBackward : public TraceableFunction {
  using TraceableFunction::TraceableFunction;
  variable_list apply(variable_list&& grads) override;
  std::string name() const override { return "AddmmBackward"; }
  
  void release_variables() override {
    std::lock_guard<std::mutex> lock(mutex_);
    mat2_.reset_data();
    mat1_.reset_data();
  }

  std::vector<int64_t> mat1_sizes;
  std::vector<int64_t> mat1_strides;
  SavedVariable mat2_;
  at::Scalar alpha;
  SavedVariable mat1_;
  std::vector<int64_t> mat2_sizes;
  std::vector<int64_t> mat2_strides;
  at::Scalar beta;
};

我們從程式碼之中找到了 addmm 的定義，其註釋說明這是個矩陣乘法操作。

def addmm(mat: Tensor, mat1: Tensor, mat2: Tensor,
          beta: float = 1., alpha: float = 1.) -> Tensor:
    r"""
    This function does exact same thing as :func:`torch.addmm` in the forward,
    except that it supports backward for sparse matrix :attr:`mat1`. :attr:`mat1`
    need to have `sparse_dim = 2`. Note that the gradients of :attr:`mat1` is a
    coalesced sparse tensor.

    Args:
        mat (Tensor): a dense matrix to be added
        mat1 (Tensor): a sparse matrix to be multiplied
        mat2 (Tensor): a dense matrix to be multiplied
        beta (Number, optional): multiplier for :attr:`mat` (:math:`\beta`)
        alpha (Number, optional): multiplier for :math:`mat1 @ mat2` (:math:`\alpha`)
    """
    return torch._sparse_addmm(mat, mat1, mat2, beta=beta, alpha=alpha)

目前我們可以繼續擴充。

Linear 裡面的 weight，bias 都是 Parameter 型別。
- Parameter 建構函式中引數 requires_grad=True。這麼設定就說明 Parameter 預設是需要計算梯度的。
- 所以 Linear 的 weight，bias 就是需要引擎計算其梯度。
ToyModel 的 _parameters 成員變數通過 parameters 方法來獲取，其返回的是一個Iterator。
- 這個 iterator 作為引數用來構建 SGD 優化器。
- 現在 SGD 優化器的 parameters 是一個指向 ToyModel._parameters 的 iterator。這說明優化器實際上是直接優化 ToyModel 的 _parameters，對於例子就是全連線層的引數，圖上對應兩個Linear 發出的指向 parameters() 的箭頭。

+--------------------------------------------------+                   +------------------+
| ToyModel                                         |                   | Engine           |
| +-------------------+             +------------+ |forward / backward |                  |
| | Linear(10, 10)    +--> ReLU +-->+Linear(10,5)| +-----------------> | Compute gradient |
| |                   |             |            | |                   |        +         |
| |  weight=Parameter |             |    weight  | |                   |        |         |
| |                   +----------+  |            | |                   |        |         |
| |  bias=Parameter   |          |  |    bias    | |                   +------------------+
| |                   |          |  |            | |                            |
| +-------------------+          |  +--+---------+ |                          2 | gradient
|                                |     |           |                            |
|                                |     |           |                            v
|                                v     v           |                           ???
|               para_iterator = parameters()       |
|                         +          ^             |
|                         |          |             |
|                         |          |             |
+--------------------------------------------------+
                          |          |
                   1 ???  |          | 4 update
                          |          |
                          |          |
      +----------------------------------------------------------------+
      |SGD                |          |                                 |
      |                   |          |                                 |
      |                   v          |                                 |
      |                              +                                 |
^ +--------> self.parameters = para_iterator(ToyModel._parameters) +-------->
|     |                                                                |    |
|     |                                                                |    |
|     +----------------------------------------------------------------+    |
|                                                                           |
<-------------------------------------------------------------------------+ v
                     3 step()

0x02 Optimizer 基類

Optimizer 是所有優化器的基類，它有如下主要公共方法:

add_param_group : 新增可學習引數組。
step : 進行一次引數更新操作。
zero_grad : 在反向傳播計算梯度之前對上一次迭代時的梯度清零。
state_dict : 返回用 dict 結構表示的引數和狀態。
load_state_dict : 載入 dict 結構表示的引數和狀態。

2.1 初始化

在 Optimizer 初始化函式之中，會做如下操作：

初始化引數包括：可學習引數（params）和超引數（defaults）。
在 self.defaults 之中儲存 lr, momentun 等全域性引數（超引數）。
在 self.state 儲存優化器當前狀態。
在 self.param_groups 之中儲存所有待優化的變數。

class Optimizer(object):

    def __init__(self, params, defaults): 
        torch._C._log_api_usage_once("python.optimizer")
        self.defaults = defaults # 儲存 lr, momentun 等全域性引數

        self._hook_for_profile()

        if isinstance(params, torch.Tensor): # params必須是字典或者tensors
            raise TypeError("params argument given to the optimizer should be "
                            "an iterable of Tensors or dicts, but got " +
                            torch.typename(params))

        self.state = defaultdict(dict) # 儲存優化器當前狀態
        self.param_groups = [] # 所有待優化的引數，其每一項是一個字典，對應一組待優化引數和其他相關引數

        param_groups = list(params) # 需要被優化的變數，是__init__ 傳入的引數
        if len(param_groups) == 0:
            raise ValueError("optimizer got an empty parameter list")
        if not isinstance(param_groups[0], dict):
            # 將引數轉換為字典
            param_groups = [{'params': param_groups}] # param_groups 是一個列表，其中一項是字典形式，優化變數被儲存在其中。

        for param_group in param_groups:
            self.add_param_group(param_group) # 把param_groups所有項都加到self.param_groups之中

2.2 新增待優化變數

上面程式碼之中用到了 add_param_group，我們接下來就看看這個函式。

add_param_group 新增不同分組的可學習引數。程式碼如下（省略了大部分檢驗程式碼）。其中，param_groups目的是為了可以用 key-value 方式來訪問待優化變數，這在fine tuning時候特別有用。

def add_param_group(self, param_group):
    r"""Add a param group to the :class:`Optimizer` s `param_groups`.

    This can be useful when fine tuning a pre-trained network as frozen layers can be made
    trainable and added to the :class:`Optimizer` as training progresses.

    Args:
        param_group (dict): Specifies what Tensors should be optimized along with group
        specific optimization options.
    """
    assert isinstance(param_group, dict), "param group must be a dict"

    params = param_group['params'] # 得到待優化的變數
    if isinstance(params, torch.Tensor):
        param_group['params'] = [params] # 構建一個列表，其中就是待優化的變數
    elif isinstance(params, set):
        raise TypeError('optimizer parameters need to be organized in ordered collections, but '
                        'the ordering of tensors in sets will change between runs. Please use a list instead.')
    else:
        param_group['params'] = list(params)
        
    # 省略校驗，比如必須是tensor型別，而且是葉子節點    

    for name, default in self.defaults.items(): # 預設引數也加入到 param_group 之中
        if default is required and name not in param_group:
            raise ValueError("parameter group didn't specify a value of required optimization parameter " +
                             name)
        else:
            param_group.setdefault(name, default) # 所有組都設定同樣的預設引數（超引數）

    # 用set來去重        
    params = param_group['params']
    param_set = set()
    for group in self.param_groups:
        param_set.update(set(group['params']))

    # 更新自身的引數組中   
    self.param_groups.append(param_group) # 加入到param_groups

2.3 待優化變數示例

我們用如下程式碼列印 param_groups出來看看。

net = nn.Linear(3, 3)
nn.init.constant_(net.weight, val=10)
nn.init.constant_(net.bias, val=5)
optimizer = optim.SGD(net.parameters(), lr=0.025)
print(optimizer.param_groups)

結果如下，第一個 3 x 3 是 net 的權重矩陣，1 x 3 是偏置矩陣。

[
  {'params': 
    [
      Parameter containing: # 權重矩陣
        tensor([[10., 10., 10.],
              [10., 10., 10.],
              [10., 10., 10.]], requires_grad=True), 
      Parameter containing: # 偏置矩陣
        tensor([5., 5., 5.], requires_grad=True)
    ], 
  'lr': 0.025, 
  'momentum': 0, 
  'dampening': 0, 
  'weight_decay': 0, 
  'nesterov': False
  }
]

2.4 優化器狀態

2.4.1 定義

PyTorch 的 state_dict 是 Python 的字典物件。

對於模型，state_dict 會把每一層和其訓練過程中需要學習的引數（比如權重和偏置）建立起來對映關係，只有引數可以訓練的layer才會儲存在模型的 state_dict 之中，如卷積層，線性層等。
對於優化器，state_dict 是其狀態資訊，其包括了兩組資訊：
- state ：一個包括了優化器當前狀態（也就是更新變數的過程之中計算得到的最新快取變數）的字典。
  - 字典的 key 是快取的index。
  - 字典的 value 也是一個字典，key 是快取變數名，value 是相應的張量。
- param_groups : 一個包括了所有 param groups 的字典。

def state_dict(self):
    r"""Returns the state of the optimizer as a :class:`dict`.

    It contains two entries:

    * state - a dict holding current optimization state. Its content
        differs between optimizer classes.
    * param_groups - a dict containing all parameter groups
    """
    # Save order indices instead of Tensors
    param_mappings = {}
    start_index = 0

    def pack_group(group):
        nonlocal start_index
        # 'params'採用不同規則
        packed = {k: v for k, v in group.items() if k != 'params'}
        param_mappings.update({id(p): i for i, p in enumerate(group['params'], start_index)
                               if id(p) not in param_mappings})
        # 儲存了引數的id，而並非引數的值
        packed['params'] = [param_mappings[id(p)] for p in group['params']]
        start_index += len(packed['params'])
        return packed

    # 對self.param_groups進行遍歷，進行pack
    param_groups = [pack_group(g) for g in self.param_groups]
    
    # 將state中的所有Tensor替換為相應的 use order indices
    # Remap state to use order indices as keys
    packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
                    for k, v in self.state.items()}
    
    return { # 返回字典形式
        'state': packed_state, # 狀態
        'param_groups': param_groups, # 待優化的引數
    }

2.4.2 示例 1

我們在示例 1 之中加入瞭如下列印語句，看看優化器內部變數：

# print model's state_dict
print('Model.state_dict:')
for param_tensor in model.state_dict():
    print(param_tensor, '\t', model.state_dict()[param_tensor].size())

# print optimizer's state_dict
print('Optimizer,s state_dict:')
for var_name in optimizer.state_dict():
    print(var_name, '\t', optimizer.state_dict()[var_name])

結果如下：

Model.state_dict:
net1.weight  torch.Size([10, 10])
net1.bias 	 torch.Size([10])
net1.weight  torch.Size([10, 10])
net2.bias 	 torch.Size([5])

Optimizer,s state_dict:
state 	 {}
param_groups 	 [{'lr': 0.001, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0, 1, 2, 3]}]

2.4.3 示例 2

示例2 是使用 SGD 優化一個函式。

from math import pi
import torch.optim

x = torch.tensor([pi/2,pi/3],requires_grad=True)
optimizer = torch.optim.SGD([x,],lr=0.2,momentum=0.5)

for step in range(11):
    if step:
        optimizer.zero_grad()
        f.backward()
        optimizer.step()

        for var_name in optimizer.state_dict():
            print(var_name, '\t', optimizer.state_dict()[var_name])
    f=-((x.sin()**3).sum())**3

輸出結果如下，可以看出來優化過程。

state 	 {0: {'momentum_buffer': tensor([ 1.0704e-06, -9.1831e+00])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-1.2757e-06, -4.0070e+00])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-3.4580e-07, -4.7366e-01])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([7.3855e-07, 1.3584e+00])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([7.2726e-07, 1.6619e+00])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-3.1580e-07,  8.4152e-01])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([2.3738e-07, 5.8072e-01])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([5.2412e-07, 8.4104e-01])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([-5.1160e-07,  1.9660e+00])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

state 	 {0: {'momentum_buffer': tensor([4.9517e-07, 7.2053e+00])}}
param_groups 	 [{'lr': 0.2, 'momentum': 0.5, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}]

我們更新一下，確定了 SGD 內部的成員變數名字是 param_groups，這是優化器的優化目標，其指向了 ToyModel._parameters 的 iterator。

 +-------------------------------------------------+                   +------------------+
 |ToyModel                                         |                   | Engine           |
 | +------------------+             +------------+ |forward / backward |                  |
 | |Linear(10, 10)    +--> ReLU +-->+Linear(10,5)| +-----------------> | Compute gradient |
 | |                  |             |            | |                   |        +         |
 | |  weight=Parameter|             |    weight  | |                   |        |         |
 | |                  +-----------+ |    bias    | |                   |        |         |
 | |  bias=Parameter  |           | +--+---------+ |                   +------------------+
 | |                  |           |    |           |                            |
 | +------------------+           |    |           |                          2 | gradient
 |                                v    v           |                            |
 |                         self._parameters        |                            v
 |                                  +              |                           ???
 |                                  |              |
 |                                  |              |
 |                                  v              |
 |              para_iterator = parameters()       |
 |                        +          ^             |
 |                        |          |             |
 |                        |          |             |
 +-------------------------------------------------+
                          |          |
                    1 ??? |          | 4 update
                          |          |
      +----------------------------------------------------------------+
      |SGD                |          |                                 |
      |                   |          |                                 |
      |                   v          |                                 |
      |                              +                                 |
^ +-------> self.param_groups = para_iterator(ToyModel._parameters) -------->
|     |                                                                |    |
|     |                                                                |    |
|     +----------------------------------------------------------------+    |
|                                                                           |
<-------------------------------------------------------------------------+ v
                     3 step()

0x03 SGD

我們用 SGD 來進一步看看優化器。SGD（stochastic gradient descent）是隨機梯度下降，即梯度下降的batch版本。對於訓練資料集，將其分成n個batch，每個batch包含m個樣本。每次更新都利用一個batch的資料，而非整個訓練集。

3.1 定義

SGD 定義如下，主要是進行校驗和設定預設數值。

class SGD(Optimizer):
    def __init__(self, params, lr=required, momentum=0, dampening=0,
                 weight_decay=0, nesterov=False):
        if lr is not required and lr < 0.0:
            raise ValueError("Invalid learning rate: {}".format(lr))
        if momentum < 0.0:
            raise ValueError("Invalid momentum value: {}".format(momentum))
        if weight_decay < 0.0:
            raise ValueError("Invalid weight_decay value: {}".format(weight_decay))

        defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
                        weight_decay=weight_decay, nesterov=nesterov)
        if nesterov and (momentum <= 0 or dampening != 0):
            raise ValueError("Nesterov momentum requires a momentum and zero dampening")
        super(SGD, self).__init__(params, defaults)
        
    def __setstate__(self, state):
        super(SGD, self).__setstate__(state)
        for group in self.param_groups:
            group.setdefault('nesterov', False)

3.2 解析

從註釋可以看出來，SGD實現了 stochastic gradient descent (optionally with momentum) 演算法。Nesterov momentum 是基於 [On the importance of initialization and momentum in deep learning](http://www.cs.toronto.edu/%7Ehinton/absps/momentum.pdf). 的演算法。

使用示例如下：

Example:
    >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
    >>> optimizer.zero_grad()
    >>> loss_fn(model(input), target).backward()
    >>> optimizer.step()

PyTorch SGD with Momentum/Nesterov 的實現與Sutskever et. al.和其他框架的實現不同。

比如 PyTorch 使用如下方法來實現 Momentum 的特殊例子：

\[\begin{aligned} v_{t+1} & = \mu * v_{t} + g_{t+1}, \\ p_{t+1} & = p_{t} - \text{lr} * v_{t+1}, \end{aligned} \]

其他框架則使用：

\[\begin{aligned} v_{t+1} & = \mu * v_{t} + \text{lr} * g_{t+1}, \\ p_{t+1} & = p_{t} - v_{t+1}. \end{aligned} \]

3.3 step

step 方法的作用就是在一定的演算法協助下，對變數進行優化。此方法主要完成一次模型引數的更新

    @torch.no_grad()
    def step(self, closure=None):
        """Performs a single optimization step.

        Args:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        # 使用 closure 重新計算loss
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        # 使用計算得到的梯度更新變數
        # self.param_groups 就是我們傳入的引數列表
        for group in self.param_groups: # 每一個group是一個dict, 其包含每組引數所需的必要引數
            params_with_grad = []
            d_p_list = []
            momentum_buffer_list = []
            # 本組引數更新所必需的設定
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']
            lr = group['lr']

            for p in group['params']: # 遍歷本組所有需要更新的引數
                if p.grad is not None:
                    params_with_grad.append(p)
                    d_p_list.append(p.grad)

                    state = self.state[p]
                    if 'momentum_buffer' not in state:
                        momentum_buffer_list.append(None)
                    else:
                        momentum_buffer_list.append(state['momentum_buffer'])

            F.sgd(params_with_grad,
                  d_p_list,
                  momentum_buffer_list,
                  weight_decay=weight_decay,
                  momentum=momentum,
                  lr=lr,
                  dampening=dampening,
                  nesterov=nesterov)

            # update momentum_buffers in state
            for p, momentum_buffer in zip(params_with_grad, momentum_buffer_list):
                state = self.state[p]
                state['momentum_buffer'] = momentum_buffer

        return loss

其中 sgd 函式如下：

def sgd(params: List[Tensor],
        d_p_list: List[Tensor],
        momentum_buffer_list: List[Optional[Tensor]],
        *,
        weight_decay: float,
        momentum: float,
        lr: float,
        dampening: float,
        nesterov: bool):
    r"""Functional API that performs SGD algorithm computation.

    See :class:`~torch.optim.SGD` for details.
    """

    for i, param in enumerate(params):

        d_p = d_p_list[i]
        # 正則化及動量累積
        if weight_decay != 0:
            d_p = d_p.add(param, alpha=weight_decay)

        if momentum != 0:
            buf = momentum_buffer_list[i]

            if buf is None:
                # 歷史更新量
                buf = torch.clone(d_p).detach()
                momentum_buffer_list[i] = buf
            else:
                # 通過buf更新了self.state
                buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

            if nesterov:
                d_p = d_p.add(buf, alpha=momentum)
            else:
                d_p = buf

        # 更新當前組學習引數  w.data -= w.grad*lr
        param.add_(d_p, alpha=-lr) # add_ 會更改物件數值

3.4 變數解析

我們接下來對全域性引數具體做以下解析。

3.4.1 lr

這就是學習率，大家熟知的概念。

3.4.2 dampening

dampening 作用到偏導數之上，用於動量SGD中調節當前梯度權重。

對應公式如下：

\[v_t = v_{t-1} * momentum + g_t * (1 - dampening) \]

對應程式碼則是：

buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

3.4.3 weight_decay

weight_decay是 L2 penalty係數，用當前可學習引數p的值修改偏導數。

待更新的可學習引數p的偏導數就是

\[g_t = g_t + ( p * weight\_decay) \]

對應程式碼是：

if weight_decay != 0:
	d_p = d_p.add(param, alpha=weight_decay)

3.4.4 nesterov

是否啟用nesterov動量，從pytorch原始碼來看，當nesterov為True時，在上述得到 v_t 的基礎上又使用了一次momentum和v_t。

\[\bigtriangledown_{w}J(w) + m * v_{t+1} \]

if (nesterov) {
  d_p = d_p.add(buf, momentum);
} else {
  d_p = buf;
}

3.4.5 Momentum

Momentum ：來源於物理學，翻譯為動量或則衝量。作用是把上次更新於當前梯度結合來進行當前權值優化更新。

引入原因是：訓練網路的初始化權值可能因為不合適而導致在訓練過程之中出現區域性最小值，沒有找到全域性最優。

而引入動量可以在一定程度上解決此問題。動量模擬物體運動時候的慣性，表示力對時間的積累效應。更新時候在一定程度之上保持以前更新的方向，同時結合當前梯度來調整更新的方向。動量越大，轉換為勢能的能量越大，可以增加穩定性，也能更快的學習，從而越有可能擺脫區域性凹區域，進入全域性凹區域。

原生權重更新公式如下：

\[w = w - Lr * dw \]

這裡 w 是權重，Lr 是學習率，dw 是 w 的導數。

引入momentum之後的權重更新公式如下：

\[v= momentum*v - Lr*dw \\w = w + v \]

這裡 momentum 是動量，v 是速度。這個公式的意思就是加上上次更新的 v 與 momentum 的乘積。當本次梯度下降 -Lr * dw 的方向與上次更新 v 的方向相同，則上次更新 v 可以起到正向加速作用。當本次梯度下降 -Lr * dw 的方向與上次更新 v 的方向相反，則上次更新 v 可以起到減速作用。

程式碼對應如下：

if momentum != 0:
    buf = momentum_buffer_list[i]

    if buf is None:
        buf = torch.clone(d_p).detach()
        momentum_buffer_list[i] = buf
    else:
        buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

    if nesterov:
        d_p = d_p.add(buf, alpha=momentum)
    else:
        d_p = buf

0x04 視覺化

4.1 目前問題

到目前為止，我們還是有幾個問題沒有解決，就是下面下劃線之處。

根據模型引數構建優化器
- 1. 採用 optimizer = optim.SGD(params=net.parameters(), lr = 1) 進行構造，這樣看起來 params 被賦值到優化器的內部成員變數之上（我們假定是叫parameters）。
- 模型包括兩個全連結層 Linear，這些層如何更新引數？？？
- Linear 裡面的 weight，bias 都是 Parameter 型別。
  - Parameter 建構函式中引數 requires_grad=True。這麼設定就說明 Parameter 預設是需要計算梯度的。
  - 所以 Linear 的 weight，bias 就是需要引擎計算其梯度。
- ToyModel 的 _parameters 成員變數通過 parameters 方法來獲取，其返回的是一個Iterator。
  - 這個 iterator 作為引數用來構建 SGD 優化器。
  - 現在 SGD 優化器的 parameters 是一個指向 ToyModel._parameters 的 iterator。這說明優化器實際上是直接優化 ToyModel 的 _parameters。
引擎計算梯度
- 如何保證 Linear 可以計算梯度？
  - weight，bias 都是 Parameter 型別，預設是需要計算梯度的。
- 2) 對於模型來說，計算出來的梯度怎麼和 Linear 引數對應起來？引擎計算出來的這些梯度累積在哪裡？？？
優化器優化引數：
- 1. 呼叫 step 進行優化，優化目標是優化器內部成員變數 self.parameters。
- self.parameters 是一個指向 ToyModel._parameters 的 iterator。這說明優化器實際上是直接優化 ToyModel 的 _parameters。
優化器更新模型：
- 1. 優化目標（self.parameters）的更新實際上就是直接作用到模型引數（比如 Linear）之上。

我們列印 outputs 看看，可以看到其 next_functions 實際是有三個，說明前面的圖例是我們簡化的，我們需要再做進一步視覺化。

outputs = {Tensor: 10} 
 T = {Tensor: 5} 
 data = {Tensor: 10} 
 device = {device} cpu
 dtype = {dtype} torch.float32
 grad = {NoneType} None
 grad_fn = {AddmmBackward} 
  metadata = {dict: 0} {}
  next_functions = {tuple: 3} 
   0 = {tuple: 2} (<AccumulateGrad object at 0x7f9c3e3bd588>, 0)
   1 = {tuple: 2} (<ReluBackward0 object at 0x7f9c3e5178d0>, 0)
   2 = {tuple: 2} (<TBackward object at 0x7f9c3e517908>, 0)
   __len__ = {int} 3
  requires_grad = {bool} True
 is_cuda = {bool} False
 is_leaf = {bool} False
 is_meta = {bool} False
 is_mkldnn = {bool} False
 is_mlc = {bool} False
 is_quantized = {bool} False
 is_sparse = {bool} False
 is_sparse_csr = {bool} False
 is_vulkan = {bool} False
 is_xpu = {bool} False
 layout = {layout} torch.strided
 name = {NoneType} None
 names = {tuple: 2} (None, None)
 ndim = {int} 2
 output_nr = {int} 0
 requires_grad = {bool} True

4.2 PyTorchViz視覺化網路

我們採用PyTorchViz來展示網路。

先安裝庫：

 pip install torchviz

然後新增程式碼視覺化，我們使用視覺化函式make_dot()來獲取繪圖物件。執行之後，程式碼相同根目錄下的data資料夾裡會生成一個.gv檔案和一個.png檔案，.gv檔案是Graphviz工具生成圖片的指令碼程式碼，.png是.gv檔案編譯生成的圖片。預設情況下程式會自動開啟.png檔案。

import torch
import torch.nn as nn
import torch.optim as optim

from torchviz import make_dot

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

net = ToyModel()
print(net) # 順便列印一下看看
optimizer = optim.SGD(params=net.parameters(), lr = 1)
optimizer.zero_grad()
input = torch.randn(10,10)
outputs = net(input)
outputs.backward(outputs)
optimizer.step()

NetVis = make_dot(outputs, params=dict(list(net.named_parameters()) + [('x', input)]))
NetVis.format = "bmp" # 檔案格式
NetVis.directory = "data" # 檔案生成的資料夾
NetVis.view() # 生成檔案

輸出。

ToyModel(
  (net1): Linear(in_features=10, out_features=10, bias=True)
  (relu): ReLU()
  (net2): Linear(in_features=10, out_features=5, bias=True)
)

圖例如下：

我們發現，之前的簡略圖忽略了 AccumulateGrad 這個關鍵環節，我們接下來就分析一下。

0x05 AccumulateGrad

5.1 原理

我們首先來概述一下 PyTorch 相關原理知識。

從概念上講，autograd 記錄了一個計算圖。圖中節點分為兩種：葉子節點和非葉子節點。

由使用者建立的節點稱為葉子節點，比如：

a=torch.tensor([1.0])

執行時變數為：
a = {Tensor: 1} tensor([1.])
 T = {Tensor: 1} tensor([1.])
 data = {Tensor: 1} tensor([1.])
 device = {device} cpu
 dtype = {dtype} torch.float32
 grad = {NoneType} None
 grad_fn = {NoneType} None
 is_cuda = {bool} False
 is_leaf = {bool} True
 requires_grad = {bool} False

但是此時 a 不能求導，在建立張量時，如果設定 requires_grad 為Ture，那麼 Pytorch 才知道需要對該張量進行自動求導。

a=torch.tensor([1.0], requires_grad = True)

執行時變數為：
a = {Tensor: 1} tensor([1.], requires_grad=True)
 T = {Tensor: 1} tensor([1.], grad_fn=<PermuteBackward>)
 data = {Tensor: 1} tensor([1.])
 device = {device} cpu
 dtype = {dtype} torch.float32
 grad = {NoneType} None
 grad_fn = {NoneType} None
 is_cuda = {bool} False
 is_leaf = {bool} True
 requires_grad = {bool} True
 shape = {Size: 1} 1

PyTorch會記錄對該張量的每一步操作歷史，從而生成一個概念上的有向無環圖，該無環圖的葉子節點是模型的輸入張量，其根為模型的輸出張量。使用者不需要對圖的所有執行路徑進行編碼，因為使用者執行的就是使用者後來想微分的。通過從根到葉跟蹤此圖形，使用者可以使用鏈式求導規則來自動計算梯度。

在內部實現上看，autograd 將此圖表示為一個“Function” 或者說是"Node" 物件（真正的表示式）的圖，該圖可以使用apply方法來進行求值。

反向傳播時候，autograd 引擎沿著從根節點（就是前向傳播的輸出節點）溯源這個圖，這樣就可以利用鏈式求導法則計算所有葉子節點的梯度。每一個前向傳播操作函式都有一個反向傳播函式與之對應，這個反向傳播函式用來計算每個variable的梯度。

反向圖之中，需要求導的葉子節點張量對應的反向傳播計算函式就是AccumulateGrad，其梯度是累加的，多次求導都會在這個張量的導數上累積，比如：

a=torch.tensor([5.0], requires_grad = True)
b = torch.tensor([3.0], requires_grad = True)
c = a + b

對應的是：

對應我們的示例，Linear 例項都是使用者顯式定義的，所有都是葉子節點。

5.2 AccumulateGrad

5.2.1 定義

定義如下，accumulateGrad 實際就是：

先累積梯度。
再呼叫傳入的 update_grad 函式來更新梯度。

struct TORCH_API AccumulateGrad : public Node {
  explicit AccumulateGrad(Variable variable_);

  variable_list apply(variable_list&& grads) override;

  static at::Tensor callHooks(
      const Variable& variable,
      at::Tensor new_grad) {
    for (auto& hook : impl::hooks(variable)) {
      new_grad = (*hook)({new_grad})[0];
    }
    return new_grad;
  }

  template <typename T>
  static void accumulateGrad(
      const Variable& variable,
      at::Tensor& variable_grad,
      const at::Tensor& new_grad,
      size_t num_expected_refs,
      const T& update_grad) { // 傳入的更新梯度函式
    
    if (!variable_grad.defined()) {
      // 忽略
    } else if (!GradMode::is_enabled()) {
      if (variable_grad.is_sparse() && !new_grad.is_sparse()) {
        auto result = new_grad + variable_grad;
        update_grad(std::move(result));
      } else if (!at::inplaceIsVmapCompatible(variable_grad, new_grad)) {
        auto result = variable_grad + new_grad;
        update_grad(std::move(result));
      } else {
        variable_grad += new_grad; // 進行累積
      }
    } else {
      at::Tensor result;
      if (variable_grad.is_sparse() && !new_grad.is_sparse()) {
        // CPU backend throws an error on sparse + dense, so prefer dense + sparse here.
        result = new_grad + variable_grad; // 進行累積
      } else {
        // Assumes operator+ result typically matches strides of first arg,
        // and hopes variable_grad was originally created obeying layout contract.
        result = variable_grad + new_grad; // 進行累積
      }
      update_grad(std::move(result));
    }
  }

  Variable variable;
};

5.2.2 apply

當呼叫 apply 時候，有兩個注意點：

傳入的更新函式就是 { grad = std::move(grad_update); } 更新梯度。
mutable_grad 得到的是張量的梯度成員變數。

Tensor& mutable_grad() const {
  return impl_->mutable_grad();
}

/// Accesses the gradient `Variable` of this `Variable`.
Variable& mutable_grad() override {
  return grad_;
}

具體程式碼如下：

auto AccumulateGrad::apply(variable_list&& grads) -> variable_list {
  check_input_variables("AccumulateGrad", grads, 1, 0);

  if (!grads[0].defined())
    return {};
  if (variable.grad_fn())
    throw std::logic_error(
        "leaf variable has been moved into the graph interior");
  if (!variable.requires_grad())
    return {};

  at::Tensor new_grad = callHooks(variable, std::move(grads[0]));
  std::lock_guard<std::mutex> lock(mutex_);
  
  at::Tensor& grad = variable.mutable_grad(); // 得到變數的mutable_grad

  accumulateGrad(
      variable,
      grad,
      new_grad,
      1 + !post_hooks().empty() /* num_expected_refs */,
      [&grad](at::Tensor&& grad_update) { grad = std::move(grad_update); });

  return variable_list();
}

具體流程圖邏輯如下：

AccumulateGrad                                 Tensor           AutogradMeta
     +                                           +                   +
     |                                           |                   |
     |                                           |                   |
     |                                           |                   |
     v                                           |                   |
   apply(update_grad)                            |                   |
     +                                           |                   |
     |                                           |                   |
     |                                           |                   |
     |                                           |                   |
     v                                           |                   |
accumulateGrad                                   |                   |
     +                                           |                   |
     |                                           |                   |
     | result = variable_grad + new_grad         |                   |
     |                                           |                   |
     v                result                     v                   v
 update_grad +---------------------------->  mutable_grad +--->    grad_

或者如下，對於一個葉子張量，反向計算時候會呼叫AccumulateGrad進行累積梯度，然後更新到葉子張量的 grad_ 之中：

+----------------------------------------------+          +-------------------------+
|Tensor                                        |          |TensorImpl               |
|                                              |          |                         |
|                                              |  bridge  |                         |
|   <TensorImpl, UndefinedTensorImpl> impl_ +-----------> |    autograd_meta_ +---------+
|                                              |          |                         |   |
|                                              |          |                         |   |
+----------------------------------------------+          +-------------------------+   |
                                                                                        |
                                                                                        |
                                                                                        |
+-------------------------+                                                             |
| AutogradMeta            | <-----------------------------------------------------------+
|                         |
|                         |
|                         |            +------------------------------------------------+
|                         |            | AccumulateGrad                                 |
|      grad_fn_ +--------------------> |                                                |
|                         |            |                                                |
|                         |            |      apply(grads) {                            |
|                         |            |                                                |
|      grad_accumulator_  |            |         accumulateGrad(new_grad) {             |
|                         |            |                                                |
|                         |            |           result = variable_grad + new_grad    |
|                         |   update   |                                                |
|      grad_    <--------------------------------+ update_grad(result)                  |
|                         |            |                                                |
|                         |            |         }                                      |
|                         |            |      }                                         |
|                         |            |                                                |
|                         |            |                                                |
+-------------------------+            +------------------------------------------------+

現在我們知道了，梯度就是累積在葉子節點的 grad_ 之上，但是這些梯度如何更新模型引數？

5.3 結合優化器

我們回到 SGD 的step 函式，只選取關鍵部分，可以看到其獲取了模型中引數的梯度，然後更新模型引數。

@torch.no_grad()
def step(self, closure=None):

    # 使用 closure 重新計算loss

    # 使用計算得到的梯度更新變數
    # self.param_groups 就是我們傳入的引數列表
    for group in self.param_groups: # 每一個group是一個dict, 其包含每組引數所需的必要引數

        for p in group['params']: # 遍歷本組所有需要更新的引數
            if p.grad is not None: # 獲取到模型引數的梯度
                params_with_grad.append(p) # 利用梯度進行優化
                d_p_list.append(p.grad)

                # momentum 相關

        F.sgd(params_with_grad, # 更新當前組學習引數  w.data -= w.grad*lr，使用 param.add_(d_p, alpha=-lr) 來更新引數
              d_p_list,
              momentum_buffer_list,
              weight_decay=weight_decay,
              momentum=momentum,
              lr=lr,
              dampening=dampening,
              nesterov=nesterov) 

        # update momentum_buffers in state

    return loss

0x06 總結

我們按照根據模型引數構建優化器 ---> 引擎計算梯度 ---> 優化器優化引數 ---> 優化器更新模型這個順序來總結。

根據模型引數構建優化器
- 1. 採用 optimizer = optim.SGD(params=net.parameters(), lr = 1) 進行構造，這樣 params 被賦值到優化器的內部成員變數 param_groups 之上。
- 模型包括兩個 Linear，這些層如何更新引數？
  - Linear 裡面的 weight，bias 都是 Parameter 型別。
    - Parameter 建構函式中引數 requires_grad=True。這麼設定就說明 Parameter 預設是需要計算梯度的。
    - 所以 Linear 的 weight，bias 就是需要引擎計算其梯度。
    - weight，bias 被新增到 ToyModel 的 _parameters 成員變數之中。
  - ToyModel 的 _parameters 成員變數通過 parameters 方法來獲取，其返回的是一個Iterator。
    - 用這個 iterator 作為引數用來構建 SGD 優化器。
    - 現在 SGD 優化器的 parameters 是一個指向 ToyModel._parameters 的 iterator。這說明優化器實際上是直接優化 ToyModel 的 _parameters。
  - 所以優化器就是直接優化更新 Linear 的 weight 和 bias。其實優化器就是一套程式碼而已，具體優化哪些東西，需要在構建時候指定，優化一個模型的引數也行，優化使用者自己指定的其他變數也行。
引擎計算梯度
- 如何保證 Linear 可以計算梯度？
  - weight，bias 都是 Parameter 型別，預設是需要計算梯度的。
  - 1. 所以計算 weight，bias 梯度。
- 對於模型來說，計算出來的梯度怎麼和 Linear 引數對應起來？引擎計算出來的這些梯度累積在哪裡？
  - 對應我們的示例，Linear 例項都是使用者顯式定義的，所以都是葉子節點。
  - 1. 葉子節點通過 AccumulateGrad 把梯度累積在模型引數張量 autograd_meta_.grad_ 之中。
優化器優化引數：
- 1. 呼叫 step 進行優化，優化目標是優化器內部成員變數 self.parameters。
- self.parameters 是一個指向 ToyModel._parameters 的 iterator。這說明優化器實際上是直接優化 ToyModel 的 _parameters。
優化器更新模型：
- 1. 優化目標（self.parameters）的更新實際上就是直接作用到模型引數（比如 Linear 的 weight，bias）之上。

具體如圖：

+---------------------------------------------------------------------+
| ToyModel                                                            |
|  +---------------------------------+                 +------------+ |                   +------------------+
|  | Linear(10, 10)                  +------> ReLU +-->+Linear(10,5)| |                   | Engine           |
|  |                                 |                 |            | |forward / backward |                  |
|  |  weight=Parameter               |                 |    weight  | +-----------------> | Compute gradient |
|  |                                 +---------------+ |    bias    | |                   |        +         |
|  |  +----------------------------+ |               | +--+---------+ |                   |        |         |
|  |  | bias=Parameter             | |               |    |           |                   |        |         |
|  |  |                            | |               |    |           |                   +------------------+
|  |  |                            | |               |    |           |  3 accumulate              |
|  |  |    autograd_meta_.grad_ <----------------------------------------------------+           2 | gradient
|  |  |                            | |               |    |           |              |             |
|  |  |    data                    | |               |    |           |              |             v
|  |  |                            | |               v    v           |              |
|  |  |                            | |        self._parameters        |              |    +------------------+
|  |  +----------------------------+ |                 +              |              |    | AccumulateGrad   |
|  +---------------------------------+                 |              |              |    |                  |
|                                                      |              |              |    |                  |
|                                                      v              |  5 update    -----------+ apply()    |
|                                  para_iterator = parameters()  <----------------+       |                  |
|                                            +                        |           |       |                  |
|                                            |                        |           |       +------------------+
|                                            |                        |           |
+---------------------------------------------------------------------+           |
                                           1 |                                    |
                                             |                                    |
              +---------------------------------------------------------------------------+
              | SGD                          |                                    |       |
              |                              |                                    |       |
              |                              v                                    +       |
              |                                                                 4 step()  |
      ^-------------> self.param_groups = para_iterator(ToyModel._parameters) +---------------->
      |       |                                                                           |    |
      |       |                                                                           |    |
      |       +---------------------------------------------------------------------------+    |
      |                                                                                        |
      <--------------------------------------------------------------------------------------+ v

手機如下：