[原始碼解析] PyTorch 分散式(2) ----- DataParallel(上)

把 minibatch 資料從page-locked memory 傳輸到 GPU 0（master），Master GPU 也持有模型，其他GPU擁有模型的 stale copy。
在 GPUs 之間 scatter minibatch 資料。具體是將輸入一個 minibatch 的資料均分成多份，分別送到對應的 GPU 進行計算。
在 GPUs 之間複製模型。與 Module 相關的所有資料也都會複製多份。
在每個GPU之上執行前向傳播，計算輸出。PyTorch 使用多執行緒來並行前向傳播，每個 GPU 在單獨的執行緒上將針對各自的輸入資料獨立並行地進行 forward 計算。
在 master GPU 之上收集（gather）輸出，計算損失。即通過將網路輸出與批次中每個元素的真實資料標籤進行比較來計算損失函式值。
把損失在 GPUs 之間 scatter，在各個GPU之上執行後向傳播，計算引數梯度。
在 GPU 0 之上歸併梯度。
更新梯度引數。
- 進行梯度下降，並更新主GPU上的模型引數。
- 由於模型引數僅在主GPU上更新，而其他從屬GPU此時並不是同步更新的，所以需要將更新後的模型引數複製到剩餘的從屬 GPU 中，以此來實現並行。

1.2 從模式角度看

首先我們先給出一個技術上的概括，從模式角度看：

DP 可以被認為是類似引數伺服器的應用。
DDP 可以被認為是集合通訊的應用。

引數伺服器大致可以分為 master 和 worker，而DP 基於單機多卡，所以對應關係如下：

worker ：所有GPU（包括GPU 0）都是worker，都負責計算和訓練網路。
master ：GPU 0（並非 GPU 真實標號，而是輸入引數 device_ids 的首位）也負責整合梯度，更新引數。

所以我們重點看看 GPU 0。

DataParallel會將網路模型預設放在GPU 0上，然後把模型從GPU 0 拷貝到其他的GPU，各個GPU開始並行訓練，接著 GPU 0 作為master來進行梯度的彙總和模型的更新，最後將計算任務下發給其他GPU。這非常類似引數伺服器的機制。

從官方圖也可以看到同樣的資訊。

1.3 從作業系統角度看

從作業系統角度看，DP 和 DDP 有如下不同（我們屬於提前劇透）：

DataParallel 是單程式，多執行緒的並行訓練方式，並且只能在單臺機器上執行。
DistributedDataParallel 是多程式，並且適用於單機和多機訓練。DistributedDataParallel 還預先複製模型，而不是在每次迭代時複製模型，並避免了全域性直譯器鎖定。

1.4 低效率

DP 有如下缺陷：

冗餘資料副本
- 資料先從主機複製到主GPU，然後將微批次（ sub-minibatches）在其他GPU之間釋出（scatter）。
在前向傳播之前需要跨GPU進行模型複製。
- 由於模型引數是在主GPU上更新的，因此模型必須在每次正向傳播開始時重新同步。
每個batch都會有執行緒建立/銷燬開銷。
- 並行前向傳播是在多個執行緒中實現的（這可能只是PyTorch的一個issue）。
有一個把梯度規約流水線化的機會但是沒有利用。
- 在Pytorch 1.0.1資料並行實現中，梯度下降發生在反向傳播的末尾，這可以進行流水線化。
在主GPU上不必要地收集模型輸出output。
GPU利用率不均，負載不均衡。主GPU的記憶體和使用率會比其他顯示卡的高，因為：
- 在主GPU上執行損失loss計算。
- 梯度規約和更新引數均發生在主GPU之上。

0x02 綜述

2.1 示例

我們使用一個例子來看看，具體邏輯是：

給本程式設定可見GPU。
- 對應程式碼就是使用 args.gpu_id="2,7" 和 os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id 來配置 gpu 序號，其實目的就是設定 os.environ['CUDA_VISIBLE_DEVICES'] = "2,7"，這樣 device_ids[0]對應的就是物理上第2號卡，device_ids[1]對應的就是物理上第7號卡。
- 也可以在執行時臨時指定，比如：CUDA_VISIBLE_DEVICES='2,7' Python train.py。
把模型引數和緩衝區放在device_ids[0]上，在執行DataParallel模組前，並行化模組必須在device_ids [0]上具有其引數和緩衝區。
- 程式碼就是 model=model.cuda() 。
構建DP模型。DP 的好處是使用起來非常方便，只需要將原來單卡的 module 用 DP 改成多卡。
- 程式碼就是 model=torch.nn.DaraParallel(model)。
- 實際上 DP 是一個Pytorch的nn.Module，所以模型和優化器都需要使用.module來得到實際的模型和優化器。
把資料載入到主GPU。
- data,label= data.cuda(),label.cuda()
進行前向傳播。
- DP 會把模型module 在每個device上覆制一份。
- DP 會把輸入資料再切分為多個小塊，把這些小塊資料分發到不同的GPU之中進行計算，每個模型只需要處理自己分配到的資料。
進行後向傳播。
- DP 會把每個GPU 計算出來的梯度累加到GPU 0之中進行彙總。

具體程式碼如下：

args.gpu_id="2,7" ; #指定gpu id
args.cuda = not args.no_cuda and torch.cuda.is_available() #是否使用cpu
# 配置環境  也可以在執行時臨時指定，比如：CUDA_VISIBLE_DEVICES='2,7' Python train.py
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id # 賦值必須是字串
device_ids=range(torch.cuda.device_count())  #torch.cuda.device_count()=2
# device_ids=[0,1] ---- 也可以這麼使用。這裡的0 就是上述指定 2，是主gpu, 1就是7,模型和資料由主gpu分發
 
if arg.cuda:
    model=model.cuda()  #將模型複製到gpu ,預設是cuda('0')，即轉到第一個GPU 2
if len(device_id)>1:
    model=torch.nn.DataParallel(model);#構建DP，前提是model已經.cuda()了
 
optimizer = torch.optim.SGD(model.parameters(), args.lr,
                                momentum=args.momentum,
                                weight_decay=args.weight_decay)
    
#前向傳播時，資料也要執行cuda(),即把資料複製到主gpu裡
for batch_idx, (data, label) in pbar:   
    if args.cuda:
        data,label= data.cuda(),label.cuda(); # 資料放到了預設GPU
    data_v = Variable(data)
    target_var = Variable(label)
    prediction= model(data_v,target_var,args)
    #這裡的prediction 預測結果是由兩個gpu合併過的，平行計算只存在於前向傳播裡
    #前向傳播每個gpu計算量為 batch_size/len(device_ids),等前向傳播完了將結果歸併到主gpu裡
    #prediction的長度等於batch_size 
    criterion = nn.CrossEntropyLoss()
    loss = criterion(prediction,target_var) # 在預設GPU之上計算loss
    optimizer.zero_grad()
    loss.backward()  
    optimizer.step()

2.2 相關知識

DP 在每次網路傳播開始前，會把master節點上的parameters和buffer廣播給其他節點，以此來維持狀態的統一。這部分相關知識主要是如何把模型拷貝到GPU之上以及如何呼叫GPU核函式，具體可以參見前文 [原始碼解析] PyTorch 如何使用GPU。

0x03 定義

3.1 定義

我們通過 DataParallel 的初始化函式來看看 DataParallel 的結構。

__init__ 三個輸入引數定義如下：

module ：模型，
device_ids ：訓練的device，
output_device ：儲存輸出結果的device。預設是在device_ids[0]，即第一塊卡。

程式碼如下：

import operator
import torch
import warnings
from itertools import chain
from ..modules import Module
from .scatter_gather import scatter_kwargs, gather
from .replicate import replicate
from .parallel_apply import parallel_apply
from torch._utils import (
    _get_all_device_indices,
    _get_available_device_type,
    _get_device_index,
    _get_devices_properties
)

class DataParallel(Module):

    # TODO: update notes/cuda.rst when this class handles 8+ GPUs well

    def __init__(self, module, device_ids=None, output_device=None, dim=0):
        super(DataParallel, self).__init__()

        # 得到可用的GPU
        device_type = _get_available_device_type()
        if device_type is None:
            self.module = module
            self.device_ids = []
            return

        # 沒有輸入的情況下，使用所有可見的GPU
        if device_ids is None:
            device_ids = _get_all_device_indices()

        # 把GPU列表上第一個作為輸出，也會作為master
        if output_device is None:
            output_device = device_ids[0]

        self.dim = dim
        self.module = module
        self.device_ids = [_get_device_index(x, True) for x in device_ids]
        self.output_device = _get_device_index(output_device, True)
        self.src_device_obj = torch.device(device_type, self.device_ids[0])

        # 檢查負載均衡
        _check_balance(self.device_ids)

        # 單卡就直接使用
        if len(self.device_ids) == 1:
            self.module.to(self.src_device_obj)

3.2 負載均衡

雖然輸入資料是均等劃分並且並行分配，但是output loss每次都會在第一塊GPU聚合相加計算，所以第一塊GPU的記憶體負載和使用率會大於其他顯示卡。

_check_balance 函式會檢查負載是否平衡，如果記憶體或者處理器 max/min > 0.75 會有警告。

def _check_balance(device_ids):
    imbalance_warn = """
    There is an imbalance between your GPUs. You may want to exclude GPU {} which
    has less than 75% of the memory or cores of GPU {}. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable."""
    device_ids = [_get_device_index(x, True) for x in device_ids]
    dev_props = _get_devices_properties(device_ids)

    def warn_imbalance(get_prop):
        values = [get_prop(props) for props in dev_props]
        min_pos, min_val = min(enumerate(values), key=operator.itemgetter(1))
        max_pos, max_val = max(enumerate(values), key=operator.itemgetter(1))
        if min_val / max_val < 0.75:
            warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
            return True
        return False

    if warn_imbalance(lambda props: props.total_memory):
        return
    if warn_imbalance(lambda props: props.multi_processor_count):
        return

0x04 前向傳播

DataParallel平行計算只存在在前向傳播過程之中。

4.1 總述

之前示例之中已經用 cuda() 函式來把模型放到 GPU[0] 之上，GPU[0] 這裡已經有了模型的parameters 和 buffers。

model=model.cuda()

所以forward函式之中，就不用作這一步，而是從分發模型和資料開始，需要注意的是：每次前向傳播的時候都會分發模型。具體分為幾個步驟。

驗證：遍歷module的parameters和buffers，看看是否都在GPU[0]之上，如果不在，報錯。
分發（(Scatter）輸入資料：將輸入資料根據其第一個維度（一般是 batch 大小）劃分多份，傳送到多個 GPU；
複製（Replicate）模型：將模型分別拷貝到多個 GPU；
並行應用（parallel_apply）：在多個模型之上並行進行前向傳播。因為 GPU device_ids[0] 和 base parallelized module 共享儲存，所以在device[0] 上的 in-place 更新也會被保留下來，其他的GPU則不會。
收集（Gather）：收集從多個 GPU 上傳送回來的資料；

具體程式碼如下：

    def forward(self, *inputs, **kwargs):
        
        with torch.autograd.profiler.record_function("DataParallel.forward"):
            # 如果機器上沒有GPU，則直接用CPU執行
            if not self.device_ids:
                return self.module(*inputs, **kwargs)
        
            # 遍歷module的parameters和buffers，看看是否都在GPU[0]之上，如果不在，報錯。
            for t in chain(self.module.parameters(), self.module.buffers()):
                if t.device != self.src_device_obj:
                    raise RuntimeError("module must have its parameters and buffers "
                                       "on device {} (device_ids[0]) but found one of "
                                       "them on device: {}".format(self.src_device_obj, t.device))

            # 現在GPU[0]上有了模型，開始訓練
            
            # 首先分發輸入
            inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
            # for forward function without any inputs, empty list and dict will be created
            # so the module can be executed on one device which is the first one in device_ids
            if not inputs and not kwargs:
                inputs = ((),)
                kwargs = ({},)

            # 如果只有單卡，直接使用
            if len(self.device_ids) == 1:
                return self.module(*inputs[0], **kwargs[0])
            
            # 分發模型
            replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
            # 並行訓練
            outputs = self.parallel_apply(replicas, inputs, kwargs)
            # 把前向傳播的結果收集到master
            return self.gather(outputs, self.output_device)

4.2 分發（輸入）

上面程式碼之中，如下語句完成了資料分發操作。

inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

對應我們傳播圖是：

所以我們先看看如何分發。

scatter 實際就是 scatter_kwargs 的封裝，所以我們直接看 scatter_kwargs。

    def scatter(self, inputs, kwargs, device_ids):
        return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)

4.2.1 scatter_kwargs

scatter_kwargs 呼叫了 scatter 分別對input和 kwargs 進行分發。

def scatter_kwargs(inputs, kwargs, target_gpus, dim=0):
    r"""Scatter with support for kwargs dictionary"""
    # 分發input
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
    # 分發kwargs
    kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
    
    # 用空項補齊，這樣可以讓 inputs 和 kwargs 長度相等
    if len(inputs) < len(kwargs):
        inputs.extend([() for _ in range(len(kwargs) - len(inputs))])
    elif len(kwargs) < len(inputs):
        kwargs.extend([{} for _ in range(len(inputs) - len(kwargs))])
    # 返回 tuple    
    inputs = tuple(inputs)
    kwargs = tuple(kwargs)
    return inputs, kwargs

4.2.2 scatter

從註釋中可以知道，tensor 會切分成大致相等的塊，然後在給定的GPU之間分配。就是將一個 batch 資料近似等分成更小的 batch。對於其他型別的變數，會根據不同型別進行不同操作，比如呼叫 scatter_map 對其內部進行遞迴處理。

def scatter(inputs, target_gpus, dim=0):
    r"""
    Slices tensors into approximately equal chunks and
    distributes them across given GPUs. Duplicates
    references to objects that are not tensors.
    """
    def scatter_map(obj):
        if isinstance(obj, torch.Tensor):
            # 針對張量會呼叫Scatter.apply處理
            return Scatter.apply(target_gpus, None, dim, obj)
        if is_namedtuple(obj):
            # 呼叫 scatter_map 對其子模組進行遞迴處理。
            return [type(obj)(*args) for args in zip(*map(scatter_map, obj))]
        if isinstance(obj, tuple) and len(obj) > 0:
            # 呼叫 scatter_map 對其子模組進行遞迴處理。
            return list(zip(*map(scatter_map, obj)))
        if isinstance(obj, list) and len(obj) > 0:
            # 呼叫 scatter_map 對其子模組進行遞迴處理。
            return [list(i) for i in zip(*map(scatter_map, obj))]
        if isinstance(obj, dict) and len(obj) > 0:
            # 呼叫 scatter_map 對其子模組進行遞迴處理。
            return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
        return [obj for targets in target_gpus]

    # After scatter_map is called, a scatter_map cell will exist. This cell
    # has a reference to the actual function scatter_map, which has references
    # to a closure that has a reference to the scatter_map cell (because the
    # fn is recursive). To avoid this reference cycle, we set the function to
    # None, clearing the cell
    try:
        res = scatter_map(inputs)
    finally:
        scatter_map = None
    return res

4.2.3 Scatter

前面提到了 Scatter.apply 處理張量，我們就接著看看。Scatter 擴充了 Function，邏輯如下：

如果 cuda 可用，則得到 streams 列表，這樣可以在後臺流進行 CPU 到 GPU 的拷貝。
呼叫 comm.scatter 進行分發。
呼叫 wait_stream 和 record_stream 對拷貝流進行同步。

class Scatter(Function):

    @staticmethod
    def forward(ctx, target_gpus, chunk_sizes, dim, input):
        target_gpus = [_get_device_index(x, True) for x in target_gpus]
        ctx.dim = dim
        ctx.input_device = input.get_device() if input.device.type != "cpu" else -1
        streams = None
        # 對於cuda，進行處理
        if torch.cuda.is_available() and ctx.input_device == -1:
            # Perform CPU to GPU copies in a background stream
            streams = [_get_stream(device) for device in target_gpus]
            
        # 呼叫C++進行操作
        outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
        # Synchronize with the copy stream
        if streams is not None:
            for i, output in enumerate(outputs):
                with torch.cuda.device(target_gpus[i]):
                    main_stream = torch.cuda.current_stream()
                    main_stream.wait_stream(streams[i]) # 同步
                    output.record_stream(main_stream) # 同步
        return outputs

    @staticmethod
    def backward(ctx, *grad_output):
        return None, None, None, Gather.apply(ctx.input_device, ctx.dim, *grad_output)

4.2.4 comm.scatter

該函式主要是呼叫 torch._C._scatter，這樣就進入了C++世界。

def scatter(tensor, devices=None, chunk_sizes=None, dim=0, streams=None, *, out=None):
    """Scatters tensor across multiple GPUs. """
    tensor = _handle_complex(tensor)
    if out is None:
        devices = [_get_device_index(d) for d in devices]
        return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
    else:
        return tuple(torch._C._scatter_out(tensor, out, dim, streams))

4.2.5 C++

在轉換檔案之中，可以看到 scatter 是我們想分析的目標。

      .def(
          "_scatter",
          [](at::Tensor& tensor,
             std::vector<int64_t>& devices,
             c10::optional<std::vector<int64_t>> chunk_sizes,
             int64_t dim,
             c10::optional<py::object> py_streams) {
            c10::optional<std::vector<c10::optional<at::cuda::CUDAStream>>> streams;
            if (py_streams) {
              py::handle handle = *py_streams;
              streams = THPUtils_PySequence_to_CUDAStreamList(handle.ptr());
            }
            // Note: We're holding the GIL up to here.
            pybind11::gil_scoped_release no_gil;
            // 實際需要看這裡
            return scatter(tensor, devices, chunk_sizes, dim, streams);
          },
          py::arg("tensor"),
          py::arg("devices"),
          py::arg("chunk_sizes"),
          py::arg("dim"),
          py::arg("streams"))

在 scatter 之中可以看到，scatter就是把資料分佈到各個GPU之上，邏輯如下：

首先呼叫 split_with_sizes 或者chunk 把tensor分割成 chunks。
其次把 chunks 分佈到各個GPU之上，具體是通過 to 分發完成的。

std::vector<at::Tensor> scatter(
    const at::Tensor& tensor,
    at::IntArrayRef devices,
    const c10::optional<std::vector<int64_t>>& chunk_sizes,
    int64_t dim,
    const c10::optional<std::vector<c10::optional<at::cuda::CUDAStream>>>&
        streams) {

  dim = at::maybe_wrap_dim(dim, tensor);
    
  // 首先把tensor分割成 chunks
  std::vector<at::Tensor> chunks = chunk_sizes
      ? tensor.split_with_sizes(/*split_sizes=*/*chunk_sizes, /*dim=*/dim)
      : tensor.chunk(/*chunks=*/devices.size(), /*dim=*/dim);
  at::cuda::OptionalCUDAStreamGuard cuda_guard;
    
  // 其次把 chunks 分佈到各個GPU之上
  for (size_t i = 0; i < chunks.size(); ++i) {
    const auto device_index = static_cast<int16_t>(devices[i]);
    if (device_index != tensor.get_device()) {
      if (i < (streams ? streams->size() : 0U) && (*streams)[i]) {
        cuda_guard.reset_stream(*(*streams)[i]);
      }
      chunks[i] = chunks[i].to( // 拷貝
          {DeviceType::CUDA, device_index},
          /*non_blocking=*/true,
          /*copy=*/false,
          /*memory_format=*/at::MemoryFormat::Preserve);
    }
  }
  return chunks; // 返回結果
}

4.3 複製（模型）

目前，我們已經使用 Scatter 函式將資料從 device[0] 分配並複製到不同的卡，下面會用 Replicate 函式將模型從 device[0] 複製到不同的卡。

        # 分發模型
        replicas = self.replicate(self.module, self.device_ids[:len(inputs)])

對應我們傳播圖是：

replicate 只是轉發，我們還需要接著看。

def replicate(self, module, device_ids):
    return replicate(module, device_ids, not torch.is_grad_enabled())

4.3.1 replicate

replicate 具體邏輯是：

使用 _replicatable_module 看看是否可以安全的複製模型。
看看有多少個GPU，需要複製多少份。
複製操作。
- 複製 parameters。
  - 使用 _broadcast_coalesced_reshape 來把parameters拷貝到各個GPU。
- 複製buffers。
  - 首先統計一下buffers。
  - 記錄需要求導的 buffer 的 index。
  - 記錄不需要求導的 buffer 的 index。
  - 對於兩種buffers分別使用_broadcast_coalesced_reshape拷貝到各個GPU。
- 複製模型。
  - modules()返回一個包含當前模型所有模組的迭代器。轉變成list，可以認為把模型打平了。
  - 遍歷modules，往每個module_copies裡面新增模型的每一層。
  - 最終，module_copies[j] 裡面包含了模型的每一層，即module_copies[j][i] 就是模型的第 i 層。
配置操作。
- 就是配置模型網路，把GPU中資料的 reference 配置到 modules 陣列的每一個module 之中，這樣這些 module 就是完備模型了。
- 因為之前是把巢狀的模型網路打散了分別拷貝到GPU：buffers和parameters也分別拷貝到了GPU。現在需要把它們重新配置到淺拷貝的模型之中，這樣就把模型邏輯補齊了。
- 遍歷模型每個子模組，只配置需要的部分引數。
  - 處理其子_modules_。
  - 處理其_parameters。
  - 處理其 _buffers。
後續並行操作時候，每一個 worker 會得到 modules 陣列的每一個module，就在這個 module 之上進行訓練。

具體程式碼如下：

def replicate(network, devices, detach=False):
    if not _replicatable_module(network):
        raise RuntimeError("Cannot replicate network where python modules are "
                           "childrens of ScriptModule")

    if not devices:
        return []

    # 看看有多少個GPU，需要複製多少份
    devices = [_get_device_index(x, True) for x in devices]
    num_replicas = len(devices) # 複製這些份

    # 1）複製操作
    
    # 複製引數 parameters
    params = list(network.parameters())
    param_indices = {param: idx for idx, param in enumerate(params)}
    # 拷貝到各個GPU,我們隨後會講解_broadcast_coalesced_reshape
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)

    # 複製buffers
    # 首先統計一下buffers
    buffers = list(network.buffers())
    buffers_rg = [] # 需要求導的
    buffers_not_rg = [] # 不需要求導的
    for buf in buffers:
        if buf.requires_grad and not detach:
            buffers_rg.append(buf)
        else:
            buffers_not_rg.append(buf)

    # 記錄需要求導的 buffer 的 index
    buffer_indices_rg = {buf: idx for idx, buf in enumerate(buffers_rg)}
    # 記錄不需要求導的 buffer 的 index
    buffer_indices_not_rg = {buf: idx for idx, buf in enumerate(buffers_not_rg)}

    # 對於兩種buffers分別拷貝到各個GPU
    buffer_copies_rg = _broadcast_coalesced_reshape(buffers_rg, devices, detach=detach)
    buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)

    # 準備拷貝模型網路
    modules = list(network.modules()) # modules()返回一個包含當前模型所有模組的迭代器。轉變成list，可以認為把模型打平了
    module_copies = [[] for device in devices] # 為各個GPU準備好空list
    module_indices = {}

	  # 得到模型的淺拷貝列表
    for i, module in enumerate(modules):  # 遍歷模型 list
        module_indices[module] = i
        for j in range(num_replicas):
            replica = module._replicate_for_data_parallel() # 獲取淺拷貝
            # This is a temporary fix for DDP. DDP needs to access the
            # replicated model parameters. It used to do so through
            # `mode.parameters()`. The fix added in #33907 for DP stops the
            # `parameters()` API from exposing the replicated parameters.
            # Hence, we add a `_former_parameters` dict here to support DDP.
            replica._former_parameters = OrderedDict()
            module_copies[j].append(replica) # 往每個module_copies裡面新增模型的每一層
    # 最終，module_copies[j] 裡面包含了模型的每一層，即module_copies[j][i] 就是模型的第 i 層

    # 2）配置操作   
    
    # 這一步的目的是：把GPU中資料的reference賦值到淺拷貝之中，變成完備模型。因為之前是把巢狀的模型網路打散了分別拷貝到GPU，buffers和parameters也分別拷貝到了GPU，現在把他們構建到淺拷貝的模型之中，把模型邏輯補齊。
    
    for i, module in enumerate(modules): # 遍歷模型每個子模組，只賦值需要的部分引數
        
        # 處理其子_modules
        for key, child in module._modules.items():
            if child is None:
                for j in range(num_replicas):
                    replica = module_copies[j][i] # module_copies[j]是第j個模型拷貝
                    replica._modules[key] = None
            else:
                module_idx = module_indices[child]
                for j in range(num_replicas):
                    replica = module_copies[j][i] # module_copies[j]是第j個模型拷貝
                    setattr(replica, key, module_copies[j][module_idx]) # 設定第j個模型的對應部分，下同
                    
        # 處理_parameters
        for key, param in module._parameters.items():
            if param is None:
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    replica._parameters[key] = None
            else:
                param_idx = param_indices[param]
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    param = param_copies[j][param_idx]
                    # parameters in replicas are no longer leaves,
                    # so setattr them as non-parameter attributes
                    setattr(replica, key, param)
                    # expose the parameter for DDP
                    replica._former_parameters[key] = param
                    
        # 處理 _buffers            
        for key, buf in module._buffers.items():
            if buf is None:
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    replica._buffers[key] = None
            else:
                if buf.requires_grad and not detach:
                    buffer_copies = buffer_copies_rg
                    buffer_idx = buffer_indices_rg[buf]
                else:
                    buffer_copies = buffer_copies_not_rg
                    buffer_idx = buffer_indices_not_rg[buf]
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    setattr(replica, key, buffer_copies[j][buffer_idx])

    return [module_copies[j][0] for j in range(num_replicas)]

4.3.2 檢查拷貝

_replicatable_module 用來檢查模型是否可以安全拷貝。

# Check if we can safely replicate the module.
# there are two types of module:
# 1. python modules
# 2. ScriptModule
#
# currently a module cannot be replicated properly if the descendants of
# any ScriptModule contains python module (type 1 above)
def _replicatable_module(module, memo=None):

    # module.modules() contains module itself as the first element
    def descendant_modules(module):
        gen = module.modules()
        next(gen)
        return gen

    if not _is_jit_enabled():
        return True
    if memo is None:
        memo = set()

    # memoize visited modules
    memo.add(module)
    if _is_script_module(module):
        memo.update(descendant_modules(module))
        return all(_is_script_module(descendant) for
                   descendant in descendant_modules(module))

    for child in module.children():
        # since any unreplicatable module will cause the check to return
        # False early, visited modules here can be safely ignored.
        if child in memo:
            continue
        if not _replicatable_module(child, memo):
            return False

    return True

4.3.3 共享拷貝

在 PyTorch 之中，有淺拷貝和深拷貝之分。

假定模型內部是一系列引數矩陣，model這個物件實際上是指向各個引數矩陣。

淺拷貝(shadow copy) 則只是拷貝最外層的數值和指標，不拷貝更深層次的物件，就是隻拷貝了父物件。model.state_dict()也是淺拷貝，如果令param=model.state_dict()，那麼當你修改param，相應地也會修改model的引數。
與之對應，深拷貝(deepcopy)：拷貝數值、指標和指標指向的深層次記憶體空間，即拷貝了父物件及其子物件。

比如：

import torch
import copy

# a引用指向某塊記憶體空間
a = torch.nn.Linear(in_features=5, out_features=1, bias=True)
# 淺拷貝相當於拷貝一個引用，所以他們指向的記憶體空間是一樣的
b = copy.copy(a)

# state_dict is shadow copy
p = a.state_dict()
print(id(a.state_dict()) == id(p)) # False，這兩個不相等

# 通過引用p去修改記憶體空間
print(a.weight)
p['weight'][0][0] = 8.8888

# 可以看到a指向的記憶體空間也被修改了
print(a.weight)

輸出如下：

False
Parameter containing:
tensor([[-0.2253,  0.0802,  0.3984, -0.1208,  0.3796]], requires_grad=True)
Parameter containing:
tensor([[ 8.8888,  0.0802,  0.3984, -0.1208,  0.3796]], requires_grad=True)

具體回到我們的分析，在 module類中，有 _replicate_for_data_parallel 方法，其用來返回一個副本，這些副本和原始模型共享儲存，就是淺拷貝。

    def _replicate_for_data_parallel(self):
        replica = self.__new__(type(self))
        replica.__dict__ = self.__dict__.copy()

        # replicas do not have parameters themselves, the replicas reference the original
        # module.
        replica._parameters = OrderedDict()
        replica._buffers = replica._buffers.copy() # 淺拷貝
        replica._modules = replica._modules.copy() # 淺拷貝模型內部的子模組
        replica._is_replica = True

        return replica

可以認為，在設定操作之前，拷貝如下：

+---------------------------------------------------------------+
|                               +----------------------+        |
| CPU                           | Module               |        |
|                               |                      |        |
|                               |     _parameters      |        |
|                               |                      |        |
|                    +--------------> _buffers  <-------------+ |
|                    |          |                      |      | |
|                    |     +------->  _modules  <----------+  | |
|                    |     |    |                      |   |  | |
|                    |     |    +----------------------+   |  | |
| +---------------------+  |    +----------------------+   |  | |
| | module_copies[0] |  |  |    | module_copies[1]     |   |  | |
| |                  |  |  |    |                      |   |  | |
| |    _parameters   |  |  |    |     _parameters      |   |  | |
| |                  |  |  |    |                      |   |  | |
| |    _buffers +----+  |  |    |     _buffers +--------------+ |
| |                     |  |    |                      |   |    |
| |    _modules  +-------->+    |     _modules  +--------->+    |
| |                     |       |                      |        |
| +---------------------+       +----------------------+        |
+---------------------------------------------------------------+

  +---------------------+       +----------------------+
  | GPU 0               |       | GPU 1                |
  |                     |       |                      |
  |     _parameters     |       |      _parameters     |
  |                     |       |                      |
  |     _buffers        |       |      _buffers        |
  |                     |       |                      |
  |                     |       |                      |
  |                     |       |                      |
  +---------------------+       +----------------------+

在設定操作之後，則如下：

   +-----------------------------------------------------------------+
   | CPU                             +----------------------+        |
   |                                 | Module               |        |
   |                                 |                      |        |
   |                                 |     _parameters      |        |
   |                                 |                      |        |
   |                                 |     _buffers         |        |
   |                                 |                      |        |
   |                                 |     _modules         |        |
   |                                 |                      |        |
   |                                 +----------------------+        |
   |   +---------------------+       +----------------------+        |
   |   | module_copies[0]    |       | module_copies[1]     |        |
   |   |                     |       |                      |        |
+---------+ _parameters      |       |     _parameters +-----------+ |
|  |   |                     |       |                      |      | |
|  |   |    _buffers +------------+  |     _buffers +-----------+  | |
|  |   |                     |    |  |                      |   |  | |
|  |   |    _modules         |    |  |     _modules         |   |  | |
|  |   |                     |    |  |                      |   |  | |
|  |   +---------------------+    |  +----------------------+   |  | |
|  +-----------------------------------------------------------------+
|                                 |                             |  |
|      +---------------------+    |  +----------------------+   |  |
|      | GPU 0               |    |  | GPU 1                |   |  |
|      |                     |    |  |                      |   |  |
+--------->  _parameters     |    |  |      _parameters <----------+
       |                     |    |  |                      |   |
       |     _buffers  <----------+  |      _buffers   <--------+
       |                     |       |                      |
       |                     |       |                      |
       |                     |       |                      |
       +---------------------+       +----------------------+

4.3.4 拷貝操作

4.3.4.1 _broadcast_coalesced_reshape

拷貝引數都用到了_broadcast_coalesced_reshape。

def _broadcast_coalesced_reshape(tensors, devices, detach=False):
    from ._functions import Broadcast
    if detach:
        # 如果是detach，就直接呼叫
        return comm.broadcast_coalesced(tensors, devices)
    else:
        # Use the autograd function to broadcast if not detach
        if len(tensors) > 0:
            # 否則先用Broadcast過度一下，最後還是呼叫broadcast_coalesced
            tensor_copies = Broadcast.apply(devices, *tensors)
            return [tensor_copies[i:i + len(tensors)]
                    for i in range(0, len(tensor_copies), len(tensors))]
        else:
            return []

4.3.4.2 Broadcast

使用 Broadcast 過度一下的原因是：因為張量不是 detached，所以除了廣播之外，還需要在上下文中設定哪些不需要梯度。在某些情況下，使用者自定義的Function可能需要知道此情況。

class Broadcast(Function):

    @staticmethod
    def forward(ctx, target_gpus, *inputs):
        assert all(i.device.type != 'cpu' for i in inputs), (
            'Broadcast function not implemented for CPU tensors'
        )
        target_gpus = [_get_device_index(x, True) for x in target_gpus]
        ctx.target_gpus = target_gpus
        if len(inputs) == 0:
            return tuple()
        ctx.num_inputs = len(inputs)
        # input 放在 device[0]
        ctx.input_device = inputs[0].get_device()
        # 和 detach 的情形一樣
        outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
        non_differentiables = []
        
        # 在上下文中設定哪些不需要梯度
        for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):
            if not input_requires_grad:
                for output in outputs:
                    non_differentiables.append(output[idx])
        ctx.mark_non_differentiable(*non_differentiables)
        return tuple([t for tensors in outputs for t in tensors])

    @staticmethod
    def backward(ctx, *grad_outputs):
        return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)

其中，mark_non_differentiable 定義在 torch/csrc/autograd/custom_function.cpp，這裡會在 AutogradContext 配置非微分的變數。

void AutogradContext::mark_non_differentiable(const variable_list &outputs) {
  non_differentiable_.clear();
  non_differentiable_.reserve(outputs.size());
  for(auto& var : outputs) {
    non_differentiable_.insert(var.unsafeGetTensorImpl());
  }
}

4.3.4.3 broadcast_coalesced

broadcast_coalesced 會跳轉到 C++世界。

def broadcast_coalesced(tensors, devices, buffer_size=10485760):
    """Broadcasts a sequence tensors to the specified GPUs.
    Small tensors are first coalesced into a buffer to reduce the number
    of synchronizations.

    Args:
        tensors (sequence): tensors to broadcast. Must be on the same device,
          either CPU or GPU.
        devices (Iterable[torch.device, str or int]): an iterable of GPU
          devices, among which to broadcast.
        buffer_size (int): maximum size of the buffer used for coalescing

    Returns:
        A tuple containing copies of :attr:`tensor`, placed on :attr:`devices`.
    """
    devices = [_get_device_index(d) for d in devices]
    tensors = [_handle_complex(t) for t in tensors]
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)

4.3.4.4 C++

從初始化程式碼中可以看到，具體在 broadcast_coalesced 完成。

  auto m = py::cast<py::module>(module);
  m.def(
       "_broadcast_coalesced",
       [](std::vector<at::Tensor>& tensors,
          std::vector<int64_t> devices,
          size_t buffer_size) {
         return broadcast_coalesced(tensors, devices, buffer_size);
       },
       py::arg("tensors"),
       py::arg("devices"),
       py::arg("buffer_size"),
       py::call_guard<py::gil_scoped_release>())

具體程式碼位於 torch/csrc/cuda/comm.cpp。我們研究一下其註釋。

broadcast_coalesced 會把變數分發給所有GPU。在broadcast_coalesced中，多個變數可以合併成一個大變數，然後廣播到其他裝置，然後會根據原始形狀進行拆分（split）。
拆分（split）時，檢視操作將使所有變數一起廣播以共享一個版本計數器，因為它們都是大變數的檢視。但是，該大變數會立即被丟棄，並且所有這些變數根本不共享儲存。
例如，當兩個緩衝區在“DataParallel”中一起廣播，其中一個在“forward”期間執行in-place操作，而另一個在backward中被使用，autograd引擎將發出抱怨。因此，我們在廣播後重新包裝這些變數，併為它們提供單獨的版本計數器。

// broadcast_coalesced
// ~~~~~~~~~~~~~~~~~~~
//
// In broadcast_coalesced, multiple variables may be coalesced into a single
// large one, broadcast to other devices, and the get split according to the
// original shapes.
//
// When splitting, the view operations will make all Variables broadcast
// together to share a single version counter, because they are all views of the
// large Variable. However, that large Variable is immediately discarded and all
// these Variables do not share storage at all.
//
// For example, when two buffers are broadcast together in `DataParallel` and
// one of them is modified in-place during `forward` but the other is needed in
// backward, autograd engine will complain.
//
// We thus re-wrap these Variables after broadcasting (i.e., effectively doing
// what is equivalent to .data in Python), and give them individual version
// counters.

broadcast_coalesced 方法的具體引數解釋如下：

tensors 必須在同一個裝置，CPU 或者 GPU；
devices 即是要拷貝到的裝置；
buffer_size 則是最大的buffer。這裡用到 buffer 將小張量合併到緩衝區以減少同步次數；

tensor_list2d broadcast_coalesced(
    TensorList tensors,
    IntArrayRef devices,
    size_t buffer_size) {
  TORCH_CHECK(
      std::all_of(
          tensors.begin(),
          tensors.end(),
          [&](const at::Tensor& t) { return t.get_device() == devices[0]; }),
      "All tensors must be on devices[0]: ",
      devices[0]);
#ifdef USE_NCCL
  buffer_size = std::min(torch::cuda::nccl::get_max_count(), buffer_size);
#endif

  tensor_list2d outputs(devices.size());
  outputs[0] = tensors.vec();
  for (auto& o : outputs)
    o.reserve(tensors.size());

  unique_type_checker type_checker;
  at::cuda::CUDAGuard device_guard(devices[0]);
  for (auto& chunk : utils::take_tensors(tensors, buffer_size)) {
    auto type_id = chunk.type_id();
    type_checker.show(type_id);
    std::vector<at::Tensor> results;
    if (chunk.options().is_sparse()) {
      auto flat_tuple = utils::flatten_sparse_tensors(chunk.tensors);
      auto broadcast_indices = broadcast(flat_tuple.first, devices); //這裡進行廣播
      auto broadcast_values = broadcast(flat_tuple.second, devices); //這裡進行廣播
      results.reserve(devices.size());
      for (size_t i = 1, num_devices = devices.size(); i < num_devices; ++i) {
        device_guard.set_index(devices[i]);
        auto& device_outputs = outputs[i];
        auto& inds = broadcast_indices[i];
        auto& vals = broadcast_values[i];
        for (auto& t :
             utils::unflatten_sparse_tensors(inds, vals, chunk.tensors)) {
          Variable var = t;
          device_outputs.push_back(make_variable(var.tensor_data(), false));
        }
      }
    } else {
      auto results = // 這裡進行廣播
          broadcast(utils::flatten_dense_tensors(chunk.tensors), devices);
      for (size_t i = 1, num_devices = devices.size(); i < num_devices; ++i) {
        device_guard.set_index(devices[i]);
        auto& device_outputs = outputs[i];
        for (auto& t :
             utils::unflatten_dense_tensors(results[i], chunk.tensors)) {
          Variable var = t;
          device_outputs.push_back(make_variable(var.tensor_data(), false));
        }
      }
    }
  }

  // If we only saw a single tensor type, then we can skip expensive reordering
  if (!type_checker.unique) {
    for (auto& o : outputs)
      utils::reorder_tensors_like(o, tensors);
  }
  return outputs;
}

broadcast 方法如下：

std::vector<Tensor> broadcast(const Tensor& tensor, IntArrayRef devices) {
  std::vector<Tensor> diff_device_dst_tensors;
  diff_device_dst_tensors.reserve(devices.size());
  for (auto device : devices) {
    if (device != tensor.get_device()) {
      diff_device_dst_tensors.push_back(at::empty(
          tensor.sizes(),
          tensor.options().device(
              at::Device(DeviceType::CUDA, device)))); // preserve memory format
    }
  }
  // 繼續呼叫操作
  _broadcast_out_impl(tensor, diff_device_dst_tensors);
  std::vector<Tensor> dst_tensors;
  dst_tensors.reserve(devices.size());
  auto it = diff_device_dst_tensors.begin();
  for (auto device : devices) {
    if (device != tensor.get_device()) {
      dst_tensors.push_back(*it++);
    } else {
      dst_tensors.push_back(tensor);
    }
  }
  TORCH_INTERNAL_ASSERT(it == diff_device_dst_tensors.end());
  return dst_tensors;
}

最終呼叫到 _broadcast_out_impl，把源張量 (CPU or CUDA) 廣播到一個CUDA裝置列表上，其呼叫了nccl::broadcast(nccl_list)。

static inline std::vector<Tensor>& _broadcast_out_impl(
    const Tensor& tensor,
    std::vector<Tensor>& out_tensors) {
#ifdef USE_NCCL
  std::vector<Tensor> nccl_list;
  nccl_list.reserve(out_tensors.size() + 1);
  nccl_list.push_back(tensor);
  for (auto& out_tensor : out_tensors) {
    nccl_list.push_back(out_tensor);
  }
  if (nccl::is_available(nccl_list)) {
    nccl::broadcast(nccl_list); // 這裡呼叫了 NCCL 操作
  } else {
#else
  {
#endif
    for (auto& out_tensor : out_tensors) {
      out_tensor.copy_(tensor, /*non_blocking=*/true);
    }
  }
  return out_tensors;
}

至此，我們已經把資料和模型都分佈到其他 GPU 之上。我們把目前的前向圖先構建出來，大家可以有一個清晰的理解，replicate 呼叫了Broadcast.forward，同時往其context 儲存了input_device和num_inputs。接下來可以進行前行傳播。

+----------------------------------------------------------------------------------------+
| DataParallel.forward                                                                   |
|                                                                                        |
|                                                                                        |
|              replicate +--------------->   parallel_apply             gather           |
|                                                                                        |
+----------------------------------------------------------------------------------------+

     +---------------------------+
     | Broadcast                 |
     |                           |
     |                           |
     |                           |
     |          forward()  +----------->
     |                           |
     |                           |
     |  +---------------------+  |
     |  | ctx                 |  |
     |  |       input_device  |  |
     |  |                     |  |
     |  |       num_inputs    |  |
     |  |                     |  |
     |  +---------------------+  |
     |                           |
     |                           |
     |                           |
     |                           |
     |                           |
     |                           |
     +---------------------------+

因為篇幅所限，下一篇我們從並行操作（前向傳播）開始繼續分析。