[原始碼解析] PyTorch 分散式(1)------歷史和概述

使用 torch.multiprocessing 封裝了 Python 原生 multiprocessing模組，這樣就可以利用多個CPU核。
匯入了 THD (distributed pytorch)，這就有了用於分散式計算的底層庫。
引入了torch.distributed包，它允許在多臺機器之間交換張量。使用這個包可以在多臺機器之上使用更大的batch進行訓練。
釋出了 c10d 庫，這成為 torch.distributed package和torch.nn.parallel.DistributedDataParallel 包的基礎後端，同時 THD 被廢棄。
提供了一個分散式RPC框架來支援分散式模型並行訓練。它允許遠端執行函式和引用遠端物件，而無需複製周圍的真實資料，並提供autograd和optimizer API以透明地進行後向傳播和跨RPC邊界更新引數。
引入了彈性訓練，Torchelastic提供了“torch.distributed.launch”CLI的一個嚴格超集，並新增了容錯和彈性功能。
引入了流水線並行，就是 torchgpipe。

其歷史演進圖如下：

                                           v1.0

                                           v1.1

                                           v1.2

               v0.1.8                      v1.3                  v1.7

                THD                        C10D               TorchElastic
                 +                          +                      +
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
                 |                          |                      |
+-------+--------+------------+-------------+-----------+----------+------------+----------> Time
        |                     |                         |                       |
        |                     |                         |                       |
        |                     |                         |                       |
        |                     |                         |                       |
        |                     |                         |                       |
        +                     +                         +                       +

  Multiprocessing       torch.distributed              RPC                   Pipeline

     v0.1.2                  v0.2                      v1.4                   v1.8

     v0.1.6                  v0.4                      v1.5                   v1.9

                                                       v1.6

具體歷史如下，有興趣的朋友可以研究一下，看看一個機器學習系統如何一步一步進入分散式世界，沒有興趣的朋友可以直接跳過到後續概述部分。

1.1 Multiprocessing

PyTorch 0.1.2

使用 torch.multiprocessing 封裝了 Python 原生 multiprocessing模組，這樣就可以利用多個CPU核。

具體原因是，在Python 之中，使用執行緒是有技術問題的，主要就是 Global Interpreter Lock，因此應該使用多程式。

With Python, one cannot use threads because of a few technical issues.
Python has what is called Global Interpreter Lock, which does not allow threads to concurrently execute python code.

Hence, the most pythonic way to use multiple CPU cores is multiprocessing

We made PyTorch to seamlessly integrate with python multiprocessing.
This involved solving some complex technical problems to make this an air-tight solution, and more can be read in this in-depth technical discussion.

PyTorch 0.1.6

Multiprocessing 支援 CUDA。

Uptil now, Tensor sharing using multiprocessing only worked for CPU Tensors.
We've now enabled Tensor sharing for CUDA tensors when using python-3.
You can read more notes here: http://pytorch.org/docs/notes/multiprocessing.html

1.2 THD 底層庫

PyTorch 0.1.8

匯入了 THD (distributed pytorch)，這就有了用於分散式計算的底層庫。

Merged an initial version of THD (distributed pytorch)

1.3 torch.distributed 庫

PyTorch 0.2

We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

The distributed package follows an MPI-style programming model. This means that there are functions provided to you such as send, recv, all_reduce that will exchange Tensors among nodes (machines).

For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:

shared file system (requires that all processes can access a single file system)

IP multicast (requires that all processes are in the same network)

environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)

這個版本引入了torch.distributed包，它允許在多臺機器之間交換張量。使用這個包可以在多臺機器之上使用更大的batch進行訓練。

該distributed包遵循 MPI 風格的程式設計模型，即distributed包提供了比如send, recv, all_reduce 這樣的方法來在不同的節點（機器）之間交換張量。

因為需要多臺機器之間彼此識別，所以需要有一個機制來唯一標示每臺機器，這就是rank。distributed包提供了幾種簡單的初始化方法：

共享檔案系統（所有機器上的所有程式都可以訪問這個檔案系統）
IP組播（要求所有程式在同一個網路中）
環境變數（需要使用者手動指定rank，並且提供一個所有程式可訪問的節點地址）

World size是將參與訓練的程式數。每個程式都將被分配一個rank，該rank是一個介於 0 和 world_size - 1 之間的數字，在此作業中是唯一的。它將用作程式識別符號，並將用於代替地址，例如，指定應將張量傳送到哪個 rank（程式）。

分散式計算中的原語包括同步模式的send, recv 和非同步模式的 isend，irecv。因為某些通訊模式出現的太頻繁了，所以 PyTorch 開發了高階函式，比如all_reduce，這些集合通訊原語會用於整個程式組，並且更加高效。

但是分散式包還是太底層，所以基本還是基於它來實現更高階的演算法或者定製特殊演算法，因為資料並行訓練是如此常見，PyTorch 為此建立了高階幫助程式DistributedDataParallel，它幾乎是 nn.DataParallel 的替代品。

PyTorch 0.4

這個版本有了幾處相關。

增加了DistributedDataParallelCPU，這個類和DistributedDataParallel很相似，但是主要支援在CPU之上訓練（DistributedDataParallel 目標是 GPU）。這個類支援 mpi, gloo and tcp 這些後端（tcp後端後來被廢除）。

Add DistributedDataParallelCPU. This is similar to DistributedDataParallel, but with specific support for models running on the CPU (contrary to DistributedDataParallel, which targets GPU), and supports mpi, gloo and tcp backends #5919.

增加了新工具指令碼。此指令碼可以在單個機器或者多個機器之上使用 DistributedDataParallel。

Helper utility for launching Distributed Training jobs

We have added an utility function to help launch jobs on a distributed setup.
In order to launch a script that leverages DistributedDataParallel on either single-node multiple-nodes, we can make use of torch.distributed launch as follows
python -m torch.distributed.launch my_script.py --arg1 --arg2 --arg3

增加了基於 NCCL 2.0 的新分散式後端，這樣速度得到很大提升，也可以基於多個GPU提供集合通訊的新API。

A new distributed backend based on NCCL 2.0

PyTorch now has a new distributed backend, which leverages NCCL 2.0 for maximum speed.
It also provides new APIs for collective operations on multiple GPUs.
You can enable the new backend via
torch.distributed.init_process_group("nccl")

其他改進如下，比如聚合多個小廣播操作，混合精度，Infiniband 支援等。

Coalesce many small broadcasts to improve performance #4978

Add mixed-precision support for distributed training #4891

Release NCCL distributed backend. Previously it was marked as experimental. #4921

Enable Infiniband support for Gloo data channel with automatic IB device detection #4795

1.4 c10d庫

PyTorch 1.0

torch.distributed new "C10D" library

The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by the new "C10D" library. The main highlights of the new library are:

C10D is performance driven and operates entirely asynchronously for all backends: Gloo, NCCL, and MPI.

Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts

Adds async support for all distributed collective operations in the torch.distributed package.

Adds send and recv support in the Gloo backend

這個版本釋出了 c10d 庫，這成為 torch.distributed package和torch.nn.parallel.DistributedDataParallel 包的基礎後端，這個庫的主要亮點是：

因為C10D完全是非同步操作，所以對於所有後端（Gloo, NCCL, 和 MPI）的效能都提升很大。
對於類似基於乙太網的慢速網路，分散式資料並行得到了巨大提升。
對torch.distributed 包中的所有分散式集合操作都新增了非同步支援。
對Gloo後端新增了send、recv支援。

另外還有幾點修改。

TCP後端被移除，Gloo和 MPI 後端被推薦用於CPU集合通訊，NCCL被推薦用於GPU集合通訊。
舊的（基於THD）torch.distributed 包被廢棄。
舊的（基於THD）torch.nn.parallel.DistributedDataParallel包被廢棄。

torch.distributed: the TCP backend is removed, we recommend to use Gloo and MPI backends for CPU collectives and NCCL backend for GPU collectives.

the old (THD-backed) torch.distributed package is deprecated but still available at torch.distributed.deprecated.

The old (THD-backed) torch.nn.parallel.DistributedDataParallel is deprecated but still available at torch.nn.parallel.deprecated.DistributedDataParallel.

PyTorch 1.1

nn.parallel.DistributedDataParallel 可以支援多GPU模型，這樣模型並行和資料並行可以跨server進行協作。

DistributedDataParallel new functionality and tutorials

nn.parallel.DistributedDataParallel: can now wrap multi-GPU modules, which enables use cases such as model parallel (tutorial) on one server and data parallel (tutorial) across servers.

c10d ProcessGroup::getGroupRank 被移除。

PyTorch 1.2

此版本做了如下改進：

Distributed Package 可以支援CPU modules，稀疏張量，本地梯度累積。

Distributed Package

DistributedDataParallel: support CPU modules. (20236)

DistributedDataParallel: support sparse tensors. (19146)

DistributedDataParallel: support local gradient accumulation. (21736)

另外也有一些其他小改進，比如對於MPI操作加入了device guard 。

PyTorch 1.3

新增了torch.distributed對macOS的支援，但是隻能使用Gloo後端，使用者只需要修改一行程式碼就可以複用其他平臺的程式碼。也做了一些其他改進。

This release adds macOS support for torch.distributed with the Gloo backend. You can more easily switch from development (e.g. on macOS) to deployment (e.g. on Linux) without having to change a single line of code. The prebuilt binaries for macOS (stable and nightly) include support out of the box.

torch.distributed.all_reduce_coalesced Support allreduce of a list of same-device tensors (24949, 25470, 24876)

torch.distributed.all_reduce Add bitwise reduction ops (BAND, BOR, BXOR) (26824)

1.5 RPC框架

PyTorch 1.4.0

此版本開始試驗分散式模型訓練。

隨著RoBERTa等模型的規模不斷擴大直到數十億個引數，模型並行訓練變得越來越重要，因為其可以幫助研究人員突破極限。1.4.0 版本提供了一個分散式RPC框架來支援分散式模型並行訓練。它允許遠端執行函式和引用遠端物件，而無需複製相關真實資料，並提供autograd和optimizer API以透明地進行後向傳播和跨RPC邊界更新引數。

Distributed Model Parallel Training [Experimental]

With the scale of models, such as RoBERTa, continuing to increase into the billions of parameters, model parallel training has become ever more important to help researchers push the limits. This release provides a distributed RPC framework to support distributed model parallel training. It allows for running functions remotely and referencing remote objects without copying the real data around, and provides autograd and optimizer APIs to transparently run backwards and update parameters across RPC boundaries.

torch.distributed.rpc是一個新引入的包。它的基本構建塊可以在模型訓練和推理中遠端執行函式，這對於分散式模型並行或實現引數伺服器框架等場景非常有用。更具體地說，它包含四個支柱：RPC、遠端引用、分散式autograd和分散式優化器。請參閱檔案和教程更多細節。

RPC [Experimental]

torch.distributed.rpc is a newly introduced package. It contains basic building blocks to run functions remotely in model training and inference, which will be useful for scenarios like distributed model parallel or implementing parameter server frameworks. More specifically, it contains four pillars: RPC, Remote Reference, Distributed Autograd, and Distributed Optimizer. Please refer to the documentation and the tutorial for more details.

PyTorch 1.5

正式釋出了 torch.distributed.rpc 。

“torch.distributed.rpc”包旨在支援不適合 “DistributedDataParallel”的各種分散式訓練正規化。示例包括引數伺服器訓練、分散式模型並行和分散式管道並行。torch.distributed.rpc包中的功能可以分為四組主要的API。

**RPC **API允許在指定目標工作程式上使用給定的引數來執行函式，並且可以獲取返回值或建立對返回值的分散式引用。
RRef（遠端引用）是另一個worker上物件的引用。持有RRef的工作者可以顯式地請求物件的副本，並且還可以與其他worker共享輕量級RRef，而不必擔心引用計數。當多個worker需要重複訪問同一遠端物件的不同版本時，這尤其有用。
使用分散式自動載入，應用程式可以自動計算梯度，即使模型已經使用RPC在多個worker上拆分過。PyTorch 在向前傳播過程中將RPC邊界處的區域性autograd圖縫合在一起，並向在後向傳播之中穿越邊界讓參與者啟動本地autograd。
分散式優化器使用分散式autograd計算的梯度來更新模型引數。它的建構函式接受一個本地優化器（例如SGD，Adagrad等）和一個引數RRef列表，它的step函式在所有不同的 RRef 所有者（worker）之上自動使用本地優化器來更新引數。

Distributed RPC framework APIs [Now Stable]

The torch.distributed.rpc package aims at supporting a wide range of distributed training paradigms that do not fit into DistributedDataParallel. Examples include parameter server training, distributed model parallelism, and distributed pipeline parallelism. Features in the torch.distributed.rpc package can be categorized into four main sets of APIs.

The RPC API allows running a function on a specified destination worker with given arguments and fetches the return value or creates a distributed reference to the return value.

The RRef (Remote REFerence) serves as a reference to an object on another worker. A worker holding an RRef can explicitly request copies of the object, and it can also share the light-weight RRef with other workers without worrying about reference counting. This is especially useful when multiple workers need to repeatedly access different versions of the same remote object.

With Distributed Autograd, applications can automatically compute gradients even if a model is split on multiple workers using RPC. This is achieved by stitching together local autograd graphs at RPC boundaries in the forward pass and reaching out to participants to transparently launch local autograd in the backward pass.

The Distributed Optimizer uses gradients computed by Distributed Autograd to update model parameters. Its constructor takes a local optimizer (e.g., SGD, Adagrad, etc.) and a list of parameter RRefs, and its step() function automatically uses the local optimizer to update parameters on all distinct RRef owner workers.

PyTorch 1.6

此版本對 DDP 和 RPC 進行了大量的改進，也增加了新特性，幾個大特性包括：

Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.

RPC的TensorPipe後端

PyTorch 1.6為RPC模組引入了一個新的後端，它利用了TensorPipe庫。TensorPipe庫是一個面向機器學習的張量感知的點對點通訊原語，旨在對PyTorch中分散式培訓的當前原語（Gloo、MPI等）進行補足，這些原語是集合通訊和分塊的。TensorPipe的成對性和非同步性使其有助於構建超越資料並行的新的網路模式：客戶機-伺服器方式（例如，嵌入的引數伺服器、actor-learner separation in Impala-style RL等）和模型與管道並行訓練（比如GPipe），gossip SGD等。

TensorPipe backend for RPC

PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, ...) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, ...) and model and pipeline parallel training (think GPipe), gossip SGD, etc.

[Beta] DDP+RPC

PyTorch分散式支援兩種強大的正規化：DDP用於完全同步的資料並行訓練，RPC框架允許分散式模型並行。

目前，這兩個特性是獨立工作的，使用者不能混合和匹配這兩個特性來嘗試混合並行模式。從PyTorch 1.6開始，我們已經使DDP和RPC能夠無縫地協同工作，這樣使用者就可以將這兩種技術結合起來，實現資料並行和模型並行。例如，使用者希望在引數伺服器上放置大型嵌入表，並使用RPC框架進行嵌入查詢，但在培訓器上儲存較小的dense引數，並使用DDP同步dense引數。

[Beta] DDP+RPC

PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Currently, these two features work independently and users can’t mix and match these to try out hybrid parallelism paradigms.

Starting PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.

[Beta] RPC - Asynchronous User Functions

RPC非同步使用者函式支援在執行使用者定義的函式時在伺服器端進行yield 和resume。在此功能之前，當被呼叫方處理請求時，一個RPC執行緒將等待使用者函式返回。如果使用者函式包含IO（例如，巢狀RPC）或信令（例如，等待另一個請求解除阻止），則相應的RPC執行緒將處於空閒狀態，等待這些事件。因此，一些應用程式必須使用大量執行緒並且傳送額外的RPC請求，這可能會導致效能下降。要使使用者函式在此類事件中yield，應用程式需要：1）使用@rpc.functions.async_executiondecorator封裝函式；2）讓函式返回'torch.futures.Future'，並將恢復邏輯作為回撥安裝到'Future'物件上。

[Beta] RPC - Asynchronous User Functions

RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when an callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the @rpc.functions.async_execution decorator; and 2) Let the function return a torch.futures.Future and install the resume logic as callbacks on the Future object.

[Beta] Fork/Join Parallelism

此版本增加了對語言級構造的支援，以及對TorchScript程式碼中粗粒度並行性的執行時支援。這種支援對於並行執行整合中的模型或並行執行遞迴網路中的雙向元件等情況非常有用，併為任務級並行解鎖了並行體系結構（例如許多核心CPU）的計算能力。

TorchScript程式的並行執行通過兩個原語：“torch.jit.fork”和“torch.jit.wait” 完成支援。

[Beta] Fork/Join Parallelism

This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.

Parallel execution of TorchScript programs is enabled through two primitives: torch.jit.fork and torch.jit.wait.

1.6 彈性訓練

PyTorch 1.7

此版本對 DDP 和 RPC 進行了一些的改進，也增加了新特性，幾個大特性包括：

[Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic提供了“torch.distributed.launch”CLI的一個嚴格超集，並新增了容錯和彈性功能。如果使用者對容錯不感興趣，他們可以通過設定“max_restarts=0”來獲得準確的功能/行為，並增加自動分配“RANK”和“MASTER_ADDR”埠的便利性（而不是在“torch.distributed.launch”中手動指定）。

通過將“torchelastic”與PyTorch捆綁在同一docker映像中，使用者可以立即開始試用torchelastic，而無需單獨安裝“torchelastic”。除了方便之外，在現有Kubeflow的分散式PyTorch操作符中新增對彈性引數的支援也是一個很好的選擇。

[Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic offers a strict superset of the current torch.distributed.launch CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting max_restarts=0 with the added convenience of auto-assigned RANK and MASTER_ADDR|PORT (versus manually specified in torch.distributed.launch).

By bundling torchelastic in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install torchelastic. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators.

[Beta] Support for uneven dataset inputs in DDP

PyTorch 1.7引入了一個新的上下文管理器，與使用“torch.nn.parallel.DistributedDataParallel”進行訓練的模型結合使用，以支援使用跨不同程式的大小不均勻的資料集進行訓練。此功能在使用DDP時提供了更大的靈活性，並防止使用者必須手動確保不同程式中的資料集大小相同。使用此上下文管理器，DDP將自動處理不均勻的資料集大小，這可以防止在訓練結束時出現錯誤或掛起。

[Beta] Support for uneven dataset inputs in DDP

PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using torch.nn.parallel.DistributedDataParallel to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training.

其他特性包括：

[Beta] NCCL Reliability - Async Error/Timeout Handling
[Beta] TorchScript remote and rpc_sync
[Beta] Distributed optimizer with TorchScript support
[Beta] Enhancements to RPC-based Profiling
[Prototype] Windows support for Distributed Training

1.7 流水線訓練

PyTorch 1.8

此版本加入了一些重大改進，比如：提高NCCL可靠性；流水線並行支撐；RPC profiling；並支援新增梯度壓縮的通訊hook。

其中流水線並行是把 fairscale.nn.Pipe引入進來，其實就是 torchgpipe。

Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression.

Upstream fairscale.nn.Pipe into PyTorch as torch.distributed.pipeline (#44090)

PyTorch 1.9

主要是

Distributed / TorchElastic 的一些bug fix。
RPC 的重大改進以支援大規模GPU分散式訓練。
在PyTorch Profiler中支援分散式培訓、GPU利用率和SM效率。

研究完歷史之後，我們再看看分散式概述。

0x02 分散式概述

以下主要是基於https://pytorch.org/tutorials/beginner/dist_overview.html 官方文件為基礎，加上自己的理解。

2.1 引論

2.1.1 torch.distributed 包

PyTorch 中的 torch.distributed包對於多程式並行提供了通訊原語，使得這些程式可以在一個或多個計算機上執行的幾個計算節點之間進行通訊。 torch.distributed包的並行方式與multiprocessing ( torch.multiprocessing) 包不同，torch.distributed包支援多個通過網路連線的機器，並且使用者必須為每個程式顯式啟動主訓練指令碼的單獨副本。

在單機且同步模型的情況下，torch.distributed或著 torch.nn.parallel.DistributedDataParallel()包裝器可能仍然比其他資料並行方法（比如torch.nn.DataParallel）具有優勢：

每個程式維護自己的優化器，並在每次迭代中執行一個完整的優化步驟。雖然這可能看起來是多餘的，但由於梯度已經聚集在一起並且是跨程式平均，因此對於每個程式都是相同的，這意味著不需要引數廣播步驟，減少了在節點之間傳輸張量所花費的時間。
每個程式都包含一個獨立的 Python 直譯器，消除了額外的直譯器開銷和“GIL 顛簸”，這些開銷來自單個 Python 程式驅動多個執行執行緒，多個模型副本或多個GPU 的開銷。這對於嚴重依賴 Python 執行時的模型尤其重要，這樣的模型通常具有遞迴層或許多小元件。

從 PyTorch v1.6.0 開始，功能torch.distributed可以分為三個主要元件：

分散式資料並行訓練 (DDP) 是一種廣泛採用的單程式多資料訓練正規化。使用 DDP，模型會在每個程式上覆制，並且每個模型副本都將被提供一組不同的輸入資料樣本。DDP 負責梯度通訊以保持模型副本同步並將其與梯度計算重疊以加速訓練。
基於 RPC 的分散式訓練 (RPC) 旨在支援無法適應資料並行訓練的通用訓練結構，例如分散式管道並行、引數伺服器正規化以及 DDP 與其他訓練正規化的組合。它有助於管理遠端物件生命週期並將 autograd 引擎擴充套件到機器邊界之外。
集體通訊 (c10d) 庫支援跨組內的程式傳送張量。它提供集體通訊 API（例如all_reduce 和 all_gather）和 P2P 通訊 API（例如send 和 isend）。DDP 和 RPC（程式組後端）) 在 v1.6.0 的 c10d 上構建，其中前者使用集體通訊，後者使用 P2P 通訊。通常，開發者不需要直接使用這個原始通訊 API，因為上面的 DDP 和 RPC 特性可以服務於許多分散式訓練場景。但是，在某些用例中，此 API 仍然有用。一個例子是分散式引數平均，應用程式希望在反向傳遞後計算所有模型引數的平均值，而不是使用 DDP 來傳達梯度。這可以將通訊與計算分離，並允許對通訊內容進行更細粒度的控制，但另一方面，它也放棄了 DDP 提供的效能優化。在用PyTorch編寫分散式應用程式顯示了使用 c10d 通訊 API 的示例。
- 通訊方式：torch.distributed 的底層通訊主要使用 Collective Communication (c10d) library 來支援跨組內的程式傳送張量，並主要支援兩種型別的通訊 API：
  - collective communication APIs: Distributed Data-Parallel Training (DDP)
  - P2P communication APIs: RPC-Based Distributed Training (RPC)
這兩種通訊 API 在 PyTorch 中分別對應了兩種分散式訓練方式：Distributed Data-Parallel Training (DDP) 和 RPC-Based Distributed Training (RPC)。

大多數現有文件是為 DDP 或 RPC 編寫的，本文的其餘部分將詳細說明這兩個元件的材料。

2.1.2 知識連結

PyTorch的multiprocessing模組封裝了python原生的multiprocessing模組，在API上百分之百的相容，它也註冊了定製的reducers, 可以使用IPC機制（共享記憶體）來讓不同的程式對同一份資料進行讀寫。但是其工作方式在CUDA上有很多弱點，比如必須規定各種程式的生命週期如何如何，導致CUDA上的multiprocessing經常行為超出預期。

2.2 資料並行訓練

在官方文件中，可以瞭解到，在掌握 torch.distributed 的基礎的前提下，我們可以根據自身機器和任務的具體情況使用不同的分散式或並行訓練方式。PyTorch 為資料並行訓練提供了多種選項。一般來說，應用會從簡單到複雜，從原型到量產。這些應用共同的發展軌跡是：

如果資料和模型可以放在一個 GPU 中，並且不關心訓練速度，就使用單裝置（single-device）訓練。
如果伺服器上有多個 GPU，並且您希望以最少的程式碼更改來加速訓練，那麼可以使用單機多 GPU DataParallel。
如果您想進一步加快訓練速度並願意編寫更多程式碼來設定它，可以使用單機多 GPU DistributedDataParallel。
如果應用程式需要跨機器邊界進行擴充套件，請使用多機 DistributedDataParallel 和啟動指令碼。
如果預期會出現錯誤（例如，OOM）或者資源可以在訓練期間動態加入和離開，則使用torchelastic啟動分散式訓練。

2.3 `torch.nn.DataParallel`

DataParallel 包使用最低程式碼量就可以利用單機多GPU達到並行性。它只需要對應用程式程式碼進行一行更改。教程 Optional: Data Parallelism 展示了一個示例。需要注意的是，雖然DataParallel非常易於使用，但通常不能提供最佳效能。這是因為DataParallel的實現在每個前向傳遞中都會複製模型，並且其單程式多執行緒並行性會受到 GIL 競爭的影響。為了獲得更好的效能，請考慮使用 DistributedDataParallel。

2.4 `torch.nn.parallel.DistributedDataParallel`

與DataParallel相比， DistributedDataParallel 需要多一步設定，即呼叫 init_process_group。DDP 使用多程式並行，因此模型副本之間不存在 GIL 競爭。此外，模型在 DDP 構建時廣播，而不是在每次前向傳播時廣播，這也有助於加快訓練速度。DDP 附帶了多種效能優化技術。如需更深入的解釋，請參閱這篇 DDP 論文(VLDB'20)。

DDP材料如下：

DDP 筆記提供了一個入門示例及其設計和實現的一些簡要說明。如果這是您第一次使用 DDP，請從本文件開始。
Getting Started with Distributed Data Parallel 解釋了 DDP 訓練的一些常見問題，包括不平衡的工作負載、檢查點和多裝置模型。請注意，DDP 可以輕鬆地與單機模型並行最佳實踐教程中描述的單機多裝置模型並行性相結合。
在啟動並配置分散式資料並行應用程式檔案顯示如何使用DDP啟動指令碼。
該 Shard Optimizer States With ZeroRedundancyOptimizer 配方演示了ZeroRedundancyOptimizer 如何有助於減少優化記憶體佔用分散式資料並行訓練。

2.5 TorchElastic

隨著應用程式複雜性和規模的增長，故障恢復成為一項迫切的要求。

有時，在使用 DDP 時不可避免地會遇到 OOM 之類的錯誤，但 DDP 本身無法從這些錯誤中恢復，基本try-except塊也無法工作。這是因為 DDP 要求所有程式以緊密同步的方式執行，並且在不同程式中啟動的所有AllReduce通訊必須匹配。

如果組中的某個程式丟擲 OOM 異常，則很可能會導致不同步（不匹配的 AllReduce操作），從而導致崩潰或掛起。如果您預計訓練期間會發生故障，或者資源可能會動態離開和加入，請使用torchelastic啟動分散式資料並行訓練。

2.6 通用分散式訓練

許多訓練正規化不適合資料並行，例如引數伺服器正規化，分散式管道並行，具有多個觀察者或代理的強化學習應用等。 torch.distributed.rpc目標是支援通用分散式訓練場景。

torch.distributed.rpc包有四大支柱：

RPC支援在遠端工作者上執行給定的函式。
RRef有助於管理遠端物件的生命週期。RRef 註釋中介紹了引用計數協議。
分散式 Autograd 將 autograd 引擎擴充套件到機器邊界之外。有關更多詳細資訊，請參閱分散式 Autograd 設計。
分散式優化器可以自動聯絡所有參與的workers，以使用分散式 autograd 引擎計算的梯度來更新引數。

RPC 教程如下（後續會選擇部分文章進行分析）：

在開始使用分散式RPC框架教程首先使用一個簡單的強化學習（RL）為例來說明RPC和器RRef。然後，它將基本的分散式模型並行應用於 RNN 示例，以展示如何使用分散式 autograd 和分散式優化器。
在使用分散式RPC框架實現一個引數伺服器教程借用 HogWild！訓練的精華，將其應用於非同步引數伺服器 (PS) 訓練應用程式。
使用 RPC的分散式管道並行教程將單機管道並行示例（在單機模型並行最佳實踐中介紹）擴充套件到分散式環境，並展示瞭如何使用 RPC 實現它。
使用非同步執行來實施批量RPC處理教程演示如何使用 @ rpc.functions.async_execution 裝飾器來實現RPC批處理，它可以幫助加快推理和培訓。它使用了上述教程 1 和 2 中使用的類似 RL 和 PS 示例。
將分散式RPC框架相與分散式資料並行結合教程演示瞭如何將DDP與RPC結合起來，這樣可以將分散式資料並行與分散式模型並行相結合訓練模型。