[原始碼解析] 深度學習流水線並行Gpipe(1)---流水線基本實現

羅西的思考發表於2021-08-23

[原始碼解析] 深度學習流水線並行Gpipe(1)---流水線基本實現

[原始碼解析] 深度學習流水線並行Gpipe(1)---流水線基本實現

0x00 摘要

GPipe是一個基於 Lingvo （Lingvo 是 Google 基於 TensorFlow 二次開發的重點針對序列模型的框架）開發的，支援超大規模模型的神經網路訓練並行庫，本文介紹其基本功能和流水線機制。

0x01 概述

1.1 什麼是GPipe

GPipe是一個基於 Lingvo （Lingvo 是 Google 基於 TensorFlow 二次開發的重點針對序列模型的框架 https://github.com/tensorflow/lingvo）開發的，支援超大規模模型的神經網路訓練並行庫，其特點如下：

GPipe 把一個L層的網路，切分成 K個 composite layers。每個composite layer 執行在單獨的TPU core上。
這 K個 core composite layers只能順序執行，但是GPipe 引入了流水並行策略來緩解這個順序執行的效能問題，把 mini-batch細分為多個更小的macro-batch，提高並行程度。
GPipe 還用recomputation這個簡單有效的技巧來降低記憶體，進一步允許訓練更大的模型。

1.2 挑戰

深度學習框架本質上是一個基於張量（Tensor）之間的計算（Operator）表示式所組成的計算圖（Graph）編譯執行引擎，提供了一系列張量的定義、一元操作、二元操作等數學原語，並根據反向傳播演算法（Back Propagation）進行梯度自動求導以及模型更新。在大量資料分批次流入計算圖進行模型訓練之後，使得模型學習到資料中的內在關聯關係，從而獲得對應場景中的“智慧”感知與判斷能力。

DNN訓練的目標是在儘可能短的時間內獲得一個高精度的模型。這一目標可以通過兩個指標來實現：

統計效率，達到預期準確度所需的歷元數；
硬體效率，完成單個曆元所需的時間。達到期望精度水平的總訓練時間只是這兩個指標的乘積；

GPU最主要提供的是兩種資源：計算資源 和 視訊記憶體頻寬資源。所以訓練大型模型有兩個基本挑戰：視訊記憶體效率和計算效率。

深度學習框架效能優化的最終目標是深度學習模型訓練最快，從而使得完成訓練的時間最短，節省模型訓練開發週期和使用者的時間成本。

0x02 並行機制

業界採用一些並行機制達到優化的目的。

2.1 機制分類與權衡

本節以下主要參考如下文章：

Efficient Large-Scale Language Model Training on GPU Clusters

DeepSpeed: Extreme-scale model training for everyone

[譯] DeepSpeed：所有人都能用的超大規模模型訓練工具

PipeDream: Fast and Efficient Pipeline Parallel DNN Training。

在 "Efficient Large-Scale Language Model Training on GPU Clusters" 論文中， NVIDIA 介紹了分散式訓練超大規模模型的三種必須的並行技術：

資料並行（Data Parallelism）
模型並行（Tensor Model Parallelism）
流水並行（Pipeline Model Parallelism）

2.1.1 資料並行

資料並行（Data Parallelism）是最常見的方法。其特點如下：

模型在多個worker機器上覆制，每個GPU都維護模型的完整副本。
輸入資料集可以跨多個gpu進行分割槽。每批輸入的訓練資料都在資料並行的 worker 之間劃分。每個worker處理訓練資料的一個子集。
使用集合通訊原語或引數伺服器定期與其他GPU同步權重。
反向傳播後需要通訊並規約梯度，以保證優化器在各個 worker 上進行相同的更新。即，對單個worker計算的權重更新進行聚合，以獲得反映所有輸入更新的最終權重更新。
每個聚合傳遞的資料量與模型的大小成比例。

資料並行性具有幾個明顯的優勢，包括計算效率高和實現起來工作量小，這使得資料並行訓練在一些流行的具有高計算通訊比的模型上執行良好，但有幾個重要的趨勢威脅著它的有效性：

視訊記憶體效率：資料並行會在所有 worker 之間進行模型和優化器的複製，因此視訊記憶體效率不高。
計算效率：隨著我們提高並行度，每個 worker 執行的計算量是恆定的。資料並行可以在小規模上實現近乎線性擴充套件。但是，在 worker 之間規約梯度的通訊開銷跟模型大小成正相關，所以當模型很大或通訊頻寬很低時，計算效率會受限。而GPU計算能力的快速增長進一步將訓練的瓶頸轉移到跨模型的通訊上。另外，引數同步的頻率影響統計和硬體效率。
伸縮性：不斷增長的模型大小增加了每次聚合的通訊量。事實上，一些廣泛使用的模型足夠大，使得通訊超過了計算時間，限制了伸縮性並支配了總的訓練時間。而且，資料並行的 batch 大小隨 worker 數量提高，而我們往往無法在不影響收斂性的情況下一直增加 batch 大小。

2.1.2 模型並行

模型並行在傳統上用於訓練過程中太大而無法儲存在工作者記憶體或快取中的模型。其特點如下：

模型並行涉及到將模型在worker之間進行劃分，以便每個worker僅對模型引數的一個子集進行評估和更新。這樣就可以分為層間並行和層內模型並行。
層間模型並行會在多個 worker 之間劃分模型的各個層。
層內模型並行把每層的模型引數切分到多個裝置。層內模型並行在有的論文裡被叫做 "Tensor 級別的模型並行" ，是對某一層（如 Linear/Dense Layer 裡的 Variable ）的模型 Tensor 切分，從而將大的模型 Tensor 分成多個相對較小的 Tensor 進行平行計算；
層間值（啟用和梯度）往往是需要跨機器通訊的唯一引數。

就其本質而言，模型並行性的計算和通訊因模型結構而異，因此在實現上有很大的工作量。

然而，即使模型並行能夠訓練非常大的模型，傳統的模型並行也會導致計算資源的嚴重利用率不足，因為它一次只主動使用一個worker（如果每個層被分配給一個worker），或者不能重疊計算和通訊（如果每個層被分割槽）。

視訊記憶體效率：模型並行DNN訓練導致GPU資源的嚴重利用不足。模型並行通過在模型並行 worker 之間劃分啟用視訊記憶體，會根據 worker 數量成比例地減少視訊記憶體使用量。至關重要的是，這是減少單個網路層的啟用視訊記憶體的唯一方法。
計算效率：由於每次前向和反向傳播中都需要額外通訊啟用值，模型並行的計算效率很低。模型並行需要高通訊頻寬，並且不能很好地擴充套件到通訊頻寬受限的節點。此外，每個模型並行worker 都會減少每個通訊階段之間執行的計算量，從而影響計算效率。模型並行性通常與資料並行性結合使用，以在記憶體和計算效率之間進行權衡。
開發效率：跨多個GPU劃分模型的負擔留給了程式設計師，即使對於最有經驗的機器學習實踐者來說，確定如何在工作者中最好地劃分DNN模型也是一項具有挑戰性的任務，這往往會導致額外的效率低下。最近有些工作探索瞭如何使用增強學習來自動確定模型並行性的裝置位置。不幸的是，這樣的線上決策技術是時間和資源密集型的；它們也不能無縫地結合流水線、資料並行和模型並行。

2.1.3 流水線並行

流水並行（Pipeline Model Parallelism）在有的論文裡叫做流水線級別的模型並行，其特點是：

將整個網路分段（stage），不同段在不同的裝置上，前後階段流水分批工作，通過一種“接力”的方式並行。
流水線並行將模型的各層劃分為可以並行處理的階段。當一個階段完成一個 micro-batch 的正向傳遞時，啟用記憶體將被通訊至流水線的下一個階段。類似地，當下一階段完成反向傳播時，將通過管道反向通訊梯度。必須同時計算多個 micro-batch 以確保流水線的各個階段能平行計算。
流水線並行訓練有可能在資料並行性困難時提供較高的DNN訓練效能。特別是，工作人員之間的通訊可以限制在分配給不同工作人員的相鄰層之間的啟用（在前向通道上）和梯度（後向）上。

但是流水線並行依然有一些問題：

視訊記憶體效率：流水線並行減少的視訊記憶體與流水線的階段數成正比，使模型的大小可以隨 worker 的數量線性擴充套件。但是，流水線並行不會減少每一層的啟用函式的視訊記憶體佔用量。此外，每個 worker 必須儲存同時執行的各個 micro-batch 的啟用值。這導致流水線第一階段的啟用記憶體與單個 mirco batch 的總啟用記憶體大致相同。
計算效率：流水線並行具有最低的通訊量，因為它的通訊量只和在各階段邊界的各層的啟用值大小成正比。但是，它不能無限擴充套件。像模型並行一樣，增加流水線大小會減少每個流水線階段的計算量，這會降低計算與通訊的比率。如果要實現好的計算效率，流水線並行還要求其每個階段的計算負載完美的均衡。此外，流水線並行性會在每個 batch 的開始和結束時因為需要重新填充或排空流水線而產生 bubble overhead。
開發效率：DNN的雙向性（正向傳遞後反向傳遞相同的層）使得流水線具有挑戰性，更重要的是，一個簡單的流水線機制引入了過時權重的最新計算，導致最終模型的精度低於資料並行訓練。

2.2 如何使用

給定一個特定的神經網路模型和一批計算資源，從任務到裝置之間的對映有多種方式，但不同的對映方案執行效率不同。哪種方案最優既取決於作業本身的特性，也取決於底層硬體的拓撲。

神經網路由很多區域性計算搭建組成，一般來說，同一個神經網路的不同運算元可能適合不同的並行模式。一個區域性計算是採用資料並行，還是模型並行取決於這個區域性任務的計算傳輸比。比如：

某些運算元（比如卷積）運算引數量很小，但中間結果量大，為了較少傳輸量，所以最划算的方法是資料並行。即對資料進行切分，不同的裝置處理不同的資料，在裝置之間偶爾進行引數同步；
某些運算元，中間計算結果相對於引數量更少，就適合模型並行。
還有一些運算元，網路引數量/中間計算結果都很大，可能採用流水並行（也就是接力的形式）是最優的。
相比於一個運算元只使用單一的並行模式，一個運算元也可以同時使用多樣的並行模式可能進一步地減少傳輸量，譬如在隱藏層比較大的地方，就可能同時對資料矩陣切割以及對模型矩陣切割。

因此，對於每一個任務選擇最優的並行模式是一個非常複雜的問題，需要具體情況具體分析。

0x03 Pytorch 手動指定並行方式

目前已有的深度學習框架，大多數提供了對資料並行的原生支援，但是對模型並行支援的還不完善。如果使用者想要將模型引數分配到不同裝置上，往往會遇到需要人工指定模型切分方式、手工編寫資料通訊邏輯程式碼等問題。

我們就看看 Pytorch 如何手動指定，主要摘錄（翻譯）：

https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

3.1 基礎知識

PyTorch以Tensor為基本單元，更符合演算法工程師寫Python指令碼的直覺，以物件導向的方式進行模型搭建和訓練。對Tensor進行賦值、切片，就像numpy一樣易用。

PyTorch 是單卡視角，一個裝置上的 Tensor、模型指令碼跟另一個裝置上的 Tensor、模型指令碼並無直接關係，對於每個裝置上的模型指令碼都完全對稱的（Mirror）最簡單的資料並行來說，PyTorch 這樣的設計沒有什麼明顯的缺陷。每個裝置上的指令碼執行到相同 batch 的模型更新部分（Optimizer），統一做一次模型同步（AllReduce 操作）就完成了資料並行，這就是 PyTorch 的 DDP（DistributedDataParallel）模組。

但在分散式情況下想把一個Tensor切分到不同機器上，需要手動構建傳輸過程，相當於直接對物理程式設計，所以對分散式使用的門檻更高。

3.2 特點

PyTorch 模型並行將單個模型拆分到不同的GPU上，而不是在每個GPU上覆制整個模型（具體來說，假設模型 m包含10層。如果使用 DataParallel，則每個GPU都具有這10層中每個層的副本，而如果在兩個GPU上使用模型並行時，每個GPU可以託管5層）。

模型並行的高階思想是將模型的不同子網路放置在不同的裝置上，並相應地實現該forward方法以跨裝置移動中間輸出。由於模型的一部分只在任何單個裝置上執行，因此一組裝置可以共同服務於一個更大的模型。

3.3 基本用法

讓我們從包含兩個線性層的玩具模型（toy model）開始。要在兩個GPU上執行此模型，只需將每個線性層放在不同的GPU上，然後移動輸入（input）和中間輸出（intermediate outputs）以匹配層裝置（layer devices）。

import torch
import torch.nn as nn
import torch.optim as optim

class ToyModel(nn.Module):
  def __init__(self):
    super(ToyModel, self).__init__()
    self.net1 = torch.nn.Linear(10, 10).to('cuda:0')  # 將net1放置在第1個GPU上
    self.relu = torch.nn.ReLU()
    self.net2 = torch.nn.Linear(10, 5).to('cuda:1')   # 將net2放置在第2個GPU上

  def forward(self, x):
    x = self.relu(self.net1(x.to('cuda:0')))
    return self.net2(x.to('cuda:1'))

請注意對於 ToyModel ，除了五個用於將線性層（linear layers）和張量（tensors）放置在適當的裝置上的to(device)呼叫之外，以上內容與在單個GPU上實現該功能非常相似。這是模型中唯一需要更改地方（即to(device) ）。 backward()和 torch.optim 會自動關注梯度（gradients），就好像模型是一個GPU一樣。呼叫損失函式時，只需確保標籤（label）與輸出（output）在同一裝置（on the same device）上。

model = ToyModel()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.paraeters(), lr=0.001)

optimizer.zero_grad()
outputs = model(torch.randn(20, 10))
labels = torch.randn(20, 5).to('cuda:1') # ToyMode 的 output 是在 'cuda:1' 上，此處的 label 也應該置於 'cuda:1' 上
loss_fn(outputs,labels).backward()
optimizer.step()

3.4 將模型並行化應用於現有模組

只需更改幾行，就可以在多個GPU上執行現有的單GPU模組。以下程式碼顯示瞭如何分解 torchvision.models.reset50() 為兩個GPU。思想是從現有 ResNet模組繼承，並在構建過程中將層拆分為兩個GPU。然後，覆蓋 forward方法來縫合兩個子網，通過相應地移動中間輸出。

from torchvision.models.resnet import ResNet, Bottleneck

num_classes = 1000

class ModelParallelResNet50(ResNet):
    def __init__(self, *args, **kwargs):
        super(ModelParallelResNet50, self).__init__(
            Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)

        self.seq1 = nn.Sequential(
            self.conv1,
            self.bn1,
            self.relu,
            self.maxpool,

            self.layer1,
            self.layer2
        ).to('cuda:0')  # 放置在第1個GPU上

        self.seq2 = nn.Sequential(
            self.layer3,
            self.layer4,
            self.avgpool,
        ).to('cuda:1')  # 放置在第2個GPU上

        self.fc.to('cuda:1')

    def forward(self, x):
        x = self.seq2(self.seq1(x).to('cuda:1'))
        return self.fc(x.view(x.size(0), -1))

對於模型太大而無法放入單個GPU的情況，上述實現解決了該問題。但是，你可能已經注意到，如果模型合適，它（model parallel）將比在單個GPU上執行要慢。這是因為在任何時間點，兩個GPU中只有一個在工作，而另一個在那兒什麼也沒做。在 layer2 和 layer3之間，中間輸出需要從 cuda:0 複製到 cuda:1，這使得效能進一步惡化。

在整個執行過程中，兩個GPU中的一個會處於空閒狀態。為了解決這個問題，有一種選擇是將每個批次進一步劃分為拆分流水線，以便當一個拆分到達第二子網時，可以將下一個拆分饋入第一子網。這樣，兩個連續的拆分可以在兩個GPU上同時執行。

3.5 通過流水線輸入（Pipelining Inputs）加速

在以下實驗中，我們將每批次 120-image 進一步劃分為 20-image 。當PyTorch非同步啟動CUDA操作時，該實現無需生成多個執行緒即可實現併發。

class PipelineParallelResNet50(ModelParallelResNet50):
    def __init__(self, split_size=20, *args, **kwargs):
        super(PipelineParallelResNet50, self).__init__(*args, **kwargs)
        self.split_size = split_size

    def forward(self, x):
        splits = iter(x.split(self.split_size, dim=0))
        s_next = next(splits)
        s_prev = self.seq1(s_next).to('cuda:1')
        ret = []

        for s_next in splits:
            # A. s_prev runs on cuda:1
            s_prev = self.seq2(s_prev)
            ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))

            # B. s_next runs on cuda:0, which can run concurrently with A
            s_prev = self.seq1(s_next).to('cuda:1')

        s_prev = self.seq2(s_prev)
        ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))

        return torch.cat(ret)


setup = "model = PipelineParallelResNet50()"
pp_run_times = timeit.repeat(
    stmt, setup, number=1, repeat=num_repeat, globals=globals())
pp_mean, pp_std = np.mean(pp_run_times), np.std(pp_run_times)

plot([mp_mean, rn_mean, pp_mean],
     [mp_std, rn_std, pp_std],
     ['Model Parallel', 'Single GPU', 'Pipelining Model Parallel'],
     'mp_vs_rn_vs_pp.png')

請注意，裝置到裝置的張量複製操作在源裝置和目標裝置上的當前流（current streams）上同步。如果建立多個流，則必須確保複製操作正確同步。在完成複製操作之前寫入源張量或讀取/寫入目標張量可能導致不確定的行為。上面的實現僅在源裝置和目標裝置上都使用預設流，因此沒有必要強制執行其他同步。

0x04 關鍵技術

因為每個模型的並行策略候選集合是指數級的，純手工從中挑出一種合適的並行策略，需要耗費演算法工程師大量的時間以及計算資源，而且演算法工程師需要考慮的相關事宜太多，比如：如何分配記憶體，層之間如何互動，如何減少通訊代價，分割的張量不能破壞原有數學模型，如何確定確定張量shape，如何確定輸入輸出等等。

所以自動並行技術（如何從框架層次自動解決並行策略選擇問題）成為一個研究熱點。

自動並行通過建立代價模型來預測並挑選一個較優的並行策略（暫時無法保證是最優的策略，因為挑出最優的策略是個NP-Hard的問題），有希望將演算法工程師從並行策略的選擇和配置中解放出來。

因此，目前分散式模型訓練有幾個必要並行技術：

流水並行，尤其是如何自動設定流水；
梯度累加；
後向重計算；
1F1B 策略（我們將採用PipeDream分析）；

下面我們結合 Gpipe程式碼看看這些技術如何使用。

0x05 基礎知識 & 支撐系統

5.1 Lingvo框架

5.1.1 核心元件

Lingvo 的核心元件如下：

Models : 一個model是一個抽象collection，包含一個或者多個tasks。Model相當於對於Task的一層wrapper。對於multi-tasks模型，Model將控制哪些variable將在Task之間共享，以及訓練時每個task如何取樣。
Tasks ：一個task就是關於一個完整優化問題的描述，比如圖片分類或者語音識別。包含input generator。
Layers ：一個Layer代表一個包含有可訓練引數的隨機函式。一個Layer可以包含其他的Layer作為孩子。Softmax, LSTM, Attension甚至一個task都是Layer的例子。
Params ：此物件包含了模型的超引數。Layers，Tasks以及models都是通過Params中的specifications來構建的。 Params是層級的，一個物件的params中可以包含其child物件的params。
NestedMap ：一個dictionary結構，用於傳遞資料。程式碼中大部分程式碼中的Python物件要麼是Tensor的例項，要麼就是BaseLayer或者NestedMap的一個子類。

5.1.2 模型定義

在Lingvo中，網路是一個層的巢狀結構。Lingvo中的大多數類都是[Lingvo/core/base_layer.py] BaseLayer 的子類。

Params ：用來配置類，定義了配置所需要的keys，這些keys在物件建立時都應該被定義。Params物件還可以包含用於配置子層的Params物件。每個layer類都會有一個params的classmethod，這個方法將會建立一個新的params物件，並且通過定義的keys來配置這個layer，同時為這些keys給出一些合理的預設值。

Params物件中屬性包括：
- cls: tParams物件關聯的python類。這可以用來構造類的例項；
- name: 該層的名稱；
- dtype: 建立變數時使用的預設資料型別。
__init__ constructor ：所有子層和變數都應該在這裡建立。
CreateVariable ：建立變數的方法。每個Layer負責建立和管理它自己的variable。
CreateChild ：建立子層的方法。
FProp ：所有的layers都有一個FProp() 函式，實現該層的前向傳播，在計算的前向step時將會被呼叫。因為可以在分散式訓練時在不同的裝置上執行，出於效能的考慮，Lingvo通過theta引數訪問variable，而不是通過self.vars或者self.theta。
FPropMeta : 返回該層關於FProp計算的後設資料。其中 meta.flops在得到一個輸入張量時給出估計的floating point operations數目。

對於模型演算法的落地，有兩個指標特別重要：

前向傳播時所需的計算力，它反應了對硬體如GPU效能要求的高低。
引數個數，它反應所佔記憶體大小。

接下來，我們需要看看如何計算模型訓練的記憶體大小，以及如何計算算力（後續流水線並行需要）。

5.2 計算記憶體

5.2.1 總體分析

我們主要參考了 ZeRO: Memory Optimization Towards Training A Trillion Parameter Models 這篇論文的思路。

在模型訓練期間，大部分記憶體被以下三種情況之一消耗：

i）啟用。
ii）OGP狀態，即由優化器狀態、引數梯度和引數本身組成的張量。
iii）臨時緩衝區。

輸入資料所佔用的視訊記憶體其實並不大，這是因為我們往往採用迭代器的方式讀取資料，這意味著我們其實並不是一次性的將所有資料讀入視訊記憶體，而這保證每次輸入所佔用的視訊記憶體與整個網路引數來比是微不足道的。

我們逐一分析。

5.2.2 啟用函式

對於啟用函式，有如下特點：

啟用函式額外消耗的視訊記憶體隨 batch 大小而增加，batch 設定為1的情況下，訓練萬億引數模型就會產生超過 1 TB 的啟用函式用的視訊記憶體。
業界已有方案如論文 Training deep nets with sublinear memory cost ，可以以33%的重新計算開銷為代價，幾乎可以消除啟用所需的所有記憶體。這個技巧叫做 gradient checkpointing, 有時還叫做 rematerialization, re-forward。

5.2.3 OGP狀態

5.2.3.1 模型自身引數

模型自身的引數指的就是各個網路層的 Weight 和Bias，這部分視訊記憶體在模型載入完成之後就會被佔用。另外需要注意到的是，有些層是有引數的，如CNN， RNN；而有些層是無引數的，如啟用層，池化層等。

5.2.3.2 優化器引數

優化器引數指的是模型在優化過程即反向傳播中所產生的引數，這部分引數主要指的就是 dw，即梯度，在SGD中，其大小與引數一樣，因此在優化期間，模型引數所佔用的視訊記憶體會翻倍。

值得注意的是，不同的優化器其所需儲存的優化引數不同，對於 Adam，由於其還需要儲存其餘引數，模型的引數量會在優化區間翻 4 倍。

5.2.3.3 樣例

對於OGP狀態，讓我們以ADAM為例。使用ADAM對具有Ψ個引數的模型進行混合精度訓練。

模型引數：引數的fp16副本，記憶體要求2Ψ位元組。
梯度：梯度的fp16副本，記憶體要求2Ψ位元組。
優化器狀態：引數、Momentum和Variance的fp32副本，記憶體需求分別為4Ψ、4Ψ和4Ψ位元組。讓我們用K來表示優化器狀態的記憶體乘數，也就是說，儲存它們所需的額外記憶體是KΨ位元組。

總的來說，OGP狀態有2Ψ+2Ψ+KΨ=16Ψ位元組（混合精度ADAM的K=12）的記憶體需求。

具體如下：

藍色是引數，橙色是梯度，綠色是優化器狀態。

在記憶體消耗公式中，Ψ表示模型大小（引數個數），K表示優化器狀態的記憶體乘數，Nd表示資料並行度。在這個例子中，我們假設模型大小為Ψ=75億，基於Adam優化器的混合精度訓練，資料並行度為Nd=64（即64個GPU），K=12。

對於一個擁有15億個引數的GPT-2這樣的模型，這導致了至少24gb的記憶體需求，遠遠高於單獨儲存fp16引數所需的3gb記憶體。

5.2.4 臨時緩衝區

臨時緩衝區是用於儲存臨時結果的緩衝區，例如，對於引數為15億的GPT-2模型， fp32緩衝區將需要6GB的記憶體。

5.3 計算算力

5.3.1 背景知識

FLOPS：注意全大寫，是floating point operations per second的縮寫，意指每秒浮點運算次數，理解為計算速度。是一個衡量硬體效能的指標。
FLOPs：注意s小寫，是floating point operations的縮寫（s表複數），意指浮點運算數，理解為計算量。可以用來衡量演算法/模型的複雜度。

前向傳播時所需的計算力就是由FLOPs體現，那麼FLOPs該怎麼計算呢？

我們知道，在一個模型進行前向傳播的時候，會進行卷積、池化、BatchNorm、Relu、Upsample等操作。這些操作的進行都會有其對應的計算力消耗產生，其中，卷積所對應的計算力消耗是所佔比重最高的。所以，我們以卷積操作為例，看看卷積所對應的計算力。

推導過程：卷積層 wx + b 需要計算兩部分，首先考慮前半部分 wx 的計算量：

令 :

k 表示卷積核大小;
c 表示輸入 feature map 的數量;

則對於輸出 feature map 上的單個 Unit 有：

k * k * c 次乘法，以及 k * k * c - 1 次加法

如果輸出 feature map 的解析度是 H * W ，且輸出 o 個 feature map，則輸出 feature map 包含 Unit的總數就是 H * W * o。

因此，該卷積層在計算 wx 時有:

k * k * c * H * W * o 次乘法          --（1）
(k * k * c - 1) * H * W * o 次加法    --（2）

再考慮偏置項 b 包含的計算量：

由於 b 只存在加法運算，輸出 feature map 上的每個 Unit 做一次偏置項加法。因此，該卷積層在計算偏置項時總共包含：

H * W * o 次加法      --（3）

將該卷積層的 wx 和 b 兩部分的計算次數累計起來就有：

式(1) 次乘法:

k * k * c * H * W * o 次乘法

式(2) + 式(3) 次加法:

(k * k * c - 1) * H * W * o  + H * W * o  = k * k * c * H * W * o

可見，式(2) + 式(3) = 式 (1)

因此，對於帶偏置項的卷積層，該層的計算力消耗為：

k * k * c * H * W * o

5.3.2 lingvo中實現

在 Lingvo 之中，具體計算算力，就是通過每個類的 FPropMeta完成，這些方法都是每個類根據自己的特點實現。我們具體找幾個例子看看如何計算FLOPS。

Conv2DLayerNoPadding如下計算：

  @classmethod
  def FPropMeta(cls, p, inputs):
    py_utils.CheckShapes((inputs,))
    b, h, w, c = inputs
    fh, fw, ic, oc = p.filter_shape
    assert ic == c
    sh, sw = p.filter_stride
    if p.padding == 'SAME':
      oh = sympy.ceiling(h / sh)
      ow = sympy.ceiling(w / sw)
    else:
      oh = sympy.ceiling((h - fh + 1) / sh)
      ow = sympy.ceiling((w - fw + 1) / sw)
    flops = b * oh * ow * fh * fw * ic * oc * 2  # mul/add counts as 2 flop.
    outputs = tshape.Shape([b, oh, ow, oc])
    return py_utils.NestedMap(flops=flops, out_shapes=(outputs,))

DropoutLayer 如下計算：

  @classmethod
  def FPropMeta(cls, p, inputs, *args):
    py_utils.CheckShapes((inputs,))
    flops_per_element = 10  # Approximately 10 flops per element.
    return py_utils.NestedMap(
        flops=inputs.num_elements() * flops_per_element, out_shapes=(inputs,))

BatchNormLayer 的 FLOPS 如下計算。

  @classmethod
  def FPropMeta(cls, p, inputs, padding=None):
    py_utils.CheckShapes((inputs,))
    return py_utils.NestedMap(
        flops=inputs.num_elements() * _BN_FLOPS_PER_ELEMENT,
        out_shapes=(inputs,))

ActivationLayer 如下計算：

  @classmethod
  def FPropMeta(cls, p, inputs):
    py_utils.CheckShapes((inputs,))
    return py_utils.NestedMap(
        flops=inputs.num_elements() * GetFlops(p.activation),
        out_shapes=(inputs,))

0x06 流水線

6.1 背景知識

6.1.1 問題點

通訊問題

無論是資料並行還是模型並行，都會在相應的機器之間進行全連線的通訊，當機器數量增大時，通訊開銷和時延會大到難以忍受。

比如引數伺服器使用中，三段式流程如下：Pull weight ---> Compute new weight ---> Push new weight。

如果三段式流程序列的進行通訊和計算，無論這個通訊是快是慢，這個時間開銷都會導致在分散式環境下每個iteration的時間比單機版要長（Ethernet的頻寬高低或者latency大小隻會影響這個時間的長短，但並不能把這個時間降到零）。所以，把通訊和計算重疊（overlap）起來以便 “掩蓋” 通訊時間幾乎是一個必須的步驟。如何重疊計算和傳輸從而提高裝置利用率就非常有挑戰。

無法充分利用

原則上我們可以通過平行計算在GPU或者TPU上訓練更大的DNN模型。但是由於DNN的順序性，這種方法可能導致在計算期間只有一個加速器處於活動狀態，不能充分利用裝置的計算能力。

6.1.2 如何設計系統

回到神經網路的訓練過程上，怎麼設計系統來重疊計算和通訊？

在後向傳播之中有兩個特點可以利用：

首先，神經網路的計算是一層接著一層完成的，不管是前向還是後向傳播，算完本層才能算下一層；
另一方面，在後向傳播的過程中，一旦後一層拿到前一層的輸入，這一層的計算就不再依賴於前一層了。

因此，根據這個特點，人們引入了流水線並行。

資料並行與模型並行都是讓裝置執行同一個層次的計算，流水並行則是把任務劃分為幾個有明確先後順序的階段，把不同的階段分給不同的計算裝置，使得單裝置只負責網路中部分層的計算。模型網路分佈在各個裝置上是非對稱的，各個裝置“接力”執行網路的一部分。每個階段（stage）和下一個階段之間僅有相鄰的某一個 Tensor 資料需要傳輸，每臺機器的資料傳輸量跟總的網路大小、機器總數、並行規模無關。

在這種多裝置接力完成一個網路計算的模式下，可以支援更大的模型或者支援更大的 Batch Size。如果通訊和計算重疊（overlap）好，又可以解決了機器之間的通訊開銷的問題，

總的來說，流水並行在通常大模型訓練情況下具有優勢。流水並行的資料傳輸量少，僅為階段之間需要傳輸的資料量之和，不像資料並行與模型並行那樣大，傳輸量與整個計算圖都有關，因此對於頻寬較小的機器，會趨於使用流水並行。但某些情況下，流水並行與模型並行的結合則會優於單一的模型並行與流水並行。同時，在資料並行與模型並行中也存在計算時間掩蓋傳輸時間的優化。

6.2 Gpipe流水線綜述

6.2.1 關鍵要點

從概念上講，GPipe是一個分散式機器學習庫，它使用同步隨機梯度下降和流水線並行的方式進行訓練，適用於任何由多個有序的層組成的深度神經網路(Deep Neural Networks, DNN)。 Gpipe通過跨不同的加速器來分割模型，並自動將一小批訓練示例劃分成更小的批量。該模型允許GPipe的加速器進行並行操作，最大限度地提高了訓練過程中的可擴充套件性。

GPipe 有幾個關鍵要點:

Network partition（網路分片）：將一個N層的網路劃分成K個partition, 每個partition在單獨的TPU上執行，partition之間需要插入一些網路通訊操作。
Pipeline parallelism（流水線並行）：把CPU裡的流水線併發技術用在了深度學習上，主要是把計算和網路通訊兩種操作，更好地重排列。即自動將mini-batch的訓練樣本分成更小的micro-batch，並在pipeline中執行，使TPU核心能夠並行操作。
Gradient Accumulation（梯度累積）：梯度一直在micro-batch中累積，因此分割槽數量不會影響模型質量。
Re-Materialization（重計算) ：Re-Materialization具體是指在前向計算過程中，GPipe只記錄stage劃分處的輸出，在計算梯度時，GPipe會重新執行前向計算邏輯，從而得到各個運算元的前向結果，然後再計算梯度結果。跟OpenAI開源的gradient-checkpointing一樣，只不過GPipe是在TPU上實現的，OpenAI的只能執行在GPU上。

6.2.2 圖示

下圖左端是原始模型。
右端說明了具有多個有序層的神經網路的GPipe模型被劃分到了四個加速器上。 Fk是kth分割槽的複合正向計算函式。 Bk是其相對應的反向傳播函式。 Bk依賴於來自上層的Bk+1和Fk的中間啟用函式。
- 上面的模型中，我們可以看到網路的順序性是如何導致資源利用不足的。
- 下面的模型則顯示了GPipe方法，在該方法中，輸入的小批量示例被劃分為更小的微批量，這些微批量可以由加速器同時處理。

6.2.3 問題

我們根據論文思路，提出了幾個問題，爭取以後按圖索驥。

如何劃分 stage？
- 將模型劃分為連續的幾個stage，每個stage各自對應一個裝置。這樣就使得模型的大小可以突破單個裝置記憶體的大小，因為一臺裝置只需要能夠容納部分模型的引數和計算；
- 因為劃分了stage，所以整個系統中，處理最慢的stage會成為瓶頸。所以應該平均分配算力。
依據什麼分割做流水？
- 如何將mini-batch進一步劃分成更小的micro-batch，同時利用pipipline方案，每次處理一個micro-batch的資料，得到結果後，將該micro-batch的結果傳送給下游裝置，同時開始處理後一個 micro-batch的資料，通過這套方案減小裝置中的Bubble（裝置空閒的時間稱為 Bubble）。
如何做Re-Materialization？

6.3 依據算力分割Stage

神經網路有一個特點：對不同的輸入，其執行時間相差不大，因此可以預估其算力，時間，引數大小等等。Gpipe就是依據算力對圖進行了分割，從而把不同層分配到不同的裝置上。

6.3.1 PartitionSequentialLayers

PartitionSequentialLayers 把一個包括sequential layers的層分解，目的是讓每一個partition都大致擁有同樣的 flops。最終目的是讓每個 GPU 都擁有儘量同樣的算力。

輸入是：一個layer param 或者一個 layer param 列表；
輸出是：一個 FeatureExtractionLayer params 列表；

邏輯是：

如果params只是一個layer，那麼就把這個layer構建成一個包含sub-layers的列表 subs；
利用 FPropMeta 計算出來這個 subs 列表的shapes和總flops，賦值給了 histo；
利用 histo 計算出來一個層代價（layer's cost）的歸一化累積直方圖；
構建一個parts變數；
- 該變數是一個num_partitions大小的陣列；陣列每個item也是一個陣列；
- 依據直方圖把subs分到parts中的每個item之中，這樣每個parts[i]都擁有部分layers，一些算力小的運算元被合併到一個 part 之中，目的是讓最終 parts 每個item的算力盡量相同；
把parts轉換成一個 FeatureExtractionLayer param 列表；

def PartitionSequentialLayers(params, num_partitions, *shapes):
  r"""Partition a layer composed of sequential layers.

  This routine strives to partition layers so that each partition costs roughly
  the same flops given the input shapes.

  Args:
    params: A layer param or a list of layer param.
    num_partitions: The desired number of partitions.
    *shapes: A tuple of tshape.Shape representing input tensors to the first
      layer.

  Returns:
    A list of FeatureExtractionLayer params.
  """

  # Recursively concatenate SequentialLayer into a list.
  # SequentialLayer 是一個層，其作用是把若干層按順序連線起來
  def FlattenSeq(p):
    if isinstance(p, list): # 已經是列表則返回
      return p
    if p.cls not in [builder_layers.SequentialLayer, FeatureExtractionLayer]:
      return [p.Copy()]
    subs = []
    for _ in range(p.repeat): # 把p包含的所有層都組裝成一個層列表
      for s in p.sub:
        subs += FlattenSeq(s)
    return subs

  # 如果params是一個layer，那麼就依據這個layer，構建一個包含sub-layers的新列表subs，如果是列表則直接返回
  subs = FlattenSeq(params)

  assert len(shapes) == 1
  tf.logging.info('num_partitions: {} input_shape: {}'.format(
      num_partitions, shapes[0]))

  # 利用 FPropMeta 計算出來這個 subs 列表的shapes和總flops，賦值給了 histo
  # Computes the estimate cost for each sub layer.
  # 假設有7個sub-layers，其flops分別是 10，40，30，10，20，50,10
  total, histo, output_shapes = 0, [], []
  for i, s in enumerate(subs):
    s.name = 'cell_%03d' % i
    meta = s.cls.FPropMeta(s, *shapes) # 
    total += meta.flops
    histo.append(total)
    output_shapes.append(meta.out_shapes)
    shapes = meta.out_shapes
  tf.logging.vlog(1, 'len %d histogram = %s', len(subs), histo)
  # 則對應的histo 為：[10，50，80，90，110，160, 170]，total為170

  # 利用 histo 計算出來一個層代價（layer's cost）的歸一化累積直方圖
  # Computes the normalized cumulative histogram of the layer's cost.
  histo_pct = [float(x / total) for x in histo]
  tf.logging.vlog(1, 'cost pct = %s', histo_pct)
  # histo_pct 為 [1/17,5/17,8/17,9/17,11/17,16/17, 1], 
  # 假設 num_partitions = 3

  # 構建一個parts變數，該變數是一個num_partitions大小的陣列，陣列每個item也是一個陣列
  # 依據直方圖把subs分到parts中的每個item之中，這樣每個parts[i]都擁有部分layers，目的是讓最終 parts 每個item的算力盡量相同
  # i-th sub layer is put into partition j, where j is roughly i-th cumulative
  # histogram times num_partitions.

  parts = [[] for _ in range(num_partitions)]
  parts_cost = [0] * num_partitions
  pre_hist_cost = 0
  for i, s in enumerate(subs):
    # 從histogram陣列中找出s對應cost的index，j也就是s對應的partition
    # 對於i,s，則 histo_pct[i] * num_partitions 分別為: [3/17, 15/17, 24/17, 27/17, 33/17, 48/17，3]，j分別為[0,0,1,1,1,2,2]
    j = min(int(histo_pct[i] * num_partitions), num_partitions - 1)
    # The boundary at parts[j] where j > 0
    if j > 0 and not parts[j]:
      parts_cost[j - 1] = histo_pct[i - 1] - pre_hist_cost
      pre_hist_cost = histo_pct[i - 1]
    parts[j].append(s) # 把s加入到對應的partition
    # 三個桶內容分別為:[1,2],[3,4,5],[6,7]
    # 對應每個桶的flops為: [60,280,330]
    
  # 把parts轉換成一個 FeatureExtractionLayer 列表
  parts_cost[num_partitions - 1] = 1.0 - pre_hist_cost
  seqs = []
  for i, pa in enumerate(parts):
    tf.logging.info('Partition %d #subs %d #cost %.3f', i, len(pa),
                         parts_cost[i])
    seqs.append(FeatureExtractionLayer.Params().Set(name='d%d' % i, sub=pa))
  return seqs

6.3.2 FeatureExtractionLayer

上面程式碼中使用了FeatureExtractionLayer，其功能就是返回一個層的序列。

FeatureExtractionLayer 從一個層序列中提取特徵，具體特點是：

把一些層連線成一個序列；
可以得到&傳遞啟用點；

class FeatureExtractionLayer(base_layer.BaseLayer):
  """A layer that extrac features from a sequence of layers.

  FeatureExtractionLayer is a layer which connects a few layers in a sequence.
  It is also capable of fetching and forwarding activation endpoints.
  # TODO(huangyp): Make it a sublayer of builder_layers.SequentialLayer
  """

  @classmethod
  def Params(cls):
    p = super().Params()
    p.Define('variable_name_prefix', '',
             'Prefix for variable names in sub layers')
    p.Define('sub', [], 'A list of layers\' params.')
    p.Define('num_act_inputs', 0, 'Number of activation inputs.')
    p.Define('num_act_outputs', 0, 'Number of activation outputs.')
    p.Define('act_fetch_layers', [],
             'Names of fetch layers that cached extra activations')
    return p

  def __init__(self, params):
    super().__init__(params)
    p = self.params
    assert p.num_act_inputs >= 0
    assert p.num_act_outputs >= 0
    p.act_fetch_layers = p.act_fetch_layers or []
    assert p.num_act_outputs == p.num_act_inputs + len(p.act_fetch_layers)
    self._seq = []
    for sub in p.sub:
      assert sub.name
      sub.name = p.variable_name_prefix + sub.name
      self.CreateChild(sub.name, sub)
      self._seq.append((sub.name, self.children[sub.name])) # 把一些層連線成一個序列

  def FProp(self, theta, *args): # 實現該層的前向傳播，在計算的前向step時將會被呼叫
    p = self.params
    assert len(args) > p.num_act_inputs
    out_args = args[:-p.num_act_inputs] if p.num_act_inputs > 0 else args
    extra_args = args[-p.num_act_inputs:] if p.num_act_inputs > 0 else ()
    for (name, ch) in self._seq:
      th = theta[name]
      out_args = _ToTuple(out_args)
      out_args = ch.FProp(th, *out_args)
    # Append fetched activations to fprop outputs.
    for fetch_layer in p.act_fetch_layers:
      assert fetch_layer in self.children
      activation = self.children[fetch_layer].activation # 子層啟用點
      if isinstance(activation, (tuple, list)):
        activation = activation[0] # 如果是list，得到相應啟用點
      extra_args += (activation,) # 把啟用點新增進來
    if extra_args:
      out_args = _ToTuple(out_args) + extra_args # 最終返回所有啟用點
    return out_args

  @classmethod
  def FPropMeta(cls, p, *args): # 返回該層關於`FProp`計算的後設資料
    assert len(args) > p.num_act_inputs
    seq_args = args[:-p.num_act_inputs] if p.num_act_inputs > 0 else args
    extra_args = args[-p.num_act_inputs:] if p.num_act_inputs > 0 else ()
    total = 0
    act_fetch_metas = {}
    for sub in p.sub:
      meta = sub.cls.FPropMeta(sub, *seq_args)
      if sub.name in p.act_fetch_layers:
        act_fetch_metas[sub.name] = meta.out_shapes[0]
      total += meta.flops
      seq_args = meta.out_shapes
    for fetch_layer in p.act_fetch_layers:
      extra_args += (act_fetch_metas[fetch_layer],)
    return py_utils.NestedMap(flops=total, out_shapes=seq_args + extra_args)

計算過程如下圖，裡面具體數值請參見上面幾段程式碼之中的舉例：

  +--------------+   +--------------+   +---------------+
  |              |   |              |   |               |
  |  sub-layer 1 |   |  sub-layer 2 |   |  sub-layer n  |
  |              |   |              |   |               |
  +-------+------+   +-------+------+   +--------+------+
          |                  |                   |
          |FPropMeta         |FPropMeta          |FPropMeta
          |                  |                   |
          v                  v                   v
       flops 1            flops 2             flops n
          +                  +                   +
          |                  |                   |
          |                  |                   |
          +--------------------------------------+
                             |
                             |
                             v
                  for i, s in enumerate(subs):
                     total += meta.flops
                     histo.append(total)
                  histo=[10,50,80,90,110,160,170]
                             +
                             |
                             |
                             v
Computes the normalized cumulative histogram of the layer's cost
        histo_pct = [float(x / total) for x in histo]
       histo_pct=[1/17,5/17,8/17,9/17,11/17,16/17,1]
                             +
                             |
                             |
                             +
           Assign layers to partition based on histogram
                   [1,2],[3,4,5],[6,7]
                             +
                             |
                             |
                             v
      +----------------------+----------------------------+
      | parts                                             |
      |                                                   |
      | +--------------+  +------------+  +-------------+ |
      | | sub-layer 1  |  |sub-layer 3 |  | sub-layer 6 | |
      | |              |  |            |  |             | |
      | | sub-layer 2  |  |sub-layer 4 |  | sub-layer 7 | |
      | |              |  |            |  |             | |
      | |              |  |sub-layer 5 |  |             | |
      | +--------------+  +------------+  +-------------+ |
      +---------------------------------------------------+

6.4 流水線分配

6.4.1 基礎類 SeqLayer

該層的目的是：用 Round-robin 策略把 cell_tpl之中的每個children cell 部署到工作裝置之上。

Params 包括：

before_tpl ：配置在流水線之前執行的CNN層；
cell_tpl ：FeatureExtractionLayer 列表；

6.4.1.1 初始化

初始化函式的邏輯是：

遍歷before_tpl，對於每個item呼叫CreateChild構建其子層，把item新增到 _before_layers 之中；
遍歷cell_tpl，對於每個item呼叫CreateChild構建其子層，把item新增到 _cells 之中；

  def __init__(self, params):
    super().__init__(params)
    p = self.params
    self._before_layers = []
    self._cells = []
    # 遍歷before_tpl，對於每個item呼叫CreateChild構建其子層，把item新增到 _before_layers 之中
    for l in p.before_tpl:
      self.CreateChild(l.name, l)
      self._before_layers.append((l.name, self.children[l.name]))
    # 遍歷cell_tpl，對於每個item呼叫CreateChild構建其子層，把item新增到 _cells 之中  
    for l in p.cell_tpl:
      self.CreateChild(l.name, l)
      self._cells.append((l.name, self.children[l.name]))

6.4.1.2 _CreateChildrenVariables

構建變數。邏輯如下：

如果使用 tpu，則
- 利用 cluster.WorkerDeviceInModelSplit(0) 來構建 before_tpl_device，即用叢集的第一個裝置作為 before_tpl_device；
- 遍歷叢集的其他裝置，分配給cell_devices；
遍歷 _before_layers，把其中每個變數部署在 before_tpl_device；
遍歷 _cells，把其中每個變數部署在 cell_devices；

  def _CreateChildrenVariables(self):
    p = self.params

    num_cells = len(p.cell_tpl)
    before_tpl_device = ''
    cell_devices = [''] * num_cells
    if py_utils.use_tpu(): # 如果使用 tpu
      # 利用 `cluster.WorkerDeviceInModelSplit(0)` 來構建 before_tpl_device，即用叢集的第一個裝置作為 before_tpl_device
      cluster = self.cluster
      before_tpl_device = cluster.WorkerDeviceInModelSplit(0)
      # 遍歷叢集的其他裝置，分配給cell_devices
      cell_devices = [
          cluster.WorkerDeviceInModelSplit(i) for i in range(num_cells)
      ]

    # 遍歷 _before_layers，把其中每個變數部署在 before_tpl_device
    for unused_name, l in self._before_layers:
      with tf.device(before_tpl_device):
        l.InstantiateVariables()

    # 遍歷 _cells，把其中每個變數部署在 cell_devices
    for i, (unused_name, l) in enumerate(self._cells):
      with tf.device(cell_devices[i]):
        l.InstantiateVariables()

    super()._CreateChildrenVariables()

6.4.1.3 FProp

前向傳播程式碼，具體邏輯如下：

遍歷 _before_layers，對於其中每層呼叫其FProp；
遍歷 _cells，對於其中每層，在cluster.WorkerDeviceInModelSplit(i)之上呼叫其FProp；

  def FProp(self, theta, *args):
    """Round-robin every children cells in cell_tpl among worker devices.

    Args:
      theta: A NestedMap object containing weights' values of this layer and its
        children layers.
      *args: Input args

    Returns:
      A list contains one tensor of [batch_size, feature_height, feature_width,
        channel].
    """
    num_layers = len(self.params.cell_tpl)
    cluster = self.cluster

    # 遍歷 _before_layers，對於其中每層呼叫其FProp	
    for (name, l) in self._before_layers:
      l_theta = theta[name]
      args = _ToTuple(args)
      args = l.FProp(l_theta, *args)
    # 遍歷 _cells，對於其中每層，在`cluster.WorkerDeviceInModelSplit(i)`之上呼叫其FProp  
    for i in range(num_layers):
      with tf.device(cluster.WorkerDeviceInModelSplit(i)):
        cell_name, cell = self._cells[i]
        args = _ToTuple(args)
        args = cell.FProp(theta[cell_name], *args)

    return args

6.4.1.4 具體實現

SeqLayer 全部程式碼如下：

class SeqLayer(base_layer.BaseLayer):
  """Round-robin every children cells in cell_tpl among worker devices."""

  @classmethod
  def Params(cls):
    p = super().Params()
    p.Define('before_tpl', [],
             'Config for the CNN layers that runs before pipelining.')
    p.Define('cell_tpl', [], 'A list of FeatureExtractionLayer layers.')
    return p

  def __init__(self, params):
    super().__init__(params)
    p = self.params
    self._before_layers = []
    self._cells = []
    for l in p.before_tpl:
      self.CreateChild(l.name, l)
      self._before_layers.append((l.name, self.children[l.name]))
    for l in p.cell_tpl:
      self.CreateChild(l.name, l)
      self._cells.append((l.name, self.children[l.name]))

  def _CreateChildrenVariables(self):
    p = self.params

    num_cells = len(p.cell_tpl)
    before_tpl_device = ''
    cell_devices = [''] * num_cells
    if py_utils.use_tpu():
      cluster = self.cluster
      before_tpl_device = cluster.WorkerDeviceInModelSplit(0)
      cell_devices = [
          cluster.WorkerDeviceInModelSplit(i) for i in range(num_cells)
      ]

    for unused_name, l in self._before_layers:
      with tf.device(before_tpl_device):
        l.InstantiateVariables()

    for i, (unused_name, l) in enumerate(self._cells):
      with tf.device(cell_devices[i]):
        l.InstantiateVariables()

    super()._CreateChildrenVariables()

  def FProp(self, theta, *args):
    """Round-robin every children cells in cell_tpl among worker devices.

    Args:
      theta: A NestedMap object containing weights' values of this layer and its
        children layers.
      *args: Input args

    Returns:
      A list contains one tensor of [batch_size, feature_height, feature_width,
        channel].
    """
    num_layers = len(self.params.cell_tpl)
    cluster = self.cluster

    for (name, l) in self._before_layers:
      l_theta = theta[name]
      args = _ToTuple(args)
      args = l.FProp(l_theta, *args)
    for i in range(num_layers):
      with tf.device(cluster.WorkerDeviceInModelSplit(i)):
        cell_name, cell = self._cells[i]
        args = _ToTuple(args)
        args = cell.FProp(theta[cell_name], *args)

    return args

6.4.2 具體分配 PipeliningLayer

PipeliningLayer 是 SeqLayer 的派生類。

在流水線最前面是device[0]，負責處理前置條件。
流水線中間是一系列 device，負責處理具體的 micro batches。
流水線最後是 device[-1]，負責整理形狀，最後輸出一個最終張量。

6.4.2.1 得到中間層輸出形狀

_CalculateOutputShapes 計算出中間層的output shape。具體邏輯如下：

遍歷 _before_layers，對其中每層呼叫其FPropMeta，得到 output shapes，插入 state_shapes 陣列之中；
遍歷 _cells，對其中每層呼叫其FPropMeta，得到 output shapes，插入 state_shapes 陣列之中；

  def _CalculateOutputShapes(self, input_shapes):
    """Calcuate the output shape of intermediate layers.

    Given the FPropMeta function in each FeatureExtractionLayer, calcuates
    the shapes of outputs of that layer. This is used to recover the shape
    information in StackedRecurrent.

    Args:
      input_shapes: NestedMap or tuple of input TensorShapes.

    Returns:
      Return a list of K + 1 NestedMaps or lists of tShape where K is
      the number of partitions.
    """
    p = self.params
    shapes = []

    # Converts TensorShape to tshape.Shape.
    def _ToTShape(x):
      if x is None:
        return None
      return tshape.Shape(x.as_list())

    shapes = py_utils.Transform(_ToTShape, input_shapes)
    shapes = _ToTuple(shapes)

    state_shapes = []
    # 遍歷_before_layers，對其中每層呼叫其FPropMeta，得到 output shapes，插入 state_shapes 陣列之中
    for (_, cell) in self._before_layers:
      shapes = cell.FPropMeta(cell.params, *shapes).out_shapes

    state_shapes.append(shapes[0] if p.nested_map_fprop else shapes)

    # 遍歷 _cells，對其中每層呼叫其FPropMeta，得到 output shapes，插入 state_shapes 陣列之中
    for (_, cell) in self._cells:
      shapes = cell.FPropMeta(cell.params, *shapes).out_shapes
      state_shapes.append(shapes[0] if p.nested_map_fprop else shapes)

    return state_shapes

6.4.2.2 得到資料型別

_get_state_dtype 的作用是得到資料型別。

  def _get_state_dtype(self, *args):
    if self.params.state_dtype:
      return self.params.state_dtype
    if self.params.nested_map_fprop:
      inputs = args[0].Filter(lambda x: x is not None)
      return py_utils.Flatten(inputs)[0].dtype
    return args[0].dtype

6.4.2.3 得到輸入形狀

Gpipe 會首先將一個小批量的訓練樣本（mini-batch）分割成更小的小批量（micro-batches），然後將每組小批量的執行通過管道傳送到單元上。

_get_input_shapes作用是得到輸入的shapes，具體邏輯如下：

從 args 得到輸入 input_tensors；
遍歷 input_tensors，找出第一個不為空的張量，獲取這個張量的 batch size，賦給 mini_batch_size；
從引數中得到 micro_batch_size，設定到 micro_batch_size；
如果 micro_batch_size 沒有意義，則：
- 如果 p.num_micro_batches 大於 mini_batch_size，則 p.num_micro_batches 為 mini_batch_size；
- 把 micro_batch_size 設定為 mini_batch_size // p.num_micro_batches；
建立一個 input_shapes 集合，遍歷 input_tensors，對於每個張量，得到其shapes列表 input_shape，並且設定 input_shape 的 batch_dim 為 micro_batch_size；
如果設定了 p.nested_map_fprop，則把 input_shapes 構建成一個遞迴巢狀的結構；
返回 input_shapes；

  def _get_input_shapes(self, *args):
    p = self.params
    if p.nested_map_fprop:
      assert len(args) == 1
      assert isinstance(args[0], py_utils.NestedMap)
      input_tensors = py_utils.Flatten(args[0])
    else:
      input_tensors = _ToTuple(args)
    
    # 遍歷 input_tensors，找出第一個不為空的張量，獲取這個張量的 batch size，賦給 mini_batch_size
    # Get batch size from the first tensor which is not None.
    mini_batch_size = None
    for input_tensor in input_tensors:
      if input_tensor is not None:
        mini_batch_size = input_tensor.get_shape().as_list()[p.batch_dim]
    assert mini_batch_size is not None
    micro_batch_size = p.micro_batch_size
    
    if not micro_batch_size: # 如果 micro_batch_size 沒有意義
      # 如果 p.num_micro_batches 大於 mini_batch_size，則 p.num_micro_batches 為 mini_batch_size
      if p.num_micro_batches > mini_batch_size:
        p.num_micro_batches = mini_batch_size
      # 把 micro_batch_size 設定為 mini_batch_size // p.num_micro_batches  
      micro_batch_size = mini_batch_size // p.num_micro_batches
    if mini_batch_size is not None:
      if micro_batch_size * p.num_micro_batches != mini_batch_size:
        raise ValueError('micro_batch_size * num_micro_batches != batch_size.')

    # 遍歷 input_tensors，對於每個張量，得到其shapes列表 input_shape，並且設定 input_shape 的 batch_dim 為 micro_batch_size
    input_shapes = ()
    for input_tensor in input_tensors:
      if input_tensor is not None:
        input_shape = input_tensor.get_shape().as_list()
        input_shape[p.batch_dim] = micro_batch_size
        input_shapes += (tf.TensorShape(input_shape),)
      else:
        input_shapes += (None,)

    # 如果設定了 p.nested_map_fprop，則把 input_shapes 構建成一個遞迴巢狀的結構    
    if p.nested_map_fprop:
      input_shapes = py_utils.Pack(args[0], input_shapes)
    return input_shapes

6.4.2.4 FProp

前向傳播函式，用流水線方式在多個裝置上執行多個 children cells。具體邏輯如下：

做一些準備工作，比如：
- 得到 children cell個數；
- 得到叢集；
- 得到輸入shapes，dtypes；
- 利用輸入shapes 計算出輸出shapes；
遍歷處理中間層：
- 對於具體cell，把cell加入到累積層中，對於每個cell，得到對應的function；
- 為後續的 StackedRecurrent 執行設定其初始狀態；
- 把cell_idx對應的裝置加入到devices列表；
為流水線中間（去除頭尾）的各個裝置設定一些變數；
在第一個裝置上執行如下操作：
- 遍歷_before_layers，執行每層的FProp，最終得到 previous;
- 對於 previous 繼續操作，構建出 inputs，即利用_StackAndSplit分割張量;
- 為流水線後續裝置設定其輸入;
在中間裝置上執行recurrent.StackedRecurrent操作 ;
在最後一個裝置上把micro_batches的形狀聚合，最終得到輸出張量:
- 如果巢狀，則返回最後一個形狀；
- 否則遍歷輸出，聚合各個輸出的形狀；

  def FProp(self, theta, *args):
    """Run multiple cells in different devices in a pipelining manner.

    Args:
      theta: A NestedMap object containing weights' values of this layer and its
        children layers.
      *args: Non-keyworded variable length argument list of input tensors.

    Returns:
      A list of output tensors
    """
    # TODO(huangyp): handle optional None inputs.
    p = self.params
    if self.do_eval and self.cluster.num_devices_per_split == 1: # 如果設定了 do_eval 並且叢集的 num_devices_per_split 為 1
      outputs = copy.copy(args)
      # 就直接序列執行
      for (name, l) in self._before_layers + self._cells:
        outputs = _ToTuple(outputs)
        outputs = l.FProp(theta[name], *outputs)
      return outputs

    num_cells = len(p.cell_tpl) # 得到 children cell個數
    cluster = self.cluster # 得到叢集

    # Compute shapes of input and output tensors.
    # 得到 輸入shapes，dtypes
    input_shapes = self._get_input_shapes(*args)
    state_dtype = self._get_state_dtype(*args)
    # 利用 輸入shapes 計算出 輸出shapes
    state_shapes = self._CalculateOutputShapes(input_shapes)
    tf.logging.info('state_shapes={}'.format(state_shapes))

    def GetCellFn(i): # 對於第 i 個層，返回一個對應的函式，這個函式將在 StackedRecurrent 內部執行
      """Get the ith feature extraction layer."""

      def CellFn(theta, state0, inputs):
        """A cell fn is exectued inside of StackedRecurrent."""
        # 沒有深入研究StackedRecurrent，只從此函式看，作用是利用cell.FProp計算輸出，並且得到一個state，其中包括輸出和micro batch tensor
        del state0

        def _FPropInputSetShape(name, t_shape): # 給輸入設定shape
          if t_shape is None:
            return None
          inputs[name].set_shape(t_shape.ToTensorShape().as_list())
          return inputs[name]

        if p.nested_map_fprop:
          # pylint: disable=protected-access
          fprop_inputs = state_shapes[i]._RecursiveMap(_FPropInputSetShape)
          # pylint: enable=protected-access
        else:
          fprop_inputs = []
          for input_idx, input_shape in enumerate(state_shapes[i]):
            name = 's{}'.format(input_idx)
            fprop_inputs.append(_FPropInputSetShape(name, input_shape))

        with py_utils.RemoveAssertContext(remove=True):
          with CellFnFPropOpReplacementWrapper():
            tf.logging.info('cell {} input {}'.format(i, fprop_inputs))
            mb_tensor = inputs[_MICRO_BATCH_STATE_NAME] # 得到輸入的 micro batch tensor
            SetOverWriteGlobalStep(mb_tensor)
            _, cell = self._cells[i]
            fprop_inputs = _ToTuple(fprop_inputs)
            outputs = cell.FProp(theta, *fprop_inputs) # 計算輸出

        if p.nested_map_fprop:
          assert py_utils.IsCompatible(outputs, state_shapes[i + 1])
          state1 = outputs.Filter(lambda x: x is not None)
        else:
          state1 = py_utils.NestedMap()
          outputs = _ToTuple(outputs)
          assert len(outputs) == len(state_shapes[i + 1])
          for output_idx in range(len(outputs)):
            if outputs[output_idx] is not None:
              name = 's{}'.format(output_idx)
              state1[name] = outputs[output_idx]
        state1[_MICRO_BATCH_STATE_NAME] = mb_tensor
        return state1, py_utils.NestedMap()

      return CellFn

    cell_fns = []
    accumulator_layers = [] # 為了梯度累積
    thetas = []
    init_states = []
    devices = []
    # 遍歷，把cell_idx對應的裝置加入到devices列表
    for cell_idx in range(num_cells): # 遍歷 children cell
      cell_name, cell = self._cells[cell_idx] # 得到具體一個 cell
      accumulator_layers.append(cell) # 把cell加入到累積層中
      cell_fns.append(GetCellFn(cell_idx)) # 對於每個cell，得到對應的function
      thetas.append(theta[cell_name]) # 新增 theta

      # 返回一個帶有形狀t_shape的，型別為state_dtype的張量,並且所有元素都設為零.
      def _TfZeros(t_shape):
        if t_shape is None:
          return None
        return tf.zeros(t_shape.ToTensorShape().as_list(), dtype=state_dtype)

      # 為後續的 StackedRecurrent 執行設定其初始狀態
      if p.nested_map_fprop:
        init_state = py_utils.Transform(_TfZeros, state_shapes[cell_idx + 1])
        init_state = init_state.Filter(lambda x: x is not None)
      else:
        init_state = py_utils.NestedMap()
        for output_idx, state in enumerate(state_shapes[cell_idx + 1]):
          state = _TfZeros(state)
          if state is not None:
            name = 's{}'.format(output_idx)
            init_state[name] = state
      init_state[_MICRO_BATCH_STATE_NAME] = tf.cast(0, dtype=state_dtype)
      init_states.append(init_state)

      # 把cell_idx對應的裝置加入到devices列表
      devices.append(cluster.WorkerDeviceInModelSplit(cell_idx))

    # 為流水線中間（去除頭尾）的各個裝置設定一些變數
    cell_grads = [None] * num_cells
    cell_outs = [lambda x: x] * num_cells
    cell_out_grads = [lambda x: x] * num_cells

    # 在第一個裝置上執行如下操作
    with tf.device(devices[0]): 
      previous = _ToTuple(args)
      for (name, l) in self._before_layers: # 遍歷_before_layers，執行每層的FProp，最終得到 previous
        previous = l.FProp(theta[name], *previous)
        previous = _ToTuple(previous)

      def _StackAndSplit(x): # 把張量分割成
        # Split tensors into microbatches.
        if x is None:
          return None
        # tf.split按照行或者列分割一個矩陣
        return tf.stack(tf.split(x, p.num_micro_batches, axis=p.batch_dim))

      # 對於 previous 繼續操作，構建出 inputs，即利用_StackAndSplit分割張量
      if p.nested_map_fprop: # 巢狀情況，只選取previous[0]做處理
        inputs = py_utils.Transform(_StackAndSplit, previous[0]) #利用_StackAndSplit分割張量
        inputs = inputs.Filter(lambda x: x is not None)
      else: # 非巢狀
        inputs = py_utils.NestedMap()
        for output_idx, output_tensor in enumerate(previous): # 遍歷第一層的輸出
          output_tensor = _StackAndSplit(output_tensor) # 利用_StackAndSplit分割張量
          if output_tensor is not None:
            name = 's{}'.format(output_idx)
            inputs[name] = output_tensor
      gs_tensor = py_utils.GetGlobalStep()
      # 為流水線後續裝置設定其輸入
      inputs[_MICRO_BATCH_STATE_NAME] = tf.stack([
          tf.cast(gs_tensor * p.num_micro_batches + t, dtype=state_dtype)
          for t in range(p.num_micro_batches)
      ])
      
    # 在中間裝置上執行操作    
    tf.logging.info('pipeline input = {}'.format(inputs))
    output_state, _ = recurrent.StackedRecurrent( 
        devices=devices,
        cell_fns=cell_fns,
        cell_grads=cell_grads,
        cell_outs=cell_outs,
        cell_out_grads=cell_out_grads,
        thetas=thetas,
        init_states=init_states,
        inputs=inputs,
        accumulator_layers=accumulator_layers,
        unused_acc_state=True)

    # 在最後一個裝置上執行如下操作，最終得到輸出張量
    with tf.device(devices[-1]):
      def _ReshapeRetVal(name, t_shape): # 把micro_batches的形狀聚合，得到最終輸出
        """Restore shape for tensors in microbatches."""
        if t_shape is None:
          return None
        output_tensor = output_state[name]
        if p.batch_dim != 0:
          perm = list(range(1, p.batch_dim + 1)) + [0]
          perm += list(range(p.batch_dim + 1, t_shape.rank + 1))
          output_tensor = tf.transpose(output_tensor, perm=perm)
        output_shape = t_shape.ToTensorShape().as_list()
        output_shape[p.batch_dim] *= p.num_micro_batches
        output_tensor = tf.reshape(output_tensor, output_shape)
        return output_tensor

      # Construct the final return values from output_state.
      if p.nested_map_fprop: # 如果巢狀，則返回最後一個形狀
        # pylint: disable=protected-access
        output_tensors = state_shapes[-1]._RecursiveMap(_ReshapeRetVal) # 聚合形狀
        # pylint: enable=protected-access
      else:
        output_tensors = []
        # 遍歷輸出，聚合各個輸出的形狀
        for output_idx, state_shape in enumerate(state_shapes[-1]): 
          output_name = 's{}'.format(output_idx)
          output_tensor = _ReshapeRetVal(output_name, state_shape) # 聚合形狀
          output_tensors.append(output_tensor)
        if len(output_tensors) == 1:
          output_tensors = output_tensors[0]
        else:
          output_tensors = tuple(output_tensors)
        
      tf.logging.info('pipeline output = {}'.format(output_tensors))
      return output_tensors

6.4.2.5 類定義

具體程式碼如下：

class PipeliningLayer(SeqLayer):
  """Pipelining a sequence of layers on multiple devices."""

  @classmethod
  def Params(cls):
    p = super().Params()
    p.Define('num_micro_batches', 1, 'Number of micro batches.')
    p.Define('micro_batch_size', None, 'Size of a micro batch.')
    p.Define('batch_dim', 0, 'The batch dimension.')
    p.Define('state_dtype', None, 'Externally specify dtype for states.')
    p.Define(
        'nested_map_fprop', False, 'Whether arguments and returns of '
        'cell fprop functions are nested maps')
    return p

具體FProp函式邏輯如下圖：

+--------------------------------------------------------------+
| FProp             _CalculateOutputShapes                     |
|                             +                                |
|                             |                                |
|                             |                                |
|                             v                                |
|                        state_shapes                          |
|                             +                                |
|                             |                                |
|                             |                                |
|                             |                                |
|                             v                                |
|                for cell_idx in range(num_cells):             |
|                             +                                |
|                             |                                |
|                             |                                |
|                             v                                |
|       devices.append(WorkerDeviceInModelSplit(cell_idx))     |
|                             +                                |
|                             |                                |
|                             |                                |
|                             v                                |
|                  with tf.device(devices[0])                  |
|                             +                                |
|                             |                                |
|                             |                                |
|                             v                                |
|             recurrent.StackedRecurrent(cell_outs)            |
|                             +                                |
|                             |                                |
|                             |                                |
|                             v                                |
|                 with tf.device(devices[-1])                  |
|                             +                                |
|                             |                                |
|                             |                                |
|                             v                                |
|                       output_tensors                         |
|                                                              |
+--------------------------------------------------------------+

Device流水線邏輯如下：

                   devices[0]
                       +
                       |
                       |
                       |
                       v
+----------------------+-------------------------+
|Pipeline                                        |
|                         devices[1]             |
|                             +                  |
|                             |                  |
|                             |                  |
|                             v                  |
|  cell_grads[1~n]        devices[2]             |
|                             +                  |
|  cell_outs[1~n]             |                  |
|                             |                  |
|  cell_out_grads[1~n]        v                  |
|                         devices[3]             |
|                             +                  |
|                             |                  |
|                             |                  |
|                             v                  |
|                         devices[4]             |
|                                                |
+----------------------+-------------------------+
                       |
                       |
                       |
                       v
                   devices[-1]

6.4.2.6 使用

原始碼中給出的例子是 GPipeBatchMajorTransformerStack，目前看來，繼承PipeliningLayer即可。

class GPipeBatchMajorTransformerStack(PipeliningLayer):
  """Stacked self- multi-head attention and fully connected layers.

  With optional layer normalization applied to the final output.

  See 'Attention Is All You Need' https://arxiv.org/abs/1706.03762
  for details. 

  Implements a gipe stack for the batch major transformer variant.
  """

GPipeBatchMajorTransformerStack 的 FProp 返回一個輸出張量的列表，其中下面程式碼呼叫了PipeliningLayer的功能。

logits = super().FProp(theta, source_input, source_paddings, target_input,
                       target_paddings, encoder_self_atten_segment_mask,
                       decoder_self_atten_segment_mask,
                       decoder_cross_atten_segment_mask, source_segment_pos,
                       target_segment_pos)

具體程式碼如下：

  def FProp(self,
            theta,
            source_input,
            source_paddings,
            target_input=None,
            target_paddings=None,
            source_segment_id=None,
            target_segment_id=None,
            labels=None,
            label_weights=None,
            source_segment_pos=None,
            target_segment_pos=None):

    p = self.params
    if p.num_decoder_layers > 0:
      assert target_input is not None
      assert target_paddings is not None
      target_time = tf.shape(target_input)[1]
      batch = tf.shape(target_input)[0]
    encoder_self_atten_segment_mask = None
    decoder_self_atten_segment_mask = None
    decoder_cross_atten_segment_mask = None

    # Prepare segment masks from segment ids.
    if p.packed_input:
      dtype = py_utils.FPropDtype(p)
      assert source_segment_id is not None, (
          'Need to specify src_segment_id if packed input is supported.')
      assert source_segment_pos is not None, (
          'Need to specify src_segment_pos for packed input and embeddings.')
      encoder_self_atten_segment_mask = batch_major_attention.SegmentMask(
          source_segment_id, source_segment_id, dtype, False)
      if target_segment_id is not None:
        decoder_self_atten_segment_mask = batch_major_attention.SegmentMask(
            target_segment_id, target_segment_id, dtype, False)
        causal_padding = tf.expand_dims(
            tf.tile(
                tf.expand_dims(
                    batch_major_attention.CausalPadding(
                        target_time, dtype=dtype), 0), [batch, 1, 1]), 1)
        decoder_self_atten_segment_mask = tf.math.maximum(
            causal_padding, decoder_self_atten_segment_mask)
        decoder_cross_atten_segment_mask = batch_major_attention.SegmentMask(
            target_segment_id, source_segment_id, dtype, False)

    # FProp through the gpipe pipeline.
    # 這裡呼叫了基類的PipeliningLayer，完成流水線操作。
    logits = super().FProp(theta, source_input, source_paddings, target_input,
                           target_paddings, encoder_self_atten_segment_mask,
                           decoder_self_atten_segment_mask,
                           decoder_cross_atten_segment_mask, source_segment_pos,
                           target_segment_pos)
            
    label_weights = tf.reshape(label_weights, [-1])
    target_probs = None
    if p.label_smoothing:
      target_probs = self.smoother.FProp(
          theta.smoother, target_paddings, labels, target_ids=None)
      target_probs = tf.reshape(target_probs, [-1, p.softmax_tpl.num_classes])
    reshaped_logits = tf.reshape(logits, [-1, p.softmax_tpl.num_classes])
    tgt_labels = tf.reshape(labels, [-1])
    num_splits = len(p.splits)
    softmax = self.children['cell_{}'.format(num_splits - 1)].softmax
    softmax_theta = theta['cell_{}'.format(num_splits - 1)].softmax
    per_example_xent, _ = softmax.XentLossFromLogits(
        softmax_theta,
        reshaped_logits,
        class_weights=tf.reshape(label_weights, [-1]),
        class_ids=tgt_labels,
        class_probabilities=target_probs)
    xent_shape = tf.shape(logits)[:2]
    per_example_xent = tf.reshape(per_example_xent, xent_shape)
    return per_example_xent, logits