Outrageously Large Neural Networks The Sparsely-Gated Mixture-of-Experts Layer

概
MoE
- 訓練

Shazeer N., Mirhoseini A., Maziarz K., Davis A., Le Q., Hinton G. and Dean J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ICLR, 2017.

概

Mixture-of-Experts (MoE).

MoE

透過一 gating network 選擇不同的 expert:

\[y = \sum_{i=1}^n G(x)_i E_i(x), \]
若 \(G(x)_i = 0\), 則我們不需要計算 \(E_i(x)\).
\(E_i(x)\) 可以是任意的網路, 所以現在的問題主要是如何設計 \(G\). 倘若我們希望選擇 \(k\) 給 experts, 可以:

\[G(x) = \text{Softmax}( \text{KeepTopK}(H(x), k), ) \\ H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{noise})_i), \\ \text{KeepTopK}(v, k)_i = \left \{ \begin{array}{ll} v_i & \text{if } v_i \text{ is in the top} k \text{ elements of } v. \\ -\infty & \text{otherwise}. \end{array} \right . \]
特別的是, 這裡加了高斯噪聲, 並用 \(W_{noise}\) 去調節不同位置的噪聲的比重, 從而可以實現負載平衡 (?).

訓練

如果不對 \(G\) 加以額外的限制, 容易出現某些 experts 持續獲得較大的權重, 所以本文引入了一個 soft constraint

\[L_{importance}(X) = w_{importance} \cdot CV(Importance (X))^2, \\ Importance(X) = \sum_{x \in X} G(x) \]
CV 作者說是 variation, 是方差嗎?
有了 soft constraint, 依然會出現每個 expert 接受的樣本數量的差別很大 (有些 expert \(i\) 可能會接受很少的樣本但是其上 \(G(x)_i\) 都很大, 有些 expert \(i\) 可能接受很多的樣本, 但是其上 \(G(x)i\) 都很小). 所以作者額外新增了對於選擇機率的約束.
對於樣本 \(x\), expert \(i\) 被選擇的機率為 (感覺這個定義應該是有問題的)

\[P(x, i) = Pr\bigg( (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{noise})_i) > kth_excluding (H(x), k, i) \bigg). \]
其中 \(kth_excluding(v, k, i)\) 表示 \(v\) 中的 k-th 大的值 (排除 \(i\)).
所以,

\[P(x, i) = \Phi( \frac{(x \cdot W_g)_i - kth_excluding(H(x), k, i)}{ \text{Softplus}((x \cdot W_{noise})_i) } ). \]
定義

\[Load(X)_i = \sum_{x \in X} P(x, i), \]
則

\[L_{load}(X) = w_{load} \cdot \text{CV}(Load(X))^2. \]

Outrageously Large Neural Networks The Sparsely-Gated Mixture-of-Experts Layer

概

MoE

訓練

相關文章