Alink漫談(十五) ：多層感知機之迭代優化

羅西的思考發表於2020-07-29

原文網址 : https://www.cnblogs.com/rossiXYZ/p/13399527.html

優化

Alink漫談(十五) ：多層感知機之迭代優化

Alink漫談(十五) ：多層感知機之迭代優化

0x00 摘要

Alink 是阿里巴巴基於實時計算引擎 Flink 研發的新一代機器學習演算法平臺，是業界首個同時支援批式演算法、流式演算法的機器學習平臺。本文和前文將帶領大家來分析Alink中多層感知機的實現。

因為Alink的公開資料太少，所以以下均為自行揣測，肯定會有疏漏錯誤，希望大家指出，我會在日後隨時更新。

0x01 前文回顧

從前文 ALink漫談(十四) ：多層感知機之總體架構我們瞭解了多層感知機的概念以及Alink中的整體架構，下面我們就要開始介紹如何優化。

1.1 基本概念

我們再次溫習下基本概念：

神經元輸入：類似於線性迴歸 z = w1x1+ w2x2 +⋯ + wnxn = wT・x（linear threshold unit (LTU)）。
神經元輸出：啟用函式，類似於二值分類，模擬了生物學中神經元只有激發和抑制兩種狀態。增加偏值，輸出層哪個節點權重大，輸出哪一個。
採用Hebb準則，下一個權重調整方法參考當前權重和訓練效果。
如何自動化學習計算權重——backpropagation，即首先正向做一個計算，根據當前輸出做一個error計算，作為指導訊號反向調整前一層輸出權重使其落入一個合理區間，反覆這樣調整到第一層，每輪調整都有一個學習率，調整結束後，網路越來越合理。

1.2 誤差反向傳播演算法

基於誤差反向傳播演算法（backpropagation，BP）的前饋神經網路訓練過程可以分為以下三步：

在前向傳播時計算每一層的淨輸入z(l)和啟用值a(l)，直至最後一層；
(反向傳播階段)將激勵響應同訓練輸入對應的目標輸出求差，從而獲得隱層和輸出層的響應誤差。用誤差反向傳播計算每一層的誤差項 δ(l)；
對於每個突觸上的權重，按照以下步驟進行更新：將輸入激勵和響應誤差相乘，從而獲得權重的梯度；將這個梯度乘上一個比例並取反後加到權重上。即計算每一層引數的偏導數，並更新引數。

這個比例(百分比)將會影響到訓練過程的速度和效果，因此稱為”訓練因子”。梯度的方向指明瞭誤差擴大的方向，因此在更新權重的時候需要對其取反，從而減小權重引起的誤差。

1.3 總體邏輯

我們回顧下總體邏輯。

在這裡插入圖片描述

0x02 訓練神經網路

這部分是重頭戲，所以要從總體邏輯中剝離出來單獨說。

FeedForwardTrainer.train 函式會完成神經網路的訓練，即：

DataSet<DenseVector> weights = trainer.train(trainData, getParams());

其邏輯大致如下：

1）初始化模型 initCoef = initModel(data, this.topology);
2）把訓練資料壓縮成向量，trainData = stack()。這裡應該是為了減少傳輸資料量。需要注意，其把輸入資料從二元組<label index, vector> 轉換為三元組 Tuple3.of((double) count, 0., stacked);
3）生成優化目標函式 new AnnObjFunc(topology...)
4）構建訓練器 FeedForwardTrainer
5）訓練器會基於目標函式構建優化器 Optimizer optimizer = new Lbfgs，即使用 L-BFGS 來訓練神經網路；

2.1 初始化模型

初始化模型相關程式碼如下：

DataSet<DenseVector> initCoef = initModel(data, this.topology); // 隨機初始化權重係數
optimizer.initCoefWith(initCoef);

public void initCoefWith(DataSet<DenseVector> initCoef) {
	this.coefVec = initCoef; // 就是賦值在內部變數上
}

這裡需要和線性迴歸比較下：

線性迴歸中，假如有兩個特徵，則初始係數為 coef = {DenseVector} "0.001 0.0 0.0 0.0"，4個數值具體是 "權重, b, w1, w2"。
多層感知機這裡，初始化係數則是一個 DenseVector。

initModel 主要就是隨機初始化權重係數。精華函式在 topology.getWeightSize() 部分。

private DataSet<DenseVector> initModel(DataSet<?> inputRel, final Topology topology) {
        if (initialWeights != null) {
          ......
        } else {
            return BatchOperator.getExecutionEnvironmentFromDataSets(inputRel).fromElements(0)
                .map(new RichMapFunction<Integer, DenseVector>() {
                    ......
                    @Override
                    public DenseVector map(Integer value) throws Exception {
                        // 這裡是精華，獲取了係數的size
                        DenseVector weights = DenseVector.zeros(topology.getWeightSize());                           
                        
                        for (int i = 0; i < weights.size(); i++) {
                            weights.set(i, random.nextGaussian() * initStdev);//隨機初始化
                        }
                        return weights;
                    }
                })
                .name("init_weights");
        }
}

topology.getWeightSize() 呼叫的是 FeedForwardTopology的函式，就是遍歷各層，累加求係數和。

@Override
public int getWeightSize() {
        int s = 0;
        for (Layer layer : layers) { //遍歷各層
            s += layer.getWeightSize();
        }
        return s;
}

AffineLayer係數大小如下：

@Override
public int getWeightSize() {
		return numIn * numOut + numOut;
}

FuntionalLayer 沒有係數

@Override
public int getWeightSize() {
	return 0;
}

SoftmaxLayerWithCrossEntropyLoss 沒有係數

@Override
public int getWeightSize() {
	return 0;
}

回顧本例對應拓撲：

this = {FeedForwardTopology@4951} 
 layers = {ArrayList@4944}  size = 4
      0 = {AffineLayer@4947} // 仿射層
       numIn = 4
       numOut = 5
      1 = {FuntionalLayer@4948} 
       activationFunction = {SigmoidFunction@4953}  // 啟用函式
      2 = {AffineLayer@4949} // 仿射層
       numIn = 5
       numOut = 3
      3 = {SoftmaxLayerWithCrossEntropyLoss@4950}  // 啟用函式

所以最後 DenseVector 向量大小是兩個仿射層的和（4 + 5 * 5）+ (5 + 3 * 3) = 43。

2.2 壓縮資料

這裡把輸入資料二元組 <label index, vector> 轉換，壓縮到 DenseVector中。這裡壓縮的目的應該是為了減少傳輸資料量。

如果輸入是：

batchData = {ArrayList@11527}  size = 64
 0 = {Tuple2@11536} "(2.0,6.5 2.8 4.6 1.5)"
  f0 = {Double@11572} 2.0
  f1 = {DenseVector@11573} "6.5 2.8 4.6 1.5"
 1 = {Tuple2@11537} "(1.0,6.1 3.0 4.9 1.8)"
  f0 = {Double@11575} 1.0
  f1 = {DenseVector@11576} "6.1 3.0 4.9 1.8"
.....

則最終輸出的壓縮資料是：

batch = {Tuple3@11532} 
 f0 = {Double@11578} 18.0
 f1 = {Double@11579} 0.0
 f2 = {DenseVector@11580} "6.5 2.8 4.6 1.5 6.1 3.0 4.9 1.8 7.3 2.9 6.3 1.8 5.7 2.8 4.5 1.3 6.4 2.8 5.6 2.1 6.7 2.5 5.8 1.8 6.3 3.3 4.7 1.6 7.2 3.6 6.1 2.5 7.2 3.0 5.8 1.6 4.9 2.4 3.3 1.0 7.4 2.8 6.1 1.9 6.5 3.2 5.1 2.0 6.6 2.9 4.6 1.3 7.9 3.8 6.4 2.0 5.2 2.7 3.9 1.4 6.4 2.7 5.3 1.9 6.8 3.0 5.5 2.1 5.7 2.5 5.0 2.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0 1.0"

具體程式碼如下：

static DataSet<Tuple3<Double, Double, Vector>>
    stack(DataSet<Tuple2<Double, DenseVector>> data ...) {
    
        return data
            .mapPartition(new MapPartitionFunction<Tuple2<Double, DenseVector>, Tuple3<Double, Double, Vector>>() {
                @Override
                public void mapPartition(Iterable<Tuple2<Double, DenseVector>> samples,
                                         Collector<Tuple3<Double, Double, Vector>> out) throws Exception {
                    List<Tuple2<Double, DenseVector>> batchData = new ArrayList<>(batchSize);
                    int cnt = 0;
                    Stacker stacker = new Stacker(inputSize, outputSize, onehot);
                    for (Tuple2<Double, DenseVector> sample : samples) {
                        batchData.set(cnt, sample);
                        cnt++;
                        if (cnt >= batchSize) { // 如果已經大於預設的資料塊大小，就直接傳送
                            // 把batchData的x-vec壓縮到DenseVector中
                            Tuple3<Double, Double, Vector> batch = stacker.stack(batchData, cnt);
                            out.collect(batch);
                            cnt = 0;
                        }
                    }

                    // 如果壓縮成功，則輸出
                    if (cnt > 0) { // 沒有大於預設資料塊大小，也傳送。cnt就是目前的資料塊大小，針對本例項，是19，這也是後面能看到 matrix 維度 19 的來源。
                        Tuple3<Double, Double, Vector> batch = stacker.stack(batchData, cnt);
                        out.collect(batch); 
                    }
                }
            })
            .name("stack_data");
}

2.3 生成優化目標函式

回顧關於損失函式和目標函式的說明：

損失函式：計算的是一個樣本的誤差；
代價函式：是整個訓練集上所有樣本誤差的平均，經常和損失函式混用；
目標函式：代價函式 + 正則化項；

多層感知機中生成目標函式程式碼如下：

final AnnObjFunc annObjFunc = new AnnObjFunc(topology, inputSize, outputSize, onehotLabel, optimizationParams);

AnnObjFuncs是多層感知機優化的目標函式，其定義如下：

topology 就是神經網路的拓撲；
stacker 是用來壓縮解壓（後續L-BFGS是向量操作，所以也需要從矩陣到向量來回轉換）；
topologyModel 是計算模型；

我們可以看到，在 AnnObjFunc 的 API 中呼叫時候，會看 AnnObjFunc.topologyModel 是否有值，如果沒有就生成。

public class AnnObjFunc extends OptimObjFunc {
    private Topology topology;
    private Stacker stacker;
    private transient TopologyModel topologyModel = null;
    
    @Override
    protected double calcLoss(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector) {
        // 看 AnnObjFunc.topologyModel 是否有值，如果沒有就生成
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        return topologyModel.computeGradient(unstacked.f0, unstacked.f1, null);
    }

    @Override
    protected void updateGradient(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector, DenseVector updateGrad) {
        // 看 AnnObjFunc.topologyModel 是否有值，如果沒有就生成
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        topologyModel.computeGradient(unstacked.f0, unstacked.f1, updateGrad);
    }
}

2.4 生成目標函式中的拓撲模型

如上所述，這個具體生成是在 AnnObjFunc 的 API 中呼叫時候，會看 AnnObjFunc.topologyModel 是否有值，如果沒有就生成。這裡會根據 FeedForwardTopology 的 layers 來生成拓撲模型。

public TopologyModel getModel(DenseVector weights) {
	FeedForwardModel feedForwardModel = new FeedForwardModel(this.layers);
	feedForwardModel.resetModel(weights);
	return feedForwardModel;
}

拓撲模型定義如下，能夠看到裡面有具體層 & 層模型。

public class FeedForwardModel extends TopologyModel {
    private List<Layer> layers; //具體層 
    private List<LayerModel> layerModels; //層模型
    /**
     * Buffers of outputs of each layers.
     */
    private transient List<DenseMatrix> outputs = null;
    /**
     * Buffers of deltas of each layers.
     */
    private transient List<DenseMatrix> deltas = null;
    
    public FeedForwardModel(List<Layer> layers) {
        this.layers = layers;
        this.layerModels = new ArrayList<>(layers.size());
        for (int i = 0; i < layers.size(); i++) {
            layerModels.add(layers.get(i).createModel());
        }
    }    
}

優化函式是根據每一層來建立模型，比如

public LayerModel createModel() {
	return new AffineLayerModel(this);
}

下面就看看具體各層模型的狀況。

2.4.1 AffineLayerModel

定義如下，其具體函式比如 eval，computePrevDelta，grad 我們後續會提及。

public class AffineLayerModel extends LayerModel {
    private DenseMatrix w;
    private DenseVector b;

    // buffer for holding gradw and gradb
    private DenseMatrix gradw;
    private DenseVector gradb;

    private transient DenseVector ones = null;
}

2.4.2 FuntionalLayerModel

定義如下，其具體函式比如 eval，computePrevDelta，grad 我們後續會提及。

public class FuntionalLayerModel extends LayerModel {
    private FuntionalLayer layer;
}

2.4.3 SoftmaxLayerModelWithCrossEntropyLoss

這個就是針對最終輸出層。

SoftmaxWithLoss層包含softmax和求交叉熵兩部分。

對於softmax的輸入向量 z，其輸出向量的第 k 個值為：

\[a_k = Softmax(z)_k = \frac{e^{Z_k}}{\sum_i e^{Z_i}} \]

而交叉熵損失函式：

\[Loss = - \sum_i y_i . ln(a_i) \]

損失函式對 a 求導，可得到：

\[δ_L^i = \frac{∂Loss}{∂Z_i} = a_i - y_i \]

具體程式碼如下，基本就是數學公式的實現。

public class SoftmaxLayerModelWithCrossEntropyLoss extends LayerModel
    implements AnnLossFunction {		
    @Override
    public double loss(DenseMatrix output, DenseMatrix target, DenseMatrix delta) {
        int batchSize = output.numRows();
        MatVecOp.apply(output, target, delta, (o, t) -> t * Math.log(o));
        double loss = -(1.0 / batchSize) * delta.sum();
        MatVecOp.apply(output, target, delta, (o, t) -> o - t);
        return loss;
    }
}

2.4.3 最終模型

最終的模型如下：

this = {FeedForwardModel@10575} 

     layers = {ArrayList@10576}  size = 4
          0 = {AffineLayer@10601} 
          1 = {FuntionalLayer@10335} 
          2 = {AffineLayer@10602} 
          3 = {SoftmaxLayerWithCrossEntropyLoss@10603} 

     layerModels = {ArrayList@10579}  size = 4
          0 = {AffineLayerModel@10581} 
          1 = {FuntionalLayerModel@10433} 
          2 = {AffineLayerModel@10582} 
          3 = {SoftmaxLayerModelWithCrossEntropyLoss@10583} 

     outputs = null
     deltas = null

用圖形化簡要表示如下（FeedForwardModel中省略了層）：

在這裡插入圖片描述

2.5 生成優化器

回顧下概念。

針對目標函式最普通的優化是窮舉法，對各個可能的權值遍歷，找尋使損失最小的一組，但是這種方法會陷入巢狀迴圈裡，使得執行速率大打折扣，於是就有了優化器。優化器的目的使快速找到目標函式最優解。

損失函式的目的是在訓練過程中找到最合適的一組權值序列，也就是讓損失函式降到儘可能低。優化器或優化演算法用於將損失函式最小化，在每個訓練週期或每輪後更新權重和偏置值，直到損失函式達到全域性最優。

我們要尋找損失函式的最小值，首先一定有一個初始的權值，伴隨一個計算得出來的損失。那麼我們要想到損失函式最低點，要考慮兩點：往哪個方向前進；前進多少距離。

因為我們想讓損失值下降得最快，肯定是要找當前損失值在損失函式的切線方向，這樣走最快。如果是在三維的損失函式上面，則更加明顯，三維平面上的點做切線會產生無數個方向，而梯度就是函式在某個點無數個變化的方向中變化最快的那個方向，也就是斜率最大的那個方向。梯度是有方向的，負梯度方向就是下降最快的方向。

那麼梯度下降是怎麼執行的呢，剛才我們找到了逆梯度方向來作為我們前進的方向，接下來只需要解決走多遠的問題。於是我們引入學習率，我們要走的距離就是梯度的大小*學習率。因為越優化梯度會越小，因此移動的距離也會越來越短。學習率的設定不能太大，否則可能會跳過最低點而導致梯度爆炸或者梯度消失；也不能設定太小，否則梯度下降可能十分緩慢。

優化演算法有兩種：

一階優化演算法。這些演算法使用與引數相關的梯度值最小化或最大化代價函式。一階導數告訴我們函式是在某一點上遞減還是遞增，簡而言之，它給出了與曲面切線。
二階優化演算法。這些演算法使用二階導數來最小化代價函式，也稱為Hessian。由於二階導數的計算成本很高，所以不常使用二階導數。二階導數告訴我們一階導數是遞增的還是遞減的，這表示了函式的曲率。二階導數為我們提供了一個與誤差曲面曲率相接觸的二次曲面。

多層感知機這裡使用L-BFGS優化器來訓練神經網路。

Optimizer optimizer = new Lbfgs(
            data.getExecutionEnvironment().fromElements(annObjFunc),
            trainData,
            BatchOperator
                .getExecutionEnvironmentFromDataSets(data)
                .fromElements(inputSize),
            optimizationParams
        );
optimizer.initCoefWith(initCoef);

0x03 L-BFGS訓練

這部分又比較複雜，需要單獨拿出來說，就是用優化器來訓練。

optimizer.optimize()
  .map(new MapFunction<Tuple2<DenseVector, double[]>, DenseVector>() {
    @Override
    public DenseVector map(Tuple2<DenseVector, double[]> value) throws Exception {
        return value.f0;
    }
});

關於 L-BFGS 的細節，可以參見前面文章。 Alink漫談(十一) ：線性迴歸之 L-BFGS優化。

這裡把 Lbfgs 類拿出來概要說明：

public class Lbfgs extends Optimizer {
    public DataSet <Tuple2 <DenseVector, double[]>> optimize() {
       DataSet <Row> model = new IterativeComQueue()
           ......
          .add(new CalcGradient())
           ......
          .add(new CalDirection(...))
          .add(new CalcLosses(...))
           ......
          .add(new UpdateModel(...))
           ......
          .exec();
    }
}

能夠看到其中幾個關鍵步驟：

CalcGradient() 計算梯度
CalDirection(...) 計算方向
CalcLosses(...) 計算損失
UpdateModel(...) 更新模型

演算法框架都是基本不變的，所差別的就是具體目標函式和損失函式的不同。比如線性迴歸採用的是UnaryLossObjFunc，損失函式是 SquareLossFunc。

我們把多層感知機特殊之處填充關鍵步驟，得到與線性迴歸的區別。

CalcGradient() 計算梯度
- 1）呼叫 AnnObjFunc.updateGradient；
  - 1.1）呼叫目標函式中拓撲模型 topologyModel.computeGradient 來計算
    - 1.1.1）計算各層的輸出；forward(data, true)
    - 1.1.2）計算輸出層損失；labelWithError.loss
    - 1.1.3）計算各層的Delta；layerModels.get(i).computePrevDelta
    - 1.1.4）計算各層梯度；`layerModels.get(i).grad
CalDirection(...) 計算方向
- 這裡的實現沒有用到目標函式的拓撲模型。
CalcLosses(...) 計算損失
- 1）呼叫 AnnObjFunc.updateGradient；
  - 1.1）呼叫目標函式中拓撲模型 topologyModel.computeGradient 來計算
    - 1.1.1）計算各層的輸出；forward(data, true)
    - 1.1.2）計算輸出層損失；labelWithError.loss
UpdateModel(...) 更新模型
- 這裡的實現沒有用到目標函式的拓撲模型。

3.1 CalcGradient 計算梯度

CalcGradient.calc 函式中會呼叫到目標函式的計算梯度功能。

// calculate local gradient
Double weightSum = objFunc.calcGradient(labledVectors, coef, grad.f0);

objFunc.calcGradient函式就是基類 OptimObjFunc 的實現，此處才會呼叫到 AnnObjFunc.updateGradient 的具體實現。

updateGradient(labelVector, coefVector, grad);

3.1.1 目標函式

回顧目標函式定義：

public class AnnObjFunc extends OptimObjFunc {
    
    protected void updateGradient(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector, DenseVector updateGrad) {
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        topologyModel.computeGradient(unstacked.f0, unstacked.f1, updateGrad);
    }    
}

首先是生成拓撲模型；這步驟已經在前面提到了。
然後是 unstacked = stacker.unstack(labledVector);，解壓資料，返回 return Tuple2.of(features, labels);。
最後就是計算梯度；這是 FeedForwardModel 類實現的。

3.1.2 計算梯度

計算梯度（此函式也負責計算損失）程式碼在 FeedForwardModel.computeGradient，大致邏輯如下：

CalcGradient.calc 會呼叫 objFunc.calcGradient（OptimObjFunc 的實現）

1）呼叫 AnnObjFunc.updateGradient；
- 1.1）呼叫目標函式中拓撲模型 topologyModel.computeGradient 來計算
  - 1.1.1）計算各層的輸出；forward(data, true)
  - 1.1.2）計算輸出層損失；labelWithError.loss
  - 1.1.3）計算各層的Delta；layerModels.get(i).computePrevDelta
  - 1.1.4）計算各層梯度；`layerModels.get(i).grad

程式碼如下：

public double computeGradient(DenseMatrix data, DenseMatrix target, DenseVector cumGrad) {
        // data 是 x，target是y
        // 計算各層的輸出
        outputs = forward(data, true); 
    
        int currentBatchSize = data.numRows();
        if (deltas == null || deltas.get(0).numRows() != currentBatchSize) {
            deltas = new ArrayList<>(layers.size() - 1);
            int inputSize = data.numCols();
            for (int i = 0; i < layers.size() - 1; i++) {
                int outputSize = layers.get(i).getOutputSize(inputSize);
                deltas.add(new DenseMatrix(currentBatchSize, outputSize));
                inputSize = outputSize;
            }
        }
        int L = layerModels.size() - 1;
        AnnLossFunction labelWithError = (AnnLossFunction) this.layerModels.get(L);
        // 計算損失
        double loss = labelWithError.loss(outputs.get(L), target, deltas.get(L - 1));
        if (cumGrad == null) {
            return loss; // 如果只計算損失，則直接返回。
        }
        // 計算Delta；
        for (int i = L - 1; i >= 1; i--) {
            layerModels.get(i).computePrevDelta(deltas.get(i), outputs.get(i), deltas.get(i - 1));
        }
        int offset = 0;
        // 計算梯度；
        for (int i = 0; i < layerModels.size(); i++) {
            DenseMatrix input = i == 0 ? data : outputs.get(i - 1);
            if (i == layerModels.size() - 1) {
                layerModels.get(i).grad(null, input, cumGrad, offset);
            } else {
                layerModels.get(i).grad(deltas.get(i), input, cumGrad, offset);
            }
            offset += layers.get(i).getWeightSize();
        }
        return loss;
}

3.1.2.1 計算各層的輸出

回憶下示例程式碼，我們設定神經網路是：

.setLayers(new int[]{4, 5, 3})

具體說就是呼叫各層模型的 eval 函式，其中第一層不好寫入迴圈，所以單獨寫出。

public class FeedForwardModel extends TopologyModel {
    public List<DenseMatrix> forward(DenseMatrix data, boolean includeLastLayer) {
        .....
        layerModels.get(0).eval(data, outputs.get(0));
        int end = includeLastLayer ? layers.size() : layers.size() - 1;
        for (int i = 1; i < end; i++) {
            layerModels.get(i).eval(outputs.get(i - 1), outputs.get(i));
        }
        return outputs;
	}
}

具體各層呼叫如下：

AffineLayerModel.eval

這就是簡單的仿射變換 WX + b，然後放入output。其中

@Override
public void eval(DenseMatrix data, DenseMatrix output) {
        int batchSize = data.numRows();
        for (int i = 0; i < batchSize; i++) {
            for (int j = 0; j < this.b.size(); j++) {
                output.set(i, j, this.b.get(j));
            }
        }
        BLAS.gemm(1., data, false, this.w, false, 1., output);
}

其中 w, b 都是預置的。

this = {AffineLayerModel@10581} 
 w = {DenseMatrix@10592} "mat[4,5]:\n  0.07807905200944776,-0.03040913035034301,.....\n"
 b = {DenseVector@10593} "-0.058043439717701296 0.1415366160323592 0.017773419483873353 -0.06802435221045448 0.022751460286303204"
 gradw = {DenseMatrix@10594} "mat[4,5]:\n  0.0,0.0,0.0,0.0,0.0\n  0.0,0.0,0.0,0.0,0.0\n  0.0,0.0,0.0,0.0,0.0\n  0.0,0.0,0.0,0.0,0.0\n"
 gradb = {DenseVector@10595} "0.0 0.0 0.0 0.0 0.0"
 ones = null

FuntionalLayerModel.eval

實現如下

public void eval(DenseMatrix data, DenseMatrix output) {
        for (int i = 0; i < data.numRows(); i++) {
            for (int j = 0; j < data.numCols(); j++) {
                output.set(i, j, this.layer.activationFunction.eval(data.get(i, j)));
            }
        }
}

類變數為

this = {FuntionalLayerModel@10433} 
 layer = {FuntionalLayer@10335} 
  activationFunction = {SigmoidFunction@10755}

輸入是

data = {DenseMatrix@10642} "mat[19,5]:\n  0.09069152145840428,-0.4117319046979133,-0.273491600786707,-0.3638766081567865,-0.17552469317567304\n"
 m = 19
 n = 5
 data = {double[95]@10668}

其中，activationFunction 就呼叫到了 SigmoidFunction.eval。

public class SigmoidFunction implements ActivationFunction {
    @Override
    public double eval(double x) {
        return 1.0 / (1 + Math.exp(-x));
    }
}

SoftmaxLayerModelWithCrossEntropyLoss.eval

這裡就是計算了最終輸出。

    public void eval(DenseMatrix data, DenseMatrix output) {
        int batchSize = data.numRows();
        for (int ibatch = 0; ibatch < batchSize; ibatch++) {
            double max = -Double.MAX_VALUE;
            for (int i = 0; i < data.numCols(); i++) {
                double v = data.get(ibatch, i);
                if (v > max) {
                    max = v;
                }
            }
            double sum = 0.;
            for (int i = 0; i < data.numCols(); i++) {
                double res = Math.exp(data.get(ibatch, i) - max);
                output.set(ibatch, i, res);
                sum += res;
            }
            for (int i = 0; i < data.numCols(); i++) {
                double v = output.get(ibatch, i);
                output.set(ibatch, i, v / sum);
            }
        }
    }

3.1.2.2 計算損失

程式碼是：

AnnLossFunction labelWithError = (AnnLossFunction) this.layerModels.get(L);
double loss = labelWithError.loss(outputs.get(L), target, deltas.get(L - 1));
if (cumGrad == null) {
	return loss; // 可以直接返回
}

如果不需要計算梯度，則直接返回，否則繼續進行。我們這裡繼續執行。

具體就是呼叫SoftmaxLayerModelWithCrossEntropyLoss的損失函式，就是輸出層的損失。

output就是輸出層，target就是label y。計算損失幾乎和常規一樣，只不過多了一個除以 batchSize。

public double loss(DenseMatrix output, DenseMatrix target, DenseMatrix delta) {
        int batchSize = output.numRows();
        MatVecOp.apply(output, target, delta, (o, t) -> t * Math.log(o));
        double loss = -(1.0 / batchSize) * delta.sum();
        MatVecOp.apply(output, target, delta, (o, t) -> o - t);
        return loss;
}

3.1.2.3 計算delta

梯度下降法需要計算損失函式對引數的偏導數，如果用鏈式法則對每個引數逐一求偏導，涉及到矩陣微分，效率比較低。所以在神經網路中經常使用反向傳播演算法來高效地計算梯度。具體就是在利用誤差反向傳播演算法計算出每一層的誤差項後，就可以得到每一層引數的梯度。

我們定義delta為隱含層的加權輸入影響總誤差的程度，即 delta_i 表示第l層神經元的誤差項，它用來表示第l層神經元對最終損失的影響，也反映了最終損失對第l層神經元的敏感程度。

上面這個公式就是誤差的反向傳播公式！因為第l層的誤差項可以通過第l+1層的誤差項計算得到。反向傳播演算法的含義是：第 l 層的一個神經元的誤差項等於該神經元啟用函式的梯度，再乘上所有與該神經元相連線的第l+1層的神經元的誤差項的權重和。這裡 W 是所有權重與偏置的集合)。

這個可能不好理解，找到三種解釋，也許能夠幫助大家增加理解：

因為梯度下降法是沿梯度（斜率）的反方向移動，所以我們往回倒，就要乘以梯度，再爬回去。
或者可以看做錯誤（delta）的反向傳遞。經過一個線就乘以線的權重，經過點就乘以節點的偏導（sigmoid的偏導形式簡潔）。
或者這麼理解：輸出層誤差在轉置權重矩陣的幫助下，傳遞到了隱藏層，這樣我們就可以利用間接誤差來更新與隱藏層相連的權重矩陣。權重矩陣在反向傳播的過程中同樣扮演著運輸兵的作用，只不過這次是搬運的輸出誤差，而不是輸入訊號。

具體程式碼如下

for (int i = L - 1; i >= 1; i--) {
	layerModels.get(i).computePrevDelta(deltas.get(i), outputs.get(i), deltas.get(i - 1));
}

需要注意的是：

最後一層的delta已經在前一個loss計算出來，通過loss函式引數儲存在 deltas.get(L - 1) 之中；
迴圈是從後數第二層向前計算；

本例項中，output是四層（0~3），deltas是三層（0~2）。

用output第4層的數值，來計算deltas第3層的數值。

用output第3層的數值和 deltas第3層的數值，來計算deltas第2層的數值。具體細化到每層：

AffineLayerModel

public void computePrevDelta(DenseMatrix delta, DenseMatrix output, DenseMatrix prevDelta) {
	BLAS.gemm(1.0, delta, false, this.w, true, 0., prevDelta);
}

gemm函式作用是 C := alpha * A * B + beta * C . 這裡可能是以為 b 小，就省略了計算。（存疑，如果有哪位朋友知道原因，請告知，謝謝）

public static void gemm(double alpha, DenseMatrix matA, boolean transA, DenseMatrix matB, boolean transB, double beta, DenseMatrix matC)

在這裡是：1.0 * delta * this.w + 0 * prevDelta。delta 不轉置，this.w 轉置

FuntionalLayerModel

這裡是導數 * delta，導數就是變化率。

public void computePrevDelta(DenseMatrix delta, DenseMatrix output, DenseMatrix prevDelta) {
        for (int i = 0; i < delta.numRows(); i++) {
            for (int j = 0; j < delta.numCols(); j++) {
                double y = output.get(i, j);
                prevDelta.set(i, j, this.layer.activationFunction.derivative(y) * delta.get(i, j));
            }
        }
}

啟用函式是 The sigmoid function. f(x) = 1 / (1 + exp(-x)).

public class SigmoidFunction implements ActivationFunction {
    @Override
    public double eval(double x) {
        return 1.0 / (1 + Math.exp(-x));
    }

    @Override
    public double derivative(double z) {
        return (1 - z) * z; // 這裡
    }
}

3.1.2.4 計算梯度

這裡是從前往後計算，最後累計在cumGrad。

int offset = 0;
for (int i = 0; i < layerModels.size(); i++) {
    DenseMatrix input = i == 0 ? data : outputs.get(i - 1);
    if (i == layerModels.size() - 1) {
    	layerModels.get(i).grad(null, input, cumGrad, offset);
    } else {
    	layerModels.get(i).grad(deltas.get(i), input, cumGrad, offset);
    }
    offset += layers.get(i).getWeightSize();
}

AffineLayerModel

因為導數有兩部分: w , b, 所以這裡有分為兩部分計算。unpack 是為了解壓縮，pack目的是最後L-BFGS是必須用向量來計算。

    public void grad(DenseMatrix delta, DenseMatrix input, DenseVector cumGrad, int offset) {
        unpack(cumGrad, offset, this.gradw, this.gradb);
        int batchSize = input.numRows();
        // 計算w
        BLAS.gemm(1.0, input, true, delta, false, 1.0, this.gradw);
        if (ones == null || ones.size() != batchSize) {
            ones = DenseVector.ones(batchSize);
        }
        // 計算b
        BLAS.gemv(1.0, delta, true, this.ones, 1.0, this.gradb);
        pack(cumGrad, offset, this.gradw, this.gradb);
    }

FuntionalLayerModel

這裡就沒有梯度計算‘

public void grad(DenseMatrix delta, DenseMatrix input, DenseVector cumGrad, int offset) {}

最後類變數如下：

this = {FeedForwardModel@10394} 

    layers = {ArrayList@10405}  size = 4
     0 = {AffineLayer@10539} 
     1 = {FuntionalLayer@10377} 
     2 = {AffineLayer@10540} 
     3 = {SoftmaxLayerWithCrossEntropyLoss@10541} 

    layerModels = {ArrayList@10401}  size = 4
     0 = {AffineLayerModel@10543} 
     1 = {FuntionalLayerModel@10376} 
     2 = {AffineLayerModel@10544} 
     3 = {SoftmaxLayerModelWithCrossEntropyLoss@10398} 

     outputs = {ArrayList@10399}  size = 4
      0 = {DenseMatrix@10374} "mat[19,5]:\n  0.5258035858891295,0.40832346939250874,0.4339942803542127,0.4146645474481978,0.45503123177429533..."
      1 = {DenseMatrix@10374} "mat[19,5]:\n  0.5258035858891295,0.40832346939250874,0.4339942803542127,0.4146645474481978,0.45503123177429533..."
      2 = {DenseMatrix@10533} "mat[19,3]:\n  0.31968260294191225,0.3305393733681367,0.3497780236899511..."
      3 = {DenseMatrix@10533} "mat[19,3]:\n  0.31968260294191225,0.3305393733681367,0.3497780236899511\...."
         
     deltas = {ArrayList@10400}  size = 3
      0 = {DenseMatrix@10375} "mat[19,5]:\n  0.0052001689807435756,-0.002841992490130668,0.02414893572802383."
      1 = {DenseMatrix@10379} "mat[19,5]:\n  0.02085622230356763,-0.011763437253154471,0.09830897540282763,-0.005205953747031061."
      2 = {DenseMatrix@10528} "mat[19,3]:\n  -0.6803173970580878,0.3305393733681367,0.3497780236899511\."

3.2 CalDirection 計算方向

這裡的實現沒有用到目標函式的拓撲模型。

3.3 CalcLosses 計算損失

會在 objFunc.calcSearchValues 中就直接進入到了 AnnObjFunc 類內部。

計算損失程式碼如下：

for (Tuple3<Double, Double, Vector> labelVector : labelVectors) {
    for (int i = 0; i < numStep + 1; ++i) {
		losses[i] += calcLoss(labelVector, stepVec[i]) * labelVector.f0;
    }
}

AnnObjFunc 的 calcLoss程式碼如下，可見是呼叫其拓撲模型來完成計算。

protected double calcLoss(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector) {
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        return topologyModel.computeGradient(unstacked.f0, unstacked.f1, null);
}

這裡呼叫的是 computeGradient 來計算損失，會提前返回

@Override
public double computeGradient(DenseMatrix data, DenseMatrix target, DenseVector cumGrad) {
        outputs = forward(data, true);
        ......
        AnnLossFunction labelWithError = (AnnLossFunction) this.layerModels.get(L);
        double loss = labelWithError.loss(outputs.get(L), target, deltas.get(L - 1));
        if (cumGrad == null) {
            return loss; // 這裡計算返回
        }
        ...
}

3.4 UpdateModel 更新模型

這裡沒有用到目標函式的拓撲模型。

0x04 輸出模型

多層感知機比普通演算法更耗費記憶體，我需要再IDEA中增加VM啟動引數，才能執行成功。

-Xms256m -Xmx640m -XX:PermSize=128m -XX:MaxPermSize=512m

這裡要小小吐槽一下Alink，在本地除錯時候，沒有辦法修改Env的引數，比如心跳時間等。造成了除錯的不方便。

輸出模型演算法如下：

        // output model
        DataSet<Row> modelRows = weights
            .flatMap(new RichFlatMapFunction<DenseVector, Row>() {
                @Override
                public void flatMap(DenseVector value, Collector<Row> out) throws Exception {
                    List<Tuple2<Long, Object>> bcLabels = getRuntimeContext().getBroadcastVariable("labels");
                    Object[] labels = new Object[bcLabels.size()];
                    bcLabels.forEach(t2 -> {
                        labels[t2.f0.intValue()] = t2.f1;
                    });

                    MlpcModelData model = new MlpcModelData(labelType);
                    model.labels = Arrays.asList(labels);
                    model.meta.set(ModelParamName.IS_VECTOR_INPUT, isVectorInput);
                    model.meta.set(MultilayerPerceptronTrainParams.LAYERS, layerSize);
                    model.meta.set(MultilayerPerceptronTrainParams.VECTOR_COL, vectorColName);
                    model.meta.set(MultilayerPerceptronTrainParams.FEATURE_COLS, featureColNames);
                    model.weights = value;
                    new MlpcModelDataConverter(labelType).save(model, out);
                }
            })
            .withBroadcastSet(labels, "labels");

// 當執行時候，引數如下：
value = {DenseVector@13212} 
     data = {double[43]@13252} 
      0 = -39.6567702949874
      1 = 16.74206842333768
      2 = 64.49084799006972
      3 = -1.6630682281137472
  ......

其中模型資料類定義如下

public class MlpcModelData {
    public Params meta = new Params();
    public DenseVector weights;
    public TypeInformation labelType;
    public List<Object> labels;
}

最終模型資料大致如下：

model = {Tuple3@13307} "
     f0 = {Params@13308} "Params {vectorCol=null, isVectorInput=false, layers=[4,5,3], featureCols=["sepal_length","sepal_width","petal_length","petal_width"]}"
      params = {HashMap@13326}  size = 4
       "vectorCol" -> null
       "isVectorInput" -> "false"
       "layers" -> "[4,5,3]"
       "featureCols" -> "["sepal_length","sepal_width","petal_length","petal_width"]"
     f1 = {ArrayList@13309}  size = 1
      0 = "{"data":[-39.65676994108487,16.742068271166456,64.49084741971454,-1.6630682163468897,-66.71571933711216,-75.86297804171262,62.609759182998204,-101.47431688844591,31.546529394499977,17.597934397561986,85.36235323961661,-126.30772079054803,326.2329896163572,-29.720070636859894,-180.1693204840142,47.70255002863321,-63.44460914025362,136.6269589647343,-0.6446457887679123,-81.86976832863223,-16.333532816181705,15.4253068036318,-11.297177263474234,-1.1338164486683862,1.3011810728093451,-261.50388539155716,223.36901758842117,38.01966001651569,231.51463912615586,-152.59659885027318,-79.02863627644948,-48.28342595225583,-63.63975869014504,111.98667709535484,153.39174290331553,-121.04900950767653,-32.47876659498367,137.82909902591624,-43.99785013791728,-93.99354048054636,42.85135076273807,-24.8725999157641,-17.962438639217815]}"
       value = {char[829]@13325} 
       hash = 0
     f2 = {Arrays$ArrayList@13310}  size = 3
      0 = "Iris-setosa"
      1 = "Iris-virginica"
      2 = "Iris-versicolor"