1 Backpropation 反向傳播演算法
我們在學習和實現反向傳播演算法的時候,往往因為其計算的複雜性,計算內涵的抽象性,只是機械的按照公式模板去套用演算法。但是這種形式的演算法使用甚至不如直接呼叫一些已有框架的演算法實現來得方便。
我們實現反向傳播演算法,就是要理解為什麼公式這麼寫,為什麼這麼算。這是非常重要的一件事情!
可能有一些教學會將演算法的順序步驟抽象為一個“反向傳播“的過程,將計算轉為一種圖形或是動畫的模式。但在你真正知道為什麼這麼算之前,這些都是無根之萍。
對於為什麼這麼算,我們的方法就是正面G。
至於什麼是正面G,意思是我們只要理解導數就可以了,其他的所有理解在這裡都被摒棄。我們只是計算,用計算來推導公式!
There are no meanings. There are just laws of arithmetic.
下面的文章大多帶有英文。提前預警。因為數學公式的使用,建議大屏裝置觀看。
2 Terminology[1]
-
L = total number of layers in the network
-
\(s_l\) = number of units (not counting bias unit) in layer l
-
K = number of output units/classes
-
Binary classification: y = 0 or y = 1, K=1;
-
Multi-class classification: K>=3;
\[\begin{align*}& a_i^{(j)} = \text{"activation" of unit $i$ in layer $j$} \newline& \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}\end{align*}
\]
\[\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, } \\ \text{ then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$.}
\]
\[z^{(j+1)} = \Theta^{(j)}a^{(j)}
\]
Example
\[\begin{align*}
z_1^{(3)}=\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} \newline
a_1^{(3)} = g(z_1^{(3)}) \newline
z_2^{(3)}=\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}a_1^{(2)} + \Theta_{22}^{(2)}a_2^{(2)} \newline
a_2^{(3)} = g(z_2^{(3)}) \newline
\text{add}\; a_0^{(3)} = 1 \newline
\end{align*}\]
\(\Theta_{10}^{(1)}a_0^{(1)}\)
- (1) : 第一層向第二層的權重
- 10 :
- 1 :對應第二層的第一個啟用單元 \(a_1^{(2)}\)
- 0 :對應第一層的第0個引數 \(a_0^{(1)}\)
3 Feedforward computation
對於上面的4層的神經網路,下面是一個詳細的實現演算法。要實現反向傳播,我們首先要實現前向傳播。
前向傳播的精髓就是計算一層的輸出值,作為下一層的輸入值。
3.1 Layer 1
\[\begin{align*}
a_1^{(1)} = x_1 \newline
a_2^{(1)} = x_2 \newline
\text{add}\; a_0^{(1)} = 1 \newline
\end{align*}\]
3.2 Layer 2
\[\begin{align*}
z_1^{(2)}=\Theta_{10}^{(1)}a_0^{(1)} + \Theta_{11}^{(1)}a_1^{(1)} + \Theta_{12}^{(1)}a_2^{(1)} \newline
a_1^{(2)} = g(z_1^{(2)}) \newline
z_2^{(2)}=\Theta_{20}^{(1)}a_0^{(1)} + \Theta_{21}^{(1)}a_1^{(1)} + \Theta_{22}^{(1)}a_2^{(1)} \newline
a_2^{(2)} = g(z_2^{(2)}) \newline
\text{add}\; a_0^{(2)} = 1 \newline
\end{align*}\]
3.3 Layer 3
\[\begin{align*}
z_1^{(3)}=\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} \newline
a_1^{(3)} = g(z_1^{(3)}) \newline
z_2^{(3)}=\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}a_1^{(2)} + \Theta_{22}^{(2)}a_2^{(2)} \newline
a_2^{(3)} = g(z_2^{(3)}) \newline
\text{add}\; a_0^{(3)} = 1 \newline
\end{align*}\]
3.4 Layer 4
\[\begin{align*}
z_1^{(4)}=\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}a_1^{(3)} + \Theta_{12}^{(3)}a_2^{(3)} \newline
a_1^{(4)} = g(z_1^{(4)}) \newline
h_\Theta(x) = a_1^{(4)}
\end{align*}\]
4 Hypothesis expansion
\[\begin{align*}
h_\Theta(x) & = a_1^{(4)} \newline
&= g(z_1^{(4)}) \newline
&= g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}a_1^{(3)} + \Theta_{12}^{(3)}a_2^{(3)}) \newline
&= g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(z_1^{(3)}) + \Theta_{12}^{(3)}g(z_2^{(3)})) \newline
&=g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} ) \\
& \quad + \Theta_{12}^{(3)}g(\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}a_1^{(2)} + \Theta_{22}^{(2)}a_2^{(2)} )) \newline
&=g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}g(z_1^{(2)}) + \Theta_{12}^{(2)}g(z_2^{(2)}) ) \\
& \quad + \Theta_{12}^{(3)}g(\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}g(z_1^{(2)}) + \Theta_{22}^{(2)}g(z_2^{(2)}) )) \newline
&=g(\Theta_{10}^{(3)}a_0^{(3)} + \Theta_{11}^{(3)}g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}g(\Theta_{10}^{(1)}a_0^{(1)} + \Theta_{11}^{(1)}a_1^{(1)} \\
& \quad + \Theta_{12}^{(1)}a_2^{(1)}) + \Theta_{12}^{(2)}g(\Theta_{20}^{(1)}a_0^{(1)} + \Theta_{21}^{(1)}a_1^{(1)} + \Theta_{22}^{(1)}a_2^{(1)}) ) \\
& \quad +\Theta_{12}^{(3)}g(\Theta_{20}^{(2)}a_0^{(2)} + \Theta_{21}^{(2)}g(\Theta_{10}^{(1)}a_0^{(1)} + \Theta_{11}^{(1)}a_1^{(1)} + \Theta_{12}^{(1)}a_2^{(1)}) \\
& \quad + \Theta_{22}^{(2)}g(\Theta_{20}^{(1)}a_0^{(1)}+
\Theta_{21}^{(1)}a_1^{(1)} + \Theta_{22}^{(1)}a_2^{(1)}) ))
\end{align*}\]
5 對應三個權重矩陣
\[\Theta^{(1)} =2 \times 3=
\begin{pmatrix}
\Theta_{10}^{(1)}& \Theta_{11}^{(1)}&\Theta_{12}^{(1)}\\
\Theta_{20}^{(1)}& \Theta_{21}^{(1)}&\Theta_{22}^{(1)}\\
\end{pmatrix}
\]
\[\Theta^{(2)} =2 \times 3 =
\begin{pmatrix}
\Theta_{10}^{(2)}& \Theta_{11}^{(2)}&\Theta_{12}^{(2)}\\
\Theta_{20}^{(2)}& \Theta_{21}^{(2)}&\Theta_{22}^{(2)}\\
\end{pmatrix}
\]
\[\Theta^{(3)} =1 \times 3=
\begin{pmatrix}
\Theta_{10}^{(3)}& \Theta_{11}^{(3)}&\Theta_{12}^{(3)}\\
\end{pmatrix}
\]
6 導數部分
單分類:
\[J(\Theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))]
\]
多分類:
\[\begin{gather*}J(\Theta) = - \frac{1}{m} \sum_{t=1}^m\sum_{k=1}^K \left[ y^{(t)}_k \ \log (h_\Theta (x^{(t)}))_k + (1 - y^{(t)}_k)\ \log (1 - h_\Theta(x^{(t)})_k)\right]\end{gather*}
\]
這裡我們實現單分類的求導,為了簡便我們假設只有一個樣本 m = 1
, 多樣本沒有什麼不一樣的,就是向量化的一個樣本的實現。同時沒有 regularization 簡化推導。
反向傳播就是計算每一個\(\Theta\)對應的導數值,而\(\Theta\)又存在於\(J(\Theta)\)之中。所以直接利用鏈式法則,追根溯源,最終計算到需要計算的\(\Theta\)身上。
1 單分類 \(\Theta^{(3)}\) 1x3
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{10}^{(3)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial \Theta_{10}^{(3)}}\\
&=(a_{1}^{(4)} - y)a_{0}^{(3)}
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{11}^{(3)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial \Theta_{11}^{(3)}}\\
&=(a_{1}^{(4)} - y)a_{1}^{(3)}
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{12}^{(3)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial \Theta_{12}^{(3)}}\\
&=(a_{1}^{(4)} - y)a_{2}^{(3)}
\end{align*}\]
2 單分類 \(\Theta^{(2)}\) 2x3
\[\begin{align*}
\dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} &= \Theta_{11}^{(3)} \\
\dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} &= a_{1}^{(3)}(1 - a_{1}^{(3)}) \\
\dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{10}^{(2)}} &= a_{0}^{(2)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{10}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{10}^{(2)}}\\
&=(a_{1}^{(4)} - y) \Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)}) a_{0}^{(2)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{11}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{11}^{(2)}}\\
&=(a_{1}^{(4)} - y) \Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)}) a_{1}^{(2)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{12}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial \Theta_{12}^{(2)}}\\
&=(a_{1}^{(4)} - y) \Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)}) a_{2}^{(2)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} &= \Theta_{12}^{(3)} \\
\dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} &= a_{2}^{(3)}(1 - a_{2}^{(3)}) \\
\dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{20}^{(2)}} &= a_{0}^{(2)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{20}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{20}^{(2)}}\\
&=(a_{1}^{(4)} - y) \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)}) a_{0}^{(2)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{21}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{21}^{(2)}}\\
&=(a_{1}^{(4)} - y) \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)}) a_{1}^{(2)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{22}^{(2)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}} \dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial \Theta_{22}^{(2)}}\\
&=(a_{1}^{(4)} - y) \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)}) a_{2}^{(2)}\\
\end{align*}\]
3 單分類 \(\Theta^{(1)}\) 2x3
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{10}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}}
( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{1}^{(2)}}
+
\dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{1}^{(2)}} )
\dfrac{\partial a_{1}^{(2)}}{\partial z_{1}^{(2)}} \dfrac{\partial z_{1}^{(2)}}{\partial \Theta_{10}^{(1)}}\\
&=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{11}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{21}^{(2)}] a_{1}^{(2)}(1 - a_{1}^{(2)}) a_{0}^{(1)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{11}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}}
( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{1}^{(2)}}
+
\dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{1}^{(2)}} )
\dfrac{\partial a_{1}^{(2)}}{\partial z_{1}^{(2)}} \dfrac{\partial z_{1}^{(2)}}{\partial \Theta_{11}^{(1)}}\\
&=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{11}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{21}^{(2)}] a_{1}^{(2)}(1 - a_{1}^{(2)}) a_{1}^{(1)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{12}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}}
( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{1}^{(2)}}
+
\dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{1}^{(2)}} )
\dfrac{\partial a_{1}^{(2)}}{\partial z_{1}^{(2)}} \dfrac{\partial z_{1}^{(2)}}{\partial \Theta_{12}^{(1)}}\\
&=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{11}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{21}^{(2)}] a_{1}^{(2)}(1 - a_{1}^{(2)}) a_{2}^{(1)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{20}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}}
( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{2}^{(2)}}
+
\dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{2}^{(2)}} )
\dfrac{\partial a_{2}^{(2)}}{\partial z_{2}^{(2)}} \dfrac{\partial z_{2}^{(2)}}{\partial \Theta_{20}^{(1)}}\\
&=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{12}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{22}^{(2)}] a_{2}^{(2)}(1 - a_{2}^{(2)}) a_{0}^{(1)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{21}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}}
( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{2}^{(2)}}
+
\dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{2}^{(2)}} )
\dfrac{\partial a_{2}^{(2)}}{\partial z_{2}^{(2)}} \dfrac{\partial z_{2}^{(2)}}{\partial \Theta_{21}^{(1)}}\\
&=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{12}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{22}^{(2)}] a_{2}^{(2)}(1 - a_{2}^{(2)}) a_{1}^{(1)}\\
\end{align*}\]
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta_{22}^{(1)}} &= \dfrac{\partial J(\Theta)}{\partial a_{1}^{(4)}} \dfrac{\partial a_{1}^{(4)}}{\partial z_{1}^{(4)}}
( \dfrac{\partial z_{1}^{(4)}}{\partial a_{1}^{(3)}} \dfrac{\partial a_{1}^{(3)}}{\partial z_{1}^{(3)}} \dfrac{\partial z_{1}^{(3)}}{\partial a_{2}^{(2)}}
+
\dfrac{\partial z_{1}^{(4)}}{\partial a_{2}^{(3)}} \dfrac{\partial a_{2}^{(3)}}{\partial z_{2}^{(3)}} \dfrac{\partial z_{2}^{(3)}}{\partial a_{2}^{(2)}} )
\dfrac{\partial a_{2}^{(2)}}{\partial z_{2}^{(2)}} \dfrac{\partial z_{2}^{(2)}}{\partial \Theta_{22}^{(1)}}\\
&=(a_{1}^{(4)} - y) [\Theta_{11}^{(3)} a_{1}^{(3)}(1 - a_{1}^{(3)})\Theta_{12}^{(2)} + \Theta_{12}^{(3)} a_{2}^{(3)}(1 - a_{2}^{(3)})\Theta_{22}^{(2)}] a_{2}^{(2)}(1 - a_{2}^{(2)}) a_{2}^{(1)}\\
\end{align*}\]
Tips
我們已經看到,單分類 \(\Theta^{(1)}\) 2x3 的導數公式計算\(\dfrac{\partial J(\Theta)}{\partial \Theta^{(1)}}\), 因為函式巢狀越發深入,如果從頭開始,計算量將會十分複雜。
\(\dfrac{\partial J(\Theta)}{\partial z^{(j+1)}} = \delta^{(j+1)}\) 是一個介面,它一方面可以找到 \(\Theta^j\) 也就是我們最終要計算的導數(在此即可終止)。也可以找到\(a^j\)(再次出發), 利用\(a^j\)解開\(z^j\)從而又形成下一層的介面\(\dfrac{\partial J(\Theta)}{\partial z^{(j)}} = \delta^{(j)}\)。找到下一層\(\Theta\)的導數。
因此δ是公式上的存檔,我們要做的就是。並且只是。首先開解 \(\dfrac{\partial z^{(j+1)}}{\partial a^{(j)}}\),再開解\(\dfrac{\partial a^{(j)}}{\partial z^{(j)}}\)。分別對應\(\Theta\) 以及 sigmoid的導數。並儲存為下一個存檔。如此往復。
\[\begin{align*}
\delta^4 &= a^4 - y \\
\delta^3 &= (\Theta^{(3)})^{T} \delta^4 \;.*\; a^{(3)}(1 - a^{(3)}) \; remove \; \delta_0^3\\
\delta^2 &= (\Theta^{(2)})^{T} \delta^3 \;.*\; a^{(2)}(1 - a^{(2)}) \; remove \; \delta_0^2\\
\end{align*}\]
因為隱藏層的bais unit是一個常數,並不對應下一層的介面。從公式上也可以得出,bais unit的\(\delta\)對應值為零。所以我們刪除他們,同時也使得下一層的\(\delta\)計算時符合矩陣運算的維度要求。
\[\begin{align*}
\dfrac{\partial J(\Theta)}{\partial \Theta^{(3)}} &= \delta^4 * a^3 = (a^4 - y) * a^3 \\
\dfrac{\partial J(\Theta)}{\partial \Theta^{(2)}} &= \delta^3 * a^2 \\
\dfrac{\partial J(\Theta)}{\partial \Theta^{(1)}} &= \delta^2 * a^1 \\
\end{align*}\]
\(\delta^{(l)}\) is vector, \(a^{(l-1)}\) is matrix.
多分類怎麼計算呢?自己想一想吧。
程式碼示例
這是手寫數字資料集的反向傳播實踐。X
5000張手寫數字的400畫素灰度圖片。兩個權重 Theta1 Theta2
.
通過前向傳播計算\(cost \; J\),通過反向傳播計算 \(\dfrac{\partial J(\Theta)}{\partial \Theta^{(j)}}\)。
10分類,3層神經網路。
% ----X = 5000x400; y = 5000x1; Theta1 = 25x401; Theta2 = 10x26----
% ----Feedforward----
a1 = X;
a1 = [ones(m,1) a1]; % 5000x401
z2 = a1 * Theta1'; % 5000x401 401x25
a2 = sigmoid(z2);
a2 = [ones(size(a2,1),1) a2]; %5000x26
z3 = a2 * Theta2'; % 5000x26 26x10
a3 = sigmoid(z3);
% ----Cost By For Loop and Not Regularized----
%for k=1:num_labels,
% y_k = (y==k);
% J = J -(1/m) * (y_k' * log(a3(:,k)) + (1 - y_k') * log(1 - a3(:,k)));
%end
% ----Cost By Matrix and Not Regularized----
% y_K = 5000x10. Notice that " .* ";
y_K = zeros(m, num_labels);
for k=1:num_labels,
y_K(:,k) = (y==k);
end
% First Part Not Regularied
%J = -(1/m) * sum(sum((y_K .* log(a3) + (1 - y_K) .* log(1 - a3))));
Theta1_fix = [zeros(size(Theta1,1),1) Theta1(:,2:end)];
Theta2_fix = [zeros(size(Theta2,1),1) Theta2(:,2:end)];
Theta_fix = [Theta1_fix(:);Theta2_fix(:)];
% Regularied
J = -(1/m) * sum(sum((y_K .* log(a3) + (1 - y_K) .* log(1 - a3)))) + (lambda/(2*m)) * sum(Theta_fix .^2);
gDz2 = sigmoidGradient(z2); % 5000x25
deltaL3 = (a3 .- y_K)'; % 5000*10' = 10*5000
deltaL2 = Theta2' * deltaL3 .* [zeros(size(gDz2,1),1) gDz2]'; % 26*5000 .* 26*5000
% First Part Not Regularied
%Theta2_grad = (1/m) * deltaL3 * a2 ;
%Theta1_grad = (1/m) * deltaL2(2:end,:) * a1 ;
Theta2_grad = (1/m) * deltaL3 * a2 + (lambda/m) * Theta2_fix;
Theta1_grad = (1/m) * deltaL2(2:end,:) * a1 + (lambda/m) * Theta1_fix;
Reference
[1] Andrew NG. Coursera Machine Learning Deep Learning. Cost Function and BackPropagation.
文章會隨時改動,要到部落格園裡看偶。一些網站會爬取本文章,但是可能會有出入。
轉載請註明出處哦( ̄︶ ̄)↗
https://www.cnblogs.com/asmurmur/