《神經網路的梯度推導與程式碼驗證》之數學基礎篇:矩陣微分與求導

SumwaiLiu發表於2020-09-01

本內容為神經網路的梯度推導與程式碼驗證系列內容的第一章,更多相關內容請見《神經網路的梯度推導與程式碼驗證》系列介紹

 目錄

 

1.1 數學符號

下面介紹一下本系列統一的數學符號表達:

  • 對於標量,通常用小寫字母來表示,例如$x,y$
  • 對於向量,通常用粗體的小寫字母來表示,例如$\boldsymbol{x,y}$。向量裡的每個元素都是標量,通常用帶角標的小寫字母表示,例如$x_{1},x_{2}$。要注意的是,在數學上向量通常預設為列向量以達到數學約定上的統一,如果要表示行向量的話,需要用到轉置操作,例如$\boldsymbol{x}^{T}$就表示一個行向量
  • 對於矩陣,通常用大寫加粗字母表示,如$\boldsymbol{A,B}$,矩陣裡的元素記作$a_{ij},b_{ij}$

 

1.2 矩陣導數的定義和佈局

根據求導的自變數和因變數是標量,向量還是矩陣,我們有9種可能的矩陣求導定義,形式上如下所示:

  因變數
 標量 向量 矩陣
自變數 標量 $\frac{\partial y}{\partial x}$   $\frac{\partial \boldsymbol{y}}{\partial x}$ $\frac{\partial \boldsymbol{Y}}{\partial x} $
向量   $\frac{\partial y}{\partial \boldsymbol{x}}$ $\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} $

$ \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{x}}$(一直沒用上)

矩陣 $ \frac{\partial y}{\partial \boldsymbol{X}}$

$\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{X}}$(用上但不深究)

$\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}}$(用上但不深究)

對於上述9種形式的導數,接下來按照理解上的難度分檔進行介紹,但注意這裡並不會對它們全介紹一遍,因為有一些是在本系列的推導不涉及到的,我對它們也沒太深究。

-------------簡單難度--------------

  • 標量對標量的導數,這個沒什麼可說的,跳過...
  • 向量/矩陣對標量的導數,定義是向量/矩陣裡的每一個元素分別對這個標量求導
    • 例子:

$\boldsymbol{Y}=\begin{bmatrix}x& 2x\\ x^{2} & 2\end{bmatrix}$對$x$的導數$\frac{\partial Y}{\partial x} = \left\lbrack \begin{array}{ll} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} \\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} 1 & 2 \\ {2x} & 0 \\ \end{array} \right\rbrack$

向量對標量的導數也是同理,

$\boldsymbol{y} = \left\lbrack \begin{array}{l} x \\ {2x} \\ x^{2} \\ \end{array} \right\rbrack$對$x$的導數$\frac{\partial\boldsymbol{y}}{\partial x} = \left\lbrack \begin{array}{l} \frac{\partial y_{1}}{\partial x} \\ \frac{\partial y_{2}}{\partial x} \\ \frac{\partial y_{3}}{\partial x} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} 1 \\ 2 \\ {2x} \\ \end{array} \right\rbrack$

$\boldsymbol{y}^{T} = \left\lbrack {x,~2x,~x^{2}} \right\rbrack$對$x$的導數$\frac{\partial\boldsymbol{y}^{T}}{\partial x} = \left\lbrack {\frac{\partial y_{1}}{\partial x},~\frac{\partial y_{2}}{\partial x},~\frac{\partial y_{3}}{\partial x}} \right\rbrack = \left\lbrack {1,~2,~2x} \right\rbrack$

總結一點就是,求導結果與因變數同形,這就是所謂的分母佈局

  • 標量對向量/矩陣的導數,定義為這個標量對向量/矩陣中的每一個元素進行求導
    • 例子:

$y = x_{1} + {2x}_{2} + x_{3}^{2} + 1$對$x = \left\lbrack \begin{array}{l} \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \\ x_{3} \\ x_{4} \\ \end{array} \right\rbrack$的導數$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} \begin{array}{l} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ \frac{\partial y}{\partial x_{3}} \\ \end{array} \\ \frac{\partial y}{\partial x_{4}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} \begin{array}{l} 1 \\ 2 \\ \end{array} \\ {2x}_{3} \\ 0 \\ \end{array} \right\rbrack$ 

$y = x_{1} + {2x}_{2} + x_{3}^{2} + 1$對$x^{T} = \left\lbrack {x_{1},x_{2},x_{3},x_{4}} \right\rbrack$的導數$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack {\frac{\partial y}{\partial x_{1}},\frac{\partial y}{\partial x_{2}},\frac{\partial y}{\partial x_{3}},\frac{\partial y}{\partial x_{4}}} \right\rbrack = \left\lbrack {1,2,{2x}_{3},0} \right\rbrack$

$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} \begin{array}{l} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ \frac{\partial y}{\partial x_{3}} \\ \end{array} \\ \frac{\partial y}{\partial x_{4}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} \begin{array}{l} 1 \\ 2 \\ \end{array} \\ {2x}_{3} \\ 0 \\ \end{array} \right\rbrack$對$x^{T} = \left\lbrack {x_{1},x_{2},x_{3},x_{4}} \right\rbrack$的導數$\frac{\partial y}{\partial\boldsymbol{X}} = \left\lbrack \begin{array}{ll} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} \\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} 1 & 2 \\ {2x}_{3} & 0 \\ \end{array} \right\rbrack$

總結一點就是,求導結果與自變數同形,這就是所謂的分子佈局。所謂矩陣求導,不過是逐元素進行標量層面的求導然後排列成向量/矩陣罷了。

--------------難度稍大一點------------

  • 接下來是略微複雜的向量對向量的導數。上面提到的求導,不是自變數是標量就是因變數是標量,所以無論是計算,還是求導之後的結果佈局,都是比較顯而易見的。而向量對向量求導我們一般定義如下:

設$\boldsymbol{y} = \left\lbrack \begin{array}{l} y_{1} \\ y_{2} \\ y_{3} \\ \end{array} \right\rbrack$,$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$,則$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} \begin{array}{l} \frac{\partial y_{1}}{\partial x_{1}} \\ \frac{\partial y_{2}}{\partial x_{1}} \\ \end{array} & \begin{array}{l} \frac{\partial y_{1}}{\partial x_{2}} \\ \frac{\partial y_{2}}{\partial x_{2}} \\ \end{array} \\ \frac{\partial y_{3}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{2}} \\ \end{array} \right\rbrack$

上面這個求導得到矩陣我們稱之為雅克比矩陣(重要),它的第一個維度(行)是以分子為準,第二個維度(列)是以分母為準。直觀第來看,就是對分子進行橫向的展開。

    • 例子:

$\boldsymbol{y} = \left\lbrack \begin{array}{l} {x_{1} + x_{2}} \\ x_{1} \\ {x_{1} + x_{2}^{2}} \\ \end{array} \right\rbrack$對$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$的導數$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} \begin{array}{l} \frac{\partial y_{1}}{\partial x_{1}} \\ \frac{\partial y_{2}}{\partial x_{1}} \\ \end{array} & \begin{array}{l} \frac{\partial y_{1}}{\partial x_{2}} \\ \frac{\partial y_{2}}{\partial x_{2}} \\ \end{array} \\ \frac{\partial y_{3}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{2}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} \begin{array}{l} 1 \\ 1 \\ \end{array} & \begin{array}{l} 1 \\ 0 \\ \end{array} \\ 1 & {2x_{2}} \\ \end{array} \right\rbrack$

 

1.3 矩陣求導的優勢

之所以要搞矩陣微分,當然不是吃飽了撐著,而是為了在分析大量的神經網路引數的時候不容易出錯。

舉個例子:

$\boldsymbol{A} = \left( \begin{array}{ll} 1 & 2 \\ 3 & 4 \\ \end{array} \right)$,$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$,求$\boldsymbol{y} = \boldsymbol{A}\boldsymbol{x}$對$\boldsymbol{x}\boldsymbol{~}$的導數

 

如果我們用求導的定義來解這個問題的話,我們首先需要計算出向量$\boldsymbol{y} = \left\lbrack \begin{array}{l} {x_{1} + 2x_{2}} \\ {{3x}_{1} + {4x}_{2}} \\ \end{array} \right\rbrack$,然後根據1.2節中,向量對向量的求導定義,計算$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{1}}{\partial x_{2}} \\ \frac{\partial y_{2}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{2}} \\ \end{array} \right\rbrack$,得到$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 3 & 4 \\ \end{array} \right\rbrack = \boldsymbol{A}$。

再來個例子:

$\boldsymbol{A} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 3 & 4 \\ \end{array} \right\rbrack$,$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$,求$y = \boldsymbol{x}^{T}\boldsymbol{A}\boldsymbol{x}$對$\boldsymbol{x}$的導數,

如果我們用求導的定義來解這個問題的話,先計算出標量$y = x_{1}^{2} + 5x_{1}x_{2} + {4x}_{2}^{2}$然後根據1.2節中,標量對向量的求導的定義,計算$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} {2x_{1} + 5x_{2}} \\ {8x_{2} + 5x_{1}} \\ \end{array} \right\rbrack$

 

事實上,$\left\lbrack \begin{array}{l} {2x_{1} + 5x_{2}} \\ {8x_{2} + 5x_{1}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} {x_{1} + 2x_{2}} \\ {3x_{1} + 4x_{2}} \\ \end{array} \right\rbrack + \left\lbrack \begin{array}{l} {x_{1} + 3x_{2}} \\ {2x_{1} + 4x_{2}} \\ \end{array} \right\rbrack = \boldsymbol{A}^{T}x + \boldsymbol{A}x$,第二個等號不是巧合,而是可以通過矩陣求導(後面會說怎麼求)直接得到的結論。

 

由此可見,對於第一個例子,或許我們通過定義法尚且能又快又準地寫出求導結果,但對於第二例子,按照定義出發,從標量對標量求導的角度出發,計算出$y$後再對$\boldsymbol{x}$求導就顯得有點繁瑣了,而且還容易出錯。但如果從矩陣求導的角度入手,因為是在向量/矩陣的維度上看待求導操作,所以求導的結果可以很容易寫成向量和矩陣的組合,這樣又高效,形式又簡潔。

 

1.4 矩陣微分與矩陣求導

高中我們學過一元函式 的微分跟其導數的關係是下面這樣的:

$df = f^{'}\left( x \right)dx$

到大學,我們在高數課本數又進一步學到了多元函式$f\left( x_{1},x_{2},x_{3} \right)$跟其導數的關係是下面這樣的:

$df = \frac{\partial f}{\partial x_{1}}dx_{1} + \frac{\partial f}{\partial x_{2}}dx_{2} + \frac{\partial f}{\partial x_{2}}dx_{2}$ (1.1)

上面這個就是全微分方程的公式了(還有印象嗎)

觀察上式,可以發現:

$df = {\sum\limits_{i = 1}^{n}\frac{\partial f}{\partial x_{i}}}dx_{i} = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$ (1.2)

第一個等號是全微分公式,而第二個等號表達了梯度(偏導數)與微分的聯絡,形式上是$\frac{\partial f}{\partial\boldsymbol{x}}$與$d\boldsymbol{x}$的內積。

受此啟發,我們可以推導到矩陣上:

$~df = {\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\frac{\partial f}{\partial x_{ij}}}}dx_{ij} = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$ (1.3)

 

其中第二個等號使用了矩陣跡的性質,即跡函式等於矩陣主對角線的元素之和。即:

$tr\left( {A^{T}B} \right) = {\sum\limits_{i,j}{a_{ij}b_{ij}}}$

上面這個式子左邊看著挺噁心的,但右邊的數學含義是非常明顯的,就是兩個矩陣對應元素相乘然後相加,跟向量的內積類似,這個叫矩陣的內積

 

舉個例子:

設$f\left( x_{11},x_{12},x_{21},x_{22} \right)$是一個多元函式。根據全微分公式,有$df = {\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\frac{\partial f}{\partial x_{ij}}}}dx_{ij}$成立。現在我們將上面這4個自變數排成一個矩陣$\boldsymbol{X} = \left\lbrack \begin{array}{ll} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right\rbrack$,那麼按照前面給出的標量對矩陣求導的定義,有:

$\frac{\partial f}{\partial\boldsymbol{X}} = \left\lbrack \begin{array}{ll} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{12}} \\ \frac{\partial f}{\partial x_{21}} & \frac{\partial f}{\partial x_{22}} \\ \end{array} \right\rbrack$

$\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X} = \left\lbrack \begin{array}{ll} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{21}} \\ \frac{\partial f}{\partial x_{12}} & \frac{\partial f}{\partial x_{22}} \\ \end{array} \right\rbrack\left\lbrack \begin{array}{ll} {dx}_{11} & {dx_{12}} \\ {dx}_{21} & {dx}_{22} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} {\frac{\partial f}{\partial x_{11}}{dx}_{11} + \frac{\partial f}{\partial x_{21}}{dx}_{21}} & {\frac{\partial f}{\partial x_{11}}dx_{12} + \frac{\partial f}{\partial x_{21}}{dx}_{22}} \\ {\frac{\partial f}{\partial x_{12}}{dx}_{11} + \frac{\partial f}{\partial x_{22}}{dx}_{21}} & {\frac{\partial f}{\partial x_{12}}{dx}_{21} + \frac{\partial f}{\partial x_{22}}{dx}_{22}} \\ \end{array} \right\rbrack$

而$tr\left( ~ \right)$是求矩陣對角線元素之和,所以

$df = \frac{\partial f}{\partial x_{11}}{dx}_{11} + \frac{\partial f}{\partial x_{21}}{dx}_{21} + \frac{\partial f}{\partial x_{12}}{dx}_{12} + \frac{\partial f}{\partial x_{22}}{dx}_{22} = ~tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$成立。

 

1.5 矩陣微分性質歸納

我們在討論如何使用矩陣微分來求導前,先看看矩陣微分的性質,都挺明顯的:

  • 微分加減法:$d\left( {\boldsymbol{X} \pm \boldsymbol{Y}} \right) = d\boldsymbol{X} \pm d\boldsymbol{Y}$
  • 微分乘法:$d\left( \boldsymbol{X}\boldsymbol{Y} \right) = \left( d\boldsymbol{X} \right)\boldsymbol{Y} + \boldsymbol{X}\left( d\boldsymbol{Y} \right)$
  • 微分轉置:$d\left( \boldsymbol{X}^{\boldsymbol{T}} \right) = \left( d\boldsymbol{X} \right)^{T}$
  • 微分的跡:$dtr\left( \boldsymbol{X} \right) = tr\left( d\boldsymbol{X} \right)$
  • 微分哈達馬乘積(逐元素相乘):$d\left( {X\bigodot Y} \right) = X\bigodot dY + dX\bigodot Y$,其優先順序比普通矩陣相乘操作低
  • 逐元素求導:$d\sigma\left( X \right) = \sigma^{'}\left( X \right)\bigodot dX$
  • 逆矩陣微分:$dX^{- 1} = - X^{- 1}dXX^{- 1}$
  • 行列式微分(沒用過):$d\left| X \right| = \left| X \right|tr\left( X^{- 1}dX \right)$

其中$\sigma\left( X \right)$表示的含義是對 裡的所有元素都進行$\mathbf{\sigma}$函式的計算,即$\sigma\left( X \right) = \left\lbrack \begin{array}{lll} {\sigma\left\lbrack x_{11} \right\rbrack} & \cdots & {\sigma\left\lbrack x_{1n} \right\rbrack} \\ \vdots & \ddots & \vdots \\ {\sigma\left\lbrack x_{m1} \right\rbrack} & \cdots & {\sigma\left\lbrack x_{mn} \right\rbrack} \\ \end{array} \right\rbrack$,這其實就是神經網路中的資料經過啟用函式的過程。

 

舉個例子,$X = \left\lbrack \begin{array}{ll} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right\rbrack$,$dsin\left( X \right) = \left\lbrack \begin{array}{ll} {cosx_{11}dx_{11}} & {{cosx}_{12}dx_{12}} \\ {cosx_{21}dx_{21}} & {{cosx}_{22}dx_{22}} \\ \end{array} \right\rbrack = cos\left( X \right)\bigodot dX$

對於其他性質,存疑的話可以自行舉例驗證

 

1.6 標量對矩陣/向量的導數求解套路-跡技巧

我們試圖利用標量(loss)對向量(神經網路某層的輸出)/矩陣導數(神經網路某層的引數)和微分的聯絡,即公式1.2和公式1.3來計算標量對向量/矩陣的導數。如果一個標量的微分能被寫成這種形式,那導數就是等號右邊的轉置符號下的那個部分。也就是$df = {\sum\limits_{i = 1}^{n}\frac{\partial f}{\partial x_{i}}}dx_{i} = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$和$df = {\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\frac{\partial f}{\partial x_{ij}}}}dx_{ij} = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$這兩個等式中$\frac{\partial f}{\partial\mathbf{x}}$。

 

在實際演練之前還有一個必用的套路需要介紹,在求標量對矩陣/向量的導數的時候非常有用,叫跡技巧。後面通過小demo就知道它為什麼是必要的了。這裡先列舉出跡的一些性質(全部都很有用):

  • 標量的跡等於自己:$tr\left( x \right) = x$
  • 轉置不變性:$tr\left( \boldsymbol{A}^{T} \right) = tr\left( \boldsymbol{A} \right)$
  • 輪換不變性:$tr\left( {\boldsymbol{A}\boldsymbol{B}} \right) = tr\left( {\boldsymbol{B}\boldsymbol{A}} \right)$,其中 與 尺寸相同(這個是顯然的,否則維度不相容)。兩側都等於$\sum\limits_{i,j}{\boldsymbol{A}_{ij}\boldsymbol{B}_{ij}}$
  • 加減法:$tr\left( {\boldsymbol{A} \pm \boldsymbol{B}} \right) = tr\left( \boldsymbol{A} \pm \boldsymbol{B} \right)$
  • 矩陣乘法和跡交換:$tr\left( {\left( {\boldsymbol{A}\bigodot\boldsymbol{B}} \right)^{T}\boldsymbol{C}} \right) = tr\left( {\boldsymbol{A}^{T}\left( {\boldsymbol{B}\bigodot\boldsymbol{C}} \right)} \right)$,需要滿足 同維度。兩側都等於${\sum\limits_{i,j}{\boldsymbol{A}_{ij}\boldsymbol{B}_{ij}}}\boldsymbol{C}_{ij}$。

標量對矩陣/向量的求導技巧總結:若標量函式$f$是矩陣$\boldsymbol{X}$經加減乘法、逆、行列式、逐元素函式等運算構成,則使用相應的運演算法則對$f$求微分,再使用跡技巧給$df$套上跡並將其它項交換至$d\boldsymbol{X}$左側,對照導數與微分的聯絡$df = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$,即可得到導數。

 

特別地,若矩陣退化為向量,對照導數與微分的聯絡$df = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$,即可得到導數。

 

舉個例子:

$y = \boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)$,$\frac{\partial y}{\partial\boldsymbol{X}}$

根據跡技巧第一條:$dy = tr\left( {dy} \right) = tr\left( {d\left( {\boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right)$

根據矩陣微分性質第二條:$tr\left( {d\left( {\boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right) = tr\left( {{d\boldsymbol{a}}^{T}ex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)} + \boldsymbol{a}^{T}dex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)}} \right)$,因為是對$\boldsymbol{X}$求導,所以${d\boldsymbol{a}}^{T} = 0$,因此$tr\left( {d\left( {\boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right) = ~tr\left( {\boldsymbol{a}^{T}dex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)}} \right)$

根據矩陣微分性質第五條:$tr\left( {\boldsymbol{a}^{T}dex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)}} \right) = tr\left( {\boldsymbol{a}^{T}\left( {exp\left( \boldsymbol{X}\boldsymbol{b} \right)\bigodot d\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right)$

根據跡技巧第五條:$tr\left( {\boldsymbol{a}^{T}\left( {exp\left( \boldsymbol{X}\boldsymbol{b} \right)\bigodot d\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right) = tr\left( {\left( {\boldsymbol{a}\bigodot exp\left( {Xb} \right)} \right)^{T}dXb} \right)$

根據跡技巧第三條:$tr\left( {\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)^{T}d\boldsymbol{X}\boldsymbol{b}} \right) = tr\left( {\boldsymbol{b}\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)^{T}d\boldsymbol{X}} \right)$

於是,$dy = tr\left( {\boldsymbol{b}\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)^{T}d\boldsymbol{X}} \right) = tr\left( {\left( {\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)\boldsymbol{b}^{T}} \right)^{T}d\boldsymbol{X}} \right)$,對比$df = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$,我們可以求得$\frac{\partial y}{\partial\boldsymbol{X}} = \boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)\boldsymbol{b}^{T}$

 

舉個簡單的例子驗證一下上述矩陣求導結果:$\boldsymbol{X} = \left\lbrack \begin{array}{ll} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right\rbrack$,$b = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack$,$a = \left\lbrack \begin{array}{l} 2 \\ 3 \\ \end{array} \right\rbrack$,則$y = \left\lbrack 2,3 \right\rbrack\left\lbrack \begin{array}{l} {exp\left( x_{11} + 2x_{12} \right)} \\ {exp\left( x_{21} + 2x_{22} \right)} \\ \end{array} \right\rbrack = 2{\exp\left( {x_{11} + 2x_{12}} \right)} + 3exp\left( x_{21} + 2x_{22} \right)$,按照基本的定義來計算,我們得到$\frac{\partial y}{\partial\boldsymbol{X}} = \left\lbrack \begin{array}{ll} {2{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} & {4{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} \\ {3exp\left( x_{21} + 2x_{22} \right)} & {6exp\left( x_{21} + 2x_{22} \right)} \\ \end{array} \right\rbrack$

又因為$\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)\boldsymbol{b}^{T} = \left\lbrack \begin{array}{l} {2exp\left( x_{11} + 2x_{12} \right)} \\ {3exp\left( x_{11} + 2x_{12} \right)} \\ \end{array} \right\rbrack\left\lbrack {1,~2} \right\rbrack = \left\lbrack \begin{array}{ll} {2{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} & {4{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} \\ {3exp\left( x_{21} + 2x_{22} \right)} & {6exp\left( x_{21} + 2x_{22} \right)} \\ \end{array} \right\rbrack$

因此證明無誤。

 

微分法求導套路小結:

使用矩陣微分,可以在不對向量或矩陣中的某一元素單獨求導再拼接,因此會比較方便,當然熟練使用的前提是對上面矩陣微分的性質,以及跡函式的性質熟練運用。還有一些場景,求導的自變數和因變數直接有複雜的多層鏈式求導的關係,此時微分法使用起來也有些麻煩。如果我們可以利用一些常用的簡單求導結果,再使用鏈式求導法則,則會非常的方便。因此下一節我們討論向量矩陣求導的鏈式法則。

 

1.7 向量微分與向量對向量求導的關係

上面講的都是標量微分與標量對矩陣/向量的導數的關係,下面進一步擴充到向量微分與向量(神經網路某一層的輸出)對向量(神經網路另一層的輸出)的導數關係:


$d\boldsymbol{f} = \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}d\boldsymbol{x}$ (1.4)

相比公式1.2,上式的偏導部分少了轉置。

總之,先舉個例子來驗證一下公式1.4:

$\boldsymbol{f} = \boldsymbol{A}\boldsymbol{x}$,$\boldsymbol{A} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 0 & {- 1} \\ \end{array} \right\rbrack$,求$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}$

解:

用定義法來解的話,先算出$\boldsymbol{f} = \left\lbrack \begin{array}{l} {x_{1} + 2x_{2}} \\ {- x_{2}} \\ \end{array} \right\rbrack$,按照1.2節中向量對向量的求導定義,得到$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 0 & {- 1} \\ \end{array} \right\rbrack$

用公式1.4來解的話,$d\boldsymbol{f} = d\boldsymbol{A}\boldsymbol{x} = \boldsymbol{A}d\boldsymbol{x}$,對比$d\boldsymbol{f} = \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}d\boldsymbol{x}$,得到$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} = \boldsymbol{A}$,如此看來公式沒有問題。

 

如果要從原理上理解為什麼公式1.2和公式1.4差了個轉置,不妨考慮對比下面兩個特殊例子

1)$f = \boldsymbol{a}^{T}\boldsymbol{x}$,$\boldsymbol{a} = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack$,求$\frac{\partial f}{\partial\boldsymbol{x}}$

2)$\boldsymbol{f} = \boldsymbol{a}^{T}\boldsymbol{x}$,$\boldsymbol{a} = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack$,求$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}$

上面兩個例子,第一個左邊是一個標量$f = x_{1} + 2x_{2}$,而第二個左邊是一個長度為1的“向量”$\boldsymbol{f} = \left\lbrack {x_{1} + 2x_{2}} \right\rbrack$。

觀察第一個例子:

根據定義,可以秒求出$\frac{\partial f}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack$,如果要得到公式1.2的左邊的標量$df$,那麼列向量$\frac{\partial f}{\partial\boldsymbol{x}}$和列向量$d\boldsymbol{x}$必然有一方需要轉置才行,否則不滿足維度相容的規則,所以我們對$\frac{\partial f}{\partial\boldsymbol{x}}$進行轉置從而得到$df = d\left( {x_{1} + 2x_{2}} \right) = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack^{T}\left\lbrack \begin{array}{l} {dx_{1}} \\ {dx_{2}} \\ \end{array} \right\rbrack = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$

 

觀察第二個例子:

根據定義,可以秒求出$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} = \left\lbrack {1,~2} \right\rbrack$,對比第一個例子,區別就在這裡,因為這個地方出現區別,導致後面會有不一樣的結論。

 

有了公式(1.2)和(1.3)能做什麼?能做非常有用的事情,那就是通過寫一個全微分公式,配合一些簡單的矩陣微分的性質(後面有說),我們就能得到標量(神經網路的loss)對矩陣(引數矩陣)的微分了。

 

1.8 矩陣向量求導鏈式法則

終於到這一步了。

矩陣向量求導鏈式法則很多時候可以幫我們快速求出導數結果,但它跟標量對標量求導的鏈式法則不完全相同,所以需要單獨討論。

1.8.1 向量對向量求導的鏈式法則

首先我們來看看向量對向量求導的鏈式法則。假設多個向量存在依賴關係,比如三個向量$\left. \boldsymbol{x}\rightarrow\boldsymbol{y}\rightarrow\boldsymbol{z} \right.$,則存在下面的鏈式法則:

$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}$

需要注意的是,上述鏈式法則只對向量間求導有效

舉個例子感受一下:

$\boldsymbol{z} = {\exp\left( \boldsymbol{y} \right)},~\boldsymbol{y} = \boldsymbol{A}\boldsymbol{x},~\boldsymbol{A} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 3 & 0 \\ \end{array} \right\rbrack$,求$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}}$

解:

根據$\boldsymbol{z}$的定義式求得$\boldsymbol{z} = \left\lbrack \begin{array}{l} {exp\left( x_{1} + 2x_{2} \right)} \\ {exp\left( 3x_{1} \right)} \\ \end{array} \right\rbrack$,回顧1.2節中提到的向量對向量求導的定義,得到$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} {exp\left( x_{1} + 2x_{2} \right)} & {2exp\left( x_{1} + 2x_{2} \right)} \\ {3exp\left( x_{1} + 2x_{2} \right)} & 0 \\ \end{array} \right\rbrack$

 

如果用鏈式法則來求,我們先求$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}$,$d\boldsymbol{z} = dexp\left( \boldsymbol{y} \right) = {\exp\left( \boldsymbol{y} \right)}\bigodot d\boldsymbol{y}$,這裡距離$d\boldsymbol{f} = \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}d\boldsymbol{x}$這樣的形式還差一點點,注意到${\exp\left( \boldsymbol{y} \right)}\bigodot d\boldsymbol{y} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)d\boldsymbol{y}$(自行驗證)。

其中$diag\left( {\exp\left( \boldsymbol{y} \right)} \right) = \left\lbrack \begin{array}{ll} {exp\left( y_{1} \right)} & 0 \\ 0 & {exp\left( y_{2} \right)} \\ \end{array} \right\rbrack$意思是將向量$\exp\left( \boldsymbol{y} \right)$的元素作為一個矩陣的對角線上的元素,其他位置全為0。所以$d\boldsymbol{z} = dexp\left( \boldsymbol{y} \right) = {\exp\left( \boldsymbol{y} \right)}\bigodot d\boldsymbol{y} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)d\boldsymbol{y}$,$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)$。

接著我們求$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}$,$d\boldsymbol{y} = d\boldsymbol{A}\boldsymbol{x} = \boldsymbol{A}d\boldsymbol{x} + \boldsymbol{x}d\boldsymbol{A} = \boldsymbol{A}d\boldsymbol{x}$,所以$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \boldsymbol{A}$。

 

$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)\boldsymbol{A} = \left\lbrack \begin{array}{ll} {exp\left( x_{1} + 2x_{2} \right)} & 0 \\ 0 & {exp\left( 3x_{1} \right)} \\ \end{array} \right\rbrack\left\lbrack \begin{array}{ll} 1 & 2 \\ 3 & 0 \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} {exp\left( x_{1} + 2x_{2} \right)} & {2exp\left( x_{1} + 2x_{2} \right)} \\ {3exp\left( 3x_{1} \right)} & 0 \\ \end{array} \right\rbrack$

驗證無誤。

 

1.8.2 標量對多個向量的鏈式求導法則

標量對多個向量的鏈式法則可以藉助上面得到的兩個有用結論推匯出來:

  • 結論1:$f$是一個標量的時候,如果設$\boldsymbol{f} = \left\lbrack f \right\rbrack$是一個1x1的特殊的向量,那麼有$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T}$成立
  • 結論2:如果$\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}$是向量,有$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}$

如果$\left. \boldsymbol{x}\rightarrow\boldsymbol{y}\rightarrow f \right.$(標量),由上面第一個結論可知,$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T}$。

等號左邊是標量對向量的導數,等號右邊是向量對向量的導數,現在可以對右邊應用結論2了,也就是向量對向量的鏈式法則$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T} = \left( {\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}} \right)^{T}$。

 

還沒完事,求導的話最有利的情況下求標量對向量的導數,因為跡技巧只有在這種情況下才有用,所以再進一步將這個特殊的向量$\boldsymbol{f}$轉換回標量$f$,最終得到了下面的標量對多個向量的鏈式求導法則:

$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T} = \left( {\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}} \right)^{T} = \left( {\left( \frac{\partial f}{\partial\boldsymbol{y}} \right)^{T}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}} \right)^{T} = \left( \frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} \right)^{T}\frac{\partial f}{\partial\boldsymbol{y}}$

這個結論推廣到一般情況是,若$\left.\boldsymbol{y}_{1}\rightarrow\boldsymbol{y}_{2}\rightarrow\boldsymbol{y}_{3}\rightarrow\ldots\boldsymbol{y}_{\boldsymbol{n}}\rightarrow z \right.$(標量),則:

$\frac{\partial z}{\partial\boldsymbol{y}_{1}} = \left( {\frac{\partial\boldsymbol{y}_{\boldsymbol{n}}}{\partial\boldsymbol{y}_{\boldsymbol{n} - 1}}\frac{\partial\boldsymbol{y}_{\boldsymbol{n} - 1}}{\partial\boldsymbol{y}_{\boldsymbol{n} - 2}}\ldots\frac{\partial\boldsymbol{y}_{2}}{\partial\boldsymbol{y}_{1}}} \right)^{T}\frac{\partial z}{\partial\boldsymbol{y}_{\boldsymbol{n}}}$

 

1.8.3  標量對多個矩陣的鏈式求導法則(證略)

下面我們再來看看標量對多個矩陣的鏈式求導法則,假設有這樣的依賴關係$\left. \boldsymbol{X}\rightarrow\boldsymbol{Y}\rightarrow z \right.$,那我們有:

$\frac{\partial z}{\partial x_{ij}} = {\sum\limits_{k,l}{\frac{\partial z}{\partial Y_{kl}}\frac{\partial Y_{kl}}{\partial X_{ij}} = tr\left( {\left( \frac{\partial z}{\partial Y} \right)^{T}\frac{\partial Y}{\partial X_{ij}}} \right)}}$

這裡大家會發現我們沒有給出基於矩陣整體的鏈式求導法則,主要原因是矩陣對矩陣的求導是比較複雜的定義,我們目前也未涉及。因此只能給出對矩陣中一個標量的鏈式求導方法。這個方法並不實用,因為我們並不想每次都基於定義法來求導最後再去排列求導結果。

 

以實用為主的話,其實沒必要對這部分深入研究下去,只需要記住一些實用的情況就行了,而且非常好記憶:

我們也可以稱以下結論為“標量對線性變換的導數”

總結下就是:

  • $z = f\left( \boldsymbol{Y} \right),~\boldsymbol{Y} = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{B}$,則$\frac{\partial z}{\partial\boldsymbol{X}} = A^{T}\frac{\partial z}{\partial\boldsymbol{Y}}$
  • 結論在$\boldsymbol{X}$替換成向量$\boldsymbol{x}$的情況下仍然成立,$z = f\left( \boldsymbol{y} \right),\boldsymbol{y} = \boldsymbol{A}\boldsymbol{x} + \boldsymbol{B}$,則$\frac{\partial z}{\partial\boldsymbol{x}} = A^{T}\frac{\partial z}{\partial\boldsymbol{y}}$
  • 結論在$\boldsymbol{X}$替換成向量$\boldsymbol{x}$的情況下仍然成立,$\frac{\partial z}{\partial\boldsymbol{x}} = A^{T}\frac{\partial z}{\partial\boldsymbol{y}}$,則$\frac{\partial z}{\partial\boldsymbol{x}} = \frac{\partial z}{\partial\boldsymbol{y}}x^{T}$

 

1.9 用矩陣求導來求解機器學習上的引數梯度

神經網路的求導術是學術史上的重要成果,還有個專門的名字叫做BP演算法,我相信如今很多人在初次推導BP演算法時也會頗費一番腦筋,事實上使用矩陣求導術來推導並不複雜。為簡化起見,我們推導二層神經網路的BP演算法。後面會相繼系統介紹如何推導FNN,CNN,RNN和LSTM的引數求導。

 

我們運用上面學過的所有知識,來求分析一個二層神經網路的loss對各層引數的梯度。以經典的 MNIST 手寫數字分類問題為例,這個二層神經網路輸入圖片拉伸成的向量$\boldsymbol{x}$,然後輸出一個概率向量$\boldsymbol{y}$。用交叉熵作為loss函式可以得下面計算公式:

$l = - \boldsymbol{y}^{T}{\log{softmax\left( {\boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}} \right)}}$

其中,$\boldsymbol{x}$是個n x 1的列向量,$\boldsymbol{W}_{1}$是p x n的矩陣,$\boldsymbol{W}_{2}$是m x p的矩陣,$\boldsymbol{y}$是m x 1的列向量,$l$是標量,$\sigma$是sigmoid函式。

 

我們一層一層往前計算梯度:

注意到$softmax\left( \boldsymbol{x} \right) = \frac{exp\left( \boldsymbol{x} \right)}{ \boldsymbol{1}^{T}exp\left( \boldsymbol{x} \right)}$,其中$exp\left( \boldsymbol{x} \right)$是個列向量,$ \boldsymbol{1}^{T}$是一個元素全為1的行向量,而$ \boldsymbol{1}^{T}exp\left( \boldsymbol{x} \right)$是一個標量。舉個小例子,如果$\boldsymbol{x} = \left\lbrack \begin{array}{l} 1 \\ 2 \\ 3 \\ \end{array} \right\rbrack$則$softmax\left( \mathbf{x} \right) = \left\lbrack \begin{array}{l}\frac{exp\left( 1 \right)}{{\exp\left(1 \right)} + {\exp\left( 2 \right)} + {\exp\left( 3 \right)}} \\\frac{exp\left( 2 \right)}{{\exp\left(1 \right)} + {\exp\left( 2 \right)} + {\exp\left( 3 \right)}} \\\frac{exp\left( 3 \right)}{{\exp\left(1 \right)} + {\exp\left( 2 \right)} + {\exp\left( 3 \right)}} \\\end{array} \right\rbrack$

令$\boldsymbol{a} = \boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}$,先求$\frac{\partial l}{\partial\boldsymbol{a}}$。

$dl = - \boldsymbol{y}^{T}d\left( {logsoftmax\left( \boldsymbol{a} \right)} \right) = - \boldsymbol{y}^{T}d\left( {log\left( \frac{exp\left( \boldsymbol{a} \right)}{ \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)} \right)} \right)$

 

這裡要注意逐元素log滿足等式$log\left( {\boldsymbol{u}/c} \right) = log\left( \boldsymbol{u} \right) -  \boldsymbol{1}log\left( c \right)$,其中$\boldsymbol{u}$和$\boldsymbol{1}$是同形的列向量,$c$是標量。

 

因為$ \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)$是一個標量,所以套用上面這個規則,有:

$log\left( \frac{exp\left( \boldsymbol{a} \right)}{ \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)} \right) = log\left( {exp\left( \boldsymbol{a} \right)} \right) -  \boldsymbol{1}log\left( { \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)} \right)$

因而有:

$dl = - \boldsymbol{y}^{T}d\left( {log\left( {\exp\left( \boldsymbol{a} \right)} \right) - \boldsymbol{1}log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = - \boldsymbol{y}^{T}d\left( {\boldsymbol{a} - \boldsymbol{1}log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = - \boldsymbol{y}^{T}d\boldsymbol{a} + d\left( {\boldsymbol{y}^{T}\boldsymbol{1}log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right)$

 

因為$\boldsymbol{y}$的元素之和等於1,所以$\boldsymbol{y}^{T}\boldsymbol{1} = 1$。進一步,我們得到:

$dl = - \boldsymbol{y}^{T}d\boldsymbol{a} + d\left( {log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right)$

根據矩陣微分性質第六條:

$d\left( {log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = {log}^{'}\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right) \odot d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)$

因為${log}^{'}\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)$和$d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)$都是標量,所以有:

 ${log}^{'}\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right) \odot d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right) = \frac{d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}}$

繼續根據矩陣微分性質第六條做變換:$\frac{d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} = \frac{\boldsymbol{1}^{T}d\left( {\exp\left( \boldsymbol{a} \right)} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} = \frac{\boldsymbol{1}^{T}\left( {{\exp\left( \boldsymbol{a} \right)} \odot d\boldsymbol{a}} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}}$

 

接下來使用跡技巧第一條:

$d\left( {1^{T}{\exp\left( \boldsymbol{a} \right)}} \right) = tr\left( {d\left( {1^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = tr\left( {1^{T}\left( {{\exp\left( \boldsymbol{a} \right)} \odot d\boldsymbol{a}} \right)} \right)$

根據跡技巧第五條:

$tr\left( {1^{T}\left( {{\exp\left( \boldsymbol{a} \right)} \odot d\boldsymbol{a}} \right)} \right) = tr\left( \left( {\left( {1{{\odot \exp}\left( \boldsymbol{a} \right)}} \right)^{T}d\boldsymbol{a}} \right) \right) = tr\left( {\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}} \right)$

 

再逆著用一次跡技巧第一條:

 $tr\left( {\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}} \right) = \left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}$

所以有:

 $dl = - \boldsymbol{y}^{T}d\boldsymbol{a} + d\left( {log\left( {1^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = - \boldsymbol{y}^{T}d\boldsymbol{a} + \frac{\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}}{1^{T}{\exp\left( \boldsymbol{a} \right)}} = \left( {- \boldsymbol{y}^{T} + \frac{\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}}{1^{T}{\exp\left( \boldsymbol{a} \right)}}} \right)d\boldsymbol{a}$

 

對照公式1.2得到$\frac{\partial l}{\partial\boldsymbol{a}} = \frac{\exp\left( \boldsymbol{a} \right)}{1^{T}{\exp\left( \boldsymbol{a} \right)}} - \boldsymbol{y} = softmax\left( \boldsymbol{a} \right) - \boldsymbol{y}$

 

接下來,我們求$\frac{\partial l}{\partial\boldsymbol{W}_{2}}$:

 

我們知道,$l = - \boldsymbol{y}^{T}{\log{softmax\left( \boldsymbol{a} \right)}}$,$\boldsymbol{a} = \boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}$

 

回想一下1.7.3節第四條結論,上面這個形式和$l = f\left( \boldsymbol{a} \right),\boldsymbol{a} = \boldsymbol{A}\boldsymbol{x} + \boldsymbol{b}$,求$\frac{\partial l}{\partial\boldsymbol{A}}$是一致的。

 

直接用1.7.3節第四條結論,我們得到$\frac{\partial l}{\partial\boldsymbol{W}_{2}} = \frac{\partial l}{\partial\boldsymbol{a}}\left( {\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)^{T}$

 

因為$d\boldsymbol{a} = d\boldsymbol{b}_{2} = \boldsymbol{I}d\boldsymbol{b}_{2}$,

 

顯然$\frac{\partial l}{\partial\boldsymbol{b}_{2}} = \boldsymbol{I}^{T}\frac{\partial l}{\partial\boldsymbol{a}} = \frac{\partial l}{\partial\boldsymbol{a}}$,於是我們得到了第二層神經網路中$\boldsymbol{W}_{2}$和$\boldsymbol{b}_{2}$的梯度。

 

接著我們想求$\frac{\partial l}{\partial\boldsymbol{W}_{1}}$:

 

令$\boldsymbol{z} = \boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}$,先求出$\frac{\partial l}{\partial\boldsymbol{z}}$。

 

根據微分性質第六條:

 $d\boldsymbol{a} = d\left( {\boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}} \right) = \boldsymbol{W}_{2}d\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) = \boldsymbol{W}_{2}\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) \odot d\boldsymbol{z}} \right)$

$\boldsymbol{W}_{2}\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) \odot d\boldsymbol{z}} \right) = \boldsymbol{W}_{2}diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)d\boldsymbol{z}$

我們得到$\frac{\partial\boldsymbol{a}}{\partial\boldsymbol{z}} = \boldsymbol{W}_{2}diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)$,進而得到$\frac{\partial l}{\partial\boldsymbol{z}} = diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right){\boldsymbol{W}_{2}}^{T}\frac{\partial l}{\partial\boldsymbol{a}}$

 

現在已知$\frac{\partial l}{\partial\boldsymbol{z}}$,$\boldsymbol{z} = \boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}$,求$\frac{\partial l}{\partial\boldsymbol{W}_{1}}$

我們直接用1.7.3節第四條結論,得到$\frac{\partial l}{\partial\boldsymbol{W}_{1}} = \frac{\partial l}{\partial\boldsymbol{z}}\boldsymbol{x}^{T}$,

按照同樣的套路,得到$\frac{\partial l}{\partial\boldsymbol{b}_{1}} = \left( \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{b}_{2}} \right)^{T}\frac{\partial l}{\partial\boldsymbol{z}} = \frac{\partial l}{\partial\boldsymbol{z}}$

至此我們已求完兩層神經網路的引數的梯度:$\frac{\partial l}{\partial\boldsymbol{W}_{2}},\frac{\partial l}{\partial\boldsymbol{b}_{2}},~\frac{\partial l}{\partial\boldsymbol{W}_{1}},~\frac{\partial l}{\partial\boldsymbol{b}_{1}}$

$\frac{\partial l}{\partial\boldsymbol{W}_{2}} = \frac{\partial l}{\partial\boldsymbol{a}}\left( {\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)^{T}$

$\frac{\partial l}{\partial\boldsymbol{b}_{2}} = \frac{\partial l}{\partial\boldsymbol{a}}$

$\frac{\partial l}{\partial\boldsymbol{W}_{1}} = diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right){\boldsymbol{W}_{2}}^{T}\frac{\partial l}{\partial\boldsymbol{a}}\boldsymbol{x}^{T}$

$\frac{\partial l}{\partial\boldsymbol{b}_{1}} = diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right){\boldsymbol{W}_{2}}^{T}\frac{\partial l}{\partial\boldsymbol{a}}$

其中,$\frac{\partial l}{\partial\boldsymbol{a}} = softmax\left( \boldsymbol{a} \right) - \boldsymbol{y}$。

 

可以看出,求神經網路中的引數的梯度,實際上只用求出輸出層那塊的梯度$\frac{\partial l}{\partial\boldsymbol{a}}$,前面隱藏層的引數梯度都只是基於輸出層那部分的梯度的矩陣運算罷了。

 

---------推廣---------

上面的推導是針對一條樣本的情況,真實情況是,我們有n組樣本$\left( {x_{1},y_{1}} \right),\left( {x_{2},y_{2}} \right),\ldots\left( {x_{n},y_{n}} \right)$,因此loss函式應當是

$l = {\sum\limits_{i = 1}^{n}{- {\boldsymbol{y}_{\boldsymbol{i}}}^{T}{\log{softmax\left( {\boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x}_{\boldsymbol{i}} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}} \right)}}}}$

但這樣的loss本質上依舊是個標量,並不影響推導的思路,累加符號可以放到最外面去,最後各引數的梯度等於各條樣本單獨計算出來的梯度再做累加。

 

參考資料:

 

(歡迎轉載,轉載請註明出處。歡迎留言或溝通交流: lxwalyw@gmail.com)

 

 

 

 

 

 

相關文章