深度學習中需要的矩陣計算
深度學習中需要的矩陣計算
The Matrix Calculus You Need For Deep Learning
翻譯自: explained.ai
原作者: Terence Parr and Jeremy Howard我只翻譯了主要的幾個部分。翻譯有問題請聯絡我 lih627@outlook.com
摘要
這篇文章的目的是為了解釋深度神經網路訓練過程中的矩陣運算。目的是幫助對了解基本神經網路的人加深對其中數學知識的理解。文章末尾提供參考文獻總結了文章中討論的矩陣運演算法則。同時可以在 Theory category at forums.fast.ai. 探討文章中的一些理論知識。
介紹
機器學習論文和實際軟體操作過程比如 PyTorch 存在很大的差異。因為後者通過內建的自動微分功能隱藏了很多細節,如果想要了解最新的訓練技術以及實現的底層,需要了解矩陣計算,這包括線性代數linear algebra和多元計算multivariate calculus。
例如,一個簡單的神經網路計算單元首先計算權重向量 w \mathbb{w} w和輸入向量 x \mathbb{x} x的點積,並加上一個標量(偏置)。公式為 z ( x ) = ∑ i n w i x i + b = w ⋅ x + b z(\mathbb{x})=\sum_{i}^nw_ix_i+b=\mathbb{w}\cdot\mathbb{x}+b z(x)=∑inwixi+b=w⋅x+b 通常該函式被稱為為仿射函式(affine function)。然後緊接著一個線性整流單元rectified linear unit,它將負值裁剪為0: max ( 0 , z ( x ) ) \max(0, z(\mathbb{x})) max(0,z(x))。這個計算過程為定義為一個「人工神經元」。神經網路包含多個這樣的單元。多個單元聚合成網路層,上一層的輸出是下一層的輸入。最後一層的輸出稱作神經網路的輸出。
訓練一個神神經網路意味著選擇合適的權重向量 w \mathbb{w} w 和偏置 b b b,對於輸入的所有向量 x \mathbb{x} x都會輸出想要的結果。為此會設計一個損失函式,來對比所有輸入網路推理的結果 a c t i v a t i o n ( x ) activation(\mathbb{x}) activation(x) 和預期結果 t a r g e t ( x ) target(\mathbb{x}) target(x) 的差異。為了最小化差異,通常使用SGD或者Adam等梯度下降方法訓練。所有這些都需要 a c t i v a t i o n ( x ) activation(\mathbb{x}) activation(x)對於模型引數 w \mathbb{w} w和 b b b的偏導數。通過逐步調整 w \mathbb{w} w和 b b b使得總的損失函式值變得越來越小。
例如可以寫一個標量版本的均方誤差損失函式:
1 N ∑ x ( t a r g e t ( x ) − a c t i v a t i o n ( x ) ) 2 = 1 N ∑ x ( t a r g e t ( x ) − max ( 0 , ∑ 1 ∣ x ∣ w i x i + b ) ) 2 \frac{1}{N}\sum_{\mathbb{x}}\left(target(\mathbf{x}) - activation(\mathbb{x})\right)^2=\frac{1}{N}\sum_{\mathbb{x}}\left(target(\mathbb{x}) - \max(0, \sum_{1}^{|x|}w_ix_i+b)\right)^2 N1x∑(target(x)−activation(x))2=N1x∑⎝⎛target(x)−max(0,1∑∣x∣wixi+b)⎠⎞2
∣ x ∣ |x| ∣x∣表示向量 x x x中元素的個數, 注意這只是一個神經元,神經網路需要同時訓練所有層的所有神經元。由於有多個輸入和多個網路輸出,通常需要一些向量對向量和求導法則。這篇文章的目的於此。
複習:標量求導法則
對於標量的求導法則如下:
Rule | f ( x ) f(x) f(x) | 對 x x x 導數 | 例子 |
---|---|---|---|
常量 | c c c | 0 | d d x 99 = 0 \frac{d}{dx}99=0 dxd99=0 |
與常量乘 | c f cf cf | c d f d x c\frac{df}{dx} cdxdf | d d x 3 x = 3 \frac{d}{dx}3x=3 dxd3x=3 |
冪 | x n x^n xn | n x n − 1 nx^{n-1} nxn−1 | d d x x 3 = 3 x 2 \frac{d}{dx}x^3=3x^2 dxdx3=3x2 |
和 | f + g f+g f+g | d f d x + d g d x \frac{df}{dx}+\frac{dg}{dx} dxdf+dxdg | d d x ( x 2 + 3 x ) = 2 x + 3 \frac{d}{dx}(x^2+3x)=2x+3 dxd(x2+3x)=2x+3 |
積 | f g fg fg | f d g d x + g d f d x f\frac{dg}{dx}+g\frac{df}{dx} fdxdg+gdxdf | d d x x 2 x = x 2 + x 2 x = 3 x 2 \frac{d}{dx}x^2x=x^2+x2x=3x^2 dxdx2x=x2+x2x=3x2 |
鏈式法則 | f ( g ( x ) ) f(g(x)) f(g(x)) | d f ( u ) d u d u d x , u = g ( x ) \frac{df(u)}{du}\frac{du}{dx}, u= g(x) dudf(u)dxdu,u=g(x) | d d x ln ( x 2 ) = 1 x 2 2 x = 2 x \frac{d}{dx}\ln(x^2)=\frac{1}{x^2}2x=\frac{2}{x} dxdln(x2)=x212x=x2 |
向量計算和偏導數
神經網路層並不是由一個引數和一個方程構成的。首先考慮多個引數的情況,例如 f ( x , y ) = 3 x 2 y f(x,y) = 3x^2y f(x,y)=3x2y。此時, f ( x , y ) f(x, y) f(x,y) 的變化取決於更改 x x x 還是更改 y y y 。引出偏導(partial derivatives)。例如對於 x x x 偏導可以寫為 ∂ ∂ x 3 y x 2 \frac{\partial}{\partial x}3yx^2 ∂x∂3yx2 將 y y y 看作常量,可以得到偏導結果為 ∂ ∂ x 3 y x 2 = 3 y ∂ ∂ x x 2 = 6 y x \frac{\partial}{\partial x}3yx^2=3y\frac{\partial}{\partial x}x^2=6yx ∂x∂3yx2=3y∂x∂x2=6yx
從整體來看,當考慮整個向量的計算而不是多元函式的計算。對於 f ( x , y ) f(x, y) f(x,y) 需要計算 ∂ ∂ x f ( x , y ) \frac{\partial}{\partial x}f(x, y) ∂x∂f(x,y) 和 ∂ ∂ y f ( x , y ) \frac{\partial}{\partial y}f(x, y) ∂y∂f(x,y) 。可以將他們組合成水平向量,因此定義 f ( x , y ) f(x, y) f(x,y) 的導數為:
∇ f ( x , y ) = [ ∂ f ( x , y ) ∂ x , ∂ f ( x , y ) ∂ y ] = [ 6 y x , 3 x 2 ] \nabla f(x, y) = \left[\frac{\partial f(x, y)}{\partial x}, \frac{\partial f(x, y)}{\partial y}\right] = \left[6yx, 3x^2\right] ∇f(x,y)=[∂x∂f(x,y),∂y∂f(x,y)]=[6yx,3x2]
因此一個多元函式的導數為偏導構成的向量。 他們將 n n n 個標量引數對映到一個標量。下一節將討論如何處理多個多元函式的導數。
矩陣計算
當從一個多元函式考慮到多個多元函式的導數時,需要從向量運算考慮到矩陣運算。現在考慮兩個函式的偏導,例如 f ( x , y ) = 3 x 2 y f(x, y) = 3x^2y f(x,y)=3x2y 和 g ( x , y ) = 2 x + y 8 g(x, y) = 2x + y^8 g(x,y)=2x+y8 。可以分別計算他們的梯度向量並堆疊在一次。此時這個矩陣稱為 Jacobian ( Jacobian matrix) ,如下
J = [ ∇ f ( x , y ) ∇ g ( x , y ) ] = [ ∂ f ( x , y ) ∂ x ∂ f ( x , y ) ∂ y ∂ g ( x , y ) ∂ x ∂ g ( x , y ) ∂ y ] = [ 6 y x 3 x 2 2 8 y 7 ] J =\begin{bmatrix}\nabla f(x, y)\\ \nabla g(x, y)\end{bmatrix} = \begin{bmatrix} \frac{\partial f(x, y)}{\partial x}&\frac{\partial f(x, y)}{\partial y}\\ \frac{\partial g(x, y)}{\partial x}&\frac{\partial g(x, y)}{\partial y} \end{bmatrix}= \begin{bmatrix} 6yx & 3x^2\\2&8y^7 \end{bmatrix} J=[∇f(x,y)∇g(x,y)]=[∂x∂f(x,y)∂x∂g(x,y)∂y∂f(x,y)∂y∂g(x,y)]=[6yx23x28y7]
這種形式為分子佈局(numerator layout) 對應另外一種為分母佈局(denominat layout), 是對上述矩陣的轉置。
Jacobian 矩陣生成
第一步,將多個標量通過向量表示。 f ( x , y , z ) → f ( x ) f(x, y, z) \to f(\mathbb{x}) f(x,y,z)→f(x)。定義預設情況下向量為列向量 n × 1 n\times 1 n×1,即
x = [ x 1 x 2 ⋮ x n ] \mathbb{x}= \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n \end{bmatrix} x=⎣⎢⎢⎢⎡x1x2⋮xn⎦⎥⎥⎥⎤
當有多個結果是標量的函式組合起來,通過 y = f ( x ) \mathbb{y}=\mathbb{f(x)} y=f(x) 表示。其中 y \mathbb{y} y 是一個向量表示 m m m 個結果為標量的方程,每個方程輸入元素個數為 n = ∣ x ∣ n=|\mathbb{x}| n=∣x∣ 的向量。展開如下:
y 1 = f 1 ( x ) y 2 = f 2 ( x ) ⋮ y m = f m ( x ) \begin{aligned} y_1 &= f_1(\mathbb{x})\\ y_2 &= f_2(\mathbb{x})\\ &\vdots\\ y_m &=f_m(\mathbb{x}) \end{aligned} y1y2ym=f1(x)=f2(x)⋮=fm(x)
例如上一節的公式可以用 x 1 , x 2 x_1,x_2 x1,x2分別表示 x , y x, y x,y:
y 1 = f 1 ( x ) = 3 x 1 2 x 2 y 2 = f 2 ( x ) = 2 x 1 + x 2 8 \begin{aligned} y_1 &= f_1(\mathbb{x}) =3x_1^2x_2\\ y_2 &= f_2(\mathbb{x}) = 2x_1 + x_2^8 \end{aligned} y1y2=f1(x)=3x12x2=f2(x)=2x1+x28
通常Jacobian矩陣包含所有 m × n m\times n m×n 個偏導。它記錄了 m m m個結果是標量函式對於 x \mathbb{x} x 的梯度
∂ y ∂ x = [ ∇ f 1 ( x ) ∇ f 2 ( x ) ⋮ ∇ f m ( x ) ] = [ ∂ ∂ x f 1 ( x ) ∂ ∂ x f 2 ( x ) ⋮ ∂ ∂ x f m ( x ) ] = [ ∂ ∂ x 1 f 1 ( x ) ∂ ∂ x 2 f 1 ( x ) ⋯ ∂ ∂ x n f 1 ( x ) ∂ ∂ x 1 f 2 ( x ) ∂ ∂ x 2 f 2 ( x ) ⋯ ∂ ∂ x n f 2 ( x ) ⋮ ⋮ ⋮ ∂ ∂ x 1 f m ( x ) ∂ ∂ x 2 f m ( x ) ⋯ ∂ ∂ x n f m ( x ) ] \frac{\partial\mathbb{y}}{\partial\mathbb{x}}= \begin{bmatrix} \nabla f_1(\mathbb{x})\\ \nabla f_2(\mathbb{x})\\ \vdots \\ \nabla f_m(\mathbb{x}) \end{bmatrix}= \begin{bmatrix} \frac{\partial}{\partial\mathbb{x}}f_1(\mathbb{x})\\ \frac{\partial}{\partial\mathbb{x}}f_2(\mathbb{x})\\ \vdots\\ \frac{\partial}{\partial\mathbb{x}}f_m(\mathbb{x}) \end{bmatrix}= \begin{bmatrix} \frac{\partial}{\partial x_1}f_1(\mathbb{x}) & \frac{\partial}{\partial x_2}f_1(\mathbb{x}) &\cdots &\frac{\partial}{\partial x_n}f_1(\mathbb{x})\\ \frac{\partial}{\partial x_1}f_2(\mathbb{x}) & \frac{\partial}{\partial x_2}f_2(\mathbb{x}) &\cdots & \frac{\partial}{\partial x_n} f_2(\mathbb{x})\\ \vdots &\vdots&&\vdots\\ \frac{\partial}{\partial x_1}f_m(\mathbb{x}) &\frac{\partial}{\partial x_2}f_m(\mathbb{x}) &\cdots &\frac{\partial}{\partial x_n}f_m(\mathbb{x}) \end{bmatrix} ∂x∂y=⎣⎢⎢⎢⎡∇f1(x)∇f2(x)⋮∇fm(x)⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡∂x∂f1(x)∂x∂f2(x)⋮∂x∂fm(x)⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡∂x1∂f1(x)∂x1∂f2(x)⋮∂x1∂fm(x)∂x2∂f1(x)∂x2∂f2(x)⋮∂x2∂fm(x)⋯⋯⋯∂xn∂f1(x)∂xn∂f2(x)⋮∂xn∂fm(x)⎦⎥⎥⎥⎤
每個 ∂ ∂ x f i ( x ) \frac{\partial}{\partial \mathbb{x}}f_i(\mathbb{x}) ∂x∂fi(x) 是一個 n = ∣ x ∣ n=|\mathbb{x}| n=∣x∣ 元水平向量。
下面考慮恆等式 f ( x ) = x \mathbb{f(x) = x} f(x)=x, f i ( x ) = x f_i(\mathbf{x})=\mathbf{x} fi(x)=x ,該恆等式包含 n n n 個函式每個函式包含 n n n 個引數,那麼其 Jacobian矩陣是一個方陣 m = n m=n m=n。
∂ b ∂ x = [ ∂ ∂ x f 1 ( x ) ∂ ∂ x f 2 ( x ) ⋮ ∂ ∂ x f m ( x ) ] = [ ∂ ∂ x 1 f 1 ( x ) ∂ ∂ x 2 f 1 ( x ) ⋯ ∂ ∂ x n f 1 ( x ) ∂ ∂ x 1 f 2 ( x ) ∂ ∂ x 2 f 2 ( x ) ⋯ ∂ ∂ x n f 2 ( x ) ⋮ ⋮ ⋮ ∂ ∂ x 1 f m ( x ) ∂ ∂ x 2 f m ( x ) ⋯ ∂ ∂ x n f m ( x ) ] = [ ∂ ∂ x 1 x 1 ∂ ∂ x 2 x 1 ⋯ ∂ ∂ x n x 1 ∂ ∂ x 1 x 2 ∂ ∂ x 2 x 2 ⋯ ∂ ∂ x n x 2 ⋮ ⋮ ⋮ ∂ ∂ x 1 x n ∂ ∂ x 2 x n ⋯ ∂ ∂ x n x n ] = I \begin{aligned} \frac{\partial\mathbf{b}}{\partial\mathbf{x}}= \begin{bmatrix} \frac{\partial}{\partial\mathbf{x}}f_1(\mathbf{x})\\ \frac{\partial}{\partial\mathbf{x}}f_2(\mathbf{x})\\ \vdots\\ \frac{\partial}{\partial\mathbf{x}}f_m(\mathbf{x}) \end{bmatrix} &= \begin{bmatrix} \frac{\partial}{\partial x_1}f_1(\mathbb{x}) & \frac{\partial}{\partial x_2}f_1(\mathbb{x}) &\cdots &\frac{\partial}{\partial x_n}f_1(\mathbb{x})\\ \frac{\partial}{\partial x_1}f_2(\mathbb{x}) & \frac{\partial}{\partial x_2}f_2(\mathbb{x}) &\cdots & \frac{\partial}{\partial x_n} f_2(\mathbb{x})\\ \vdots &\vdots&&\vdots\\ \frac{\partial}{\partial x_1}f_m(\mathbb{x}) &\frac{\partial}{\partial x_2}f_m(\mathbb{x}) &\cdots &\frac{\partial}{\partial x_n}f_m(\mathbb{x}) \end{bmatrix}\\ &= \begin{bmatrix} \frac{\partial}{\partial x_1}x_1 & \frac{\partial}{\partial x_2}x_1 &\cdots &\frac{\partial}{\partial x_n}x_1\\ \frac{\partial}{\partial x_1}x_2 & \frac{\partial}{\partial x_2}x_2 &\cdots & \frac{\partial}{\partial x_n} x_2 \\ \vdots &\vdots&&\vdots\\ \frac{\partial}{\partial x_1}x_n &\frac{\partial}{\partial x_2}x_n &\cdots &\frac{\partial}{\partial x_n}x_n \end{bmatrix}\\ &=I \end{aligned} ∂x∂b=⎣⎢⎢⎢⎡∂x∂f1(x)∂x∂f2(x)⋮∂x∂fm(x)⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡∂x1∂f1(x)∂x1∂f2(x)⋮∂x1∂fm(x)∂x2∂f1(x)∂x2∂f2(x)⋮∂x2∂fm(x)⋯⋯⋯∂xn∂f1(x)∂xn∂f2(x)⋮∂xn∂fm(x)⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡∂x1∂x1∂x1∂x2⋮∂x1∂xn∂x2∂x1∂x2∂x2⋮∂x2∂xn⋯⋯⋯∂xn∂x1∂xn∂x2⋮∂xn∂xn⎦⎥⎥⎥⎤=I
向量元素級二元運算子的導數
很多向量複雜計算可以簡化為多個元素級向量二元操作的組合。例如通過 y = f ( w ) ◯ g ( w ) \mathbf{y}=\mathbf{f(w)}\bigcirc\mathbf{g(w)} y=f(w)◯g(w)其中 m = n = ∣ y ∣ = ∣ w ∣ = ∣ x ∣ m = n=|\mathbb{y}|=|\mathbb{w}|=|\mathbb{x}| m=n=∣y∣=∣w∣=∣x∣,例如下面這個例子
[ y 1 y 2 ⋮ y n ] = [ f 1 ( w ) ◯ g 1 ( x ) f 2 ( w ) ◯ g 2 ( x ) ⋮ f n ( w ) ◯ g n ( x ) ] \begin{bmatrix} y_1\\ y2\\ \vdots\\y_n \end{bmatrix}= \begin{bmatrix} f_1(\mathbf{w})\bigcirc g_1(\mathbf{x})\\ f_2(\mathbf{w})\bigcirc g_2(\mathbf{x})\\ \vdots\\ f_n(\mathbf{w})\bigcirc g_n(\mathbf{x}) \end{bmatrix} ⎣⎢⎢⎢⎡y1y2⋮yn⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡f1(w)◯g1(x)f2(w)◯g2(x)⋮fn(w)◯gn(x)⎦⎥⎥⎥⎤
考慮對 x \mathbf{x} x的Jacobian
J w = ∂ y ∂ w = [ ∂ ∂ w 1 ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ w 2 ( f 1 ( w ) ◯ g 1 ( x ) ) ⋯ ∂ ∂ w n ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ w 1 ( f 2 ( w ) ◯ g 2 ( x ) ) ∂ ∂ w 2 ( f 2 ( w ) ◯ g 2 ( x ) ) ⋯ ∂ ∂ w n ( f 2 ( w ) ◯ g 2 ( x ) ) ⋮ ⋮ ⋮ ∂ ∂ w 1 ( f n ( w ) ◯ g n ( x ) ) ∂ ∂ w 2 ( f n ( w ) ◯ g n ( x ) ) ⋯ ∂ ∂ w n ( f n ( w ) ◯ g n ( x ) ) ] J_\mathbf{w}=\frac{\partial\mathbf{y}}{\partial\mathbf{w}}= \begin{bmatrix} \frac{\partial}{\partial w_1}\left(f_1(\mathbf{w})\bigcirc g_1(\mathbf{x})\right) &\frac{\partial}{\partial w_2}\left(f_1(\mathbf{w})\bigcirc g_1(\mathbf{x})\right) &\cdots &\frac{\partial}{\partial w_n}\left(f_1(\mathbf{w})\bigcirc g_1(\mathbf{x})\right)\\ \frac{\partial}{\partial w_1}\left(f_2(\mathbf{w})\bigcirc g_2(\mathbf{x})\right) &\frac{\partial}{\partial w_2}\left(f_2(\mathbf{w})\bigcirc g_2(\mathbf{x})\right) &\cdots &\frac{\partial}{\partial w_n}\left(f_2(\mathbf{w})\bigcirc g_2(\mathbf{x})\right)\\ \vdots & \vdots &&\vdots\\ \frac{\partial}{\partial w_1}\left(f_n(\mathbf{w})\bigcirc g_n(\mathbf{x})\right) &\frac{\partial}{\partial w_2}\left(f_n(\mathbf{w})\bigcirc g_n(\mathbf{x})\right) &\cdots &\frac{\partial}{\partial w_n}\left(f_n(\mathbf{w})\bigcirc g_n(\mathbf{x})\right)\\ \end{bmatrix} Jw=∂w∂y=⎣⎢⎢⎢⎡∂w1∂(f1(w)◯g1(x))∂w1∂(f2(w)◯g2(x))⋮∂w1∂(fn(w)◯gn(x))∂w2∂(f1(w)◯g1(x))∂w2∂(f2(w)◯g2(x))⋮∂w2∂(fn(w)◯gn(x))⋯⋯⋯∂wn∂(f1(w)◯g1(x))∂wn∂(f2(w)◯g2(x))⋮∂wn∂(fn(w)◯gn(x))⎦⎥⎥⎥⎤
上面的式子看起來比較麻煩,先考慮Jacobian是一個對角矩陣,即 ∂ ∂ w j ( f i ( w ) ◯ g i ( x ) ) = 0 , i ≠ j \frac{\partial}{\partial w_j}\left(f_i(\mathbf{w})\bigcirc g_i(\mathbf{x})\right) = 0, i\neq j ∂wj∂(fi(w)◯gi(x))=0,i=j 。此時非對角線上的元素恆為0。 f i f_i fi 只是一個和 w i w_i wi 有關的函式,此時二元表示式可以簡化為 f i ( w i ) ◯ g i ( x i ) f_i(w_i)\bigcirc g_i(x_i) fi(wi)◯gi(xi), Jacobian可以表示為:
∂ y ∂ w = d i a g ( ∂ ∂ w 1 ( f 1 ( w 1 ) ◯ g 1 ( x 1 ) ) , ∂ ∂ w 2 ( f 2 ( w 2 ) ◯ g 2 ( x 2 ) ) , ⋯ , ∂ ∂ w n ( f n ( w n ) ◯ g n ( x n ) ) ) \frac{\partial \mathbf{y}}{\partial\mathbf{w}}=diag\left(\frac{\partial}{\partial w_1}(f_1(w_1)\bigcirc g_1(x_1)), \frac{\partial}{\partial w_2}(f_2(w_2)\bigcirc g_2(x_2)), \cdots, \frac{\partial}{\partial w_n}(f_n(w_n)\bigcirc g_n(x_n))\right) ∂w∂y=diag(∂w1∂(f1(w1)◯g1(x1)),∂w2∂(f2(w2)◯g2(x2)),⋯,∂wn∂(fn(wn)◯gn(xn)))
對應偏導運算可以總結如下
Op | Partial with Respect to w \mathbf{w} w |
---|---|
+ + + | ∂ ( w + x ) ∂ w = d i a g ( ⋯ ∂ ( w i + x i ) ∂ w i ⋯ ) = I \frac{\partial(\mathbf{w} + \mathbf{x})}{\partial\mathbf{w}}=diag(\cdots\frac{\partial(w_i + x_i)}{\partial w_i}\cdots)=I ∂w∂(w+x)=diag(⋯∂wi∂(wi+xi)⋯)=I |
− - − | ∂ ( w + x ) ∂ w = d i a g ( ⋯ ∂ ( w i − x i ) ∂ w i ⋯ ) = I \frac{\partial(\mathbf{w} + \mathbf{x})}{\partial\mathbf{w}}=diag(\cdots\frac{\partial{(w_i - x_i)}}{\partial w_i}\cdots)=I ∂w∂(w+x)=diag(⋯∂wi∂(wi−xi)⋯)=I |
⊗ \otimes ⊗ | ∂ ( w ⊗ x ) ∂ w = d i a g ( ⋯ ∂ ( w i × x i ) ∂ w i ⋯ ) = d i a g ( x ) \frac{\partial(\mathbf{w}\otimes\mathbf{x})}{\partial\mathbf{w}}=diag\left(\cdots\frac{\partial(w_i\times x_i)}{\partial w_i}\cdots\right)=diag(\mathbf{x}) ∂w∂(w⊗x)=diag(⋯∂wi∂(wi×xi)⋯)=diag(x) |
⊘ \oslash ⊘ | ∂ ( w ⊘ x ) ∂ w = d i a g ( ⋯ ∂ ( w i / x i ) ∂ w i ⋯ ) = d i a g ( ⋯ 1 x i ⋯ ) \frac{\partial(\mathbf{w}\oslash\mathbf{x})}{\partial\mathbf{w}}=diag\left(\cdots\frac{\partial(w_i/ x_i)}{\partial w_i}\cdots\right)=diag(\cdots\frac{1}{x_i}\cdots) ∂w∂(w⊘x)=diag(⋯∂wi∂(wi/xi)⋯)=diag(⋯xi1⋯) |
對 x \mathbf{x} x 的偏導
OP | Partial With Respect to x \mathbf{x} x |
---|---|
+ + + | ∂ ( w + x ) ∂ x = I \frac{\partial(\mathbf{w+x})}{\partial\mathbf{x}}=I ∂x∂(w+x)=I |
− - − | ∂ ( w − x ) ∂ x = − I \frac{\partial(\mathbf{w-x})}{\partial\mathbf{x}}=-I ∂x∂(w−x)=−I |
⊗ \otimes ⊗ | ∂ ( w ⊗ x ) ∂ x = d i a g ( w ) \frac{\partial(\mathbf{w\otimes x})}{\partial\mathbf{x}}=diag(\mathbf{w}) ∂x∂(w⊗x)=diag(w) |
⊘ \oslash ⊘ | ∂ ( w ⊘ x ) ∂ x = d i a g ( ⋯ − w i x i 2 ⋯ ) \frac{\partial(\mathbf{w\oslash x})}{\partial\mathbf{x}}=diag\left(\cdots\frac{-w_i}{x_i^2}\cdots\right) ∂x∂(w⊘x)=diag(⋯xi2−wi⋯) |
其中 ⊗ \otimes ⊗ 和 ⊘ \oslash ⊘ 表示元素間乘除法。
涉及標量運算的導數
當使用標量和向量之間的運算,例如乘法或者加法時,可以把標量擴充為向量並表示為兩個向量之前二元運算。例如,將標量 z z z 與向量 x \mathbf{x} x 相加 y = x + z = f ( x ) + g ( z ) \mathbf{y} = \mathbf{x} + z=f(\mathbf{x}) + g(z) y=x+z=f(x)+g(z) 此時 f ( x ) = x f(\mathbf{x})=\mathbf{x} f(x)=x , g ( z ) = 1 ⃗ z g(z) = \vec{1}z g(z)=1z 。同理 y = x z = x ⊗ 1 ⃗ z \mathbf{y}=\mathbf{x}z=\mathbf{x}\otimes\vec{1}z y=xz=x⊗1z 。 此時可以通過上一節的內容來計算導數。
∂ y ∂ x = d i a g ( ⋯ ∂ ∂ ( f i ( x i ) ◯ g i ( z ) ) ⋯ ) \frac{\partial\mathbf{y}}{\partial\mathbf{x}}=diag\left(\cdots \frac{\partial}{\partial}(f_i(x_i)\bigcirc g_i(z))\cdots\right) ∂x∂y=diag(⋯∂∂(fi(xi)◯gi(z))⋯)
對應可得
∂ ∂ x ( x + z ) = d i a g ( 1 ⃗ ) = I ∂ ∂ z ( x + z ) = d i a g ( 1 ⃗ ) = I \frac{\partial}{\partial\mathbf{x}}(\mathbf{x} + z) = diag(\vec{1})= I\\ \frac{\partial}{\partial z}(\mathbf{x} + z) = diag(\vec{1})= I ∂x∂(x+z)=diag(1)=I∂z∂(x+z)=diag(1)=I
∂ ∂ x ( x z ) = d i a g ( 1 ⃗ z ) = I z ∂ ∂ z ( x z ) = x \frac{\partial}{\partial\mathbf{x}}(\mathbf{x}z)=diag(\vec{1}z) = Iz\\ \frac{\partial}{\partial z}(\mathbf{x}z)= \mathbf{x} ∂x∂(xz)=diag(1z)=Iz∂z∂(xz)=x
關於最後一個等式推導如下, x x x 是一個列向量:
∂ ∂ z ( f i ( x i ) ⊗ g i ( z ) ) = x i ∂ z ∂ z + z ∂ x i ∂ z = x i + 0 = x i \frac{\partial}{\partial z}(f_i(x_i)\otimes g_i(z) ) = x_i\frac{\partial z}{\partial z} + z\frac{\partial x_i}{\partial z} = x_i + 0 = x_i ∂z∂(fi(xi)⊗gi(z))=xi∂z∂z+z∂z∂xi=xi+0=xi
向量對向量求導:結果是矩陣
向量對標量求導:結果是向量
向量歸約和(sum reduction)
深度學習通常會計算向量中所有元素的和,例如網路的損失函式。可以通過向量點乘或者其他操作把向量轉化為標量。
令 y = ∑ ( f ( x ) ) = ∑ i = 1 n f i ( x ) y=\sum(\mathbf{f(x)}) = \sum_{i=1}^nf_i(\mathbf{x}) y=∑(f(x))=∑i=1nfi(x) ,注意每個函式的引數都是向量 x \mathbf{x} x 。對應雅可比矩陣為 1 × n 1\times n 1×n 向量:
∂
y
∂
x
=
[
∂
y
∂
x
1
,
∂
y
∂
x
2
,
⋯
,
∂
y
∂
x
n
]
=
[
∂
∂
x
1
∑
i
f
i
(
x
)
,
∂
∂
x
2
∑
i
f
i
(
x
)
,
⋯
,
∂
∂
x
n
∑
i
f
i
(
x
)
]
=
[
∑
i
∂
f
i
(
x
)
∂
x
1
,
∑
i
∂
f
i
(
x
)
∂
x
2
,
⋯
,
∑
i
∂
f
i
(
x
)
∂
x
n
]
(
move derivate inside
∑
)
\begin{aligned} \frac{\partial y}{\partial\mathbf{x}}&= \begin{bmatrix} \frac{\partial y}{\partial x_1}, \frac{\partial y}{\partial x_2}, \cdots,\frac{\partial y}{\partial x_n} \end{bmatrix}\\ &= \begin{bmatrix} \frac{\partial}{\partial x_1}\sum_i f_i(\mathbf{x}), \frac{\partial}{\partial x_2}\sum_if_i(\mathbf{x}),\cdots,\frac{\partial}{\partial x_n}\sum_if_i(\mathbf{x}) \end{bmatrix}\\ &= \begin{bmatrix} \sum_i\frac{\partial f_i(\mathbf{x})}{\partial x_1}, \sum_i\frac{\partial f_i(\mathbf{x})}{\partial x_2},\cdots,\sum_i\frac{\partial f_i(\mathbf{x})}{\partial x_n} \end{bmatrix} (\text{move derivate inside} \sum) \end{aligned}
∂x∂y=[∂x1∂y,∂x2∂y,⋯,∂xn∂y]=[∂x1∂∑ifi(x),∂x2∂∑ifi(x),⋯,∂xn∂∑ifi(x)]=[∑i∂x1∂fi(x),∑i∂x2∂fi(x),⋯,∑i∂xn∂fi(x)](move derivate inside∑)
考慮最簡單的情況
y
=
s
u
m
(
x
)
y = sum(\mathbf{x})
y=sum(x) , 此時
f
i
(
x
)
=
x
i
f_i(\mathbf{x})=x_i
fi(x)=xi
∇
y
=
[
∑
i
∂
x
i
∂
x
1
,
∑
i
∂
x
i
∂
x
2
,
⋯
,
∑
i
∂
x
i
∂
x
n
]
=
[
1
,
1
,
⋯
,
1
]
=
1
⃗
T
\nabla y = \begin{bmatrix} \sum_i\frac{\partial x_i}{\partial x_1},\sum_i\frac{\partial x_i}{\partial x_2},\cdots,\sum_i\frac{\partial x_i}{\partial x_n} \end{bmatrix} = [1, 1,\cdots,1] = \vec{1}^T
∇y=[∑i∂x1∂xi,∑i∂x2∂xi,⋯,∑i∂xn∂xi]=[1,1,⋯,1]=1T
此時結果是一個全1的行向量。
考慮另外一種情況: y = s u m ( x z ) y= sum(\mathbf{x}z) y=sum(xz) , f i ( x , z ) = x i z f_i(\mathbf{x}, z)=x_iz fi(x,z)=xiz, 梯度為
∂ y ∂ x = [ ∑ i ∂ ∂ x 1 x i z , ∑ i ∂ ∂ x 2 x i z , ⋯ , ∑ i ∂ ∂ x n x i z ] = [ z , z , ⋯ , z ] \begin{aligned} \frac{\partial y}{\partial \mathbf{x}} &= \begin{bmatrix} \sum_i\frac{\partial}{\partial x_1}x_iz,\sum_i\frac{\partial}{\partial x_2}x_iz, \cdots, \sum_i\frac{\partial}{\partial x_n}x_iz \end{bmatrix}\\ &= \begin{bmatrix} z, z,\cdots, z \end{bmatrix} \end{aligned} ∂x∂y=[∑i∂x1∂xiz,∑i∂x2∂xiz,⋯,∑i∂xn∂xiz]=[z,z,⋯,z]
現在考慮對於標量 z z z 的梯度,其結果是 1 × 1 1\times 1 1×1標量
∂ y ∂ z = ∂ ∂ z ∑ i = 1 n x i z = ∑ i ∂ ∂ z x i z = ∑ i x i = s u m ( x ) \begin{aligned} \frac{\partial\mathbf{y}}{\partial z} &= \frac{\partial}{\partial z}\sum_{i=1}^n x_iz\\ &= \sum_i\frac{\partial}{\partial z}x_i z\\ &= \sum_ix_i\\ &=sum(\mathbf{x}) \end{aligned} ∂z∂y=∂z∂i=1∑nxiz=i∑∂z∂xiz=i∑xi=sum(x)
鏈式法則
從上面可以知道,複雜的函式計算可以通過基本的矩陣運算來實現。例如通常不能夠直接計算巢狀表示式的梯度如 s u m ( w + x ) sum(\mathbf{w + x}) sum(w+x) (除非將它門展開為標量計算)。 但是可以通過鏈式法則組合基本的矩陣求導法則來計算。下面先舉例解釋單變數鏈式法則(single-variable chain rule)。即標量方程對標量求導。進而推廣到全導數(total derivative)並且使用它去定義單變數全導數鏈式法則(single-variable total-derivative chain rule)。它在神經網路中受到廣泛的應用。
Single-variable chain rule
鏈式法則也是一種分而治之的策略。將複雜的表示式分解為子表示式,且子表示式的導數更方便求解。例如求解 d d x s i n ( x 2 ) = 2 x c o s ( x 2 ) \frac{d}{dx}sin(x^2)=2xcos(x^2) dxdsin(x2)=2xcos(x2) 外層的 s i n sin sin 可以使用內層表示式的結果。 d d x x 2 = 2 x \frac{d}{dx}x^2 = 2x dxdx2=2x , d d u s i n ( u ) = c o s ( u ) \frac{d}{du}sin(u)=cos(u) dudsin(u)=cos(u) 。看起來像是外部函式的導數和內部函式的導數連結起來。通常複合函式可以被寫作 y = f ( g ( x ) ) y=f(g(x)) y=f(g(x)) 或者 ( f ∘ g ) ( x ) (f\circ g)(x) (f∘g)(x) 。 y = f ( u ) , u = g ( x ) y = f(u), u= g(x) y=f(u),u=g(x) 鏈式法則可以表示為
d y d x = d y d u d u d x \frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx} dxdy=dudydxdu
- 通過中間變數把複雜函式求導轉化為兩個簡單函式的求導
- 分別計算兩個簡單函式的導數
- 兩個導數結果想乘
- 替換中間變數
鏈式法則也可以通過資料流或者抽象語法樹(abstract syntax tree)表示
如上圖所示,更改引數 x x x 會通過平方和正弦函式作用於 y y y 。可以把 d u d x \frac{du}{dx} dxdu 理解為 x x x 的變化傳導到 u u u。 鏈式法則可以表示為 d y d x = d u d x d y d u \frac{dy}{dx}=\frac{du}{dx}\frac{dy}{du} dxdy=dxdududy (x到y)
Single-variable chain rule 應用場景:注意上圖 x x x 到 y y y 只有一條資料流。因此 x x x 的改變僅能通過一條路徑影響到 y y y 。 但是如果表示式為 y ( x ) = x + x 2 y(x) = x + x^2 y(x)=x+x2 ,它表達為 y ( x , u ) = x + u , 此 時 y(x, u) = x + u,此時 y(x,u)=x+u,此時 y ( x , u ) y(x, u) y(x,u) 的資料流圖有多條路徑,此時應該使用單變數全微分鏈式法則(single-variable total-derivative chain rule)。可以先考慮下面這個式子 y = f ( x ) = l n ( s i n ( x 3 ) 2 ) y = f(x)=ln(sin(x^3)^2) y=f(x)=ln(sin(x3)2), 過程如下:
-
使用中間變數
u 1 = f 1 ( x ) = x 3 u 2 = f 2 ( u 1 ) = s i n ( u 1 ) u 3 = f 3 ( u 2 ) = u 2 2 u 4 = f 4 ( u 3 ) = l n ( u 3 ) ( y = u 4 ) \begin{aligned} u_1 &= f_1(x) = x^3\\ u_2 &= f_2(u_1)= sin(u_1)\\ u_3 &= f_3(u_2) = u_2^2\\ u_4 &= f_4(u_3) =ln(u_3)(y = u_4) \end{aligned} u1u2u3u4=f1(x)=x3=f2(u1)=sin(u1)=f3(u2)=u22=f4(u3)=ln(u3)(y=u4) -
計算微分
d d u x u 1 = 3 x 2 d d u 1 u 2 = c o s ( u 1 ) d d u 2 u 3 = 2 u 2 d d u 3 u 5 = 1 u 3 \begin{aligned} \frac{d}{du_x}u_1 &= 3 x^2\\ \frac{d}{du_1}u_2&= cos(u_1)\\ \frac{d}{du_2}u_3 &= 2u_2\\ \frac{d}{du_3}u_5 &= \frac{1}{u_3} \end{aligned} duxdu1du1du2du2du3du3du5=3x2=cos(u1)=2u2=u31 -
組合四個中間變數
d y d x = d u 4 d x = 1 u 3 2 u 2 c o s ( u 1 ) 3 x 2 = 6 u 2 x 2 c o s ( u 1 ) u 3 \frac{dy}{dx} = \frac{du_4}{dx} = \frac{1}{u_3}2u_2cos(u_1)3x^2 = \frac{6u_2x^2cos(u_1)}{u_3} dxdy=dxdu4=u312u2cos(u1)3x2=u36u2x2cos(u1) -
替換中間變數
d y d x = 6 s i n ( u 1 ) x 2 c o s ( x 3 ) u 2 2 = 6 s i n ( x 3 ) x 2 c o s ( x 3 ) s i n ( x 3 ) 2 = 6 x 2 c o s ( x 3 ) s i n ( x 3 ) \frac{dy}{dx} = \frac{6sin(u_1)x^2cos(x^3)}{u_2^2} = \frac{6sin(x^3)x^2cos(x^3)}{sin(x^3)^2} = \frac{6x^2cos(x^3)}{sin(x^3)} dxdy=u226sin(u1)x2cos(x3)=sin(x3)26sin(x3)x2cos(x3)=sin(x3)6x2cos(x3)
視覺化鏈式法則如下圖(1條路徑)
Single-variable total-derivative chain rule
單變數鏈式法則應用範圍有限,因為每個中心變數都必須是單變數的函式。但是它展示了鏈式法則的核心。如果想要對 y = f ( x ) = x + x 2 y=f(x)=x + x^2 y=f(x)=x+x2 通過鏈式法則求導,需要對基本的鏈式法則做增強。
顯然可以直接求導 d y d x = d d x x + d d x x 2 = 1 + 2 x \frac{dy}{dx}=\frac{d}{dx}x + \frac{d}{dx}x^2 = 1 + 2x dxdy=dxdx+dxdx2=1+2x。但是它應用了變數加法的導數法則而不是鏈式法則。先嚐試用鏈式法則來計算
u 1 ( x ) = x 2 u 2 ( x , u 1 ) = x + u 1 ( y = f ( x ) = u 2 ( x , u 1 ) ) \begin{aligned} u_1(x) &= x^2\\ u_2(x, u_1) &=x + u _1\quad(y=f(x)=u_2(x, u_1)) \end{aligned} u1(x)u2(x,u1)=x2=x+u1(y=f(x)=u2(x,u1))
先假設 d u 2 d u 1 = 0 + 1 = 1 \frac{du_2}{du_1} = 0 + 1 = 1 du1du2=0+1=1 和 d u 1 d x = 2 x \frac{du_1}{dx}=2x dxdu1=2x , 則 d y d x = d u 2 d x = d u 2 d u 1 d u 1 d x = 2 x \frac{dy}{dx} = \frac{du_2}{dx}=\frac{du_2}{du_1}\frac{du_1}{dx} = 2x dxdy=dxdu2=du1du2dxdu1=2x y與正確結果不相同。原因在於 u 2 ( x , u ) = x + u 1 u_2(x, u) = x + u_1 u2(x,u)=x+u1 有多個引數,此時需要引入偏導數。先嚐試一下:
∂ u 1 ( x ) ∂ x = 2 x ∂ u 2 ( x , u 1 ) ∂ u 1 = ∂ ∂ u 1 ( x + u 1 ) = 0 + 1 = 1 ∂ u 2 ( x , u 1 ) ∂ x ≠ ∂ ∂ x ( x + u 1 ) = 1 + 0 = 1 \begin{aligned} \frac{\partial u_1(x)}{\partial x} &= 2x\\ \frac{\partial u_2(x,u_1)}{\partial u_1} &= \frac{\partial}{\partial u_1}(x + u_1) = 0 + 1= 1\\ \frac{\partial u_2(x, u_1)}{\partial x}&\neq \frac{\partial}{\partial x}(x + u_1) = 1 + 0 = 1 \end{aligned} ∂x∂u1(x)∂u1∂u2(x,u1)∂x∂u2(x,u1)=2x=∂u1∂(x+u1)=0+1=1=∂x∂(x+u1)=1+0=1
∂ u 2 ( x , u 1 ) ∂ x \frac{\partial u_2(x, u_1)}{\partial x} ∂x∂u2(x,u1) 出現問題,因為 u 1 u_1 u1 包含變數了 x x x。在計算偏導的時候不能把 u 1 u_1 u1 看作標量。可以通過如下計算圖展示。
x x x 的變化會通過加法和平方運算影響到 y y y。下面的式子可以看出來 x x x 如何影響 y y y
y ^ = ( x + Δ x ) + ( x + Δ x ) 2 \hat{y} = (x +\Delta x) + (x +\Delta x)^2 y^=(x+Δx)+(x+Δx)2
Δ y = y ^ − y \Delta y = \hat{y} - y Δy=y^−y, 此時需要引出總導數( total derivatives), 他假設所有的中間變數都包含 x x x 並且可能隨著 x x x 的變化而變化。公式如下:
d y d x = ∂ f ( x ) x = ∂ u 2 ( x , u 1 ) ∂ x = ∂ u 2 ∂ x ∂ x ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x = ∂ u 2 ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x \frac{dy}{dx}=\frac{\partial f(x)}{x} = \frac{\partial u_2(x, u_1)}{\partial x} = \frac{\partial u_2}{\partial x}\frac{\partial x}{\partial x} + \frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} = \frac{\partial u_2}{\partial x} + \frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} dxdy=x∂f(x)=∂x∂u2(x,u1)=∂x∂u2∂x∂x+∂u1∂u2∂x∂u1=∂x∂u2+∂u1∂u2∂x∂u1
帶入公式:
d
y
d
x
=
∂
u
2
∂
x
+
∂
u
2
∂
u
1
∂
u
1
∂
x
=
1
+
1
×
2
x
=
1
=
2
x
\frac{dy}{dx} = \frac{\partial u_2}{\partial x} + \frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} = 1 + 1\times2x = 1 = 2x
dxdy=∂x∂u2+∂u1∂u2∂x∂u1=1+1×2x=1=2x
單變數總導數鏈式法則(single-variable total-derivative chaine rule)總結如下:
∂
f
(
x
,
u
1
,
⋯
,
u
n
)
∂
x
=
∂
f
∂
x
+
∑
i
n
∂
f
∂
u
i
∂
u
i
∂
x
\frac{\partial f(x, u_1,\cdots,u_n)}{\partial x}=\frac{\partial f}{\partial x} + \sum_i^n\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x}
∂x∂f(x,u1,⋯,un)=∂x∂f+i∑n∂ui∂f∂x∂ui
下面例子
f
(
x
)
=
s
i
n
(
x
+
x
2
)
f(x) = sin(x + x^2)
f(x)=sin(x+x2)
u 1 ( x ) = x 2 u 2 ( x , u 1 ) = x + u 1 u 3 ( u 2 ) = s i n ( u 2 ) \begin{aligned} u_1(x) &= x^2\\ u_2(x, u_1) &= x + u_1\\ u_3(u_2) &= sin(u_2) \end{aligned} u1(x)u2(x,u1)u3(u2)=x2=x+u1=sin(u2)
對應偏導
∂ u 1 ∂ x = 2 x ∂ u 2 ∂ x = ∂ x ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x = 1 + 2 x ∂ f ( x ) ∂ x = ∂ u 3 ∂ x + ∂ u 3 ∂ u 2 ∂ u 2 ∂ x = 0 + c o s ( u 2 ) ∂ u 2 ∂ x = c o s ( x + x 2 ) ( 1 + 2 x ) \begin{aligned} \frac{\partial u_1}{\partial x} &= 2x\\ \frac{\partial u_2}{\partial x} &=\frac{\partial x}{\partial x} + \frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x}= 1 + 2x\\ \frac{\partial f(x)}{\partial x} &= \frac{\partial u_3}{\partial x} +\frac{\partial u_3}{\partial u_2}\frac{\partial u_2}{\partial x} = 0 + cos(u_2)\frac{\partial u_2}{\partial x} = cos(x + x^2)(1+2x) \end{aligned} ∂x∂u1∂x∂u2∂x∂f(x)=2x=∂x∂x+∂u1∂u2∂x∂u1=1+2x=∂x∂u3+∂u2∂u3∂x∂u2=0+cos(u2)∂x∂u2=cos(x+x2)(1+2x)
可以針對 f ( x ) = x 3 f(x) = x^3 f(x)=x3 應用法則:
u 1 ( x ) = x 2 u 2 ( x , u 1 ) = x u 1 ∂ u 1 ∂ x = 2 x ∂ u 2 ∂ x = u 1 + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x = x 2 + x × 2 x = 3 x 2 \begin{aligned} u_1(x) &= x^2\\ u_2(x, u1) &= xu_1\\ \frac{\partial u_1}{\partial x} &= 2x\\ \frac{\partial u_2}{\partial x} &= u_1 + \frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} = x^2 + x\times 2x = 3x^2 \end{aligned} u1(x)u2(x,u1)∂x∂u1∂x∂u2=x2=xu1=2x=u1+∂u1∂u2∂x∂u1=x2+x×2x=3x2
使用更多的中間變數,可以把求導分解成更簡單的子問題。可以引入 x : u n + 1 = x x:u_{n+1} = x x:un+1=x 來更為清晰的展示鏈式法則:
∂ f ( u 1 , ⋯ , u n + 1 ) ∂ x = ∑ i = 1 n + 1 ∂ f ∂ u i ∂ u i ∂ x \frac{\partial f(u_1,\cdots, u_{n + 1})}{\partial x} = \sum_{i=1}^{n + 1}\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} ∂x∂f(u1,⋯,un+1)=i=1∑n+1∂ui∂f∂x∂ui
向量鏈式法則
把標量表示式擴充到向量
y
=
f
(
x
)
\mathbf{y} = \mathbf{f}(x)
y=f(x), 例如
[
y
1
(
x
)
y
2
(
x
)
]
=
[
f
1
(
x
)
f
2
(
x
)
]
=
[
l
n
(
x
2
)
s
i
n
(
3
x
)
]
\begin{bmatrix} y_1(x)\\ y_2(x)\\ \end{bmatrix}= \begin{bmatrix} f_1(x)\\ f_2(x) \end{bmatrix}= \begin{bmatrix} ln(x^2)\\ sin(3x) \end{bmatrix}
[y1(x)y2(x)]=[f1(x)f2(x)]=[ln(x2)sin(3x)]
首先引入兩個中間變數
g
1
g_1
g1 和
g
2
g_2
g2
[
g
1
(
x
)
g
2
(
x
)
]
=
[
x
2
3
x
]
[
f
1
(
g
)
f
2
(
g
)
]
=
[
l
n
(
g
1
)
s
i
n
(
g
2
)
]
\begin{aligned} \begin{bmatrix} g_1(x)\\ g_2(x) \end{bmatrix} &= \begin{bmatrix} x^2\\ 3x \end{bmatrix}\\ \begin{bmatrix} f_1(\mathbf{g})\\ f_2(\mathbf{g}) \end{bmatrix} &= \begin{bmatrix} ln(g_1)\\ sin(g_2) \end{bmatrix} \end{aligned}
[g1(x)g2(x)][f1(g)f2(g)]=[x23x]=[ln(g1)sin(g2)]
則關於標量
x
x
x 的導數構成的向量
y
\mathbf{y}
y 是一個列向量,可以通過單變數總導數鏈式法則計算
∂ y ∂ x = [ ∂ f 1 ( g ) ∂ x ∂ f 2 ( g ) ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ g 1 ∂ x + ∂ f 1 ∂ g 2 ∂ g 2 ∂ x ∂ f 2 ∂ g 1 ∂ g 1 ∂ x + ∂ f 2 ∂ g 2 ∂ g 2 ∂ x ] = [ 1 g 1 2 x + 0 0 + c o s ( g 2 ) 3 ] = [ 2 x 3 c o s ( 3 x ) ] \frac{\partial\mathbf{y}}{\partial x} = \begin{bmatrix} \frac{\partial f_1(\mathbf{g})}{\partial x}\\ \frac{\partial f_2(\mathbf{g})}{\partial x} \end{bmatrix}= \begin{bmatrix} \frac{\partial f_1}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x}\\ \frac{\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x} \end{bmatrix} = \begin{bmatrix} \frac{1}{g_1}2x+0\\ 0 + cos(g_2)3 \end{bmatrix}= \begin{bmatrix} \frac{2}{x}\\ 3cos(3x) \end{bmatrix} ∂x∂y=[∂x∂f1(g)∂x∂f2(g)]=[∂g1∂f1∂x∂g1+∂g2∂f1∂x∂g2∂g1∂f2∂x∂g1+∂g2∂f2∂x∂g2]=[g112x+00+cos(g2)3]=[x23cos(3x)]
上個公式表明,可以通過標量的鏈式法則求導後將其組合為向量,更一般的,可以從下面表示式發現規律
∂ ∂ x f ( g ( x ) ) = [ ∂ f 1 ∂ g 1 ∂ g 1 ∂ x + ∂ f 1 ∂ g 2 ∂ g 2 ∂ x ∂ f 2 ∂ g 1 ∂ g 1 ∂ x + ∂ f 2 ∂ g 2 ∂ g 2 ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ f 1 ∂ g 2 ∂ f 1 ∂ g 1 ∂ f 2 ∂ g 2 ] [ ∂ g 1 ∂ x ∂ g 2 ∂ x ] = ∂ f ∂ g ∂ g ∂ x \frac{\partial}{\partial x}\mathbf{f}(g(x))= \begin{bmatrix} \frac{\partial f_1}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x}\\ \frac{\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x} \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial g_1} & \frac{\partial f_1}{\partial g_2}\\ \frac{\partial f_1}{\partial g_1} & \frac{\partial f_2}{\partial g_2} \end{bmatrix} \begin{bmatrix} \frac{\partial g_1}{\partial x}\\ \frac{\partial g_2}{\partial x} \end{bmatrix}= \frac{\partial \mathbf{f}}{\partial \mathbf{g}} \frac{\partial \mathbf{g}}{\partial x} ∂x∂f(g(x))=[∂g1∂f1∂x∂g1+∂g2∂f1∂x∂g2∂g1∂f2∂x∂g1+∂g2∂f2∂x∂g2]=[∂g1∂f1∂g1∂f1∂g2∂f1∂g2∂f2][∂x∂g1∂x∂g2]=∂g∂f∂x∂g
這說明Jacobian可以通過兩個Jacobian乘法運算得到。更一般的,當輸入是向量時,只需要把第二個向量用矩陣表示,如下表示式
∂
∂
x
f
(
g
(
x
)
)
=
∂
f
∂
g
∂
g
∂
x
=
[
f
1
g
1
f
1
g
2
⋯
f
1
g
k
f
2
g
1
f
2
g
2
⋯
f
2
g
k
⋮
⋮
⋮
f
m
g
1
f
m
g
2
⋯
f
m
g
k
]
[
∂
g
1
∂
x
1
∂
g
1
∂
x
2
⋯
∂
g
1
∂
x
n
∂
g
2
∂
x
1
∂
g
2
∂
x
1
⋯
∂
g
2
∂
x
n
⋮
⋮
⋮
∂
g
k
∂
x
1
∂
g
k
∂
x
2
⋯
∂
g
k
∂
x
n
]
\frac{\partial}{\partial \mathbf{x}}\mathbf{f(g(x))} = \frac{\partial \mathbf{f}}{\partial \mathbf{g}}\frac{\partial \mathbf{g}}{\partial\mathbf{x}} = \begin{bmatrix} \frac{f_1}{g_1} &\frac{f_1}{g_2} &\cdots &\frac{f_1}{g_k}\\ \frac{f_2}{g_1} &\frac{f_2}{g_2} &\cdots &\frac{f_2}{g_k}\\ \vdots &\vdots &&\vdots\\ \frac{f_m}{g_1}&\frac{f_m}{g_2} &\cdots&\frac{f_m}{g_k} \end{bmatrix} \begin{bmatrix} \frac{\partial g_1}{\partial x_1} &\frac{\partial g_1}{\partial x_2}&\cdots&\frac{\partial g_1}{\partial x_n}\\ \frac{\partial g_2}{\partial x_1} &\frac{\partial g_2}{\partial x_1}&\cdots&\frac{\partial g_2}{\partial x_n}\\ \vdots&\vdots&&\vdots\\ \frac{\partial g_k}{\partial x_1}& \frac{\partial g_k}{\partial x_2} &\cdots &\frac{\partial g_k}{\partial x_n} \end{bmatrix}
∂x∂f(g(x))=∂g∂f∂x∂g=⎣⎢⎢⎢⎢⎡g1f1g1f2⋮g1fmg2f1g2f2⋮g2fm⋯⋯⋯gkf1gkf2⋮gkfm⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎡∂x1∂g1∂x1∂g2⋮∂x1∂gk∂x2∂g1∂x1∂g2⋮∂x2∂gk⋯⋯⋯∂xn∂g1∂xn∂g2⋮∂xn∂gk⎦⎥⎥⎥⎥⎤
從單變數鏈式擴充到向量形式一個直觀的好處是,可以通過同樣的公式表示總導數。上面的式子中 $m = |\mathbf{f}|, n = |\mathbf{x}|, k = |\mathbf{g}| $ 最終Jacobian為 m × n m\times n m×n 矩陣。
即使得到了 ∂ f ∂ g ∂ g ∂ x \frac{\partial \mathbf{f}}{\partial \mathbf{g}}\frac{\partial \mathbf{g}}{\partial \mathbf{x}} ∂g∂f∂x∂g 公式,很多情況下還可以做進一步簡化。Jacobian 通常是方陣,並且非對角線元素是 0 。神經網路一般處理關於向量的方程,而不是方程構成的向量。例如,對於仿射函式 s u m ( w ⊗ x ) sum(\mathbf{w}\otimes\mathbf{x}) sum(w⊗x) 和啟用函式 m a x ( 0 , x ) max(0, \mathbf{x}) max(0,x) 下一節會介紹他的導數。下圖給出了Jacobian的形狀。(長方形形狀表示標量/行向量/列向量/矩陣)
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片儲存下來直接上傳(img-MvKEc0TW-1601559616707)(https://raw.githubusercontent.com/lih627/MyPicGo/master/imgs/20201001001119.png)]
神經啟用函式的梯度
下面了來計算神經網路啟用函式的導數,包括引數
w
\mathbf{w}
w 和
b
b
b (注意$\mathbf{x} $ 和
w
\mathbf{w}
w 都是列向量):
a
c
t
i
v
a
t
i
o
n
(
x
)
=
m
a
x
(
0
,
w
⋅
x
+
b
)
activation(\mathbf{x})=max(0, \mathbf{w}\cdot\mathbf{x} + b)
activation(x)=max(0,w⋅x+b)
上述表示全連線層在接一個線形整流單元作為啟用函式。首先忽略
m
a
x
max
max 函式,計算
∂
∂
w
(
w
⋅
x
+
b
)
\frac{\partial}{\partial\mathbf{w}}(\mathbf{w\cdot x} + b)
∂w∂(w⋅x+b) 和
∂
∂
b
(
w
⋅
x
+
b
)
\frac{\partial}{\partial b}(\mathbf{w\cdot x} + b)
∂b∂(w⋅x+b) 。首先考慮
w
⋅
x
\mathbf{w\cdot x}
w⋅x 。其實就是元素對應位置的乘積的和。 即
∑
i
n
(
w
i
x
i
)
=
s
u
m
(
w
⊗
x
)
\sum_{i}^n(w_ix_i)=sum(\mathbf{w\otimes x})
∑in(wixi)=sum(w⊗x) 。或者採用線性代數的表達方式
w
⋅
x
=
w
T
x
\mathbf{w\cdot x} = \mathbf{w}^T\mathbf{x}
w⋅x=wTx 。在上面章節已經討論過了
s
u
m
(
x
)
sum(\mathbf{x})
sum(x) 和
w
⊗
x
\mathbf{w\otimes x }
w⊗x 的偏導數了。在這裡使用鏈式法則:
u = w ⊗ x y = s u m ( u ) \begin{aligned} \mathbf{u} &= \mathbf{w\otimes x}\\ y &= sum(\mathbf{u}) \end{aligned} uy=w⊗x=sum(u)
計算偏導數:
∂
u
∂
w
=
∂
∂
w
(
w
⊗
x
)
=
d
i
a
g
(
x
)
∂
y
∂
u
=
∂
∂
u
s
u
m
(
u
)
=
1
⃗
T
\begin{aligned} \frac{\partial\mathbf{u}}{\partial \mathbf{w}} &= \frac{\partial}{\partial \mathbf{w}}(\mathbf{w\otimes x}) = diag(\mathbf{x})\\ \frac{\partial y}{\partial\mathbf{u}} &= \frac{\partial}{\partial \mathbf{u}}sum(\mathbf{u}) =\vec{1}^T \end{aligned}
∂w∂u∂u∂y=∂w∂(w⊗x)=diag(x)=∂u∂sum(u)=1T
通過鏈式法則可得:
∂
y
∂
w
=
∂
y
∂
u
∂
u
∂
w
=
1
⃗
T
d
i
a
g
(
x
)
=
x
T
\frac{\partial y}{\partial \mathbf{w}} = \frac{\partial y}{\partial \mathbf{u}}\frac{\partial \mathbf{u}}{\partial \mathbf{w}} = \vec{1}^Tdiag(\mathbf{x}) = \mathbf{x}^T
∂w∂y=∂u∂y∂w∂u=1Tdiag(x)=xT
因此有:
∂
y
∂
w
=
[
x
1
,
⋯
,
x
n
]
=
x
T
\frac{\partial y}{\partial \mathbf{w}} = [x_1, \cdots, x_n] = \mathbf{x}^T
∂w∂y=[x1,⋯,xn]=xT
現在考慮
y
=
w
⋅
x
+
b
y=\mathbf{w\cdot x} + b
y=w⋅x+b 需要考慮兩個偏導,並且不需要鏈式法則:
∂
y
∂
w
=
∂
∂
w
w
⋅
x
+
∂
∂
w
b
=
x
T
+
0
⃗
T
=
x
T
∂
y
∂
b
=
∂
∂
b
w
⋅
x
+
∂
∂
b
b
=
0
+
1
=
1
\begin{aligned} \frac{\partial y}{\partial \mathbf{w}} &= \frac{\partial}{\partial\mathbf{w}}\mathbf{w\cdot x} + \frac{\partial}{\partial \mathbf{w}}b = \mathbf{x}^T + \vec{0}^T = \mathbf{x}^T\\ \frac{\partial y}{\partial b} &= \frac{\partial}{\partial b} \mathbf{w\cdot x} + \frac{\partial}{\partial b} b = 0 + 1 = 1 \end{aligned}
∂w∂y∂b∂y=∂w∂w⋅x+∂w∂b=xT+0T=xT=∂b∂w⋅x+∂b∂b=0+1=1
下面需要考慮
m
a
x
(
0
,
z
)
max(0, z)
max(0,z) 函式的導數, 顯然
KaTeX parse error: Unknown column alignment: * at position 63: …\begin{array} {*̲*lr**} 0 &z\le …
當計算加啟用函式後的梯度時,使用向量鏈式法則;
z
(
w
,
b
,
x
)
=
w
⋅
x
+
b
a
c
t
i
v
a
t
i
o
n
(
z
)
=
m
a
x
(
0
,
z
)
\begin{aligned} z(\mathbf{w}, b, \mathbf{x}) &= \mathbf{w\cdot x} + b\\ activation(z) &= max(0, z) \end{aligned}
z(w,b,x)activation(z)=w⋅x+b=max(0,z)
鏈式法則表達為:
∂
a
c
t
i
v
a
t
i
o
n
∂
w
=
∂
a
c
t
i
v
a
a
t
i
o
n
∂
z
∂
z
∂
w
\frac{\partial activation}{\partial \mathbf{w}} = \frac{\partial activaation}{\partial z} \frac{\partial z}{\partial \mathbf{w}}
∂w∂activation=∂z∂activaation∂w∂z
帶入表示式為:
∂
a
c
t
i
v
a
t
i
o
n
∂
w
=
{
0
∂
z
∂
w
=
0
⃗
T
w
⋅
x
+
b
≤
0
1
∂
z
∂
w
=
x
T
w
⋅
x
+
b
>
0
\frac{\partial activation}{\partial \mathbf{w}} = \left\{ \begin{array}{ll} 0\frac{\partial z}{\partial w} = \vec{0}^T & \mathbf{w\cdot x} + b \le 0\\ 1\frac{\partial z}{\partial\mathbf{w}}=\mathbf{x}^T & \mathbf{w\cdot x} + b>0 \end{array} \right.
∂w∂activation={0∂w∂z=0T1∂w∂z=xTw⋅x+b≤0w⋅x+b>0
同理:
KaTeX parse error: Expected & or \\ or \cr or \end at position 66: …\begin{array}ll}̲ 0\frac{\partia…
擴充: 廣播函式
當使使用廣播函式(board casting functions) 時,此時
m
a
x
max
max 輸入的引數為向量。只需要對向量中的每一個元素做標量的
m
a
x
max
max 運算。即:
m
a
x
(
0
,
x
)
=
[
m
a
x
(
0
,
x
1
)
m
a
x
(
0
,
x
2
)
⋮
m
a
x
(
0
,
x
n
)
]
max(0,\mathbf{x}) = \begin{bmatrix} max(0, x_1)\\ max(0, x_2)\\ \vdots\\ max(0, x_n) \end{bmatrix}
max(0,x)=⎣⎢⎢⎢⎡max(0,x1)max(0,x2)⋮max(0,xn)⎦⎥⎥⎥⎤
此時梯度為:
∂
∂
x
m
a
x
(
0
,
x
)
=
[
∂
∂
x
1
m
a
x
(
0
,
x
1
)
∂
∂
x
2
m
a
x
(
0
,
x
2
)
⋮
∂
∂
x
n
m
a
x
(
0
,
x
n
)
]
\frac{\partial}{\partial \mathbf{x}} max(0, \mathbf{x}) = \begin{bmatrix} \frac{\partial}{\partial x_1}max(0, x_1)\\ \frac{\partial}{\partial x_2}max(0, x_2)\\ \vdots\\ \frac{\partial}{\partial x_n}max(0, x_n) \end{bmatrix}
∂x∂max(0,x)=⎣⎢⎢⎢⎡∂x1∂max(0,x1)∂x2∂max(0,x2)⋮∂xn∂max(0,xn)⎦⎥⎥⎥⎤
神經網路損失函式的梯度
損失函式的結果是一個標量,先進行符號定義 每個樣本和標籤被定義為
(
x
i
,
t
a
r
g
e
t
(
x
i
)
)
(\mathbf{x}_i, target(\mathbf{x}_i))
(xi,target(xi)) 有:
X
=
[
x
1
,
x
2
,
⋯
,
x
N
]
T
X=[\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N]^T
X=[x1,x2,⋯,xN]T
其中
N
=
∣
X
∣
N=|X|
N=∣X∣, 標籤構成的向量為:
y
=
[
t
a
r
g
e
t
(
x
1
)
,
t
a
r
g
e
t
(
x
2
)
,
⋯
,
t
a
r
g
e
t
(
x
N
)
]
T
\mathbf{y} = [target(\mathbf{x}_1), target(\mathbf{x}_2), \cdots, target(\mathbf{x}_N)]^T
y=[target(x1),target(x2),⋯,target(xN)]T
其中
y
i
y_i
yi 是標量。損失函式被定義為:
C
(
w
,
b
,
X
,
y
)
=
1
N
∑
i
=
1
N
(
y
i
−
a
c
t
i
v
a
t
i
o
n
(
x
i
)
)
2
=
1
N
∑
i
=
1
N
(
y
i
−
m
a
x
(
w
⋅
x
i
+
b
)
)
2
C(\mathbf{w}, b, X,\mathbf{y})= \frac{1}{N}\sum_{i = 1}^N(y_i - activation(\mathbf{x}_i))^2= \frac{1}{N}\sum_{i = 1}^N(y_i - max(\mathbf{w\cdot x}_i + b))^2
C(w,b,X,y)=N1i=1∑N(yi−activation(xi))2=N1i=1∑N(yi−max(w⋅xi+b))2
根據鏈式法則:
u
(
w
,
b
,
x
)
=
m
a
x
(
0
,
w
⋅
x
+
b
)
v
(
y
,
u
)
=
y
−
u
C
(
v
)
=
1
N
∑
i
=
1
N
v
2
\begin{aligned} u(\mathbf{w}, b, \mathbf{x}) &= max(0, \mathbf{w\cdot x} + b)\\ v(y, u) &= y - u\\ C(v) &= \frac{1}{N}\sum_{i = 1}^N v^2 \end{aligned}
u(w,b,x)v(y,u)C(v)=max(0,w⋅x+b)=y−u=N1i=1∑Nv2
關於權重的梯度
從前幾章節可以知道:
∂
∂
w
u
(
w
,
b
,
x
)
=
{
0
⃗
T
w
⋅
x
+
b
≤
0
x
T
w
⋅
x
+
b
>
0
\frac{\partial}{\partial \mathbf{w}}u(\mathbf{w}, b, \mathbf{x}) = \left\{ \begin{array}{ll} \vec{0}^T & \mathbf{w\cdot x} + b\le0\\ \mathbf{x}^T & \mathbf{w\cdot x} + b> 0 \end{array} \right.
∂w∂u(w,b,x)={0TxTw⋅x+b≤0w⋅x+b>0
∂
v
(
y
,
u
)
∂
w
=
∂
∂
w
(
y
−
u
)
=
0
⃗
T
−
∂
u
∂
w
=
−
∂
u
∂
w
=
{
0
⃗
T
w
⋅
x
+
b
≤
0
−
x
T
w
⋅
x
+
b
>
0
\frac{\partial v(y, u)}{\partial \mathbf{w}} = \frac{\partial}{\partial\mathbf{w}}(y - u) = \vec{0}^T - \frac{\partial u}{\partial \mathbf{w}}= -\frac{\partial u}{\partial\mathbf{w}} = \left\{ \begin{array}{ll} \vec{0}^T & \mathbf{w\cdot x} + b \le 0\\ -\mathbf{x}^T &\mathbf{w\cdot x} + b > 0 \end{array} \right.
∂w∂v(y,u)=∂w∂(y−u)=0T−∂w∂u=−∂w∂u={0T−xTw⋅x+b≤0w⋅x+b>0
那總的梯度可以通過下式子計算:
∂
C
(
v
)
∂
w
=
∂
∂
w
1
N
∑
i
=
1
N
v
2
=
1
N
∑
i
=
1
N
∂
v
2
∂
w
=
1
N
∑
i
=
1
N
2
v
∂
v
∂
w
=
1
N
∑
i
=
1
N
{
2
v
0
⃗
T
=
0
⃗
T
w
⋅
x
i
+
b
≤
0
−
2
v
x
T
w
⋅
x
i
+
b
>
0
=
1
N
∑
i
=
1
N
{
0
⃗
T
w
⋅
x
i
+
b
≤
0
−
2
(
y
i
−
u
)
x
i
T
w
⋅
x
+
i
+
b
>
0
=
1
N
∑
i
=
1
N
{
0
⃗
T
w
⋅
x
i
+
b
≤
0
−
2
(
y
i
−
m
a
x
(
0
,
w
⋅
x
+
i
+
b
)
)
x
i
T
w
⋅
x
i
+
b
>
0
=
{
0
⃗
T
w
⋅
x
i
+
b
≤
0
−
2
N
∑
i
=
1
N
(
y
i
−
(
w
⋅
x
i
+
b
)
)
x
i
T
w
⋅
x
i
+
b
>
0
=
{
0
⃗
T
w
⋅
x
i
+
b
≤
0
2
N
∑
i
=
1
N
(
w
⋅
x
i
+
b
−
y
i
)
x
i
T
w
⋅
x
i
+
b
>
0
\begin{aligned} \frac{\partial C(v)}{\partial \mathbf{w}} &= \frac{\partial}{\partial \mathbf{w}}\frac{1}{N} \sum_{i= 1}^N v^2\\ &= \frac{1}{N}\sum_{i=1}^N\frac{\partial v^2}{\partial \mathbf{w}} \\&= \frac{1}{N}\sum_{i= 1}^N 2v\frac{\partial v}{\partial \mathbf{w}}\\ &= \frac{1}{N}\sum_{i = 1}^N\left\{ \begin{array}{lr} 2v\vec{0}^T = \vec{0}^T &\mathbf{w\cdot x}_i + b\le0\\ -2v\mathbf{x}^T & \mathbf{w\cdot x}_i + b > 0 \end{array} \right.\\ &= \frac{1}{N}\sum_{i=1}^N\left\{ \begin{array}{ll} \vec{0}^T & \mathbf{w\cdot x}_i + b\le 0\\ -2(y_i - u)\mathbf{x}_i^T &\mathbf{w\cdot x}+i + b >0 \end{array} \right.\\ &= \frac{1}{N}\sum_{i=1}^N\left\{ \begin{array}{ll} \vec{0}^T &\mathbf{w\cdot x}_i + b\le 0\\ -2(y_i - max(0,\mathbf{w\cdot x}+i + b))\mathbf{x}_i^T & \mathbf{w\cdot x}_i + b> 0 \end{array} \right.\\ &= \left\{ \begin{array}{ll} \vec{0}^T & \mathbf{w\cdot x}_i + b \le 0\\ \frac{-2}{N}\sum_{i= 1}^N(y_i - (\mathbf{w\cdot x}_i + b))\mathbf{x}_i^T &\mathbf{w\cdot x}_i + b > 0 \end{array} \right.\\ &= \left\{ \begin{array}{ll} \vec{0}^T & \mathbf{w\cdot x}_i + b\le 0\\ \frac{2}{N}\sum_{i = 1}^N(\mathbf{w\cdot x}_i + b - y_i)\mathbf{x}_i^T & \mathbf{w\cdot x}_i + b > 0 \end{array} \right. \end{aligned}
∂w∂C(v)=∂w∂N1i=1∑Nv2=N1i=1∑N∂w∂v2=N1i=1∑N2v∂w∂v=N1i=1∑N{2v0T=0T−2vxTw⋅xi+b≤0w⋅xi+b>0=N1i=1∑N{0T−2(yi−u)xiTw⋅xi+b≤0w⋅x+i+b>0=N1i=1∑N{0T−2(yi−max(0,w⋅x+i+b))xiTw⋅xi+b≤0w⋅xi+b>0={0TN−2∑i=1N(yi−(w⋅xi+b))xiTw⋅xi+b≤0w⋅xi+b>0={0TN2∑i=1N(w⋅xi+b−yi)xiTw⋅xi+b≤0w⋅xi+b>0
可以定義一個誤差項
e
i
=
w
⋅
x
i
+
b
−
y
i
e_i = \mathbf{w\cdot x}_i + b - y_i
ei=w⋅xi+b−yi 來簡化總梯度。注意該梯度針對啟用函式結果非0的情況:
∂
C
∂
w
=
2
N
∑
i
=
1
N
e
i
x
i
T
w
⋅
x
i
+
b
>
0
\frac{\partial C}{\partial\mathbf{w}}=\frac{2}{N}\sum_{i = 1}^Ne_i\mathbf{x_i}^T \quad \mathbf{w\cdot x}_i + b > 0
∂w∂C=N2i=1∑NeixiTw⋅xi+b>0
注意此時梯度是通過所有樣本計算的加權平均項。權重與誤差項相關。最終的梯度指向更大的
e
i
e_i
ei 對應樣本的方向。梯度下降公式寫作
w
t
+
1
=
w
t
−
η
∂
C
∂
w
\mathbf{w}_{t + 1} = \mathbf{w}_t - \mathbf{\eta}\frac{\partial C}{\partial\mathbf{w}}
wt+1=wt−η∂w∂C
針對偏置項的公式
優化偏置項
b
b
b 和優化權重項類似, 先使用中間變數
u
(
w
,
b
,
x
)
=
m
a
x
(
0
,
w
⋅
x
+
b
)
v
(
y
,
u
)
=
y
−
i
C
(
v
)
=
1
N
∑
i
=
1
N
v
2
\begin{aligned} u(\mathbf{w}, b, \mathbf{x}) &= max(0, \mathbf{w\cdot x} + b)\\ v(y, u) &= y - i\\ C(v) &= \frac{1}{N}\sum_{i=1}^N v^2 \end{aligned}
u(w,b,x)v(y,u)C(v)=max(0,w⋅x+b)=y−i=N1i=1∑Nv2
已知:
∂
u
∂
b
=
{
0
w
⋅
x
+
b
≤
0
1
w
⋅
x
+
b
>
0
\frac{\partial u}{\partial b} = \left\{ \begin{array}{ll} 0 & \mathbf{w\cdot x} + b \le 0\\ 1 & \mathbf{w\cdot x} + b> 0 \end{array} \right.
∂b∂u={01w⋅x+b≤0w⋅x+b>0
那麼對於
v
v
v , 其偏導數為:
∂
v
(
y
,
u
)
∂
b
=
−
∂
u
∂
b
=
{
0
w
⋅
x
+
b
≤
0
−
1
w
⋅
x
+
b
>
0
\frac{\partial v(y, u)}{\partial b} = -\frac{\partial u}{\partial b} = \left\{ \begin{array}{ll} 0 & \mathbf{w\cdot x} + b \le 0\\ -1 & \mathbf{w\cdot x} + b > 0 \end{array} \right.
∂b∂v(y,u)=−∂b∂u={0−1w⋅x+b≤0w⋅x+b>0
那麼總的偏導數為
∂
C
(
v
)
∂
b
=
∂
∂
b
1
N
∑
i
=
1
N
v
2
=
1
N
∑
i
=
1
N
∂
∂
b
v
2
=
1
N
∑
i
=
1
N
2
v
∂
v
∂
b
=
1
N
∑
i
=
1
N
{
0
w
⋅
x
+
b
≤
0
−
2
v
w
⋅
x
+
b
>
0
=
1
N
∑
i
=
1
N
{
0
w
⋅
x
+
b
≤
0
−
2
(
y
i
−
m
a
x
(
0
,
w
⋅
x
i
+
b
)
w
⋅
x
+
b
>
0
=
{
0
w
⋅
x
+
b
≤
0
2
N
∑
i
=
1
N
2
(
w
⋅
x
i
+
b
−
y
i
)
w
⋅
x
i
+
b
>
0
\begin{aligned} \frac{\partial C(v)}{\partial b} &= \frac{\partial}{\partial b}\frac{1}{N} \sum_{i = 1}^Nv ^ 2\\ &= \frac{1}{N}\sum_{i = 1}^N \frac{\partial}{\partial b} v^2\\ &= \frac{1}{N}\sum_{i = 1}^N 2v\frac{\partial v}{\partial b}\\ &= \frac{1}{N}\sum_{i = 1}^N\left\{ \begin{array}{ll} 0 & \mathbf{w\cdot x} + b\le 0\\ -2v & \mathbf{w\cdot x} + b > 0 \end{array} \right.\\ &= \frac{1}{N}\sum_{i = 1}^N\left\{ \begin{array}{ll} 0 & \mathbf{w\cdot x} + b \le 0\\ -2(y_i - max(0, \mathbf{w\cdot x}_i + b) &\mathbf{w \cdot} x + b > 0 \end{array} \right. \\ &= \left\{ \begin{array}{ll} 0 & \mathbf{w\cdot x} + b \le 0\\ \frac{2}{N}\sum_{i = 1}^N2(\mathbf{w\cdot x}_i + b - y_i) &\mathbf{w \cdot x}_i + b > 0 \end{array} \right. \end{aligned}
∂b∂C(v)=∂b∂N1i=1∑Nv2=N1i=1∑N∂b∂v2=N1i=1∑N2v∂b∂v=N1i=1∑N{0−2vw⋅x+b≤0w⋅x+b>0=N1i=1∑N{0−2(yi−max(0,w⋅xi+b)w⋅x+b≤0w⋅x+b>0={0N2∑i=1N2(w⋅xi+b−yi)w⋅x+b≤0w⋅xi+b>0
同理通過定於誤差項
e
i
=
w
⋅
x
i
+
b
−
y
i
e_i = \mathbf{w\cdot x}_i + b - y_i
ei=w⋅xi+b−yi
可以得到偏導數為
∂
C
∂
b
=
2
N
∑
i
=
1
N
e
i
w
⋅
x
i
+
b
>
0
\frac{\partial C}{\partial b} = \frac{2}{N}\sum_{i = 1}^Ne_i \quad \mathbf{w\cdot x}_i + b > 0
∂b∂C=N2i=1∑Neiw⋅xi+b>0
對應的更新公式為
b
t
+
1
=
b
t
−
η
∂
C
∂
b
b_{t + 1} = b_t - \mathbf{\eta}\frac{\partial C}{\partial b}
bt+1=bt−η∂b∂C
總結
在實際使用過程中,通常使用擴充權重向量。即
w
^
=
[
w
T
,
b
]
T
x
^
=
[
x
T
,
1
]
\begin{aligned} \hat{\mathbf{w}} &= [\mathbf{w}^T, b]^T\\ \hat{\mathbf{x}} & = [\mathbf{x}^T, 1] \end{aligned}
w^x^=[wT,b]T=[xT,1]
此時
w
⋅
x
+
b
=
w
^
⋅
x
^
\mathbf{w\cdot x} + b = \hat{\mathbf{w}}\cdot \hat{\mathbf{x}}
w⋅x+b=w^⋅x^
相關文章
- 如何理解雅克比矩陣在深度學習中的應用?矩陣深度學習
- 矩陣計算矩陣
- 計算矩陣的秩矩陣
- 一文讀懂深度學習中的矩陣微積分深度學習矩陣
- 機器學習中的矩陣向量求導(五) 矩陣對矩陣的求導機器學習矩陣求導
- 計算機視覺中的深度學習計算機視覺深度學習
- Numpy中的矩陣運算矩陣
- 矩陣:如何使用矩陣操作進行 PageRank 計算?矩陣
- 計算機圖形學之矩陣變換計算機矩陣
- matlab計算含有未知數的矩陣Matlab矩陣
- 演算法學習:矩陣快速冪/矩陣加速演算法矩陣
- 跟我一起學《深度學習》 第二章 線性代數(2.3 單位矩陣和逆矩陣)深度學習矩陣
- 機器學習中的矩陣向量求導(四) 矩陣向量求導鏈式法則機器學習矩陣求導
- 學習雲端計算需要培訓嗎?雲端計算需要學習什麼內容?
- NYOJ 1409 快速計算【矩陣連乘】矩陣
- 怎樣用python計算矩陣乘法?Python矩陣
- python 計算矩陣的相關演算法Python矩陣演算法
- 【numpy學習筆記】矩陣操作筆記矩陣
- 樣本協方差矩陣的定義與計算矩陣
- 計算機視覺與深度學習公司計算機視覺深度學習
- 學習分享:對極幾何、基本矩陣、本質矩陣(持續更新)矩陣
- PHP陣列學習之計算陣列元素總和PHP陣列
- 深度學習高效計算與處理器設計深度學習
- 轉矩的計算?
- 深度學習在計算機視覺各項任務中的應用深度學習計算機視覺
- 科學計算與Matlab筆記:第2章:Matlab矩陣處理Matlab筆記矩陣
- 矩陣中的路徑矩陣
- 深度:如何從系統層面優化深度學習計算?優化深度學習
- 巨大的矩陣(矩陣加速)矩陣
- OpenCV 例項解讀:深度學習的計算與加速OpenCV深度學習
- 史丹佛—深度學習和計算機視覺深度學習計算機視覺
- OpenGL 學習 07 向量 矩陣變換 投影矩陣
- 學習計算機程式設計需要什麼基礎?計算機程式設計
- 深度學習中的Dropout深度學習
- 8個計算機視覺深度學習中常見的Bug計算機視覺深度學習
- Three.js中的矩陣JS矩陣
- 作為雲端計算的互動設計師需要學習哪些?
- 資料結構:陣列,稀疏矩陣,矩陣的壓縮。應用:矩陣的轉置,矩陣相乘資料結構陣列矩陣