本文使用通俗的語言和形象的圖示,介紹了隨機梯度下降演算法和它的三種經典變體,並提供了完整的實現程式碼。
GitHub 連結:https://github.com/ManuelGonzalezRivero/3dbabove
代價函式的多種優化方法
目標函式是衡量預測值和實際值的相似程度的指標。通常,我們希望得到使代價儘可能小的引數集,而這意味著你的演算法效能不錯。函式的最小可能代價被稱為最小值。有時一個代價函式可以有多個區域性極小值。幸運的是,在引數空間的維數非常高的情況下,阻礙目標函式充分優化的區域性最小值並不經常出現,因為這意味著物件函式相對於每個引數在訓練過程的早期都是凹的。但這並非常態,通常我們得到的是許多鞍點,而不是真正的最小值。
找到生成最小值的一組引數的演算法被稱為優化演算法。我們發現隨著演算法複雜度的增加,則演算法傾向於更高效地逼近最小值。我們將在這篇文章中討論以下演算法:
- 隨機梯度下降法
- 動量演算法
- RMSProp
- Adam 演算法
隨機梯度下降法
我的「Logistic 迴歸深入淺出」的文章裡介紹了一個隨機梯度下降如何運作的例子。如果你查閱隨機梯度下降法的資料(SGD),通常會遇到如下的等式:
資料上會說,θ是你試圖找到最小化 J 的引數,這裡的 J 稱為目標函式。最後,我們將學習率記為α。通常要反覆應用上述等式,直到達到你所需的代價值。
這是什麼意思?想一想,假如你坐在一座山頂上的雪橇上,望著另一座山丘。如果你滑下山丘,你會自然地往下移動,直到你最終停在山腳。如果第一座小山足夠陡峭,你可能會開始滑上另一座山的一側。從這個比喻中你可以想到:
學習率越高意味著摩擦力越小,因此雪橇會像在冰上一樣沿著山坡下滑。低的學習率意味著摩擦力高,所以雪橇會像在地毯上一樣,難以滑下。我們如何用上面的方程來模擬這種效果?
隨機梯度下降法:
- 初始化引數(θ,學習率)
- 計算每個θ處的梯度
- 更新引數
- 重複步驟 2 和 3,直到代價值穩定
讓我們用一個簡單的例子來看看它是如何運作的!
在這裡我們看到一個目標函式和它的導數(梯度):
我們可以用下面的程式碼生成函式和梯度值/30 的圖:
import numpy as np
def minimaFunction(theta):
return np.cos(3*np.pi*theta)/theta
def minimaFunctionDerivative(theta):
const1 = 3*np.pi
const2 = const1*theta
return -(const1*np.sin(const2)/theta)-np.cos(const2)/theta**2
theta = np.arange(.1,2.1,.01)
Jtheta = minimaFunction(theta)
dJtheta = minimaFunctionDerivative(theta)
plt.plot(theta,Jtheta,label = r'$J(\theta)$')
plt.plot(theta,dJtheta/30,label = r'$dJ(\theta)/30$')
plt.legend()
axes = plt.gca()
#axes.set_ylim([-10,10])
plt.ylabel(r'$J(\theta),dJ(\theta)/30$')
plt.xlabel(r'$\theta$')
plt.title(r'$J(\theta),dJ(\theta)/30 $ vs $\theta$')
plt.show()
上圖中有兩個細節值得注意。首先,注意這個代價函式有幾個極小值(大約在 0.25、1.0 和 1.7 附近取得)。其次,注意在最小值處的導數在零附近的曲線走向。這個點就是我們所需要的新參。
我們可以在下面的程式碼中看到上面四個步驟的實現。它還會生成一個視訊,顯示每個步驟的θ和梯度的值。
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
def optimize(iterations, oF, dOF,params,learningRate):
"""
computes the optimal value of params for a given objective function and its derivative
Arguments:
- iteratoins - the number of iterations required to optimize the objective function
- oF - the objective function
- dOF - the derivative function of the objective function
- params - the parameters of the function to optimize
- learningRate - the learning rate
Return:
- oParams - the list of optimized parameters at each step of iteration
"""
oParams = [params]
#The iteration loop
for i in range(iterations):
# Compute the derivative of the parameters
dParams = dOF(params)
# Compute the update
params = params-learningRate*dParams
# app end the new parameters
oParams.append(params)
return np.array(oParams)
def minimaFunction(theta):
return np.cos(3*np.pi*theta)/theta
def minimaFunctionDerivative(theta):
const1 = 3*np.pi
const2 = const1*theta
return -(const1*np.sin(const2)/theta)-np.cos(const2)/theta**2
theta = .6
iterations=45
learningRate = .0007
optimizedParameters = optimize(iterations,\
minimaFunction,\
minimaFunctionDerivative,\
theta,\
learningRate)
這似乎運作得很好!您應該注意到,如果θ的初始值較大,則優化演算法將在某一個區域性極小處結束。然而,如上所述,在極高維度空間中這種可能性並不大,因為它要求所有引數同時滿足凹函式。
你可能會想,「如果我們的學習率太大,會發生什麼?」。如果步長過大,則演算法可能永遠不會找到如下的動畫所示的最佳值。監控代價函式並確保它單調遞減,這一點很重要。如果沒有單調遞減,可能需要降低學習率。
SGD 也適用於多變數引數空間的情況。我們可以將二維函式繪製成等高線圖。在這裡你可以看到 SGD 對一個不對稱的碗形函式同樣有效。
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats
import matplotlib.animation as animation
def minimaFunction(params):
#Bivariate Normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
Z1 = mlab.bivariate_normal(X, Y, sigma11,sigma12,mu11,mu12)
Z = Z1
return -40*Z
def minimaFunctionDerivative(params):
# Derivative of the bivariate normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
dZ1X = -scipy.stats.norm.pdf(X, mu11, sigma11)*(mu11 - X)/sigma11**2
dZ1Y = -scipy.stats.norm.pdf(Y, mu12, sigma12)*(mu12 - Y)/sigma12**2
return (dZ1X,dZ1Y)
def optimize(iterations, oF, dOF,params,learningRate,beta):
"""
computes the optimal value of params for a given objective function and its derivative
Arguments:
- iteratoins - the number of iterations required to optimize the objective function
- oF - the objective function
- dOF - the derivative function of the objective function
- params - the parameters of the function to optimize
- learningRate - the learning rate
- beta - The weighted moving average parameter
Return:
- oParams - the list of optimized parameters at each step of iteration
"""
oParams = [params]
vdw = (0.0,0.0)
#The iteration loop
for i in range(iterations):
# Compute the derivative of the parameters
dParams = dOF(params)
#SGD in this line Goes through each parameter and applies parameter = parameter -learningrate*dParameter
params = tuple([par-learningRate*dPar for dPar,par in zip(dParams,params)])
# append the new parameters
oParams.append(params)
return oParams
iterations=100
learningRate = 1
beta = .9
x,y = 4.0,1.0
params = (x,y)
optimizedParameters = optimize(iterations,\
minimaFunction,\
minimaFunctionDerivative,\
params,\
learningRate,\
beta)
動量 SGD
注意,傳統 SGD 沒有解決所有問題!通常,使用者想要使用非常大的學習速率來快速學習感興趣的引數。不幸的是,當代價函式波動較大時,這可能導致不穩定。你可以看到,在前面的視訊中,由於缺乏水平方向上的最小值,y 引數方向的抖動形式。動量演算法試圖使用過去的梯度預測學習率來解決這個問題。通常,使用動量的 SGD 通過以下公式更新引數:
γ 和 ν 值允許使用者對 dJ(θ) 的前一個值和當前值進行加權來確定新的θ值。人們通常選擇γ和ν的值來建立指數加權移動平均值,如下所示:
β引數的最佳選擇是 0.9。選擇一個等於 1-1/t 的β值可以讓使用者更願意考慮νdw 的最新 t 值。這種簡單的改變可以使優化過程產生顯著的結果!我們現在可以使用更大的學習率,並在儘可能短的時間內收斂!
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats
import matplotlib.animation as animation
def minimaFunction(params):
#Bivariate Normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
Z1 = mlab.bivariate_normal(X, Y, sigma11,sigma12,mu11,mu12)
Z = Z1
return -40*Z
def minimaFunctionDerivative(params):
# Derivative of the bivariate normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
dZ1X = -scipy.stats.norm.pdf(X, mu11, sigma11)*(mu11 - X)/sigma11**2
dZ1Y = -scipy.stats.norm.pdf(Y, mu12, sigma12)*(mu12 - Y)/sigma12**2
return (dZ1X,dZ1Y)
def optimize(iterations, oF, dOF,params,learningRate,beta):
"""
computes the optimal value of params for a given objective function and its derivative
Arguments:
- iteratoins - the number of iterations required to optimize the objective function
- oF - the objective function
- dOF - the derivative function of the objective function
- params - the parameters of the function to optimize
- learningRate - the learning rate
- beta - The weighted moving average parameter for momentum
Return:
- oParams - the list of optimized parameters at each step of iteration
"""
oParams = [params]
vdw = (0.0,0.0)
#The iteration loop
for i in range(iterations):
# Compute the derivative of the parameters
dParams = dOF(params)
# Compute the momentum of each gradient vdw = vdw*beta+(1.0+beta)*dPar
vdw = tuple([vDW*beta+(1.0-beta)*dPar for dPar,vDW in zip(dParams,vdw)])
#SGD in this line Goes through each parameter and applies parameter = parameter -learningrate*dParameter
params = tuple([par-learningRate*dPar for dPar,par in zip(vdw,params)])
# append the new parameters
oParams.append(params)
return oParams
iterations=100
learningRate = 5.3
beta = .9
x,y = 4.0,1.0
params = (x,y)
optimizedParameters = optimize(iterations,\
minimaFunction,\
minimaFunctionDerivative,\
params,\
learningRate,\
beta)
RMSProp
像工程中的其它事物一樣,我們一直在努力做得更好。RMS prop 試圖通過觀察關於每個引數的函式梯度的相對大小,來改善動量函式。因此,我們可以取每個梯度平方的加權指數移動平均值,並按比例歸一化梯度下降函式。具有較大梯度的引數的 sdw 值將變得比具有較小梯度的引數大得多,從而使代價函式平滑下降到最小值。可以在下面的等式中看到:
請注意,這裡的 epsilon 是為數值穩定性而新增的,可以取 10e-7。這是為什麼暱?
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats
import matplotlib.animation as animation
def minimaFunction(params):
#Bivariate Normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
Z1 = mlab.bivariate_normal(X, Y, sigma11,sigma12,mu11,mu12)
Z = Z1
return -40*Z
def minimaFunctionDerivative(params):
# Derivative of the bivariate normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
dZ1X = -scipy.stats.norm.pdf(X, mu11, sigma11)*(mu11 - X)/sigma11**2
dZ1Y = -scipy.stats.norm.pdf(Y, mu12, sigma12)*(mu12 - Y)/sigma12**2
return (dZ1X,dZ1Y)
def optimize(iterations, oF, dOF,params,learningRate,beta):
"""
computes the optimal value of params for a given objective function and its derivative
Arguments:
- iteratoins - the number of iterations required to optimize the objective function
- oF - the objective function
- dOF - the derivative function of the objective function
- params - the parameters of the function to optimize
- learningRate - the learning rate
- beta - The weighted moving average parameter for RMSProp
Return:
- oParams - the list of optimized parameters at each step of iteration
"""
oParams = [params]
sdw = (0.0,0.0)
eps = 10**(-7)
#The iteration loop
for i in range(iterations):
# Compute the derivative of the parameters
dParams = dOF(params)
# Compute the momentum of each gradient sdw = sdw*beta+(1.0+beta)*dPar^2
sdw = tuple([sDW*beta+(1.0-beta)*dPar**2 for dPar,sDW in zip(dParams,sdw)])
#SGD in this line Goes through each parameter and applies parameter = parameter -learningrate*dParameter
params = tuple([par-learningRate*dPar/((sDW**.5)+eps) for sDW,par,dPar in zip(sdw,params,dParams)])
# append the new parameters
oParams.append(params)
return oParams
iterations=10
learningRate = .3
beta = .9
x,y = 5.0,1.0
params = (x,y)
optimizedParameters = optimize(iterations,\
minimaFunction,\
minimaFunctionDerivative,\
params,\
learningRate,\
beta)
Adam 演算法
Adam 演算法將動量和 RMSProp 的概念結合成一種演算法,以獲得兩全其美的效果。公式如下:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats
import matplotlib.animation as animation
def minimaFunction(params):
#Bivariate Normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
Z1 = mlab.bivariate_normal(X, Y, sigma11,sigma12,mu11,mu12)
Z = Z1
return -40*Z
def minimaFunctionDerivative(params):
# Derivative of the bivariate normal function
X,Y = params
sigma11,sigma12,mu11,mu12 = (3.0,.5,0.0,0.0)
dZ1X = -scipy.stats.norm.pdf(X, mu11, sigma11)*(mu11 - X)/sigma11**2
dZ1Y = -scipy.stats.norm.pdf(Y, mu12, sigma12)*(mu12 - Y)/sigma12**2
return (dZ1X,dZ1Y)
def optimize(iterations, oF, dOF,params,learningRate,beta1,beta2):
"""
computes the optimal value of params for a given objective function and its derivative
Arguments:
- iteratoins - the number of iterations required to optimize the objective function
- oF - the objective function
- dOF - the derivative function of the objective function
- params - the parameters of the function to optimize
- learningRate - the learning rate
- beta1 - The weighted moving average parameter for momentum component of ADAM
- beta2 - The weighted moving average parameter for RMSProp component of ADAM
Return:
- oParams - the list of optimized parameters at each step of iteration
"""
oParams = [params]
vdw = (0.0,0.0)
sdw = (0.0,0.0)
vdwCorr = (0.0,0.0)
sdwCorr = (0.0,0.0)
eps = 10**(-7)
#The iteration loop
for i in range(iterations):
# Compute the derivative of the parameters
dParams = dOF(params)
# Compute the momentum of each gradient vdw = vdw*beta+(1.0+beta)*dPar
vdw = tuple([vDW*beta1+(1.0-beta1)*dPar for dPar,vDW in zip(dParams,vdw)])
# Compute the rms of each gradient sdw = sdw*beta+(1.0+beta)*dPar^2
sdw = tuple([sDW*beta2+(1.0-beta2)*dPar**2.0 for dPar,sDW in zip(dParams,sdw)])
# Compute the weight boosting for sdw and vdw
vdwCorr = tuple([vDW/(1.0-beta1**(i+1.0)) for vDW in vdw])
sdwCorr = tuple([sDW/(1.0-beta2**(i+1.0)) for sDW in sdw])
#SGD in this line Goes through each parameter and applies parameter = parameter -learningrate*dParameter
params = tuple([par-learningRate*vdwCORR/((sdwCORR**.5)+eps) for sdwCORR,vdwCORR,par in zip(vdwCorr,sdwCorr,params)])
# append the new parameters
oParams.append(params)
return oParams
iterations=100
learningRate = .1
beta1 = .9
beta2 = .999
x,y = 5.0,1.0
params = (x,y)
optimizedParameters = optimize(iterations,\
minimaFunction,\
minimaFunctionDerivative,\
params,\
learningRate,\
beta1,\
beta2)</div>
Adam 演算法可能是目前深度學習中使用最廣泛的優化演算法,適用於多種應用。Adam 計算了一個 νdw^corr 的值,用於加快指數加權移動平均值的變化。它將通過增加它們的值來對它們進行標準化,與當前的迭代次數成反比。使用 Adam 時有一些很好的初始值可供嘗試。它最好以 0.9 的 β_1 和 0.999 的 β_2 開頭。
總結
優化目標函式的演算法有相當多的選擇。在上述示例中,我們發現各種方法的收斂速度越來越快:
– SGD: 100 次迭代
– SGD+Momentum: 50 次迭代
– RMSProp: 10 次迭代
– ADAM: 5 次迭代
原文連結:https://3dbabove.com/2017/11/14/optimizationalgorithms/