研究一下不同梯度下降的最佳化器和自己的最佳化器的比較（SGD,Momentum,AdaGrad,Adam）

Link_kingdom發表於2023-01-28

研究一下不同梯度下降的最佳化器和自己的最佳化器的比較（SGD,Momentum,AdaGrad,Adam）

參考：《深度學習入門：基於Python的理論與實現》

import matplotlib.pyplot as plt
import numpy as np
import latexify
from collections import OrderedDict
from common.optimizer import *

一維函式的最佳化

定義函式

def J(x):
    return x**6/6 - 5.5*x**5/5 + 6.5*x**4/4 + 5.5*x**3/3 - 7.5*x**2/2

def dJ(x):
    return x**5 - 5.5*x**4 + 6.5*x**3 + 5.5*x**2 - 7.5*x

def ddJ(x):
    return 5*x**4 - 22*x**3 + 19.5*x**2 + 11*x - 7.5

x = np.linspace(-2, 5, 100)
plt.figure(figsize=(8, 6))
plt.plot(x, J(x), label='J(x)')
plt.plot(x, dJ(x), label='dJ(x)')
plt.plot(x, ddJ(x), label='ddJ(x)')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.title('J(x) and dJ(x) and ddJ(x)')
plt.ylim(-5, 20)
plt.show()

png

視覺化迭代過程

# 建立一個動畫，將梯度下降的過程視覺化，這裡使用的是matplotlib的animation模組
from matplotlib import animation
from IPython.display import HTML

def show_animation(x_history):
    fig = plt.figure(figsize=(8, 6))
    ax = plt.axes(xlim=(-2, 5), ylim=(-5, 30))
    line, = ax.plot([], [], 'bo', lw=2, label='point')
    x = np.linspace(-2, 5, 100)
    plt.legend()

    def init():
        line.set_data([], [])
        return line, 

    def animate(frame):
        line.set_data(x_history[frame], J(x_history[frame]))
        plt.title('iteration = {}'.format(frame))
        return line, 

    anim = animation.FuncAnimation(fig, animate, init_func=init,
                                frames=len(x_history), # 這裡的frames是指動畫的幀數
                                interval=40,  # 這裡的interval是指動畫的間隔時間 單位是ms
                                blit=True # 這裡的blit是指是否只更新動畫中改變的部分
                                )
    return HTML(anim.to_html5_video())

# test 
x_history = {
    'AdaGrad':[1,2,3], 
    'SGD':[1,2,3],
    'Momentum':[1,2,3],
    'Nesterov':[1,2,3],
    'Adam': [1,2,3]
}
print(len(x_history.keys()))
print(len(x_history))

5
5

SGD

簡單實現

alpha = 0.05
x = 4
x_history = [x]
for i in range(200):
    x = x - alpha * dJ(x)
    x_history.append(x)
print(x_history[-1])
show_animation(x_history)

-1.0

png

最佳化器實現

init_pos = 4
params = {}
params['x'] = init_pos
grads = {}
x_history = []
optimizer = SGD(lr=0.05)
for i in range(200):
    x_history.append(params['x'])
    grads['x'] = dJ(params['x'])
    optimizer.update(params, grads)
print(params['x'])
show_animation(x_history)

-1.0

png

Momentum

簡單實現

v = 0
alpha = 0.02
beta = 0.9
x = 4
x_history = [x]
for i in range(200):
    v = beta * v - alpha * dJ(x)
    x = x + v
    x_history.append(x)
print(x_history[-1])
show_animation(x_history)

-0.9999435339008631

png

最佳化器實現

init_pos = 4
params = {}
params['x'] = init_pos
grads = {}
x_history = []
optimizer = Momentum(lr=0.02)
for i in range(200):
    x_history.append(params['x'])
    grads['x'] = dJ(params['x'])
    optimizer.update(params, grads)
print(params['x'])
show_animation(x_history)

-0.9999435339008631

png

AdaGrad

簡單實現

alpha = 0.5
x = 4
h = 0
x_history = [x]
for i in range(200):
    h = h + dJ(x)**2
    x = x - alpha * dJ(x) / (np.sqrt(h)+1e-7)
    x_history.append(x)
print(x_history[-1])
show_animation(x_history)

3.0000003436659832

png

最佳化器實現

init_pos = np.float64(4)
params = {}
params['x'] = init_pos
grads = {}
x_history = []
optimizer = AdaGrad(lr=0.5)
for i in range(200):
    x_history.append(params['x'])
    grads['x'] = dJ(params['x'])
    optimizer.update(params, grads)
print(params['x'])
show_animation(x_history)

3.0000003436659832

png

小結

我們可以看到上面AdaGrad的學習率在迭代的過程中不斷減小，這是因為AdaGrad會對每個引數的梯度進行累加，從而使得學習率不斷減小,這樣的話，我們就可以使用一個較大的學習率，從而加快學習的速度。但是也有個缺點，我們可以從影片中看到其飛快地收斂於區域性最優，這是值得我們思考的地方：我們能不能利用AdaGrad的優點，同時避免其缺點呢？即用較大學習率加快前期學習，同時也引入動量的概念，使得其一直朝著全域性最優的方向前進呢？這就是Adam的思想。

Adam

簡單實現

alpha = 0.5
beta1 = 0.9
beta2 = 0.999
m = 0
v = 0
x = 4
x_history = [x]
for i in range(200):
    m += (1-beta1) * (dJ(x) - m)
    v += (1-beta2) * (dJ(x)**2 - v)
    lr_t = alpha * np.sqrt(1-beta2**(i+1)) / (1-beta1**(i+1))
    x = x - lr_t * m / (np.sqrt(v)+1e-7)
    x_history.append(x)
print(x_history[-1])
show_animation(x_history)

0.9999331921434963

png

最佳化器實現

init_pos = np.float64(4)
params = {}
params['x'] = init_pos
grads = {}
x_history = []
optimizer = Adam(lr=0.5)
for i in range(200):
    x_history.append(params['x'])
    grads['x'] = dJ(params['x'])
    optimizer.update(params, grads)
print(params['x'])
show_animation(x_history)

0.9999331921434963

png

可以看到在和AdaGrad相同學習率（lr = 0.5 ）的情況下，Adam越過了第一個坡，而AdaGrad則沒有。下面我們只需要再調整一下學習率為0.8，就可以看到Adam和AdaGrad的差距了。

init_pos = np.float64(4)
params = {}
params['x'] = init_pos
grads = {}
x_history = []
optimizer = Adam(lr=0.8)
for i in range(200):
    x_history.append(params['x'])
    grads['x'] = dJ(params['x'])
    optimizer.update(params, grads)
print(params['x'])
show_animation(x_history)

-0.9999825571870972

png

從影片中可以看到Adam收斂到了全域性最優，而AdaGrad則沒有。

綜合說明與對比

上面的每個最佳化方法中的簡單實現與最佳化器實現的結果是完全一致的，而我們做的只不過是將最佳化器的實現封裝成了一個類，這樣我們就可以方便地呼叫不同的最佳化器了。

下面我們同時畫出四個函式的迭代動畫過程，讓大家可以直觀地感受一下不同最佳化器的優劣。

optimizers = {}
optimizers['SGD'] = SGD(lr=0.05)
optimizers['Momentum'] = Momentum(lr=0.02)
optimizers['AdaGrad'] = AdaGrad(lr=0.5)
optimizers['Adam'] = Adam(lr=0.8)

x_history = {}

for key in optimizers:
    params = {}
    params['x'] = init_pos
    grads = {}
    x_history[key] = []
    optimizer = optimizers[key]
    for i in range(200):
        x_history[key].append(params['x'])
        grads['x'] = dJ(params['x'])
        optimizer.update(params, grads)
    print(key, params['x'])

SGD -1.0
Momentum -0.9999435339008631
AdaGrad 3.0000003436659832
Adam -0.9999825571870972

# 建立一個動畫，將梯度下降的過程視覺化，這裡使用的是matplotlib的animation模組
from matplotlib import animation
from IPython.display import HTML

def show_mult_animation(x_history):
    # @param x_history: 一個字典，key是最佳化器的名字，value是對應最佳化器的x_history
    # @return: 一個動畫，將梯度下降的過程視覺化
    fig = plt.figure(figsize=(20, 10))
    ax1 = fig.add_subplot(2,2,1)
    ax2 = fig.add_subplot(2,2,2)
    ax3 = fig.add_subplot(2,2,3)
    ax4 = fig.add_subplot(2,2,4)
    axlist = [ax1, ax2, ax3, ax4]
    linelist = []
    for i in range(4):
        ax = axlist[i]
        ax.set_xlim(-2, 5)
        ax.set_ylim(-5, 30)
        ax.set_xlabel('x')
        ax.set_ylabel('J(x)')
        x = np.linspace(-2, 5, 100)
        ax.plot(x, J(x), 'b-', lw=1, label='J(x)')
        line, = ax.plot([], [], 'bo', lw=2, label='point')
        linelist.append(line)
        ax.legend()
        ax.set_title(list(x_history.keys())[i] + ' optimizer')

    def init():
        for line in linelist:
            line.set_data([], [])
        return linelist

    def animate(frame):
        line_index = 0
        for key in x_history.keys():
            x = x_history[key][frame]
            y = J(x)
            line = linelist[line_index]
            line_index += 1
            line.set_data(x, y)
        fig.suptitle('iterations: ' + str(frame))
        return linelist

    anim = animation.FuncAnimation(fig, animate, init_func=init,
                                frames=200, # 這裡的frames是指動畫的幀數
                                interval=40,  # 這裡的interval是指動畫的間隔時間 單位是ms
                                blit=True # 這裡的blit是指是否只更新動畫中改變的部分
                                )
    return HTML(anim.to_html5_video())

show_mult_animation(x_history)

png

自己的最佳化器PGD(Physical Gradient Descent)

def physisc_grad_descent(J, grad_J, gradgrad_J, x0=4, dt = 0.05, g = 9.8, mu = 0.1, n_iters=75, is_print=False):
    x = x0
    v_square = 0
    vx = 0
    x_history = np.array([x])
    for i in range(n_iters):
        if is_print:
            print("該點的水平位置為：", x,end=' ')
            print("該點的水平速度為：", vx,end=' ')
            print("該點的動能為：", 0.5 * v_square)
        prex = x
        # update x
        x = x + dt * vx
        v_square = v_square + 2 * g * (J(prex)-J(x)) - 2 * mu * (g + v_square * gradgrad_J(x) / (1 + grad_J(x)**2))*abs(x-prex)
        if v_square <= 0:
            v_square = 0
            vx = g * (-grad_J(x)) / (1 + (grad_J(x))**2) * dt
        else:
            if x - prex < 0:
                vx = -np.sqrt(v_square) / np.sqrt(1 + (grad_J(x))**2 )
            else:
                vx = np.sqrt(v_square) / np.sqrt(1 + (grad_J(x))**2 )
        x_history = np.append(x_history, x)
    return x_history

x_history = physisc_grad_descent(J, dJ, ddJ, x0=4, dt = 0.05, g = 9.8, mu = 0.2, n_iters=200, is_print=False)
show_animation(x_history)

png

總結而言，自己這樣寫還是太侷限了，因為還要知道自己的二階導數的資訊，這個對於大多數現有的學習策略來說是不可知的，所以這個方法還是不太可行的。

二維導數的最佳化

定義函式

def f(x, y):
    return x**2 / 20.0 + y**2

def df(x, y):
    return x / 10.0, 2.0*y

init_pos = (-7.0, 2.0)
params = {}
params['x'], params['y'] = init_pos[0], init_pos[1]
grads = {}
grads['x'], grads['y'] = 0, 0


optimizers = OrderedDict()
optimizers["SGD"] = SGD(lr=0.95)
optimizers["Momentum"] = Momentum(lr=0.1)
optimizers["AdaGrad"] = AdaGrad(lr=1.5)
optimizers["Adam"] = Adam(lr=0.3)

idx = 1
plt.figure(figsize=(20,15))
for key in optimizers:
    optimizer = optimizers[key]
    x_history = []
    y_history = []
    params['x'], params['y'] = init_pos[0], init_pos[1]
    
    for i in range(200):
        x_history.append(params['x'])
        y_history.append(params['y'])
        
        grads['x'], grads['y'] = df(params['x'], params['y'])
        optimizer.update(params, grads)
    
    # 輸出最終位置
    print(key + ":" + str(params['x']) + "," + str(params['y']))

    x = np.arange(-10, 10, 0.01)
    y = np.arange(-5, 5, 0.01)
    
    X, Y = np.meshgrid(x, y) 
    Z = f(X, Y)
    
    # for simple contour line  
    mask = Z > 7
    Z[mask] = 0
    
    # plot 
    plt.subplot(2, 2, idx)
    idx += 1
    plt.plot(x_history, y_history, 'o-', color="red")
    plt.contour(X, Y, Z)
    plt.ylim(-10, 10)
    plt.xlim(-10, 10)
    plt.plot(0, 0, '+')
    #colorbar()
    #spring()
    plt.title(key)
    plt.xlabel("x")
    plt.ylabel("y")
    
plt.show()

SGD:-1.4955945746380926e-08,1.4110158217310404e-09
Momentum:4.525745158046888e-05,-2.984927273613453e-05
AdaGrad:-0.00021640235526721696,1.0811148417095901e-41
Adam:-0.00011915420080310103,3.3543852840193107e-05

png

視覺化迭代過程

# 建立一個動畫，將梯度下降的過程視覺化，這裡使用的是matplotlib的animation模組
from matplotlib import animation
from IPython.display import HTML

def show_animation(x_history,y_history,key):
    fig = plt.figure(figsize=(8, 6))
    ax = plt.axes(xlim=(-2, 5), ylim=(-5, 30))
    line, = ax.plot([], [], 'ro', lw=2, label='point')

    x = np.arange(-10, 10, 0.01)
    y = np.arange(-5, 5, 0.01)
    
    X, Y = np.meshgrid(x, y) 
    Z = f(X, Y)
    
    # for simple contour line  
    mask = Z > 7
    Z[mask] = 0
    
    plt.contour(X, Y, Z)
    plt.ylim(-10, 10)
    plt.xlim(-10, 10)
    plt.plot(0, 0, '+')
    plt.xlabel("x")
    plt.ylabel("y")
    plt.title(key)
    plt.legend()

    def init():
        line.set_data([], [])
        return line, 

    def animate(frame):
        line.set_data(x_history[frame], y_history[frame])
        ax.set_title("frame: {}".format(frame))
        return line, 

    anim = animation.FuncAnimation(fig, animate, init_func=init,
                                frames=len(x_history), # 這裡的frames是指動畫的幀數
                                interval=40,  # 這裡的interval是指動畫的間隔時間 單位是ms
                                blit=True # 這裡的blit是指是否只更新動畫中改變的部分
                                )
    return anim

anim_list = []
for key in optimizers:
    optimizer = optimizers[key]
    x_history = []
    y_history = []
    params['x'], params['y'] = init_pos[0], init_pos[1]
    
    for i in range(200):
        x_history.append(params['x'])
        y_history.append(params['y'])
        
        grads['x'], grads['y'] = df(params['x'], params['y'])
        optimizer.update(params, grads)
    
    # 輸出最終位置
    print(key + ":" + str(params['x']) + "," + str(params['y']))
    anim_list.append(show_animation(x_history,y_history,key))

SGD:-1.4955945746380926e-08,1.4110158217310404e-09
Momentum:3.0175123211357402e-05,-2.6841550736447172e-05
AdaGrad:-0.02598841542479064,8.887975802048801e-24
Adam:-0.00018076469204204358,5.470947760432499e-05

png

HTML(anim_list[0].to_html5_video())

HTML(anim_list[1].to_html5_video())

HTML(anim_list[2].to_html5_video())

HTML(anim_list[3].to_html5_video())

# 生成動畫gif
anim_list[0].save('SGD.gif', writer='imagemagick', fps=30)
anim_list[1].save('Momentum.gif', writer='imagemagick', fps=30)
anim_list[2].save('AdaGrad.gif', writer='imagemagick', fps=30)
anim_list[3].save('Adam.gif', writer='imagemagick', fps=30)

MovieWriter imagemagick unavailable; using Pillow instead.
MovieWriter imagemagick unavailable; using Pillow instead.
MovieWriter imagemagick unavailable; using Pillow instead.
MovieWriter imagemagick unavailable; using Pillow instead.

總結，這四個最佳化器各有優點，但是在實際應用中，我們可以根據自己的需要，選擇不同的最佳化器，比如在深度學習中，我們一般使用Adam，因為它的收斂速度比較快，而且在大多數情況下都能收斂到全域性最優，而一些特殊的情況下，我們可以使用SGD，因為它的收斂速度比較慢，但是在一些特殊的情況下，它依然能夠收斂到全域性最優。

%%HTML
<img src="SGD.gif">
<img src="Momentum.gif">
<img src="AdaGrad.gif">
<img src="Adam.gif">

優化器：SGD ＞ Momentum ＞ AdaGrad ＞ RMSProp ＞ Adam
2020-12-25
優化
Optimizer: SGD, SGD with Momentum, Adagrad, RMSProp, Adam, AdamW
2024-03-18
三種梯度下降演算法的區別(BGD, SGD, MBGD)
2021-01-21
梯度演算法
11迴歸演算法-BGD、SGD、MBGD梯度下降
2018-10-20
演算法梯度
李沐：用隨機梯度下降來最佳化人生！
2024-06-17
隨機梯度
【機器學習】梯度下降
2020-08-11
機器學習梯度
JAVA中的Comparable介面和自定義比較器
2020-10-10
Java
比較器-Comparable和Comparator
2024-11-13
不同解決方案的比較
2024-05-29
機器學習之梯度下降
2020-02-08
機器學習梯度
【機器學習】梯度下降 II
2020-09-20
機器學習梯度
神經網路最佳化演算法：Dropout、梯度消失/爆炸、Adam最佳化演算法，一篇就夠了！
2019-08-19
神經網路演算法梯度
優先佇列的比較器
2020-11-05
佇列
機器學習常用的分類器比較
2018-04-02
機器學習
機器學習--線性迴歸--梯度下降的實現
2020-10-08
機器學習梯度
機器學習中梯度下降演算法的實際應用和技巧
2019-03-22
機器學習梯度演算法
機器學習之梯度下降法
2020-06-12
機器學習梯度
【機器學習基礎】——梯度下降
2021-10-12
機器學習梯度
Adam真的是最好最佳化器嗎？有人認為不過是神經網路進化的結果
2020-12-10
神經網路
Db2最佳化器的說明
2024-08-08
DB2
最佳化器-RBO 的規則轉化
2023-03-14
資料庫的大腦-最佳化器
2022-08-25
資料庫
Vue中的計算屬性和偵聽器比較
2019-04-08
Vue
機器學習梯度下降法，最通俗易懂的解釋
2019-03-28
機器學習梯度
定製排序和比較器排序
2020-10-22
排序
javascript訪問不同物件的速度比較
2021-09-11
JavaScript物件
機器學習方法（一）——梯度下降法
2018-10-12
機器學習梯度
機器學習——梯度下降演算法
2019-07-24
機器學習梯度演算法
Java 比較器
2021-05-05
Java
VPS，雲伺服器，獨立伺服器的比較
2019-12-16
伺服器
編譯器最佳化丨Cache最佳化
2022-12-02
編譯
機器學習之隨機梯度下降法
2020-06-13
機器學習隨機梯度
對梯度下降演算法的理解和實現
2020-09-29
梯度演算法
大資料中的Java虛擬機器最佳化
2023-02-18
大資料Java虛擬機
==和equals方法的比較
2020-11-06
ImageMagic 和 GraphicsMagick 的比較
2024-10-26
ArrayList和LinkedList的比較
2024-07-05
MySQL8.0 · 最佳化器新特性 · Cost Model, 直方圖及最佳化器開銷最佳化
2018-09-28
MySql直方圖

研究一下不同梯度下降的最佳化器和自己的最佳化器的比較（SGD,Momentum,AdaGrad,Adam）

研究一下不同梯度下降的最佳化器和自己的最佳化器的比較（SGD,Momentum,AdaGrad,Adam）

一維函式的最佳化

定義函式

視覺化迭代過程

SGD

簡單實現

最佳化器實現

Momentum

簡單實現

最佳化器實現

AdaGrad

簡單實現

最佳化器實現

小結

Adam

簡單實現

最佳化器實現

綜合說明與對比

自己的最佳化器PGD(Physical Gradient Descent)

二維導數的最佳化

定義函式

視覺化迭代過程

相關文章