IMPLEMENTING A GRU/LSTM RNN WITH PYTHON AND THEANO - 學習筆記

Andrew.Hann發表於2017-04-28

catalogue

0. 引言
1. LSTM NETWORKS
2. LSTM 的變體
3. GRUs (Gated Recurrent Units)
4. IMPLEMENTATION GRUs

 

0. 引言

In this post we’ll learn about LSTM (Long Short Term Memory) networks and GRUs (Gated Recurrent Units).  LSTMs were first proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber, and are among the most widely used models in Deep Learning for NLP today. GRUs, first used in  2014, are a simpler variant of LSTMs that share many of the same properties.  Let’s start by looking at LSTMs, and then we’ll see how GRUs are different.

0x1: 長期依賴(Long-Term Dependencies)問題

RNN 的關鍵點之一就是他們可以用來連線先前的資訊到當前的任務上,例如使用過去的視訊段來推測對當前段的理解。如果 RNN 可以做到這個,他們就變得非常有用。但是真的可以麼?答案是,還有很多依賴因素。
有時候,我們僅僅需要知道先前的資訊來執行當前的任務。例如,我們有一個語言模型用來基於先前的詞來預測下一個詞。如果我們試著預測 “the clouds are in the sky” 最後的詞,我們並不需要任何其他的上下文 —— 因此下一個詞很顯然就應該是 sky。在這樣的場景中,相關的資訊和預測的詞位置之間的間隔是非常小的,RNN 可以學會使用先前的資訊。

但是同樣會有一些更加複雜的場景。假設我們試著去預測“I grew up in France... I speak fluent French”最後的詞。當前的資訊建議下一個詞可能是一種語言的名字,但是如果我們需要弄清楚是什麼語言,我們是需要先前提到的離當前位置很遠的 France 的上下文的。這說明相關資訊和當前預測位置之間的間隔就肯定變得相當的大。
不幸的是,在這個間隔不斷增大時,RNN 會喪失學習到連線如此遠的資訊的能力。

Relevant Link:

http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/

 

1. LSTM NETWORKS

0x1: LSTM如何從模型結構上規避"長週期依賴梯度消失問題"

In previous post we looked at how the vanishing gradient problem prevents standard RNNs from learning long-term dependencies.

LSTMs were designed to combat vanishing gradients through a gating mechanism.  To understand what this means, let’s look at how a LSTM calculates a hidden state s_t (I’m using \circ to mean elementwise multiplication):

\begin{aligned}  i &=\sigma(x_tU^i + s_{t-1} W^i) \\  f &=\sigma(x_t U^f +s_{t-1} W^f) \\  o &=\sigma(x_t U^o + s_{t-1} W^o) \\  g &=\ tanh(x_t U^g + s_{t-1}W^g) \\  c_t &= c_{t-1} \circ f + g \circ i \\  s_t &=\tanh(c_t) \circ o  \end{aligned}

These equations look quite complicated, but actually it’s not that hard. First, notice that a LSTM layer is  just another way to compute a hidden state. Previously, we computed the hidden state as s_t = \tanh(Ux_t + Ws_{t-1}). The inputs to this unit were x_t, the current input at step t, and s_{t-1}, the previous hidden state.  The output was a new hidden state s_t. A LSTM unit does the exact same thing, just in a different way! 

直觀上來說,i可以理解為"是否要讓本次記憶細胞通過",g可以理解為"本次要通過多少記憶",而i * g乘法可以增加複雜性,通過sigmoid的關係動態調整最終的輸入o是否要通過和通過多少

LSTM 通過刻意的設計來避免長期依賴問題。記住長期的資訊在實踐中是 LSTM 的預設行為,而非需要付出很大代價才能獲得的能力,所有 RNN 都具有一種重複神經網路模組的鏈式的形式。在標準的 RNN 中,這個重複的模組只有一個非常簡單的結構,例如一個 tanh 層。我們知道,代價函式的反饋難以讓網路前端的神經元學習到本質是"梯度消失問題",即因為偏導數鏈式求導的關係,導致代價函式C值不斷被稀釋。LSTM 同樣是這樣的結構,但是重複的模組擁有一個不同的結構。不同於 單一神經網路層,這裡是有四個,以一種非常特殊的方式進行互動

圖中圖示示例如下,在上面的圖例中,每一條黑線傳輸著一整個向量,從一個節點的輸出到其他節點的輸入。粉色的圈代表 pointwise 的操作,諸如向量的和,而黃色的矩陣就是學習到的神經網路層。合在一起的線表示向量的連線,分開的線表示內容被複制,然後分發到不同的位置

0x2: LSTM 的核心思想

LSTM 有通過精心設計的稱作為"門"的結構(門是一個LSTM神經元裡面的一個子結構)來去除或者增加資訊到細胞狀態的能力。門是一種讓資訊"選擇式通過"的方法。他們包含一個 sigmoid 神經網路層和一個 pointwise 乘法操作

Sigmoid 層輸出 0 到 1 之間的數值,描述每個部分有多少量可以通過。0 代表“不許任何量通過”,1 就指“允許任意量通過"。LSTM 擁有三個門,來保護和控制細胞狀態。

1. 忘記門

在我們 LSTM 中的第一步是決定我們會從細胞狀態中丟棄什麼資訊。這個決定通過一個稱為忘記門層完成。該門會讀取 h_{t-1} 和 x_t,輸出一個在 0 到 1 之間的數值給每個在細胞狀態 C_{t-1} 中的數字。1 表示“完全保留”,0 表示“完全捨棄”。
讓我們回到語言模型的例子中來基於已經看到的預測下一個詞。在這個問題中,細胞狀態可能包含當前主語的性別,因此正確的代詞可以被選擇出來。當我們看到新的主語,我們希望忘記舊的主語

2. 輸入門層 和 tanh層

下一步是確定什麼樣的新資訊被存放在細胞狀態中。這裡包含兩個部分

1. 第一,sigmoid 層稱 “輸入門層” 決定什麼值我們將要更新
2. 然後,一個 tanh 層建立一個新的候選值向量,C_t,會被加入到狀態中

接著,我們把舊狀態與 f_t 相乘,丟棄掉我們確定需要丟棄的資訊。接著加上 i_t * {C}_t。這就是新的候選值,根據我們決定更新每個狀態的程度進行變化。
在語言模型的例子中,這就是我們實際根據前面確定的目標,丟棄舊代詞的性別資訊並新增新的資訊的地方

3. 輸出門

最終,我們需要確定輸出什麼值。這個輸出將會基於我們的細胞狀態,但是也是一個過濾後的版本

1. 首先,我們執行一個 sigmoid 層來確定細胞狀態的哪個部分將輸出出去
2. 接著,我們把細胞狀態C_t通過 tanh 進行處理(得到一個在 -11 之間的值)並將它和 sigmoid 門的輸出相乘,最終我們僅僅會輸出我們確定輸出的那部分(選擇性輸出)

在語言模型的例子中,因為他就看到了一個 代詞,可能需要輸出與一個 動詞 相關的資訊。例如,可能輸出是否代詞是單數還是負數,這樣如果是動詞的話,我們也知道動詞需要進行的詞形變化

Intuitively, plain RNNs could be considered a special case of LSTMs. If you fix the input gate all 1’s, the forget gate to all 0’s (you always forget the previous memory) and the output gate to all one’s (you expose the whole memory) you almost get standard RNN. There’s just an additional \tanh that squashes the output a bit. The gating mechanism is what allows LSTMs to explicitly model long-term dependencies. By learning the parameters for its gates, the network learns how its memory should behave. 

Relevant Link:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

 

2. LSTM 的變體

不是所有的 LSTM 都長成一個樣子的。實際上,幾乎所有包含 LSTM 的論文都採用了微小的變體。差異非常小

0x1: peephole connection

一個流形的 LSTM 變體,就是由 Gers & Schmidhuber (2000) 提出的,增加了 “peephole connection”。是說,我們讓 門層 也會接受細胞狀態的輸入

上面的圖例中,我們增加了 peephole 到每個門上

0x2: coupled 忘記和輸入門

通過使用 coupled 忘記和輸入門。不同於之前是分開確定什麼忘記和需要新增什麼新的資訊,這裡是一同做出決定

1. 我們僅僅會當我們將要輸入在當前位置時忘記
2. 我們僅僅輸入新的值到那些我們已經忘記舊的資訊的那些狀態

Relevant Link: 

https://arxiv.org/pdf/1503.04069.pdf

 

3. GRUs (Gated Recurrent Units)

GRUs本質上也是LSTM的一個變體,因為使用的比較多,這裡單獨列一章節討論。它將忘記門和輸入門合成了一個單一的 更新門。同樣還混合了細胞狀態和隱藏狀態,和其他一些改動。最終的模型比標準的 LSTM 模型要簡單,也是非常流行的變體。

The idea behind a GRU layer is quite similar to that of a LSTM layer, as are the equations.

A GRU has two gates

1. a reset gate r
2. an update gate z 

Intuitively, the reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around. If we set the reset to all 1’s and  update gate to all 0’s we again arrive at our plain RNN model(類LSTM模型和傳統RNN最大的區別就在於它使用了門來進行選擇性遺忘和記憶). The basic idea of using a gating mechanism to learn long-term dependencies is the same as in a LSTM, but there are a few key differences:

  • A GRU has two gates, an LSTM has three gates.
  • GRUs don’t possess and internal memory (c_t) that is different from the exposed hidden state. They don’t have the output gate that is present in LSTMs.
  • The input and forget gates are coupled by an update gate z and the reset gate r is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both r and z.
  • We don’t apply a second nonlinearity when computing the output.

Relevant Link:

http://www.jianshu.com/p/9dc9f41f0b29

 

4. IMPLEMENTATION GRUs
Remember that a GRU (LSTM) layer is just another way of computing the hidden state. So all we really need to do is change the hidden state computation in our forward propagation function.

def forward_prop_step(x_t, s_t1_prev):
      # This is how we calculated the hidden state in a simple RNN. No longer!
      # s_t = T.tanh(U[:,x_t] + W.dot(s_t1_prev))
       
      # Get the word vector
      x_e = E[:,x_t]
       
      # GRU Layer
      z_t1 = T.nnet.hard_sigmoid(U[0].dot(x_e) + W[0].dot(s_t1_prev) + b[0])
      r_t1 = T.nnet.hard_sigmoid(U[1].dot(x_e) + W[1].dot(s_t1_prev) + b[1])
      c_t1 = T.tanh(U[2].dot(x_e) + W[2].dot(s_t1_prev * r_t1) + b[2])
      s_t1 = (T.ones_like(z_t1) - z_t1) * c_t1 + z_t1 * s_t1_prev
       
      # Final output calculation
      # Theano's softmax returns a matrix with one row, we only need the row
      o_t = T.nnet.softmax(V.dot(s_t1) + c)[0]
 
      return [o_t, s_t1]

0x1: code

train.py

#! /usr/bin/env python

import sys
import os
import time
import numpy as np
from utils import *
from datetime import datetime
from gru_theano import GRUTheano

LEARNING_RATE = float(os.environ.get("LEARNING_RATE", "0.001"))
VOCABULARY_SIZE = int(os.environ.get("VOCABULARY_SIZE", "2000"))
EMBEDDING_DIM = int(os.environ.get("EMBEDDING_DIM", "48"))
HIDDEN_DIM = int(os.environ.get("HIDDEN_DIM", "128"))
NEPOCH = int(os.environ.get("NEPOCH", "20"))
MODEL_OUTPUT_FILE = os.environ.get("MODEL_OUTPUT_FILE")
INPUT_DATA_FILE = os.environ.get("INPUT_DATA_FILE", "./data/reddit-comments-2015.csv")
PRINT_EVERY = int(os.environ.get("PRINT_EVERY", "25000"))

if not MODEL_OUTPUT_FILE:
  ts = datetime.now().strftime("%Y-%m-%d-%H-%M")
  MODEL_OUTPUT_FILE = "GRU-%s-%s-%s-%s.dat" % (ts, VOCABULARY_SIZE, EMBEDDING_DIM, HIDDEN_DIM)

# Load data
x_train, y_train, word_to_index, index_to_word = load_data(INPUT_DATA_FILE, VOCABULARY_SIZE)

# Build model
model = GRUTheano(VOCABULARY_SIZE, hidden_dim=HIDDEN_DIM, bptt_truncate=-1)

# Print SGD step time
t1 = time.time()
model.sgd_step(x_train[10], y_train[10], LEARNING_RATE)
t2 = time.time()
print "SGD Step time: %f milliseconds" % ((t2 - t1) * 1000.)
sys.stdout.flush()

# We do this every few examples to understand what's going on
def sgd_callback(model, num_examples_seen):
  dt = datetime.now().isoformat()
  loss = model.calculate_loss(x_train[:10000], y_train[:10000])
  print("\n%s (%d)" % (dt, num_examples_seen))
  print("--------------------------------------------------")
  print("Loss: %f" % loss)
  generate_sentences(model, 10, index_to_word, word_to_index)
  save_model_parameters_theano(model, MODEL_OUTPUT_FILE)
  print("\n")
  sys.stdout.flush()

for epoch in range(NEPOCH):
  train_with_sgd(model, x_train, y_train, learning_rate=LEARNING_RATE, nepoch=1, decay=0.9, 
    callback_every=PRINT_EVERY, callback=sgd_callback)

gru_theano.py

# -*- coding: utf-8 -*-

import numpy as np
import theano as theano
import theano.tensor as T
from theano.gradient import grad_clip
import time
import operator


class GRUTheano:
    def __init__(self, word_dim, hidden_dim=128, bptt_truncate=-1):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Initialize the network parameters
        E = np.random.uniform(-np.sqrt(1. / word_dim), np.sqrt(1. / word_dim), (hidden_dim, word_dim))
        U = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (6, hidden_dim, hidden_dim))
        W = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (6, hidden_dim, hidden_dim))
        V = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (word_dim, hidden_dim))
        b = np.zeros((6, hidden_dim))
        c = np.zeros(word_dim)
        # Theano: Created shared variables
        self.E = theano.shared(name='E', value=E.astype(theano.config.floatX))
        self.U = theano.shared(name='U', value=U.astype(theano.config.floatX))
        self.W = theano.shared(name='W', value=W.astype(theano.config.floatX))
        self.V = theano.shared(name='V', value=V.astype(theano.config.floatX))
        self.b = theano.shared(name='b', value=b.astype(theano.config.floatX))
        self.c = theano.shared(name='c', value=c.astype(theano.config.floatX))
        # SGD / rmsprop: Initialize parameters
        self.mE = theano.shared(name='mE', value=np.zeros(E.shape).astype(theano.config.floatX))
        self.mU = theano.shared(name='mU', value=np.zeros(U.shape).astype(theano.config.floatX))
        self.mV = theano.shared(name='mV', value=np.zeros(V.shape).astype(theano.config.floatX))
        self.mW = theano.shared(name='mW', value=np.zeros(W.shape).astype(theano.config.floatX))
        self.mb = theano.shared(name='mb', value=np.zeros(b.shape).astype(theano.config.floatX))
        self.mc = theano.shared(name='mc', value=np.zeros(c.shape).astype(theano.config.floatX))
        # We store the Theano graph here
        self.theano = {}
        self.__theano_build__()

    def __theano_build__(self):
        E, V, U, W, b, c = self.E, self.V, self.U, self.W, self.b, self.c

        x = T.ivector('x')
        y = T.ivector('y')

        def forward_prop_step(x_t, s_t1_prev, s_t2_prev):
            # This is how we calculated the hidden state in a simple RNN. No longer!
            # s_t = T.tanh(U[:,x_t] + W.dot(s_t1_prev))

            # Word embedding layer
            x_e = E[:, x_t]

            # GRU Layer 1
            z_t1 = T.nnet.hard_sigmoid(U[0].dot(x_e) + W[0].dot(s_t1_prev) + b[0])
            r_t1 = T.nnet.hard_sigmoid(U[1].dot(x_e) + W[1].dot(s_t1_prev) + b[1])
            c_t1 = T.tanh(U[2].dot(x_e) + W[2].dot(s_t1_prev * r_t1) + b[2])
            s_t1 = (T.ones_like(z_t1) - z_t1) * c_t1 + z_t1 * s_t1_prev

            # GRU Layer 2
            z_t2 = T.nnet.hard_sigmoid(U[3].dot(s_t1) + W[3].dot(s_t2_prev) + b[3])
            r_t2 = T.nnet.hard_sigmoid(U[4].dot(s_t1) + W[4].dot(s_t2_prev) + b[4])
            c_t2 = T.tanh(U[5].dot(s_t1) + W[5].dot(s_t2_prev * r_t2) + b[5])
            s_t2 = (T.ones_like(z_t2) - z_t2) * c_t2 + z_t2 * s_t2_prev

            # Final output calculation
            # Theano's softmax returns a matrix with one row, we only need the row
            o_t = T.nnet.softmax(V.dot(s_t2) + c)[0]

            return [o_t, s_t1, s_t2]

        [o, s, s2], updates = theano.scan(
            forward_prop_step,
            sequences=x,
            truncate_gradient=self.bptt_truncate,
            outputs_info=[None,
                          dict(initial=T.zeros(self.hidden_dim)),
                          dict(initial=T.zeros(self.hidden_dim))])

        prediction = T.argmax(o, axis=1)
        o_error = T.sum(T.nnet.categorical_crossentropy(o, y))

        # Total cost (could add regularization here)
        cost = o_error

        # Gradients
        dE = T.grad(cost, E)
        dU = T.grad(cost, U)
        dW = T.grad(cost, W)
        db = T.grad(cost, b)
        dV = T.grad(cost, V)
        dc = T.grad(cost, c)

        # Assign functions
        self.predict = theano.function([x], o)
        self.predict_class = theano.function([x], prediction)
        self.ce_error = theano.function([x, y], cost)
        self.bptt = theano.function([x, y], [dE, dU, dW, db, dV, dc])

        # SGD parameters
        learning_rate = T.scalar('learning_rate')
        decay = T.scalar('decay')

        # rmsprop cache updates
        mE = decay * self.mE + (1 - decay) * dE ** 2
        mU = decay * self.mU + (1 - decay) * dU ** 2
        mW = decay * self.mW + (1 - decay) * dW ** 2
        mV = decay * self.mV + (1 - decay) * dV ** 2
        mb = decay * self.mb + (1 - decay) * db ** 2
        mc = decay * self.mc + (1 - decay) * dc ** 2

        self.sgd_step = theano.function(
            [x, y, learning_rate, theano.Param(decay, default=0.9)],
            [],
            updates=[(E, E - learning_rate * dE / T.sqrt(mE + 1e-6)),
                     (U, U - learning_rate * dU / T.sqrt(mU + 1e-6)),
                     (W, W - learning_rate * dW / T.sqrt(mW + 1e-6)),
                     (V, V - learning_rate * dV / T.sqrt(mV + 1e-6)),
                     (b, b - learning_rate * db / T.sqrt(mb + 1e-6)),
                     (c, c - learning_rate * dc / T.sqrt(mc + 1e-6)),
                     (self.mE, mE),
                     (self.mU, mU),
                     (self.mW, mW),
                     (self.mV, mV),
                     (self.mb, mb),
                     (self.mc, mc)
                     ])

    def calculate_total_loss(self, X, Y):
        return np.sum([self.ce_error(x, y) for x, y in zip(X, Y)])

    def calculate_loss(self, X, Y):
        # Divide calculate_loss by the number of words
        num_words = np.sum([len(y) for y in Y])
        return self.calculate_total_loss(X, Y) / float(num_words)

Relevant Link:

http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
https://github.com/dennybritz/rnn-tutorial-gru-lstm/blob/master/gru_theano.py 

Copyright (c) 2017 LittleHann All rights reserved

相關文章