用Keras搞一個閱讀理解機器人

Andrew.Hann發表於2017-02-27

catalogue

1. 訓練集
2. 資料預處理
3. 神經網路模型設計(對話集 <-> 問題集)
4. 神經網路模型設計(問題集 <-> 回答集)
5. RNN神經網路
6. 訓練
7. 效果驗證

 

1. 訓練集

1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary?     bathroom    1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel?     hallway    4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel?     hallway    4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel?     office    11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra?     bathroom    8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra?     bathroom    2
4 Mary went to the bedroom.
5 Daniel moved to the hallway.
6 Where is Sandra?     bathroom    2
7 John went to the garden.
8 John travelled to the office.
9 Where is Sandra?     bathroom    2
10 Daniel journeyed to the bedroom.
11 Daniel travelled to the hallway.
12 Where is John?     office    8

訓練集以對話集合[2] + 問題[1] + 回答[1]的形式組成

Relevant Link:

https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz

 

2. 資料預處理

0x1: 詞彙表

詞彙表是語料向量化的一個基礎,用來將單詞對映到向量空間

Here's what a "story" tuple looks like (input, query, answer):
([u'Mary', u'moved', u'to', u'the', u'bathroom', u'.', u'John', u'went', u'to', u'the', u'hallway', u'.'], [u'Where', u'is', u'Mary', u'?'], u'bathroom')

0x2: 訓練集編碼

根據詞彙表將對話 + 問題 + 答案進行數字向量化編碼

Vectorizing the word sequences...
inputs_train[0]
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  5 16 19 18  9  1  4 21 19 18 12  1]
queries_train[0]
[ 7 13  5  2]
answers_train[0]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.]

//訓練集和測試機同時都需要編碼
inputs_test[0]
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  4 20 19 18 12  1  5 14 19 18  9  1]
queries_test[0]
[ 7 13  4  2]
answers_test[0]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.
  0.  0.  0.  0.]

0x3: size歸一化Padding

根據預處理過程中找到的最長對話、最長問題、最長答案,將所有的對話/問題/答案向量全部前置Padding到一個等長的size chunk

def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
    X = []
    Xq = []
    Y = []
    for story, query, answer in data:
        x = [word_idx[w] for w in story]
        xq = [word_idx[w] for w in query]
        y = np.zeros(len(word_idx) + 1)  # let's not forget that index 0 is reserved
        y[word_idx[answer]] = 1
        X.append(x)
        Xq.append(xq)
        Y.append(y)
    return (pad_sequences(X, maxlen=story_maxlen),
            pad_sequences(Xq, maxlen=query_maxlen), np.array(Y))

Relevant Link:

 

3. 神經網路模型設計(對話集 <-> 問題集)

總體來說,我們把對話集和問題集分成2路神經網路,分別迭代計算,最後合併到一個activation function中(softmax activation)。之所以將對話集和問題集分成2路然後合併為一組,是因為對話和問題之間內在是由關聯性的,閱讀理解題目不會平白無故提出一個和對話無關的問題

0x1: 對話集訓練

1. 嵌入層 Embedding

嵌入層將正整數(下標)轉換為具有固定大小的向量,如[[4],[20]]->[[0.25,0.1],[0.6,-0.2]]
Embedding層只能作為模型的第一層,我們這裡把它作為模型的第一層,用來接收對話詞化向量

keras.layers.embeddings.Embedding(
    input_dim, 
    output_dim, 
    init='uniform', 
    input_length=None, 
    W_regularizer=None, 
    activity_regularizer=None, 
    W_constraint=None, 
    mask_zero=False, 
    weights=None, 
    dropout=0.0
)

1. input_dim:大或等於0的整數,字典長度,即輸入資料最大下標+1
2. output_dim:大於0的整數,代表全連線嵌入的維度
3. init:初始化方法,為預定義初始化方法名的字串,或用於初始化權重的Theano函式。該引數僅在不傳遞weights引數時有意義。
4. weights:權值,為numpy array的list。該list應僅含有一個如(input_dim,output_dim)的權重矩陣
5. W_regularizer:施加在權重上的正則項,為WeightRegularizer物件
6. W_constraints:施加在權重上的約束項,為Constraints物件
7. mask_zero:布林值,確定是否將輸入中的‘0’看作是應該被忽略的‘填充’(padding)值,該引數在使用遞迴層處理變長輸入時有用。設定為True的話,模型中後續的層必須都支援masking,否則會丟擲異常
8. input_length:當輸入序列的長度固定時,該值為其長度。如果要在該層後接Flatten層,然後接Dense層,則必須指定該引數,否則Dense層的輸出維度無法自動推斷。
dropout:0~1的浮點數,代表要斷開的嵌入比例,

輸入shape 
形如(samples,sequence_length)的2D張量

輸出shape 
形如(samples, sequence_length, output_dim)的3D張量

code slice

# embed the input sequence into a sequence of vectors
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size,
                              output_dim=64,
                              input_length=story_maxlen))

2. Dropout防止過擬合

input_encoder_m.add(Dropout(0.3))

0x2: 問題集訓練

和對話集模型結構一樣

# embed the question into a sequence of vectors
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size,
                               output_dim=64,
                               input_length=query_maxlen))
question_encoder.add(Dropout(0.3))

0x3: 合併對話集和問題集

Merge層根據給定的模式,將一個張量列表中的若干張量合併為一個單獨的張量

keras.engine.topology.Merge(
    layers=None, 
    mode='sum', 
    concat_axis=-1, 
    dot_axes=-1, 
    output_shape=None, 
    node_indices=None, 
    tensor_indices=None, 
    name=None
)

1. layers:該引數為Keras張量的列表,或Keras層物件的列表。該列表的元素數目必須大於1。
2. mode:合併模式,為預定義合併模式名的字串或lambda函式或普通函式,如果為lambda函式或普通函式,則該函式必須接受一個張量的list作為輸入,並返回一個張量。如果為字串,則必須是下列值之一:“sum”,“mul”,“concat”,“ave”,“cos”,“dot”
3. concat_axis:整數,當mode=concat時指定需要串聯的軸
4. dot_axes:整數或整數tuple,當mode=dot時,指定要消去的軸
5. output_shape:整數tuple或lambda函式/普通函式(當mode為函式時)。如果output_shape是函式時,該函式的輸入值應為一一對應於輸入shape的list,並返回輸出張量的shape。
6. node_indices:可選,為整數list,如果有些層具有多個輸出節點(node)的話,該引數可以指定需要merge的那些節點的下標。如果沒有提供,該引數的預設值為全0向量,即合併輸入層0號節點的輸出值。
7. tensor_indices:可選,為整數list,如果有些層返回多個輸出張量的話,該引數用以指定需要合併的那些張量 

對於Embedding層來說,輸入的是2D,輸出的3D,第三維度對我們價值不大,因此在Merge的時候消去

match = Sequential()
match.add(Merge([input_encoder_m, question_encoder],
                mode='dot',
                dot_axes=[2, 2]))

0x4: 對話集和問題集啟用層

match.add(Activation('softmax'))

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/layers/embedding_layer/
https://keras-cn.readthedocs.io/en/latest/layers/core_layer/

 

4. 神經網路模型設計(問題集 <-> 回答集)

同樣道理,問題集和回答集之間也是有關聯關係的,因此也就它們分成2路巢狀網路,最後合併到一起

# embed the input into a single vector with size = story_maxlen:
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size,
                              output_dim=query_maxlen,
                              input_length=story_maxlen))
input_encoder_c.add(Dropout(0.3))
# output: (samples, story_maxlen, query_maxlen)
# sum the match vector with the input vector:
response = Sequential()
response.add(Merge([match, input_encoder_c], mode='sum'))
# output: (samples, story_maxlen, query_maxlen)
response.add(Permute((2, 1)))  # output: (samples, query_maxlen, story_maxlen)

# concatenate the match vector with the question vector,
# and do logistic regression on top
answer = Sequential()
answer.add(Merge([response, question_encoder], mode='concat', concat_axis=-1))

 

5. RNN神經網路

有很多演算法模型適合做這種推測型的神經網路,這裡選擇了RNN LSTM,RNN 是包含迴圈的網路,允許資訊的持久化。傳統的BP在面對層數較多時,能反饋到之前層的能力就會很弱了,但是RNN它能夠較好地利用之前的歷史訓練資料,相當於一個長期記憶的能力

 

keras.layers.recurrent.LSTM(
    output_dim, 
    init='glorot_uniform', 
    inner_init='orthogonal', 
    forget_bias_init='one', 
    activation='tanh', 
    inner_activation='hard_sigmoid', 
    W_regularizer=None, 
    U_regularizer=None, 
    b_regularizer=None, 
    dropout_W=0.0, 
    dropout_U=0.0
)

1. output_dim:內部投影和輸出的維度
2. init:初始化方法,為預定義初始化方法名的字串,或用於初始化權重的Theano函式。
3. inner_init:內部單元的初始化方法
4. forget_bias_init:遺忘門偏置的初始化函式.建議初始化為全1元素
5. activation:啟用函式,為預定義的啟用函式名 
6. inner_activation:內部單元啟用函式
7. W_regularizer:施加在權重上的正則項,為WeightRegularizer物件
8. U_regularizer:施加在遞迴權重上的正則項,為WeightRegularizer物件
9. b_regularizer:施加在偏置向量上的正則項,為WeightRegularizer物件
10. dropout_W:0~1之間的浮點數,控制輸入單元到輸入門的連線斷開比例
11. dropout_U:0~1之間的浮點數,控制輸入單元到遞迴連線的斷開比例

code slice

# the original paper uses a matrix multiplication for this reduction step.
# we choose to use a RNN instead.
answer.add(LSTM(32))
# one regularization layer -- more would probably be needed.
answer.add(Dropout(0.3))
answer.add(Dense(vocab_size))
# we output a probability distribution over the vocabulary
answer.add(Activation('softmax'))

Relevant Link:

http://www.jianshu.com/p/9dc9f41f0b29 
https://keras-cn.readthedocs.io/en/latest/layers/recurrent_layer/

 

6. 訓練

fit(
    self, 
    x, 
    y, 
    batch_size=32, 
    nb_epoch=10, 
    verbose=1, 
    callbacks=[], 
    validation_split=0.0, 
    validation_data=None, 
    shuffle=True, 
    class_weight=None, 
    sample_weight=None
)

1. x:輸入資料。如果模型只有一個輸入,那麼x的型別是numpy array,如果模型有多個輸入,那麼x的型別應當為list,list的元素是對應於各個輸入的numpy array。如果模型的每個輸入都有名字,則可以傳入一個字典,將輸入名與其輸入資料對應起來。
2. y:標籤,numpy array。如果模型有多個輸出,可以傳入一個numpy array的list。如果模型的輸出擁有名字,則可以傳入一個字典,將輸出名與其標籤對應起來。
3. batch_size:整數,指定進行梯度下降時每個batch包含的樣本數。訓練時一個batch的樣本會被計算一次梯度下降,使目標函式優化一步。
4. nb_epoch:整數,訓練的輪數,訓練資料將會被遍歷nb_epoch次。Keras中nb開頭的變數均為"number of"的意思
5. verbose:日誌顯示,0為不在標準輸出流輸出日誌資訊,1為輸出進度條記錄,2為每個epoch輸出一行記錄
6. callbacks:list,其中的元素是keras.callbacks.Callback的物件。這個list中的回撥函式將會在訓練過程中的適當時機被呼叫,參考回撥函式
7. validation_split:0~1之間的浮點數,用來指定訓練集的一定比例資料作為驗證集。驗證集將不參與訓練,並在每個epoch結束後測試的模型的指標,如損失函式、精確度等。
8. validation_data:形式為(X,y)或(X,y,sample_weights)的tuple,是指定的驗證集。此引數將覆蓋validation_spilt。
9. shuffle:布林值,表示是否在訓練過程中每個epoch前隨機打亂輸入樣本的順序。
10. class_weight:字典,將不同的類別對映為不同的權值,該引數用來在訓練過程中調整損失函式(只能用於訓練)。該引數在處理非平衡的訓練資料(某些類的訓練樣本數很少)時,可以使得損失函式對樣本數不足的資料更加關注。
11. sample_weight:權值的numpy array,用於在訓練時調整損失函式(僅用於訓練)。可以傳遞一個1D的與樣本等長的向量用於對樣本進行1對1的加權,或者在面對時序資料時,傳遞一個的形式為(samples,sequence_length)的矩陣來為每個時間步上的樣本賦不同的權。這種情況下請確定在編譯模型時新增了sample_weight_mode='temporal'

fit函式返回一個History的物件,其History.history屬性記錄了損失函式和其他指標的數值隨epoch變化的情況,如果有驗證集的話,也包含了驗證集的這些指標變化情況

0x1: 儲存模型結構、訓練出來的權重、及優化器狀態

keras的callback引數可以幫助我們實現在訓練過程中的適當時機被呼叫。實現實時儲存訓練模型以及訓練引數

keras.callbacks.ModelCheckpoint(
    filepath, 
    monitor='val_loss', 
    verbose=0, 
    save_best_only=False, 
    save_weights_only=False, 
    mode='auto', 
    period=1
)

1. filename:字串,儲存模型的路徑
2. monitor:需要監視的值
3. verbose:資訊展示模式,0或1
4. save_best_only:當設定為True時,將只儲存在驗證集上效能最好的模型
5. mode:‘auto’,‘min’,‘max’之一,在save_best_only=True時決定效能最佳模型的評判準則,例如,當監測值為val_acc時,模式應為max,當檢測值為val_loss時,模式應為min。在auto模式下,評價準則由被監測值的名字自動推斷。
6. save_weights_only:若設定為True,則只儲存模型權重,否則將儲存整個模型(包括模型結構,配置資訊等)
7. period:CheckPoint之間的間隔的epoch數

0x2: 當監測值不再改善時中止訓練

keras.callbacks.EarlyStopping(
    monitor='val_loss', 
    patience=0, 
    verbose=0, 
    mode='auto'
)

1. monitor:需要監視的量
2. patience:當early stop被啟用(如發現loss相比上一個epoch訓練沒有下降),則經過patience個epoch後停止訓練。
3. verbose:資訊展示模式
4. mode:‘auto’,‘min’,‘max’之一,在min模式下,如果檢測值停止下降則中止訓練。在max模式下,當檢測值不再上升則停止訓練。

0x3:學習率動態調整

keras.callbacks.LearningRateScheduler(schedule) 

schedule:函式,該函式以epoch號為引數(從0算起的整數),返回一個新學習率(浮點數)

也可以讓keras自動調整學習率

keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', 
    factor=0.1, 
    patience=10, 
    verbose=0, 
    mode='auto', 
    epsilon=0.0001, 
    cooldown=0, 
    min_lr=0
)

1. monitor:被監測的量
2. factor:每次減少學習率的因子,學習率將以lr = lr*factor的形式被減少
3. patience:當patience個epoch過去而模型效能不提升時,學習率減少的動作會被觸發
4. mode:‘auto’,‘min’,‘max’之一,在min模式下,如果檢測值觸發學習率減少。在max模式下,當檢測值不再上升則觸發學習率減少。
5. epsilon:閾值,用來確定是否進入檢測值的“平原區”
6. cooldown:學習率減少後,會經過cooldown個epoch才重新進行正常操作
7. min_lr:學習率的下限

當學習停滯時,減少2倍或10倍的學習率常常能獲得較好的效果

0x4: code

'''Trains a memory network on the bAbI dataset.

References:
- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
  "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
  http://arxiv.org/abs/1502.05698

- Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
  "End-To-End Memory Networks",
  http://arxiv.org/abs/1503.08895

Reaches 98.6% accuracy on task 'single_supporting_fact_10k' after 120 epochs.
Time per epoch: 3s on CPU (core i7).
'''

from __future__ import print_function
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Activation, Dense, Merge, Permute, Dropout
from keras.layers import LSTM
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
from functools import reduce
import tarfile
import numpy as np
import re
import os
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau


def tokenize(sent):
    '''Return the tokens of a sentence including punctuation.

    >>> tokenize('Bob dropped the apple. Where is the apple?')
    ['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']
    '''
    return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()]


def parse_stories(lines, only_supporting=False):
    '''Parse stories provided in the bAbi tasks format

    If only_supporting is true, only the sentences that support the answer are kept.
    '''
    data = []
    story = []
    for line in lines:
        line = line.decode('utf-8').strip()
        nid, line = line.split(' ', 1)
        nid = int(nid)
        if nid == 1:
            story = []
        if '\t' in line:
            q, a, supporting = line.split('\t')
            q = tokenize(q)
            substory = None
            if only_supporting:
                # Only select the related substory
                supporting = map(int, supporting.split())
                substory = [story[i - 1] for i in supporting]
            else:
                # Provide all the substories
                substory = [x for x in story if x]
            data.append((substory, q, a))
            story.append('')
        else:
            sent = tokenize(line)
            story.append(sent)
    return data


def get_stories(f, only_supporting=False, max_length=None):
    '''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story.

    If max_length is supplied, any stories longer than max_length tokens will be discarded.
    '''
    data = parse_stories(f.readlines(), only_supporting=only_supporting)
    flatten = lambda data: reduce(lambda x, y: x + y, data)
    data = [(flatten(story), q, answer) for story, q, answer in data if not max_length or len(flatten(story)) < max_length]
    return data


def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
    X = []
    Xq = []
    Y = []
    for story, query, answer in data:
        x = [word_idx[w] for w in story]
        xq = [word_idx[w] for w in query]
        y = np.zeros(len(word_idx) + 1)  # let's not forget that index 0 is reserved
        y[word_idx[answer]] = 1
        X.append(x)
        Xq.append(xq)
        Y.append(y)
    return (pad_sequences(X, maxlen=story_maxlen),
            pad_sequences(Xq, maxlen=query_maxlen), np.array(Y))

path = ''
try:
    if not os.path.isfile("babi_tasks_1-20_v1-2.tar.gz"):
        path = get_file('babi_tasks_1-20_v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
    else:
        path = 'babi_tasks_1-20_v1-2.tar.gz'
except:
    print('Error downloading dataset, please download it manually:\n'
          '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
    raise
if not os.path.isfile(path):
    print("babi_tasks_1-20_v1-2.tar.gz downlaod faild")
    exit()
tar = tarfile.open(path)

challenges = {
    # QA1 with 10,000 samples
    'single_supporting_fact_10k': 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt',
    # QA2 with 10,000 samples
    'two_supporting_facts_10k': 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt',
}
challenge_type = 'single_supporting_fact_10k'
challenge = challenges[challenge_type]

print('Extracting stories for the challenge:', challenge_type)
train_stories = get_stories(tar.extractfile(challenge.format('train')))
test_stories = get_stories(tar.extractfile(challenge.format('test')))

vocab = sorted(reduce(lambda x, y: x | y, (set(story + q + [answer]) for story, q, answer in train_stories + test_stories)))
print('vocab')
print(vocab)

# Reserve 0 for masking via pad_sequences
vocab_size = len(vocab) + 1
story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories)))
query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories)))

print('-')
print('Vocab size:', vocab_size, 'unique words')
print('Story max length:', story_maxlen, 'words')
print('Query max length:', query_maxlen, 'words')
print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
print('-')
print('Here\'s what a "story" tuple looks like (input, query, answer):')
print(train_stories[0])
print('-')
print('Vectorizing the word sequences...')

word_idx = dict((c, i + 1) for i, c in enumerate(vocab))
inputs_train, queries_train, answers_train = vectorize_stories(train_stories, word_idx, story_maxlen, query_maxlen)
inputs_test, queries_test, answers_test = vectorize_stories(test_stories, word_idx, story_maxlen, query_maxlen)

print('inputs_train[0]')
print(inputs_train[0])
print('queries_train[0]')
print(queries_train[0])
print('answers_train[0]')
print(answers_train[0])

print ('inputs_test[0]')
print(inputs_test[0])
print('queries_test[0]')
print(queries_test[0])
print('answers_test[0]')
print(answers_test[0])


print('-')
print('inputs: integer tensor of shape (samples, max_length)')
print('inputs_train shape:', inputs_train.shape)
print('inputs_test shape:', inputs_test.shape)
print('-')
print('queries: integer tensor of shape (samples, max_length)')
print('queries_train shape:', queries_train.shape)
print('queries_test shape:', queries_test.shape)
print('-')
print('answers: binary (1 or 0) tensor of shape (samples, vocab_size)')
print('answers_train shape:', answers_train.shape)
print('answers_test shape:', answers_test.shape)
print('-')
print('Compiling...')


######################################## story - question_encoder ########################################
# embed the input sequence into a sequence of vectors
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size,
                              output_dim=64,
                              input_length=story_maxlen))
input_encoder_m.add(Dropout(0.3))
# output: (samples, story_maxlen, embedding_dim)


# embed the question into a sequence of vectors
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size,
                               output_dim=64,
                               input_length=query_maxlen))
question_encoder.add(Dropout(0.3))
# output: (samples, query_maxlen, embedding_dim)
# compute a 'match' between input sequence elements (which are vectors)
# and the question vector sequence
match = Sequential()
match.add(Merge([input_encoder_m, question_encoder],
                mode='dot',
                dot_axes=[2, 2]))
match.add(Activation('softmax'))
# output: (samples, story_maxlen, query_maxlen)
######################################## story - question_encoder ########################################


# embed the input into a single vector with size = story_maxlen:
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size,
                              output_dim=query_maxlen,
                              input_length=story_maxlen))
input_encoder_c.add(Dropout(0.3))
# output: (samples, story_maxlen, query_maxlen)
# sum the match vector with the input vector:
response = Sequential()
response.add(Merge([match, input_encoder_c], mode='sum'))
# output: (samples, story_maxlen, query_maxlen)
response.add(Permute((2, 1)))  # output: (samples, query_maxlen, story_maxlen)

# concatenate the match vector with the question vector,
# and do logistic regression on top
answer = Sequential()
answer.add(Merge([response, question_encoder], mode='concat', concat_axis=-1))


# the original paper uses a matrix multiplication for this reduction step.
# we choose to use a RNN instead.
answer.add(LSTM(32))
# one regularization layer -- more would probably be needed.
answer.add(Dropout(0.3))
answer.add(Dense(vocab_size))
# we output a probability distribution over the vocabulary
answer.add(Activation('softmax'))

# checkpoint
checkpointer = ModelCheckpoint(filepath="./checkpoint.hdf5", verbose=1)
# learning rate adjust dynamic
lrate = ReduceLROnPlateau(min_lr=0.00001)

answer.compile(optimizer='rmsprop', loss='categorical_crossentropy',
               metrics=['accuracy'])
# Note: you could use a Graph model to avoid repeat the input twice
answer.fit(
    [inputs_train, queries_train, inputs_train], answers_train,
    batch_size=32,
    nb_epoch=5000,
    validation_data=([inputs_test, queries_test, inputs_test], answers_test),
    callbacks=[checkpointer, lrate]
)

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/models/model/

 

7. 效果驗證

predict的網路結構需要和train的時候保持一樣,我們可以使用model.load將訓練時儲存的模型以及引數載入進來

0x1; 題目: target.story

1 little go back to the alibaba.
2 hann join to the jiangnan university.
3 Where is little?      

0x2: code

'''Trains a memory network on the bAbI dataset.

References:
- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
  "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
  http://arxiv.org/abs/1502.05698

- Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
  "End-To-End Memory Networks",
  http://arxiv.org/abs/1503.08895

Reaches 98.6% accuracy on task 'single_supporting_fact_10k' after 120 epochs.
Time per epoch: 3s on CPU (core i7).
'''

from __future__ import print_function
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Activation, Dense, Merge, Permute, Dropout
from keras.layers import LSTM
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
from functools import reduce
import tarfile
import numpy as np
import re
import os
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau
from keras.models import load_model

def tokenize(sent):
    '''Return the tokens of a sentence including punctuation.

    >>> tokenize('Bob dropped the apple. Where is the apple?')
    ['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']
    '''
    return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()]


def parse_stories(lines, only_supporting=False):
    '''Parse stories provided in the bAbi tasks format

    If only_supporting is true, only the sentences that support the answer are kept.
    '''
    data = []
    story = []
    for line in lines:
        line = line.decode('utf-8').strip()
        nid, line = line.split(' ', 1)
        nid = int(nid)
        if nid == 1:
            story = []
        if '\t' in line:
            q, a, supporting = line.split('\t')
            q = tokenize(q)
            substory = None
            if only_supporting:
                # Only select the related substory
                supporting = map(int, supporting.split())
                substory = [story[i - 1] for i in supporting]
            else:
                # Provide all the substories
                substory = [x for x in story if x]
            data.append((substory, q, a))
            story.append('')
        else:
            sent = tokenize(line)
            story.append(sent)
    return data


def parse_stories_target(lines, only_supporting=False):
    '''Parse stories provided in the bAbi tasks format

    If only_supporting is true, only the sentences that support the answer are kept.
    '''
    data = []
    story = []
    for line in lines:
        line = line.decode('utf-8').strip()
        nid, line = line.split(' ', 1)
        nid = int(nid)
        if nid == 1:
            story = []
        print(line)
        sent = tokenize(line)
        story.append(sent)
    substory = story[:len(story) - 1]
    query = story[-1]
    data.append((substory, query))
    return data



def get_stories(f, only_supporting=False, max_length=None):
    '''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story.

    If max_length is supplied, any stories longer than max_length tokens will be discarded.
    '''
    data = parse_stories(f.readlines(), only_supporting=only_supporting)
    flatten = lambda data: reduce(lambda x, y: x + y, data)
    data = [(flatten(story), q, answer) for story, q, answer in data if not max_length or len(flatten(story)) < max_length]
    return data


def get_stories_target(f, only_supporting=False, max_length=None):
    '''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story.

    If max_length is supplied, any stories longer than max_length tokens will be discarded.
    '''
    data = parse_stories_target(f.readlines(), only_supporting=only_supporting)
    print('parse_stories_target')
    print(data)
    flatten = lambda data: reduce(lambda x, y: x + y, data)
    data = [(flatten(story), q) for story, q in data if not max_length or len(flatten(story)) < max_length]
    return data


def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
    X = []
    Xq = []
    for story, query in data:
        x = [word_idx[w] for w in story]
        xq = [word_idx[w] for w in query]
        X.append(x)
        Xq.append(xq)
    return (
            pad_sequences(X, maxlen=story_maxlen),
            pad_sequences(Xq, maxlen=query_maxlen)
        )

path = 'babi_tasks_1-20_v1-2.tar.gz'
if not os.path.isfile(path):
    print("babi_tasks_1-20_v1-2.tar.gz not exist")
    exit()
tar = tarfile.open(path)

challenges = {
    # QA1 with 10,000 samples
    'single_supporting_fact_10k': 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt',
    # QA2 with 10,000 samples
    'two_supporting_facts_10k': 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt',
}
challenge_type = 'single_supporting_fact_10k'
challenge = challenges[challenge_type]

print('Extracting stories for the challenge:', challenge_type)
train_stories = get_stories(tar.extractfile(challenge.format('train')))
test_stories = get_stories(tar.extractfile(challenge.format('test')))

# get the vocab, which is same as the train
vocab = sorted(reduce(lambda x, y: x | y, (set(story + q + [answer]) for story, q, answer in train_stories + test_stories)))
print('vocab')
print(vocab)


# Reserve 0 for masking via pad_sequences
vocab_size = len(vocab) + 1
story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories)))
query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories)))

print('-')
print('Vocab size:', vocab_size, 'unique words')
print('Story max length:', story_maxlen, 'words')
print('Query max length:', query_maxlen, 'words')
print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
print('-')
print('Here\'s what a "story" tuple looks like (input, query, answer):')
print(train_stories[0])
print('-')
print('Vectorizing the word sequences...')

word_idx = dict((c, i + 1) for i, c in enumerate(vocab))
# get the vec of the target story we input
with open("./target.story") as fs:
    target_input = get_stories_target(fs)
    target_storys, target_queries = vectorize_stories(target_input, word_idx, story_maxlen, query_maxlen)



print('target_storys[0]')
print(target_storys[0])
print('target_queries[0]')
print(target_queries[0])


print('-')
print('inputs: integer tensor of shape (samples, max_length)')
print('inputs_train shape:', target_storys.shape)
print('-')
print('queries: integer tensor of shape (samples, max_length)')
print('queries_train shape:', target_queries.shape)
print('-')
print('Compiling...')


# laod the model
answer_model = load_model('./checkpoint.hdf5')
answer_output = answer_model.predict([target_storys], batch_size=32, verbose=1)
print(answer_output)

實做的時候發現幾問題

1. 如果我們用於訓練的語料庫不夠完備,就會造成我的詞彙表不夠完備,從而直接導致在進行詞向量化的時候解析失敗,從而後續的歸類也無法進行了
2. 在遞迴下降的過程中,我們可能遇到2種情況
  1) 小型的凹槽
  2) 真正的谷底(可能很陡峭,是突然下降的那種)
這2種情況造成的現象都可能是在連續幾個spoch中,loss不再減少或者降低很微小,這個時候我們要慎重使用learning rate自動調整(實際上是降低,例如常見的10%)策略,因為這有可能導致我們過早地收斂在一個假的小凹槽中,而無法下降都真正的最佳谷底。但是反過來將,如果遇到的是那種很深且突然下降的谷底,我們的learning rate太大可能導致我們始終無法收斂到最佳值,而一直在最佳值左右晃動

Relevant Link:

Copyright (c) 2017 LittleHann All rights reserved

相關文章