如何用 TensorFlow 訓練聊天機器人（附github）

超人汪小建發表於2017-09-28

前言

實際工程中很少有直接用深度學習實現端對端的聊天機器人，但這裡我們來看看怎麼用深度學習的seq2seq模型來實現一個簡易的聊天機器人。這篇文章將嘗試使用TensorFlow來訓練一個基於seq2seq的聊天機器人，實現根據語料庫的訓練讓機器人回答問題。

seq2seq

關於seq2seq的機制原理可看之前的文章《深度學習的seq2seq模型》。

迴圈神經網路

在seq2seq模型中會使用到迴圈神經網路，目前流行的幾種迴圈神經網路包括RNN、LSTM和GRU。這三種迴圈神經網路的機制原理可看之前的文章《迴圈神經網路》《LSTM神經網路》《GRU神經網路》。

訓練樣本集

主要是一些QA對，開放資料也很多可以下載，這裡只是隨便選用一小部分問題和回答，存放的格式是第一行為問題，第二行為回答，第三行又是問題，第四行為回答，以此類推。

資料預處理

要訓練就肯定要將資料轉成數字，可以用0到n的值來表示整個詞彙，每個值表示一個單詞，這裡用VOCAB_SIZE來定義。還有問題的最大最小長度，回答的最大最小長度。除此之外還要定義UNK、GO、EOS和PAD符號，分別表示未知單詞，比如你超過 VOCAB_SIZE範圍的則認為未知單詞，GO表示decoder開始的符號，EOS表示回答結束的符號，而PAD用於填充，因為所有QA對放到同個seq2seq模型中輸入和輸出都必須是相同的，於是就需要將較短長度的問題或回答用PAD進行填充。

limit = {
    'maxq': 10,
    'minq': 0,
    'maxa': 8,
    'mina': 3
}

UNK = 'unk'
GO = '<go>'
EOS = '<eos>'
PAD = '<pad>'
VOCAB_SIZE = 1000複製程式碼

按照QA長度的限制進行篩選。

def filter_data(sequences):
    filtered_q, filtered_a = [], []
    raw_data_len = len(sequences) // 2

    for i in range(0, len(sequences), 2):
        qlen, alen = len(sequences[i].split(' ')), len(sequences[i + 1].split(' '))
        if qlen >= limit['minq'] and qlen <= limit['maxq']:
            if alen >= limit['mina'] and alen <= limit['maxa']:
                filtered_q.append(sequences[i])
                filtered_a.append(sequences[i + 1])
    filt_data_len = len(filtered_q)
    filtered = int((raw_data_len - filt_data_len) * 100 / raw_data_len)
    print(str(filtered) + '% filtered from original data')

    return filtered_q, filtered_a複製程式碼

我們還要得到整個語料庫所有單詞的頻率統計，還要根據頻率大小統計出排名前n個頻率的單詞作為整個詞彙，也就是前面對應的VOCAB_SIZE。另外我們還需要根據索引值得到單詞的索引，還有根據單詞得到對應索引值的索引。

def index_(tokenized_sentences, vocab_size):
    freq_dist = nltk.FreqDist(itertools.chain(*tokenized_sentences))
    vocab = freq_dist.most_common(vocab_size)
    index2word = [GO] + [EOS] + [UNK] + [PAD] + [x[0] for x in vocab]
    word2index = dict([(w, i) for i, w in enumerate(index2word)])
    return index2word, word2index, freq_dist複製程式碼

前面也說到在我們的seq2seq模型中，對於encoder來說，問題的長短是不同的，那麼不夠長的要用PAD進行填充，比如問題為"how are you"，假如長度定為10，則需要將其填充為"how are you pad pad pad pad pad pad pad"。對於decoder來說，要以GO開始，以EOS結尾，不夠長還得填充，比如"fine thank you"，則要處理成"go fine thank you eos pad pad pad pad pad "。第三個要處理的則是我們的target，target其實和decoder的輸入是相同的，只不過它剛好有一個位置的偏移，比如上面要去掉go，變成"fine thank you eos pad pad pad pad pad pad"。

def zero_pad(qtokenized, atokenized, w2idx):
    data_len = len(qtokenized)
    # +2 dues to '<go>' and '<eos>'
    idx_q = np.zeros([data_len, limit['maxq']], dtype=np.int32)
    idx_a = np.zeros([data_len, limit['maxa'] + 2], dtype=np.int32)
    idx_o = np.zeros([data_len, limit['maxa'] + 2], dtype=np.int32)

    for i in range(data_len):
        q_indices = pad_seq(qtokenized[i], w2idx, limit['maxq'], 1)
        a_indices = pad_seq(atokenized[i], w2idx, limit['maxa'], 2)
        o_indices = pad_seq(atokenized[i], w2idx, limit['maxa'], 3)
        idx_q[i] = np.array(q_indices)
        idx_a[i] = np.array(a_indices)
        idx_o[i] = np.array(o_indices)

    return idx_q, idx_a, idx_o


def pad_seq(seq, lookup, maxlen, flag):
    if flag == 1:
        indices = []
    elif flag == 2:
        indices = [lookup[GO]]
    elif flag == 3:
        indices = []
    for word in seq:
        if word in lookup:
            indices.append(lookup[word])
        else:
            indices.append(lookup[UNK])
    if flag == 1:
        return indices + [lookup[PAD]] * (maxlen - len(seq))
    elif flag == 2:
        return indices + [lookup[EOS]] + [lookup[PAD]] * (maxlen - len(seq))
    elif flag == 3:
        return indices + [lookup[EOS]] + [lookup[PAD]] * (maxlen - len(seq) + 1)複製程式碼

然後將上面處理後的結構都持久化起來，供訓練時使用。

構建圖

encoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
decoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
targets = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
weights = tf.placeholder(dtype=tf.float32, shape=[batch_size, sequence_length])複製程式碼

建立四個佔位符，分別為encoder的輸入佔位符、decoder的輸入佔位符和decoder的target佔位符，還有權重佔位符。其中batch_size是輸入樣本一批的數量，sequence_length為我們定義的序列的長度。

cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)複製程式碼

建立迴圈神經網路結構，這裡使用LSTM結構，hidden_size是隱含層數量，用MultiRNNCell是因為我們希望建立一個更復雜的網路，num_layers為LSTM的層數。

results, states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
    tf.unstack(encoder_inputs, axis=1),
    tf.unstack(decoder_inputs, axis=1),
    cell,
    num_encoder_symbols,
    num_decoder_symbols,
    embedding_size,
    feed_previous=False
)複製程式碼

使用TensorFlow為我們準備好了的embedding_rnn_seq2seq函式搭建seq2seq結構，當然我們也可以自己從LSTM搭起，分別建立encoder和decoder，但為了方便直接使用embedding_rnn_seq2seq即可。使用tf.unstack函式是為了將encoder_inputs和decoder_inputs展開成一個列表，num_encoder_symbols和num_decoder_symbols對應到我們的詞彙數量。embedding_size則是我們的嵌入層的數量，feed_previous這個變數很重要，設為False表示這是訓練階段，訓練階段會使用decoder_inputs作為decoder的其中一個輸入，但feed_previous為True時則表示預測階段，而預測階段沒有decoder_inputs，所以只能依靠decoder上一時刻輸出作為當前時刻的輸入。

logits = tf.stack(results, axis=1)
loss = tf.contrib.seq2seq.sequence_loss(logits, targets=targets, weights=weights)
pred = tf.argmax(logits, axis=2)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)複製程式碼

接著使用sequence_loss來建立損失，這裡根據embedding_rnn_seq2seq的輸出來計算損失，同時該輸出也可以用來做預測，最大的值對應的索引即為詞彙的單詞，優化器使用的事AdamOptimizer。

建立會話

with tf.Session() as sess:
    ckpt = tf.train.get_checkpoint_state(model_dir)
    if ckpt and ckpt.model_checkpoint_path:
        saver.restore(sess, ckpt.model_checkpoint_path)
    else:
        sess.run(tf.global_variables_initializer())
    epoch = 0
    while epoch < 5000000:
        epoch = epoch + 1
        print("epoch:", epoch)
        for step in range(0, 1):
            print("step:", step)
            train_x, train_y, train_target = loadQA()
            train_encoder_inputs = train_x[step * batch_size:step * batch_size + batch_size, :]
            train_decoder_inputs = train_y[step * batch_size:step * batch_size + batch_size, :]
            train_targets = train_target[step * batch_size:step * batch_size + batch_size, :]
            op = sess.run(train_op, feed_dict={encoder_inputs: train_encoder_inputs, targets: train_targets,
                                               weights: train_weights, decoder_inputs: train_decoder_inputs})
            cost = sess.run(loss, feed_dict={encoder_inputs: train_encoder_inputs, targets: train_targets,
                                             weights: train_weights, decoder_inputs: train_decoder_inputs})
            print(cost)
            step = step + 1
        if epoch % 100 == 0:
            saver.save(sess, model_dir + '/model.ckpt', global_step=epoch + 1)複製程式碼

建立會話開始執行，這裡會用到tf.train.Saver物件來儲存和讀取模型，保險起見可以每隔一定間隔儲存一次模型，下次重啟會接著訓練而不用從頭重新來過，這裡因為是一個例子，QA對數量不多，所以直接一次性當成一批送進去訓練，而並沒有分成多批。

預測

with tf.device('/cpu:0'):
    batch_size = 1
    sequence_length = 10
    num_encoder_symbols = 1004
    num_decoder_symbols = 1004
    embedding_size = 256
    hidden_size = 256
    num_layers = 2

    encoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
    decoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])

    targets = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
    weights = tf.placeholder(dtype=tf.float32, shape=[batch_size, sequence_length])

    cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)

    results, states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
        tf.unstack(encoder_inputs, axis=1),
        tf.unstack(decoder_inputs, axis=1),
        cell,
        num_encoder_symbols,
        num_decoder_symbols,
        embedding_size,
        feed_previous=True,
    )
    logits = tf.stack(results, axis=1)
    pred = tf.argmax(logits, axis=2)

    saver = tf.train.Saver()
    with tf.Session() as sess:
        module_file = tf.train.latest_checkpoint('./model/')
        saver.restore(sess, module_file)
        map = Word_Id_Map()
        encoder_input = map.sentence2ids(['you', 'want', 'to', 'turn', 'twitter', 'followers', 'into', 'blog', 'readers'])

        encoder_input = encoder_input + [3 for i in range(0, 10 - len(encoder_input))]
        encoder_input = np.asarray([np.asarray(encoder_input)])
        decoder_input = np.zeros([1, 10])
        print('encoder_input : ', encoder_input)
        print('decoder_input : ', decoder_input)
        pred_value = sess.run(pred, feed_dict={encoder_inputs: encoder_input, decoder_inputs: decoder_input})
        print(pred_value)
        sentence = map.ids2sentence(pred_value[0])
        print(sentence)複製程式碼

預測階段也同樣要建立相同的模型，然後將訓練時儲存的模型載入進來，然後實現對問題的回答的預測。預測階段我們用cpu來執行就行了，避免使用GPU。建立圖的步驟和訓練時基本一致，引數也要保持一致，不同的地方在於我們要將embedding_rnn_seq2seq函式的feed_previous引數設為True,因為我們已經沒有decoder輸入了。另外我們也不需要損失函式和優化器，僅僅提供預測函式即可。

建立會話後開始執行，先載入model目錄下的模型，然後再將待測試的問題轉成向量形式，接著進行預測，得到輸出如下：
['how', 'do', 'you', 'do', 'this', '', '', '', '', '']。

github

github.com/sea-boat/se…

以下是廣告和相關閱讀

========廣告時間========

鄙人的新書《Tomcat核心設計剖析》已經在京東銷售了，有需要的朋友可以到 item.jd.com/12185360.ht… 進行預定。感謝各位朋友。

為什麼寫《Tomcat核心設計剖析》

=========================