Tensorflow搞一個聊天機器人

Andrew.Hann發表於2017-02-22

catalogue

0. 前言
1. 訓練語料庫
2. 資料預處理
3. 詞彙轉向量
4. 訓練
5. 聊天機器人 - 驗證效果

 

0. 前言

不是搞機器學習演算法專業的,3個月前開始補了一些神經網路,卷積,神經網路一大堆基礎概念,尼瑪,還真有點複雜,不過搞懂這些基本數學概念,再看tensorflow的api和python程式碼覺得跌跌撞撞竟然能看懂了,背後的意思也能明白一點點

0x1: 模型分類

1. 基於檢索的模型 vs. 產生式模型

基於檢索的模型(Retrieval-Based Models)有一個預先定義的"回答集(repository)",包含了許多回答(responses),還有一些根據輸入的問句和上下文(context),以及用於挑選出合適的回答的啟發式規則。這些啟發式規則可能是簡單的基於規則的表示式匹配,或是相對複雜的機器學習分類器的整合。基於檢索的模型不會產生新的文字,它只能從預先定義的"回答集"中挑選出一個較為合適的回答。
產生式模型(Generative Models)不依賴於預先定義的回答集,它會產生一個新的回答。經典的產生式模型是基於機器翻譯技術的,只不過不是將一種語言翻譯成另一種語言,而是將問句"翻譯"成回答(response)

2. 長對話模型 vs. 短對話模型

短對話(Short Conversation)指的是一問一答式的單輪(single turn)對話。舉例來說,當機器收到使用者的一個提問時,會返回一個合適的回答。對應地,長對話(Long Conversation)指的是你來我往的多輪(multi-turn)對話,例如兩個朋友對某個話題交流意見的一段聊天。在這個場景中,需要談話雙方(聊天機器人可能是其中一方)記得雙方曾經談論過什麼,這是和短對話的場景的區別之一。現下,機器人客服系統通常是長對話模型

3. 開放話題模型 vs. 封閉話題模型

開放話題(Open Domain)場景下,使用者可以說任何內容,不需要是有特定的目的或是意圖的詢問。人們在Twitter、Reddit等社交網路上的對話形式就是典型的開放話題情景。由於該場景下,可談論的主題的數量不限,而且需要一些常識作為聊天基礎,使得搭建一個這樣的聊天機器人變得相對困難。
封閉話題(Closed Domain)場景,又稱為目標驅動型(goal-driven),系統致力於解決特定領域的問題,因此可能的詢問和回答的數量相對有限。技術客服系統或是購物助手等應用就是封閉話題模型的例子。我們不要求這些系統能夠談論政治,只需要它們能夠儘可能有效地解決我們的問題。雖然使用者還是可以向這些系統問一些不著邊際的問題,但是系統同樣可以不著邊際地給你回覆 ;)

Relevant Link:

http://naturali.io/deeplearning/chatbot/introduction/2016/04/28/chatbot-part1.html
http://blog.topspeedsnail.com/archives/10735/comment-page-1#comment-1161
http://blog.csdn.net/malefactor/article/details/51901115

 

1. 訓練語料庫

wget https://raw.githubusercontent.com/rustch3n/dgk_lost_conv/master/dgk_shooter_min.conv.zip
解壓
unzip dgk_shooter_min.conv.zip

Relevant Link:

https://github.com/rustch3n/dgk_lost_conv

 

2. 資料預處理

一般來說,我們拿到的基礎語料庫可能是一些電影臺詞對話,或者是UBUNTU對話語料庫(Ubuntu Dialog Corpus),但基本上我們都要完成以下幾大步驟

1. 分詞(tokenized)
2. 英文單詞取詞根(stemmed)
3. 英文單詞變形的歸類(lemmatized)(例如單複數歸類)等
4. 此外,例如人名、地名、組織名、URL連結、系統路徑等專有名詞,我們也可以統一用型別識別符號來替代 

M 表示話語,E 表示分割,遇到M就吧當前對話片段加入臨時對話集,遇到E就說明遇到一箇中斷或者交談雙方轉換了,一口氣吧臨時對話集加入convs總對話集,一次加入一個對話集,可以理解為拍電影裡面的一個"咔"

convs = []  # conversation set
with open(conv_path, encoding="utf8") as f:
    one_conv = []  # a complete conversation
    for line in f:
        line = line.strip('\n').replace('/', '')
        if line == '':
            continue
        if line[0] == 'E':
            if one_conv:
                convs.append(one_conv)
            one_conv = []
        elif line[0] == 'M':
            one_conv.append(line.split(' ')[1])

因為場景是聊天機器人,影視劇的臺詞也是一人一句對答的,所以這裡需要忽略2種特殊情況,只有一問或者只有一答,以及問和答的數量不一致,即最後一個人問完了沒有得到回答

# Grasping calligraphy answer answer
ask = []  # ask
response = []  # answers
for conv in convs:
    if len(conv) == 1:
        continue
    if len(conv) % 2 != 0:
        conv = conv[:-1]
    for i in range(len(conv)):
        if i % 2 == 0:
            ask.append(conv[i])
        else:
            response.append(conv[i])

Relevant Link:

 

3. 詞彙轉向量

我們知道影象識別、語音識別之所以能率先在深度學習領域取得較大成就,其中一個原因在於這2個領域的原始輸入資料本身就帶有很強的樣本關聯性,例如畫素權重分佈在同一類物體的不同影象中,表現是基本一致的,這本質上也人腦識別同類物體的機制是一樣的,即我們常說的"舉一反三"能力,我們學過的文字越多,就越可能駕馭甚至能創造組合出新的文字用法,寫出華麗的文章

但是NPL或者語義識別領域的輸入資料,對話或者叫語料往往是不具備這種強關聯性的,為此,就需要引入一個概念模型,叫詞向量(word2vec)或短語向量(seq2seq),簡單來說就是將語料庫中的詞彙抽象對映到一個向量空間中,向量的排布是根據預發和詞義語境決定的,例如,"中國->人"(中國後面緊跟著一個人字的可能性是極大的)、"你今年幾歲了->我 ** 歲了"

0x1: Token化處理、詞編碼

將訓練集中的對話的每個檔案拆分成單獨的一個個文字,形成一個詞表(word table)

def gen_vocabulary_file(input_file, output_file):
    vocabulary = {}
    with open(input_file) as f:
        counter = 0
        for line in f:
            counter += 1
            tokens = [word for word in line.strip()]
            for word in tokens:
                if word in vocabulary:
                    vocabulary[word] += 1
                else:
                    vocabulary[word] = 1
        vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)
        # For taking 10000 custom character kanji
        if len(vocabulary_list) > 10000:
            vocabulary_list = vocabulary_list[:10000]
        print(input_file + " phrase table size:", len(vocabulary_list))
        with open(output_file, "w") as ff:
            for word in vocabulary_list:
                ff.write(word + "\n")

完成了Token化之後,需要對單詞進行數字編碼,方便後續的向量空間處理,這裡依據的核心思想是這樣的

我們的訓練語料庫的對話之間都是有強關聯的,基於這份有關聯的對話集獲得的詞表的詞之間也有邏輯關聯性,那麼我們只要按照此表原生的順序對詞進行編碼,這個編碼後的[work, id]就是一個有向量空間關聯性的詞表

def convert_conversation_to_vector(input_file, vocabulary_file, output_file):
    tmp_vocab = []
    with open(vocabulary_file, "r") as f:
        tmp_vocab.extend(f.readlines())
    tmp_vocab = [line.strip() for line in tmp_vocab]
    vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)])
    for item in vocab:
        print item.encode('utf-8')

所以我們根據訓練預料集得到的此表可以作為對話訓練集和對話測試機進行向量化的依據,我們的目的是將對話(包括訓練集和測試集)的問和答都轉化對映到向量空間

968
""字在訓練集詞彙表中的位置是968,我們就給該字設定一個編碼968

0x2: 對話轉為向量

原作者在詞表的選取上作了裁剪,只選取前5000個詞彙,但是仔細思考了一下,感覺問題源頭還是在訓練語料庫不夠豐富,不能完全覆蓋所有的對話語言場景

這一步得到一個ask/answer的語句seq向量空間集,對於訓練集,我們將ask和answer建立對映關係

Relevant Link:

 

4. 訓練

0x1: Sequence-to-sequence basics

A basic sequence-to-sequence model, as introduced in Cho et al., 2014, consists of two recurrent neural networks (RNNs): an encoder that processes the input and a decoder that generates the output. This basic architecture is depicted below.

Each box in the picture above represents a cell of the RNN, most commonly a GRU cell or an LSTM cell. Encoder and decoder can share weights or, as is more common, use a different set of parameters. Multi-layer cells have been successfully used in sequence-to-sequence models too
In the basic model depicted above, every input has to be encoded into a fixed-size state vector, as that is the only thing passed to the decoder. To allow the decoder more direct access to the input, an attention mechanism was introduced in Bahdanau et al., 2014.; suffice it to say that it allows the decoder to peek into the input at every decoding step. A multi-layer sequence-to-sequence network with LSTM cells and attention mechanism in the decoder looks like this.

0x2: 訓練過程

利用ask/answer的訓練集輸入神經網路,並使用ask/answer測試向量對映集實現BP反饋與,使用一個三層神經網路,讓tensorflow自動調整權重引數,獲得一個ask-?的模型

# -*- coding: utf-8 -*-

import tensorflow as tf  # 0.12
from tensorflow.models.rnn.translate import seq2seq_model
import os
import numpy as np
import math

PAD_ID = 0
GO_ID = 1
EOS_ID = 2
UNK_ID = 3

# ask/answer conversation vector file
train_ask_vec_file = 'train_ask.vec'
train_answer_vec_file = 'train_answer.vec'
test_ask_vec_file = 'test_ask.vec'
test_answer_vec_file = 'test_answer.vec'

# word table 6000
vocabulary_ask_size = 6000
vocabulary_answer_size = 6000

buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
layer_size = 256
num_layers = 3
batch_size = 64


# read *dencode.vec和*decode.vec data into memory
def read_data(source_path, target_path, max_size=None):
    data_set = [[] for _ in buckets]
    with tf.gfile.GFile(source_path, mode="r") as source_file:
        with tf.gfile.GFile(target_path, mode="r") as target_file:
            source, target = source_file.readline(), target_file.readline()
            counter = 0
            while source and target and (not max_size or counter < max_size):
                counter += 1
                source_ids = [int(x) for x in source.split()]
                target_ids = [int(x) for x in target.split()]
                target_ids.append(EOS_ID)
                for bucket_id, (source_size, target_size) in enumerate(buckets):
                    if len(source_ids) < source_size and len(target_ids) < target_size:
                        data_set[bucket_id].append([source_ids, target_ids])
                        break
                source, target = source_file.readline(), target_file.readline()
    return data_set

if __name__ == '__main__':
    model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size,
                                       target_vocab_size=vocabulary_answer_size,
                                       buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0,
                                       batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.97,
                                       forward_only=False)

    config = tf.ConfigProto()
    config.gpu_options.allocator_type = 'BFC'  # forbidden out of memory

    with tf.Session(config=config) as sess:
        # 恢復前一次訓練
        ckpt = tf.train.get_checkpoint_state('.')
        if ckpt != None:
            print(ckpt.model_checkpoint_path)
            model.saver.restore(sess, ckpt.model_checkpoint_path)
        else:
            sess.run(tf.global_variables_initializer())

        train_set = read_data(train_ask_vec_file, train_answer_vec_file)
        test_set = read_data(test_ask_vec_file, test_answer_vec_file)

        train_bucket_sizes = [len(train_set[b]) for b in range(len(buckets))]
        train_total_size = float(sum(train_bucket_sizes))
        train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in range(len(train_bucket_sizes))]

        loss = 0.0
        total_step = 0
        previous_losses = []
        # continue train,save modle after a decade of time
        while True:
            random_number_01 = np.random.random_sample()
            bucket_id = min([i for i in range(len(train_buckets_scale)) if train_buckets_scale[i] > random_number_01])

            encoder_inputs, decoder_inputs, target_weights = model.get_batch(train_set, bucket_id)
            _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, False)

            loss += step_loss / 500
            total_step += 1

            print(total_step)
            if total_step % 500 == 0:
                print(model.global_step.eval(), model.learning_rate.eval(), loss)

                # if model has't not improve,decrese the learning rate
                if len(previous_losses) > 2 and loss > max(previous_losses[-3:]):
                    sess.run(model.learning_rate_decay_op)
                previous_losses.append(loss)
                # save model
                checkpoint_path = "chatbot_seq2seq.ckpt"
                model.saver.save(sess, checkpoint_path, global_step=model.global_step)
                loss = 0.0
                # evaluation the model by test dataset
                for bucket_id in range(len(buckets)):
                    if len(test_set[bucket_id]) == 0:
                        continue
                    encoder_inputs, decoder_inputs, target_weights = model.get_batch(test_set, bucket_id)
                    _, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)
                    eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf')
                    print(bucket_id, eval_ppx)

Relevant Link:

https://www.tensorflow.org/tutorials/seq2seq
http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/

 

5. 聊天機器人 - 驗證效果

# -*- coding: utf-8 -*-

import tensorflow as tf  # 0.12
from tensorflow.models.rnn.translate import seq2seq_model
import os
import sys
import locale
import numpy as np

PAD_ID = 0
GO_ID = 1
EOS_ID = 2
UNK_ID = 3

train_ask_vocabulary_file = "train_ask_vocabulary.vec"
train_answer_vocabulary_file = "train_answer_vocabulary.vec"


def read_vocabulary(input_file):
    tmp_vocab = []
    with open(input_file, "r") as f:
        tmp_vocab.extend(f.readlines())
    tmp_vocab = [line.strip() for line in tmp_vocab]
    vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)])
    return vocab, tmp_vocab


if __name__ == '__main__':
    vocab_en, _, = read_vocabulary(train_ask_vocabulary_file)
    _, vocab_de, = read_vocabulary(train_answer_vocabulary_file)

    # word table 6000
    vocabulary_ask_size = 6000
    vocabulary_answer_size = 6000

    buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
    layer_size = 256
    num_layers = 3
    batch_size = 1

    model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size,
                                       target_vocab_size=vocabulary_answer_size,
                                       buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0,
                                       batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.99,
                                       forward_only=True)
    model.batch_size = 1

    with tf.Session() as sess:
        # restore last train
        ckpt = tf.train.get_checkpoint_state('.')
        if ckpt != None:
            print(ckpt.model_checkpoint_path)
            model.saver.restore(sess, ckpt.model_checkpoint_path)
        else:
            print("model not found")

        while True:
            input_string = raw_input('me > ').decode(sys.stdin.encoding or locale.getpreferredencoding(True)).strip()
            # 退出
            if input_string == 'quit':
                exit()

            # convert the user's input to vector
            input_string_vec = []
            for words in input_string.strip():
                input_string_vec.append(vocab_en.get(words, UNK_ID))
            bucket_id = min([b for b in range(len(buckets)) if buckets[b][0] > len(input_string_vec)])
            encoder_inputs, decoder_inputs, target_weights = model.get_batch({bucket_id: [(input_string_vec, [])]},
                                                                             bucket_id)
            _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)
            outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
            if EOS_ID in outputs:
                outputs = outputs[:outputs.index(EOS_ID)]

            response = "".join([tf.compat.as_str(vocab_de[output]) for output in outputs])
            print('AI > ' + response)

神經網路還是很依賴樣本的訓練的,我在實驗的過程中發現,用GPU跑到20000 step之後,模型的效果才逐漸顯現出來,才開始逐漸像正常的人機對話了

Relevant Link:

Copyright (c) 2017 LittleHann All rights reserved

相關文章