入門Python神經機器翻譯，這是一篇非常精簡的實戰指南

機器之心發表於2019-03-03

選自Medium，作者：Susan Li，機器之心編譯。

機器翻譯（MT）是一項極具挑戰性的任務，其研究如何使用計算機將文字或是語音從一種語言翻譯成另一種語言。本文藉助 Keras 從最基本的文字載入與資料預處理開始，並討論了在迴圈神經網路與編碼器解碼器框架下如何才能構建一個可接受的神經翻譯系統，本教程所有的程式碼已在 GitHub 開源。

傳統意義上來說，機器翻譯一般使用高度複雜的語言知識開發出的大型統計模型，但是近來很多研究使用深度模型直接對翻譯過程建模，並在只提供原語資料與譯文資料的情況下自動學習必要的語言知識。這種基於深度神經網路的翻譯模型目前已經獲得了最佳效果。

專案地址：github.com/susanli2016…

接下來，我們將使用深度神經網路來解決機器翻譯問題。我們將展示如何開發一個將英文翻譯成法文的神經網路機器翻譯模型。該模型將接收英文文字輸入同時返回法語譯文。更確切地說，我們將構建 4 個模型，它們是：

一個簡單的 RNN；
一個帶詞嵌入的 RNN；
一個雙向 RNN；
一個編碼器—解碼器模型。

訓練和評估深度神經網路是一項計算密集型的任務。作者使用 AWS EC2 例項來執行所有程式碼。如果你打算照著本文做，你得訪問 GPU 例項。

載入庫

import collections
import helper
import numpy as np
import project_tests as tests
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
複製程式碼

作者使用 help.py 載入資料，同時使用 project_test.py 測試函式。

資料

該資料集包含一個相對較小的詞彙表，其中 small_vocab_en 檔案包含英文語句，small_vocab_fr 包含對應的法文翻譯。

資料集下載地址：github.com/susanli2016…

載入資料

english_sentences = helper.load_data('data/small_vocab_en')
french_sentences = helper.load_data('data/small_vocab_fr')
print('Dataset Loaded')
複製程式碼

語句樣本

small_vocab_en 中的每行包含一個英文語句，同時其法文翻譯位於 small_vocab_fr 中對應的每行。

for sample_i in range(2):
 print('small_vocab_en Line {}: {}'.format(sample_i + 1, english_sentences[sample_i]))
 print('small_vocab_fr Line {}: {}'.format(sample_i + 1, french_sentences[sample_i]))
複製程式碼

詞彙表

問題的複雜性取決於詞彙表的複雜性。一個更復雜的詞彙表意味著一個更復雜的問題。對於將要處理的資料集，讓我們看看它的複雜性。

english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])
print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')
複製程式碼

預處理

我們將使用以下預處理方法將文字轉化為整數序列：

1. 將詞轉化為 id 表達；

2. 加入 padding 使得每個序列一樣長。

Tokensize（標記字串）

使用 Keras 的 Tokenizer 函式將每個語句轉化為一個單詞 id 的序列。使用該函式來標記化英文語句和法文語句。

函式 tokenize 返回標記化後的輸入和類。

def tokenize(x):
 x_tk = Tokenizer(char_level = False)
 x_tk.fit_on_texts(x)
 return x_tk.texts_to_sequences(x), x_tk
text_sentences = [
 'The quick brown fox jumps over the lazy dog .',
 'By Jove , my quick study of lexicography won a prize .',
 'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
 print('Sequence {} in x'.format(sample_i + 1))
 print(' Input: {}'.format(sent))
 print(' Output: {}'.format(token_sent))
複製程式碼

Padding

通過使用 Keras 的 pad_sequences 函式在每個序列最後新增零以使得所有英文序列具有相同長度，所有法文序列具有相同長度。

def pad(x, length=None):
 if length is None:
 length = max([len(sentence) for sentence in x])
 return pad_sequences(x, maxlen = length, padding = 'post')
tests.test_pad(pad)
# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
 print('Sequence {} in x'.format(sample_i + 1))
 print(' Input: {}'.format(np.array(token_sent)))
 print(' Output: {}'.format(pad_sent))
複製程式碼

預處理流程

實現預處理函式：

def preprocess(x, y):
 preprocess_x, x_tk = tokenize(x)
 preprocess_y, y_tk = tokenize(y)
preprocess_x = pad(preprocess_x)
 preprocess_y = pad(preprocess_y)
# Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
 preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
return preprocess_x, preprocess_y, x_tk, y_tk
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
 preprocess(english_sentences, french_sentences)

max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)
print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)
複製程式碼

模型

在本節中，我們將嘗試各種神經網路結構。我們將訓練 4 個相對簡單的結構作為開始：

模型 1 是一個簡單的 RNN；
模型 2 是一個帶詞嵌入的 RNN；
模型 3 是一個雙向 RNN；
模型 4 是兩個 RNN 組成的編碼器—解碼器架構。

在嘗試了 4 種簡單的結構之後，我們將構建一個更深的模型，其效能要優於以上 4 種模型。

id 重新轉化為文字

神經網路將輸入轉化為單詞 id，但這不是我們最終想要的形式，我們想要的是法文翻譯。logits_to_text 函式彌補了從神經網路輸出的 logits 到法文翻譯之間的缺口，我們將使用該函式更好地理解神經網路的輸出。

def logits_to_text(logits, tokenizer):
 index_to_words = {id: word for word, id in tokenizer.word_index.items()}
 index_to_words[0] = '<PAD>'
return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
print('`logits_to_text` function loaded.')
複製程式碼

模型 1：RNN

我們構建一個基礎的 RNN 模型，該模型是將英文翻譯成法文序列的良好基準。

def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
 learning_rate = 1e-3
 input_seq = Input(input_shape[1:])
 rnn = GRU(64, return_sequences = True)(input_seq)
 logits = TimeDistributed(Dense(french_vocab_size))(rnn)
 model = Model(input_seq, Activation('softmax')(logits))
 model.compile(loss = sparse_categorical_crossentropy, 
 optimizer = Adam(learning_rate), 
 metrics = ['accuracy'])

 return model
tests.test_simple_model(simple_model)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
# Train the neural network
simple_rnn_model = simple_model(
 tmp_x.shape,
 max_french_sequence_length,
 english_vocab_size,
 french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
複製程式碼

基礎 RNN 模型的驗證集準確度是 0.6039。

模型 2：詞嵌入

詞嵌入是在 n 維空間中近義詞距離相近的向量表示，其中 n 表示嵌入向量的大小。我們將使用詞嵌入來構建一個 RNN 模型。

from keras.models import Sequential
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
 learning_rate = 1e-3
 rnn = GRU(64, return_sequences=True, activation="tanh")

 embedding = Embedding(french_vocab_size, 64, input_length=input_shape[1]) 
 logits = TimeDistributed(Dense(french_vocab_size, activation="softmax"))

 model = Sequential()
 #em can only be used in first layer --> Keras Documentation
 model.add(embedding)
 model.add(rnn)
 model.add(logits)
 model.compile(loss=sparse_categorical_crossentropy,
 optimizer=Adam(learning_rate),
 metrics=['accuracy'])

 return model
tests.test_embed_model(embed_model)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))
embeded_model = embed_model(
 tmp_x.shape,
 max_french_sequence_length,
 english_vocab_size,
 french_vocab_size)
embeded_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
print(logits_to_text(embeded_model.predict(tmp_x[:1])[0], french_tokenizer))
複製程式碼

嵌入式模型的驗證集準確度是 0.8401。

模型 3：雙向 RNN

def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

 learning_rate = 1e-3
 model = Sequential()
 model.add(Bidirectional(GRU(128, return_sequences = True, dropout = 0.1), 
 input_shape = input_shape[1:]))
 model.add(TimeDistributed(Dense(french_vocab_size, activation = 'softmax')))
 model.compile(loss = sparse_categorical_crossentropy, 
 optimizer = Adam(learning_rate), 
 metrics = ['accuracy'])
 return model
tests.test_bd_model(bd_model)
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
bidi_model = bd_model(
 tmp_x.shape,
 preproc_french_sentences.shape[1],
 len(english_tokenizer.word_index)+1,
 len(french_tokenizer.word_index)+1)
bidi_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=20, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(bidi_model.predict(tmp_x[:1])[0], french_tokenizer))
複製程式碼

雙向 RNN 模型的驗證集準確度是 0.5992。

模型 4：編碼器—解碼器框架

編碼器構建一個語句的矩陣表示，而解碼器將該矩陣作為輸入並輸出預測的翻譯。

def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

 learning_rate = 1e-3
 model = Sequential()
 model.add(GRU(128, input_shape = input_shape[1:], return_sequences = False))
 model.add(RepeatVector(output_sequence_length))
 model.add(GRU(128, return_sequences = True))
 model.add(TimeDistributed(Dense(french_vocab_size, activation = 'softmax')))

 model.compile(loss = sparse_categorical_crossentropy, 
 optimizer = Adam(learning_rate), 
 metrics = ['accuracy'])
 return model
tests.test_encdec_model(encdec_model)
tmp_x = pad(preproc_english_sentences)
tmp_x = tmp_x.reshape((-1, preproc_english_sentences.shape[1], 1))
encodeco_model = encdec_model(
 tmp_x.shape,
 preproc_french_sentences.shape[1],
 len(english_tokenizer.word_index)+1,
 len(french_tokenizer.word_index)+1)
encodeco_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=20, validation_split=0.2)
print(logits_to_text(encodeco_model.predict(tmp_x[:1])[0], french_tokenizer))
複製程式碼

編碼器—解碼器模型的驗證集準確度是 0.6406。

模型 5：自定義深度模型

構建一個將詞嵌入和雙向 RNN 合併到一個模型中的 model_final。

至此，我們需要需要做一些實驗，例如將 GPU 引數改為 256，將學習率改為 0.005，對模型訓練多於（或少於）20 epochs 等等。

def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

 model = Sequential()
 model.add(Embedding(input_dim=english_vocab_size,output_dim=128,input_length=input_shape[1]))
 model.add(Bidirectional(GRU(256,return_sequences=False)))
 model.add(RepeatVector(output_sequence_length))
 model.add(Bidirectional(GRU(256,return_sequences=True)))
 model.add(TimeDistributed(Dense(french_vocab_size,activation='softmax')))
 learning_rate = 0.005

 model.compile(loss = sparse_categorical_crossentropy, 
 optimizer = Adam(learning_rate), 
 metrics = ['accuracy'])

 return model
tests.test_model_final(model_final)
print('Final Model Loaded')
複製程式碼

預測

def final_predictions(x, y, x_tk, y_tk):
 tmp_X = pad(preproc_english_sentences)
 model = model_final(tmp_X.shape,
 preproc_french_sentences.shape[1],
 len(english_tokenizer.word_index)+1,
 len(french_tokenizer.word_index)+1)

 model.fit(tmp_X, preproc_french_sentences, batch_size = 1024, epochs = 17, validation_split = 0.2)

 y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
 y_id_to_word[0] = '<PAD>'
sentence = 'he saw a old yellow truck'
 sentence = [x_tk.word_index[word] for word in sentence.split()]
 sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
 sentences = np.array([sentence[0], x[0]])
 predictions = model.predict(sentences, len(sentences))
print('Sample 1:')
 print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
 print('Il a vu un vieux camion jaune')
 print('Sample 2:')
 print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
 print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))
final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)
複製程式碼

我們得到了語句完美的翻譯同時驗證集準確度是 0.9776！

原文連結：medium.com/@actsusanli…

神經機器翻譯實戰
2018-12-22
Retrofit 2 0非常簡單的入門（翻譯官方文件）
2018-12-03
機器學習第一步，這是一篇手把手的隨機森林入門實戰
2020-02-17
機器學習隨機森林
微軟提出新型通用神經機器翻譯方法，挑戰低資源語言翻譯問題
2018-05-28
微軟
【翻譯】ECMAScript裝飾器的簡單指南
2018-07-19
Serilog文件翻譯系列（一） - 入門指南
2024-08-28
多對多多語言神經機器翻譯的對比學習
2022-01-30
Python反反爬蟲實戰，JS解密入門案例，詳解呼叫有道翻譯
2020-10-22
Python爬蟲JS解密
今晚直播：非自迴歸神經機器翻譯 | PhD Talk #24
2018-03-09
這可能是最簡單易懂的機器學習入門
2018-12-31
機器學習
從規則到神經網路：機器翻譯技術的演化之路
2023-12-26
神經網路
React實戰入門指南
2018-12-07
React
直播實錄 | 非自迴歸神經機器翻譯 + ICLR 2018 論文解讀
2018-03-12
ICLR
C# 10分鐘完成百度翻譯（機器翻譯）——入門篇
2022-01-10
C#
化繁為簡的翻譯機——直譯器模式
2019-01-28
模式
NLP教程(6) - 神經機器翻譯、seq2seq與注意力機制
2022-05-10
從冷戰到深度學習：一篇圖文並茂的機器翻譯史
2018-03-16
深度學習
Python 庫這非常的實用
2020-12-04
Python
分享一份非常適合新手學習的《Go語言從入門到實戰——簡明高效的Go語言實戰指南》課程
2019-08-10
Go
低資源神經機器翻譯MetaNMT ：來自MAML與NLP的溫柔救贖
2019-03-01
谷歌大腦神經機器翻譯大規模實驗：尋找最優的超引數組合
2018-08-09
谷歌
用python實現簡單的線上翻譯程式
2020-09-23
Python
MongoDB一篇從入門到實戰
2020-12-27
MongoDB
社群內 Go 入門指南的學習，白話精簡示例。
2019-06-26
Go
機器學習入門實戰疑問
2020-04-30
機器學習
[翻譯] Flutter 中的動畫 - 簡易指南 - 教程
2019-03-14
Flutter動畫
Datawhale AI夏令營-機器翻譯挑戰賽
2024-07-17
AI
非常適合GO語言新手學習的《Go語言從入門到實戰——簡明高效的Go語言實戰指南》課程——推薦分享
2019-08-17
Go
[翻譯] Go 語言入門
2019-07-19
Go
如何用PaddlePaddle實現機器翻譯？
2019-05-28
Python快速入門，看這一篇就夠了！
2023-04-13
Python
[譯] 以太坊入門指南
2019-03-04
入門 | Tensorflow實戰講解神經網路搭建詳細過程
2019-06-21
神經網路
香港大學顧佳濤：非自迴歸神經機器翻譯 | 直播預告
2018-03-05
初學者的機器學習入門實戰教程！
2019-03-22
機器學習
編譯原理實戰入門：用 JavaScript 寫一個簡單的四則運算編譯器（修訂版）
2020-11-10
編譯原理JavaScript
想入門設計卷積神經網路？這是一份綜合設計指南
2018-05-13
卷積神經網路
python入門-爬取百度翻譯中的雙語例句
2018-12-14
Python

入門Python神經機器翻譯，這是一篇非常精簡的實戰指南

相關文章