[譯] 利用 Keras 深度學習庫進行詞性標註教程

luochen1992發表於2018-04-28

原文網址 : https://juejin.im/post/5ae4613a5188256727742d7d

原文地址：Part-of-Speech tagging tutorial with the Keras Deep Learning library

原文作者：Cdiscount Data Science

譯文出自：掘金翻譯計劃

本文永久連結：github.com/xitu/gold-m…

譯者：luochen

校對者：stormluke mingxing47

在本教程中，你將明白怎樣使用一個簡單的 Keras 模型來訓練和評估用於多分類問題的人工神經網路。

在自然語言處理中，詞性標註是一件眾所周知的任務。它指的是將單詞按詞性分類（也稱為詞類或詞性類別）。這是一種有監督的學習方法。

人工神經網路已成功應用於詞性標註，並且表現卓越。我們將重點關注多層感知器網路，這是一種非常流行的網路結構，被視為解決詞性標註問題的最新技術。（譯者注：對於詞性標註問題，RNN 有更好的效果）

讓我們把它付諸實踐！

在本文中，你將獲得一個關於如何在 Keras 中實現簡單的多層感知器的快速教程，並在已標註的語料庫上進行訓練。

確保可重複性

為了保證我們的實驗能夠復現，我們需要設定一個隨機種子：

import numpy as np

CUSTOM_SEED = 42
np.random.seed(CUSTOM_SEED)
複製程式碼

獲取已標註的語料庫

Penn Treebank 是一個詞性標註語料庫。Python 庫中有個示例 [NLTK](https://github.com/nltk/nltk) 包含能夠用於訓練和測試某些自然語言處理模型（NLP models）的語料庫。

首先，我們下載已標註的語料庫：

import nltk

nltk.download('treebank')
複製程式碼

然後我們載入標記好的句子。

from nltk.corpus import treebank

sentences = treebank.tagged_sents(tagset='universal')
複製程式碼

然後我們隨便挑個句子看看：

import random

print(random.choice(sentences))
複製程式碼

這是一個元組列表 (term, tag).

[('Mr.', 'NOUN'), ('Otero', 'NOUN'), (',', '.'), ('who', 'PRON'), ('apparently', 'ADV'), ('has', 'VERB'), ('an', 'DET'), ('unpublished', 'ADJ'), ('number', 'NOUN'), (',', '.'), ('also', 'ADV'), ('could', 'VERB'), ("n't", 'ADV'), ('be', 'VERB'), ('reached', 'VERB'), ('.', '.')]
複製程式碼

這是一個包含四十多個不同類別的多分類問題。 Treebank 語料庫上的詞性標註是一個眾所周知的問題，我們期望模型精度能超過 95%。

tags = set([
    tag for sentence in treebank.tagged_sents() 
    for _, tag in sentence
])
print('nb_tags: %sntags: %s' % (len(tags), tags))
複製程式碼

產生了一個：

46
{'IN', 'VBZ', '.', 'RP', 'DT', 'VB', 'RBR', 'CC', '#', ',', 'VBP', 'WP$', 'PRP', 'JJ', 
'RBS', 'LS', 'PRP$', 'WRB', 'JJS', '``', 'EX', 'POS', 'WP', 'VBN', '-LRB-', '-RRB-', 
'FW', 'MD', 'VBG', 'TO', '$', 'NNS', 'NNPS', "''", 'VBD', 'JJR', ':', 'PDT', 'SYM', 
'NNP', 'CD', 'RB', 'WDT', 'UH', 'NN', '-NONE-'}
複製程式碼

### 監督式學習的資料集預處理

我們將標記的句子劃分成 3 個資料集：

訓練集 相當於擬合模型的樣本資料，
驗證集 用於調整分類器的引數，例如選擇網路中神經元的個數，
測試集 僅用於評估分類器的效能。

我們使用大約 60% 的標記句子進行訓練，20% 作為驗證集，20% 用於評估我們的模型。

train_test_cutoff = int(.80 * len(sentences)) 
training_sentences = sentences[:train_test_cutoff]
testing_sentences = sentences[train_test_cutoff:]

train_val_cutoff = int(.25 * len(training_sentences))
validation_sentences = training_sentences[:train_val_cutoff]
training_sentences = training_sentences[train_val_cutoff:]
複製程式碼

特徵工程

我們的特徵集非常簡單。對於每一個單詞而言，我們根據提取單詞的句子建立一個特徵字典。這些屬性包含該單詞的前後單詞以及它的字首和字尾。

def add_basic_features(sentence_terms, index):
    """ 計算基本的單詞特徵
        :param sentence_terms: [w1, w2, ...] 
        :type sentence_terms: list
        :param index: the index of the word 
        :type index: int
        :return: dict containing features
        :rtype: dict
    """
    term = sentence_terms[index]
    return {
        'nb_terms': len(sentence_terms),
        'term': term,
        'is_first': index == 0,
        'is_last': index == len(sentence_terms) - 1,
        'is_capitalized': term[0].upper() == term[0],
        'is_all_caps': term.upper() == term,
        'is_all_lower': term.lower() == term,
        'prefix-1': term[0],
        'prefix-2': term[:2],
        'prefix-3': term[:3],
        'suffix-1': term[-1],
        'suffix-2': term[-2:],
        'suffix-3': term[-3:],
        'prev_word': '' if index == 0 else sentence_terms[index - 1],
        'next_word': '' if index == len(sentence_terms) - 1 else sentence_terms[index + 1]
    }
複製程式碼

我們將句子列表對映到特徵字典列表。

def untag(tagged_sentence):
    """ 
    刪除每個標記詞語的標籤。

:param tagged_sentence: 已標註的句子
    :type tagged_sentence: list
    :return: a list of tags
    :rtype: list of strings
    """
    return [w for w, _ in tagged_sentence]

def transform_to_dataset(tagged_sentences):
    """
    將標註的句子切分為 X 和 y 以及新增一些基本特徵

:param tagged_sentences: 已標註的句子列表
    :param tagged_sentences: 元組列表之列表 (term_i, tag_i)
    :return: 
    """
    X, y = [], []

for pos_tags in tagged_sentences:
        for index, (term, class_) in enumerate(pos_tags):
            # Add basic NLP features for each sentence term
            X.append(add_basic_features(untag(pos_tags), index))
            y.append(class_)
    return X, y
複製程式碼

對於訓練、驗證和測試句子，我們將屬性分為 X（輸入變數）和 y（輸出變數）。

X_train, y_train = transform_to_dataset(training_sentences)
X_test, y_test = transform_to_dataset(testing_sentences)
X_val, y_val = transform_to_dataset(validation_sentences)
複製程式碼

特徵編碼

我們的神經網路將向量作為輸入，所以我們需要將我們的字典特徵轉換為向量。 sklearn 的內建函式 [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) 提供一種非常直接的方法進行向量轉換。

from sklearn.feature_extraction import DictVectorizer

# 用我們的特徵集擬合字典向量生成器
dict_vectorizer = DictVectorizer(sparse=False)
dict_vectorizer.fit(X_train + X_test + X_val)

# 將字典特徵轉換為向量
X_train = dict_vectorizer.transform(X_train)
X_test = dict_vectorizer.transform(X_test)
X_val = dict_vectorizer.transform(X_val)
複製程式碼

我們的 y 向量必須被編碼。輸出變數包含 49 個不同的字串值，它們被編碼為整數。

from sklearn.preprocessing import LabelEncoder

# 用類別列表訓練標籤編碼器
label_encoder = LabelEncoder()
label_encoder.fit(y_train + y_test + y_val)

# 將類別值編碼成整數
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
y_val = label_encoder.transform(y_val)
複製程式碼

然後我們需要將這些編碼值轉換為虛擬變數（獨熱編碼）。

# 將整數轉換為虛擬變數（獨熱編碼）
from keras.utils import np_utils

y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
y_val = np_utils.to_categorical(y_val)
複製程式碼

建立 Keras 模型

[Keras](https://github.com/fchollet/keras/) 是一個高階框架，用於設計和執行神經網路，它擁有多個後端像是 [TensorFlow](https://github.com/tensorflow/tensorflow/), [Theano](https://github.com/Theano/Theano) 以及 [CNTK](https://github.com/Microsoft/CNTK)。

我們想建立一個最基本的神經網路：多層感知器。這種線性疊層可以通過序貫（Sequential）模型輕鬆完成。該模型將包含輸入層，隱藏層和輸出層。為了克服過擬合，我們使用 dropout 正則化。我們設定斷開率為 20%，這意味著在訓練過程中每次更新引數時按 20% 的概率隨機斷開輸入神經元。

我們對隱藏層使用 Rectified Linear Units (ReLU) 啟用函式，因為它們是可用的最簡單的非線性啟用函式。

對於多分類問題，我們想讓神經元輸出轉換為概率，這可以使用 softmax 函式完成。我們決定使用多分類交叉熵（categorical cross-entropy）損失函式。最後我們選擇 Adam optimizer 因為似乎它非常適合分類任務.

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

def build_model(input_dim, hidden_neurons, output_dim):
    """
    構建、編譯以及返回一個用於擬合/預測的 Keras 模型。
    """
    model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim),
        Activation('relu'),
        Dropout(0.2),
        Dense(hidden_neurons),
        Activation('relu'),
        Dropout(0.2),
        Dense(output_dim, activation='softmax')
    ])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
複製程式碼

在 Keras API 和 Scikit-Learn 之間建立一個包裝器

[Keras](https://github.com/fchollet/keras/) 提供了一個名為 [KerasClassifier](https://keras.io/scikit-learn-api/) 的包裝器。它實現了 Scikit-Learn 分類器介面。

所有的模型引數定義如下。我們需要提供一個返回神經網路結構的函式 (build_fn)。隱藏的神經元的數量和批量大小的選擇非常隨意。我們將迭代次數設定為 5，因為隨著迭代次數增多，多層感知器就會開始過擬合(即使用了 Dropout Regularization)。

from keras.wrappers.scikit_learn import KerasClassifier

model_params = {
    'build_fn': build_model,
    'input_dim': X_train.shape[1],
    'hidden_neurons': 512,
    'output_dim': y_train.shape[1],
    'epochs': 5,
    'batch_size': 256,
    'verbose': 1,
    'validation_data': (X_val, y_val),
    'shuffle': True
}

clf = KerasClassifier(**model_params)
複製程式碼

訓練 Keras 模型

最後，我們在訓練集上訓練多層感知器。

hist = clf.fit(X_train, y_train)
複製程式碼

通過回撥歷史（callback history），我們能夠視覺化模型的 log loss 和 accuracy 隨時間的變化。

import matplotlib.pyplot as plt

def plot_model_performance(train_loss, train_acc, train_val_loss, train_val_acc):
    """  繪製模型損失和準確度隨時間變化的曲線  """

    blue= '#34495E'
    green = '#2ECC71'
    orange = '#E23B13'

    # 繪製模型損失曲線
    fig, (ax1, ax2) = plt.subplots(2, figsize=(10, 8))
    ax1.plot(range(1, len(train_loss) + 1), train_loss, blue, linewidth=5, label='training')
    ax1.plot(range(1, len(train_val_loss) + 1), train_val_loss, green, linewidth=5, label='validation')
    ax1.set_xlabel('# epoch')
    ax1.set_ylabel('loss')
    ax1.tick_params('y')
    ax1.legend(loc='upper right', shadow=False)
    ax1.set_title('Model loss through #epochs', color=orange, fontweight='bold')

    # 繪製模型準確度曲線
    ax2.plot(range(1, len(train_acc) + 1), train_acc, blue, linewidth=5, label='training')
    ax2.plot(range(1, len(train_val_acc) + 1), train_val_acc, green, linewidth=5, label='validation')
    ax2.set_xlabel('# epoch')
    ax2.set_ylabel('accuracy')
    ax2.tick_params('y')
    ax2.legend(loc='lower right', shadow=False)
    ax2.set_title('Model accuracy through #epochs', color=orange, fontweight='bold')
複製程式碼

然後，看看模型的效能:

plot_model_performance(
    train_loss=hist.history.get('loss', []),
    train_acc=hist.history.get('acc', []),
    train_val_loss=hist.history.get('val_loss', []),
    train_val_acc=hist.history.get('val_acc', [])
)
複製程式碼

模型效能隨迭代次數的變化。

兩次迭代之後，我們發現模型過擬合。

評估多層感知器

由於我們模型已經訓練好了，所以我們可以直接評估它：

score = clf.score(X_test, y_test)
print(score)

[Out] 0.95816
複製程式碼

我們在測試集上的準確率接近 96%，當你檢視我們在模型中輸入的基本特徵時，這一點令人印象非常深刻。請記住，即使對於人類標註者來說，100% 的準確性也是不可能的。我們估計人類詞性標註的準確度大概在 98%。

模型的視覺化

from keras.utils import plot_model

plot_model(clf.model, to_file='model.png', show_shapes=True)
複製程式碼

儲存 Keras 模型

儲存 Keras 模型非常簡單，因為 Keras 庫提供了一種本地化的方法：

clf.model.save('/tmp/keras_mlp.h5')
複製程式碼

儲存了模型的結構，權重以及訓練配置（損失函式，優化器）。

資源

Keras: Python 深度學習庫：[doc]
Adam: 一種隨機優化方法：[paper]
Improving neural networks by preventing co-adaptation of feature detectors: [paper]

在本文中，您學習如何使用 Keras 庫定義和評估用於多分類的神經網路的準確性。程式碼在這裡：[.py|.ipynb].

掘金翻譯計劃是一個翻譯優質網際網路技術文章的社群，文章來源為掘金上的英文分享文章。內容覆蓋 Android、iOS、前端、後端、區塊鏈、產品、設計、人工智慧等領域，想要檢視更多優質譯文請持續關注掘金翻譯計劃、官方微博、知乎專欄。

COVID-19：利用Opencv, Keras/Tensorflow和深度學習進行口罩檢測
2020-06-03
OpenCVKeras深度學習
jieba 詞性標註 & 並行分詞
2020-12-19
Jieba詞性標註並行分詞
利用ENVI深度學習進行遙感變化監測教程
2024-06-26
深度學習
python的詞性標註
2020-12-24
Python詞性標註
深度學習keras筆記
2020-12-17
深度學習Keras筆記
系統學習NLP（十）--詞性標註演算法綜述
2019-03-09
詞性標註演算法
Machine Learning Mastery 部落格文章翻譯：深度學習與 Keras
2019-04-11
MacAST深度學習Keras
拯救深度學習：標註資料不足下的深度學習方法
2020-10-16
深度學習
使用Mobilenet和Keras進行遷移學習！
2018-11-20
Keras遷移學習
Python利用深度學習進行文字摘要的綜合指南（附教程）
2019-07-16
Python深度學習
pyhanlp 中文詞性標註與分詞簡介
2019-01-07
HanLP詞性標註分詞
什麼是深度學習的影片標註？
2023-01-12
深度學習
使用Keras進行深度學習：（六）LSTM和雙向LSTM講解及實踐
2018-05-04
Keras深度學習
使用Keras進行深度學習：（五）RNN和雙向RNN講解及實踐
2018-04-26
Keras深度學習RNN
機器學習二——利用numpy庫對矩陣進行操作
2020-09-30
機器學習矩陣
如何使用機器學習進行影像識別 | 資料標註
2023-01-13
機器學習
自然語言處理工具pyhanlp分詞與詞性標註
2019-05-18
自然語言處理HanLP分詞詞性標註
AAAI 2021論文：利用深度元學習對城市銷量進行預測
2021-08-11
AI
神經網路學習之利用LabelImg對影像標註
2021-08-19
神經網路
利用詞向量進行推理（Reasoning with word vectors）
2022-01-22
Keras vs PyTorch：誰是「第一」深度學習框架？
2018-07-03
KerasPyTorch深度學習框架
《深度學習案例精粹：基於TensorFlow與Keras》案例集用於深度學習訓練
2022-02-15
深度學習Keras
Java實現：拋開jieba等工具，寫HMM+維特比演算法進行詞性標註
2020-10-21
JavaJiebaHMM維特比演算法詞性標註
深度學習模型在序列標註任務中的應用
2018-11-01
深度學習模型
精通Python自然語言處理 4 ：詞性標註--單詞識別
2018-06-01
Python自然語言處理詞性標註
[譯]深度學習中所需的線性代數知識
2019-03-04
深度學習
用深度學習進行欺詐檢測
2019-04-28
深度學習
Python機器學習筆記：使用Keras進行迴歸預測
2019-01-02
Python機器學習筆記Keras
基於Keras和Gunicorn+Flask部署深度學習模型
2019-10-08
KerasFlask深度學習模型
深度學習-TF、keras兩種padding方式：vaild和sam
2019-02-19
深度學習KeraspaddingAI
讀懂深度學習，走進“深度學習+”階段
2023-01-13
深度學習
如何利用深度學習寫詩歌
2018-04-02
深度學習
CAD中如何進行尺寸標註
2021-07-20
[譯] 深度學習的侷限性
2018-12-24
深度學習
深度學習Tensorflow實戰，新課進行曲！
2018-03-26
深度學習
自我學習與理解：keras框架下的深度學習（三）迴歸問題
2021-12-27
Keras框架深度學習
如何學習和利用深度學習演算法框架
2018-04-12
深度學習演算法框架
【NLP學習其四】如何構建自己用於訓練的資料集？什麼是詞性標註？
2021-08-08
詞性標註