用神經網路訓練一個文字分類器

Anne90發表於2017-08-10

神經網路文字分類

理解聊天機器人的工作原理是非常重要的。聊天機器人內部一個非常重要的元件就是文字分類器。我們看一下文字分類器的神經網路（ANN）的內部工作原理。

多層神經網路

我們將會使用2層網路（1個隱層）和一個“詞包”的方法來組織我們的訓練資料。文字分類有3個特點：模式匹配、演算法、神經網路。雖然使用多項樸素貝葉斯演算法的方法非常有效，但是它有3個致命的缺陷：

這個演算法輸出一個分數而不是一個概率。我們可以使用概率來忽略特定閾值以下的預測結果。這類似於忽略收音機中的噪聲。
這個演算法從一個樣本中學習一個分類中包含什麼，而不是一個分類中不包含什麼。一個分類中不包含什麼的的學習模式往往也很重要。
不成比例的大訓練集的分類將會導致扭曲的分類分數，迫使演算法相對於分類規模來調整輸出分數，這並不理想。

和它“天真”的對手一樣，這種分類器並不試圖去理解句子的含義，而僅僅對它進行分類。事實上，所謂的“人工智慧聊天機器人”並不理解語言，但那是另一個故事。

如果你剛接觸人工神經網路，這是它的工作原理。

理解分類演算法，請看這裡。

我們來逐個分析文字分類器的每個部分。我們將按照以下順序：

引用需要的庫
提供訓練集
整理資料
迭代：編寫程式碼+測試預測結果+調整模型
抽象

程式碼在這裡，我們使用ipython notebook這個在資料科學專案上非常高效的工具。程式碼語法是python。

我們首先匯入自然語言工具包。我們需要一個可靠的方法將句子切分成詞並且將單詞詞幹化處理。

# use natural language toolkit
import nltk
from nltk.stem.lancaster import LancasterStemmer
import os
import json
import datetime
stemmer = LancasterStemmer()

# use natural language toolkit

import nltk

from nltk.stem.lancaster import LancasterStemmer

import os

import json

import datetime

stemmer = LancasterStemmer()

下面是我們的訓練集，12個句子屬於3個類別（“意圖”）。

# 3 classes of training data
training_data = []
training_data.append({"class":"greeting", "sentence":"how are you?"})
training_data.append({"class":"greeting", "sentence":"how is your day?"})
training_data.append({"class":"greeting", "sentence":"good day"})
training_data.append({"class":"greeting", "sentence":"how is it going today?"})

training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"see you later"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"talk to you soon"})

training_data.append({"class":"sandwich", "sentence":"make me a sandwich"})
training_data.append({"class":"sandwich", "sentence":"can you make a sandwich?"})
training_data.append({"class":"sandwich", "sentence":"having a sandwich today?"})
training_data.append({"class":"sandwich", "sentence":"what's for lunch?"})
print ("%s sentences in training data" % len(training_data))

# 3 classes of training data

training_data = []

training_data.append({"class":"greeting", "sentence":"how are you?"})

training_data.append({"class":"greeting", "sentence":"how is your day?"})

training_data.append({"class":"greeting", "sentence":"good day"})

training_data.append({"class":"greeting", "sentence":"how is it going today?"})

training_data.append({"class":"goodbye", "sentence":"have a nice day"})

training_data.append({"class":"goodbye", "sentence":"see you later"})

training_data.append({"class":"goodbye", "sentence":"have a nice day"})

training_data.append({"class":"goodbye", "sentence":"talk to you soon"})

training_data.append({"class":"sandwich", "sentence":"make me a sandwich"})

training_data.append({"class":"sandwich", "sentence":"can you make a sandwich?"})

training_data.append({"class":"sandwich", "sentence":"having a sandwich today?"})

training_data.append({"class":"sandwich", "sentence":"what's for lunch?"})

print ("%s sentences in training data" % len(training_data))

12 sentences in training data

1	12 sentences in training data

現在我們可以將資料結構組織為：documents, classes 和words.

words = []
classes = []
documents = []
ignore_words = ['?']
# loop through each sentence in our training data
for pattern in training_data:
    # tokenize each word in the sentence
    w = nltk.word_tokenize(pattern['sentence'])
    # add to our words list
    words.extend(w)
    # add to documents in our corpus
    documents.append((w, pattern['class']))
    # add to our classes list
    if pattern['class'] not in classes:
        classes.append(pattern['class'])

# stem and lower each word and remove duplicates
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]
words = list(set(words))

# remove duplicates
classes = list(set(classes))

print (len(documents), "documents")
print (len(classes), "classes", classes)
print (len(words), "unique stemmed words", words)

words = []

classes = []

documents = []

ignore_words = ['?']

# loop through each sentence in our training data

for pattern in training_data:

# tokenize each word in the sentence

w = nltk.word_tokenize(pattern['sentence'])

# add to our words list

words.extend(w)

# add to documents in our corpus

documents.append((w, pattern['class']))

# add to our classes list

if pattern['class'] not in classes:

classes.append(pattern['class'])

# stem and lower each word and remove duplicates

words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]

words = list(set(words))

# remove duplicates

classes = list(set(classes))

print (len(documents), "documents")

print (len(classes), "classes", classes)

print (len(words), "unique stemmed words", words)

12 documents
3 classes ['greeting', 'goodbye', 'sandwich']
26 unique stemmed words ['sandwich', 'hav', 'a', 'how', 'for', 'ar', 'good', 'mak', 'me', 'it', 'day', 'soon', 'nic', 'lat', 'going', 'you', 'today', 'can', 'lunch', 'is', "'s", 'see', 'to', 'talk', 'yo', 'what']

12 documents

3 classes ['greeting', 'goodbye', 'sandwich']

26 unique stemmed words ['sandwich', 'hav', 'a', 'how', 'for', 'ar', 'good', 'mak', 'me', 'it', 'day', 'soon', 'nic', 'lat', 'going', 'you', 'today', 'can', 'lunch', 'is', "'s", 'see', 'to', 'talk', 'yo', 'what']

注意每個單詞都是詞根並且小寫。詞根有助於機器將“have”和“having”等同起來。同時我們也不關心大小寫。

我們將訓練集中的每個句子轉換為詞包。

# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(classes)

# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # stem each word
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
    # create our bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)

    training.append(bag)
    # output is a '0' for each tag and '1' for current tag
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    output.append(output_row)

# sample training/output
i = 0
w = documents[i][0]
print ([stemmer.stem(word.lower()) for word in w])
print (training[i])
print (output[i])

# create our training data

training = []

output = []

# create an empty array for our output

output_empty = [0] * len(classes)

# training set, bag of words for each sentence

for doc in documents:

# initialize our bag of words

bag = []

# list of tokenized words for the pattern

pattern_words = doc[0]

# stem each word

pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]

# create our bag of words array

for w in words:

bag.append(1) if w in pattern_words else bag.append(0)

training.append(bag)

# output is a '0' for each tag and '1' for current tag

output_row = list(output_empty)

output_row[classes.index(doc[1])] = 1

output.append(output_row)

# sample training/output

i = 0

w = documents[i][0]

print ([stemmer.stem(word.lower()) for word in w])

print (training[i])

print (output[i])

['how', 'ar', 'you', '?']
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0]

['how', 'ar', 'you', '?']

[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

[1, 0, 0]

上面的步驟是文字分類中的一個經典步驟：每個訓練句子被轉化為一個包含0和1的陣列，而不是語料庫中包含獨特單詞的陣列。

['how', 'are', 'you', '?']

1	['how', 'are', 'you', '?']

被詞幹化為：

['how', 'ar', 'you', '?']

1	['how', 'ar', 'you', '?']

然後轉換為輸入詞包的形式：1代表單詞存在於詞包中（忽略問號？）

[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

1	[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

輸出：第一類

[1, 0, 0]

[1, 0, 0]

注意：一個句子可以有多個分類，也可以沒有。確保理解上面的內容，仔細閱讀程式碼直到你理解它。

機器學習的第一步是要有乾淨的資料

接下來我們的學習2層神經網路的核心功能。

如果你是人工神經網路新手，這裡是它的工作原理

我們使用numpy，原因是它可以提供快速的矩陣乘法運算。

我們使用sigmoid函式對值進行歸一化，用其導數來衡量錯誤率。通過不斷迭代和調整，直到錯誤率低到一個可以接受的值。

下面我們也實現了bag-of-words函式，將輸入的一個句子轉化為一個包含0和1的陣列。這就是轉換訓練資料，得到正確的轉換資料至關重要。

import numpy as np
import time

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)

def clean_up_sentence(sentence):
    # tokenize the pattern
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    return sentence_words

# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=False):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words
    bag = [0]*len(words)  
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s: 
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)

    return(np.array(bag))

def think(sentence, show_details=False):
    x = bow(sentence.lower(), words, show_details)
    if show_details:
        print ("sentence:", sentence, "n bow:", x)
    # input layer is our bag of words
    l0 = x
    # matrix multiplication of input and hidden layer
    l1 = sigmoid(np.dot(l0, synapse_0))
    # output layer
    l2 = sigmoid(np.dot(l1, synapse_1))
    return l2

import numpy as np

import time

# compute sigmoid nonlinearity

def sigmoid(x):

output = 1/(1+np.exp(-x))

return output

# convert output of sigmoid function to its derivative

def sigmoid_output_to_derivative(output):

return output*(1-output)

def clean_up_sentence(sentence):

# tokenize the pattern

sentence_words = nltk.word_tokenize(sentence)

# stem each word

sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]

return sentence_words

# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence

def bow(sentence, words, show_details=False):

# tokenize the pattern

sentence_words = clean_up_sentence(sentence)

# bag of words

bag = [0]*len(words)

for s in sentence_words:

for i,w in enumerate(words):

if w == s:

bag[i] = 1

if show_details:

print ("found in bag: %s" % w)

return(np.array(bag))

def think(sentence, show_details=False):

x = bow(sentence.lower(), words, show_details)

if show_details:

print ("sentence:", sentence, "n bow:", x)

# input layer is our bag of words

l0 = x

# matrix multiplication of input and hidden layer

l1 = sigmoid(np.dot(l0, synapse_0))

# output layer

l2 = sigmoid(np.dot(l1, synapse_1))

return l2

現在我們對神經網路訓練函式進行編碼，創造連線權重。別太激動，這主要是矩陣乘法——來自中學數學課堂。

我們現在準備去構建我們的神經網路模型，我們將連線權重儲存為json檔案。

你應該嘗試不同的“α”（梯度下降引數），看看它是如何影響錯誤率。此引數有助於錯誤調整，並找到最低錯誤率：

synapse_0 += alpha * synapse_0_weight_update

我們在隱藏層使用了20個神經元，你可以很容易地調整。這些引數將隨著於您的訓練資料規模的不同而不同，將錯誤率調整到低於10 ^ – 3是比較合理的。

X = np.array(training)
y = np.array(output)

start_time = time.time()

train(X, y, hidden_neurons=20, alpha=0.1, epochs=100000, dropout=False, dropout_percent=0.2)

elapsed_time = time.time() - start_time
print ("processing time:", elapsed_time, "seconds")

X = np.array(training)

y = np.array(output)

start_time = time.time()

train(X, y, hidden_neurons=20, alpha=0.1, epochs=100000, dropout=False, dropout_percent=0.2)

elapsed_time = time.time() - start_time

print ("processing time:", elapsed_time, "seconds")

Training with 20 neurons, alpha:0.1, dropout:False 
Input matrix: 12x26    Output matrix: 1x3
delta after 10000 iterations:0.0062613597435
delta after 20000 iterations:0.00428296074919
delta after 30000 iterations:0.00343930779307
delta after 40000 iterations:0.00294648034566
delta after 50000 iterations:0.00261467859609
delta after 60000 iterations:0.00237219554105
delta after 70000 iterations:0.00218521899378
delta after 80000 iterations:0.00203547284581
delta after 90000 iterations:0.00191211022401
delta after 100000 iterations:0.00180823798397
saved synapses to: synapses.json
processing time: 6.501226902008057 seconds

Training with 20 neurons, alpha:0.1, dropout:False

Input matrix: 12x26 Output matrix: 1x3

delta after 10000 iterations:0.0062613597435

delta after 20000 iterations:0.00428296074919

delta after 30000 iterations:0.00343930779307

delta after 40000 iterations:0.00294648034566

delta after 50000 iterations:0.00261467859609

delta after 60000 iterations:0.00237219554105

delta after 70000 iterations:0.00218521899378

delta after 80000 iterations:0.00203547284581

delta after 90000 iterations:0.00191211022401

delta after 100000 iterations:0.00180823798397

saved synapses to: synapses.json

processing time: 6.501226902008057 seconds

synapse.json檔案中包含了全部的連線權重，這就是我們的模型。

一旦連線權重已經計算完成，對於分類來說只需要classify()函式了：大約15行程式碼

備註：如果訓練集有變化，我們的模型需要重新計算。對於非常大的資料集，這需要較長的時間。

現在我們可以生成一個句子屬於一個或者多個分類的概率了。它的速度非常快，這是因為我們之前定義的think()函式中的點積運算。

# probability threshold
ERROR_THRESHOLD = 0.2
# load our calculated synapse values
synapse_file = 'synapses.json' 
with open(synapse_file) as data_file: 
    synapse = json.load(data_file) 
    synapse_0 = np.asarray(synapse['synapse0']) 
    synapse_1 = np.asarray(synapse['synapse1'])

def classify(sentence, show_details=False):
    results = think(sentence, show_details)

    results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD ] 
    results.sort(key=lambda x: x[1], reverse=True) 
    return_results =[[classes[r[0]],r[1]] for r in results]
    print ("%s n classification: %s" % (sentence, return_results))
    return return_results

classify("sudo make me a sandwich")
classify("how are you today?")
classify("talk to you tomorrow")
classify("who are you?")
classify("make me some lunch")
classify("how was your lunch today?")
print()
classify("good day", show_details=True)

# probability threshold

ERROR_THRESHOLD = 0.2

# load our calculated synapse values

synapse_file = 'synapses.json'

with open(synapse_file) as data_file:

synapse = json.load(data_file)

synapse_0 = np.asarray(synapse['synapse0'])

synapse_1 = np.asarray(synapse['synapse1'])

def classify(sentence, show_details=False):

results = think(sentence, show_details)

results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD ]

results.sort(key=lambda x: x[1], reverse=True)

return_results =[[classes[r[0]],r[1]] for r in results]

print ("%s n classification: %s" % (sentence, return_results))

return return_results

classify("sudo make me a sandwich")

classify("how are you today?")

classify("talk to you tomorrow")

classify("who are you?")

classify("make me some lunch")

classify("how was your lunch today?")

print()

classify("good day", show_details=True)

<strong>sudo make me a sandwich </strong>
 [['sandwich', 0.99917711814437993]]
<strong>how are you today? </strong>
 [['greeting', 0.99864563257858363]]
<strong>talk to you tomorrow </strong>
 [['goodbye', 0.95647479275905511]]
<strong>who are you? </strong>
 [['greeting', 0.8964283843977312]]
<strong>make me some lunch</strong> 
 [['sandwich', 0.95371924052636048]]
<strong>how was your lunch today? </strong>
 [['greeting', 0.99120883810944971], ['sandwich', 0.31626066870883057]]

sudo make me a sandwich

[['sandwich', 0.99917711814437993]]

how are you today?

[['greeting', 0.99864563257858363]]

talk to you tomorrow

[['goodbye', 0.95647479275905511]]

who are you?

[['greeting', 0.8964283843977312]]

make me some lunch

[['sandwich', 0.95371924052636048]]

how was your lunch today?

[['greeting', 0.99120883810944971], ['sandwich', 0.31626066870883057]]

你可以用其它語句、不同概率來試驗幾次，也可以新增訓練資料來改進／擴充套件當前的模型。尤其注意用很少的訓練資料就得到穩定的預測結果。

有一些句子將會產生多個預測結果（高於閾值）。你需要給你的程式設定一個合適的閾值。並非所有的文字分類方案都是相同的：一些預測情況比其他預測需要更高的置信水平。

最後這個分類結果展示了一些內部的細節：

found in bag: good
found in bag: day
sentence: **good day** 
 bow: [0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
good day 
 [['greeting', 0.99664077655648697]]

found in bag: good

found in bag: day

sentence: **good day**

bow: [0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

good day

[['greeting', 0.99664077655648697]]

從這個句子的詞包中可以看到，有兩個單詞和我們的詞庫是匹配的。同時我們的神經網路從這些 0 代表的非匹配詞語中學習了。

如果提供一個僅僅有一個常用單詞 ‘a’ 被匹配的句子，那我們會得到一個低概率的分類結果A：

found in bag: a
sentence: **a burrito! **
 bow: [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
a burrito! 
 [['sandwich', 0.61776860634647834]]

found in bag: a

sentence: **a burrito! **

bow: [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

a burrito!

[['sandwich', 0.61776860634647834]]

現在你已經掌握了構建聊天機器人的一些基礎知識結構，它能處理大量不同的意圖，並且對於有限或者海量的訓練資料都能很好的適配。想要為某個意圖新增一個或者多個響應實在輕而易舉，就不必多講了。

Enjoy!

打賞支援我翻譯更多好文章，謝謝！
打賞譯者

打賞支援我翻譯更多好文章，謝謝！

任選一種支付方式

用神經網路訓練一個文字分類器

文字分類(下)-卷積神經網路(CNN)在文字分類上的應用
2018-07-25
文字分類卷積神經網路CNN
matlab練習程式（神經網路分類）
2017-12-10
Matlab神經網路
3.3 神經網路的訓練
2019-12-31
神經網路
使用 PyTorch 構建和訓練一個卷積神經網路進行影像分類任務
2024-06-27
PyTorch卷積神經網路
用神經網路測量訓練集的半衰期
2020-11-14
神經網路
[譯] RNN 迴圈神經網路系列 2：文字分類
2019-03-01
RNN神經網路文字分類
機器學習之訓練神經網路：最佳做法
2020-06-28
機器學習神經網路
談談如何訓練一個效能不錯的深度神經網路
2015-08-11
神經網路
【機器學習】李宏毅——類神經網路訓練不起來怎麼辦
2022-12-15
機器學習神經網路
圖解機器學習：神經網路和 TensorFlow 的文字分類
2017-07-17
圖解機器學習神經網路文字分類
使用tf.estimator.Estimator訓練神經網路
2018-09-16
神經網路
深度神經網路為何很難訓練？
2017-12-21
神經網路
神經網路訓練的三個基本概念Epoch, Batch, Iteration
2019-05-20
神經網路BAT
如何應對訓練的神經網路不工作？
2020-04-06
神經網路
Batch Normalization: 如何更快地訓練深度神經網路
2019-04-18
BATORM神經網路
從零開始：教你如何訓練神經網路
2019-02-16
神經網路
訓練神經網路時如何確定batch size？
2018-07-12
神經網路BAT
訓練自己的Android TensorFlow神經網路
2020-10-25
Android神經網路
umich cv-5-1 神經網路訓練1
2023-10-28
神經網路
umich cv-5-2 神經網路訓練2
2023-10-28
神經網路
利用TensorFlow和神經網路來處理文字分類問題
2017-08-23
神經網路文字分類
神經網路實現鳶尾花分類
2020-09-29
神經網路
用一個畫素攻陷神經網路
2018-10-14
神經網路
面向統一的AI神經網路架構和預訓練方法
2023-05-08
AI神經網路架構
訓練PaddleOCR文字方向分類模型
2024-08-27
模型
零基礎入門深度學習（一）：用numpy實現神經網路訓練
2020-01-09
深度學習神經網路
神經網路加速器應用例項：影象分類
2019-07-26
神經網路
Yelp訓練了一個神經網路來debug，然後就被這個AI刪庫了
2019-01-16
神經網路AI
120種小狗影像傻傻分不清？用fastai訓練一個分類器
2019-02-25
ASTAI
一窺Habana的推理和訓練神經處理器
2019-12-16
【python實現卷積神經網路】開始訓練
2020-04-18
Python卷積神經網路
《神經網路和深度學習》系列文章三十八：深度神經網路為何很難訓練？
2016-12-22
神經網路深度學習
訓練一個影像分類器demo in PyTorch【學習筆記】
2022-06-30
PyTorch筆記
如何入門Pytorch之四：搭建神經網路訓練MNIST
2020-09-13
PyTorch神經網路
深度學習與CV教程(6) | 神經網路訓練技巧 (上)
2022-06-01
深度學習神經網路
神經網路之反向傳播訓練(8行程式碼)
2018-05-23
神經網路反向傳播行程
如何用C++在TensorFlow中訓練深度神經網路
2017-12-29
C++神經網路
使用人工神經網路訓練手寫數字識別模型
2023-10-09
神經網路模型

用神經網路訓練一個文字分類器

機器學習的第一步是要有乾淨的資料

Enjoy!

打賞支援我翻譯更多好文章，謝謝！

相關文章