利用TensorFlow和神經網路來處理文字分類問題

機器之心發表於2017-08-23

在這篇文章中，機器之心海外分析師對Medium（連結見文後）上的一篇熱門部落格進行了介紹，討論了六個關於建立機器學習模型來進行文字分類的主要話題。

在這篇文章中，作者討論了六個關於建立機器學習模型來進行文字分類的主要話題。

TensorFlow 如何工作
機器學習模型是什麼
神經網路是什麼
神經網路怎樣進行學習
如何處理資料並且把它們傳輸給神經網路的輸入
怎樣執行模型並且得到預測結果

作者也提供了可在Jupyter notebook上執行的程式碼。我將回顧這六個話題並且與我自己的經驗相結合。

1. TensorFlow 概覽

TensorFlow 是最流行的開源 AI 庫之一。它的高計算效率，豐富的開發資源使它被企業和個人開發者廣泛採用。在我看來，學習 TensorFlow 的最好的方法就是使用它的官網教程（https://www.tensorflow.org/）。在這個網站上，你可以瀏覽「getting started」教程。

我首先將會對 TensorFlow 的基本定義和主要特徵進行介紹。張量（Tensor）是一種資料結構，它可以把原始值形成任意的多維陣列【1】。張量的級別就是它的維度數。這裡，我建議閱讀 Python 的應用程式設計介面 API，因為它對 TensorFlow 的初學者來說是很友好的。你可以安裝 TensorFlow 並且配置環境，緊隨官方網站上的指導就可以了。測試你是否成功安裝 TensorFlow 的方法就是匯入（import）TensorFlow 庫。在 TensorFlow 中，計算圖（computational graph）是核心部件。資料流程圖形用來代表計算過程。在圖形下，操作（Operation）代表計算單位，張量代表資料單位。為了執行程式碼，我們應該對階段函式（Session function）進行初始化。這裡是執行求和操作的完整程式碼。

#import the library
import tensorflow as tf
#build the graph and name as my_graph
my_graph = tf.Graph()
#tf.Session encapsulate the environment for my_graph
with my_graph.as_default():
    x = tf.constant([1,3,6]) 
    y = tf.constant([1,1,1])
    #add function
    op = tf.add(x,y)
    #run it by fetches
    result = sess.run(fetches=op)
    #print it
    print(result)

你可以看見在 TensorFlow 中編譯是遵循一種模式的，並且很容易被記住。你將會匯入庫，建立恆定張量（constant tensors）並且建立圖形。然後我們應該定義哪一個圖將會被在 Session 中使用，並且定義操作單元。最終你可以在 Session 中使用 run() 的方法，並且評估其中引數獲取的每一個張量。

2. 預測模型

預測模型可以很簡單。它把機器學習演算法和資料集相結合。建立一個模型的過程程如下圖所示：

利用TensorFlow和神經網路來處理文字分類問題

我們首先應該找到正確的資料作為輸入，並且使用一些資料處理函式來處理資料。然後，這些資料就可以與機器學習演算法結合來建立模型了。在你得到模型後，你可以把模型當做一個預測器並且輸入需要的資料來預測，從而產生結果。整個程式如下圖所示：

利用TensorFlow和神經網路來處理文字分類問題

在本文中，輸入是文字，輸出結果是類別（category）。這種機器學習演算法叫做監督學習，訓練資料集是已標註過種類的文字。這也是分類任務，而且是應用神經網路來進行模型建立的。

3. 神經網路

神經網路的主要特徵是自學（self-learning），而不是進行明確地程式化。它的靈感來源於人類中樞神經系統。第一個神經網路演算法是感知機（Perceptron）。

為了理解神經網路的工作機制，作者用 TensorFlow 建立了一個神經網路結構。

神經網路結構

這裡作者使用了兩個隱蔽層（hidden layers），每一個隱蔽層的職責是把輸入轉換成輸出層可以使用的東西【1】。第一個隱蔽層的節點的數量應該被定義。這些節點叫做神經元，和權值相乘。訓練階段是為了對這些值進行調節，為了產生一個正確的輸出。網路也引入了偏差（bias），這就可以讓你向左或向右移動啟用函式，從而讓預測結果更加準確【2】。資料還會經過一個定義每個神經元最終輸出的啟用函式。這裡，作者使用的是修正線性單元（ReLU），可以增加非線性。這個函式被定義為：

f(x) = max(0,x)（輸出是 x 或 0，無論 x 多大）

對第二個隱蔽層來說，輸入就是第一層，函式與第一個隱蔽層相同。

對於輸出層，作者使用的是 one-hot 編碼來得到結果。在 one-hot 編碼中，除了其中的一位值為 1 以外，所有的位元（bits）都會得到一個 0 值。這裡使用三種類別作為範例，如下圖所示。

利用TensorFlow和神經網路來處理文字分類問題

我們可以發現輸出節點的數量值就是類別的數量值。如果我們想要劃分不同的類別，我們可以使用 Softmax 函式來使每一個單元的輸出轉化成 0 到 1 間的值，並且使所有單元的總和為 1。它將會告訴我們每種類別的機率是多少。

利用TensorFlow和神經網路來處理文字分類問題

上述過程由下列程式碼實現：

# Network Parameters
n_hidden_1 = 10        # 1st layer number of features
n_hidden_2 = 5         # 2nd layer number of features
n_input = total_words  # Words in vocab
n_classes = 3          # Categories: graphics, space and baseball
def multilayer_perceptron(input_tensor, weights, biases):
    layer_1_multiplication = tf.matmul(input_tensor, weights['h1'])
    layer_1_addition = tf.add(layer_1_multiplication, biases['b1'])
    layer_1_activation = tf.nn.relu(layer_1_addition)
# Hidden layer with RELU activation
    layer_2_multiplication = tf.matmul(layer_1_activation, weights['h2'])
    layer_2_addition = tf.add(layer_2_multiplication, biases['b2'])
    layer_2_activation = tf.nn.relu(layer_2_addition)
# Output layer with linear activation
    out_layer_multiplication = tf.matmul(layer_2_activation, weights['out'])
    out_layer_addition = out_layer_multiplication + biases['out']
return out_layer_addition

在這裡，它呼叫了 matmul（）函式來實現矩陣之間的乘法函式，並呼叫 add（）函式將偏差新增到函式中。

4. 神經網路是如何訓練的

我們可以看到其中要點是構建一個合理的結構，並最佳化網路權重的預測。接下來我們需要訓練 TensorFlow 中的神經網路。在 TensorFlow 中，我們使用 Variable 來儲存權重和偏差。在這裡，我們應該將輸出值與預期值進行比較，並指導函式獲得最小損失結果。有很多方法來計算損失函式，由於它是一個分類任務，所以我們應該使用交叉熵誤差。此前 D. McCaffrey[3] 分析並認為交叉熵可以避免訓練停滯不前。我們在這裡透過呼叫函式 tf.nn.softmax_cross_entropy_with_logits() 來使用交叉熵誤差，我們還將透過呼叫 function: tf.reduced_mean() 來計算誤差。

# Construct model
prediction = multilayer_perceptron(input_tensor, weights, biases)
# Define loss
entropy_loss = tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=output_tensor)
loss = tf.reduce_mean(entropy_loss)

我們應該找到最優值來使輸出誤差最小化。這裡我們使用隨機梯度下降（SGD）的方法：

利用TensorFlow和神經網路來處理文字分類問題

透過多次迭代，我們將會得到接近於全域性最小損失的權值。學習速率不應該太大。自適應瞬間評估函式（Adaptive Moment Estimation function）經常用於計算梯度下降。在這個最佳化演算法中，對梯度和梯度的二階矩量進行平滑處理【4】。

程式碼如下所示，在其它專案中，學習速率可以是動態的，從而使訓練過程更加迅速。

learning_rate = 0.001
# Construct model
prediction = multilayer_perceptron(input_tensor, weights, biases)
# Define loss
entropy_loss = tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=output_tensor)
loss = tf.reduce_mean(entropy_loss)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

5. 資料操作

這一部分對於分類成功也很重要。機器學習的開發者們需要更加在意資料，這會為你節省大量時間，並讓結果更加準確，因為這可以讓你無需從頭開始更改配置。在這裡，筆者需要指出兩個重點。首先，為每個單詞建立一個索引；然後為每個文字建立一個矩陣，如果單詞在文字中，則值為 1，否則為 0。以下程式碼可以幫助你理解這個過程：

import numpy as np    #numpy is a package for scientific computing
from collections import Counter
vocab = Counter()
text = "Hi from Brazil"
#Get all words
for word in text.split(' '):
    vocab[word]+=1
        
#Convert words to indexes
def get_word_2_index(vocab):
    word2index = {}
    for i,word in enumerate(vocab):
        word2index[word] = i
        
    return word2index
#Now we have an index
word2index = get_word_2_index(vocab)
total_words = len(vocab)
#This is how we create a numpy array (our matrix)
matrix = np.zeros((total_words),dtype=float)
#Now we fill the values
for word in text.split():
    matrix[word2index[word]] += 1
print(matrix)
>>> [ 1.  1.  1.]

Python 中的 Counter() 是一個雜湊表。當輸入是「Hi from Brazil」時，矩陣是 [1 ,1, 1]。如果輸入不同，比如「Hi」，矩陣會得到不同的結果：

matrix = np.zeros((total_words),dtype=float)
text = "Hi"
for word in text.split():
    matrix[word2index[word.lower()]] += 1
print(matrix)
>>> [ 1.  0.  0.]

6. 執行模型，獲得結果

在這一部分裡，我們將使用 20 Newsgroups 作為資料集。它包含有關 20 種話題的 18,000 篇文章。我們使用 scilit-learn 庫載入資料。在這裡作者使用了 3 個類別：comp.graphics、sci.space 和 rec.sport.baseball。它有兩個子集，一個用於訓練，一個用於測試。下面是載入資料集的方式：

from sklearn.datasets import fetch_20newsgroups
categories = ["comp.graphics","sci.space","rec.sport.baseball"]
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

它遵循通用的模式，非常易於開發者使用。

在實驗中，epoch 設定為 10，這意味著會有 10 次正+反向遍歷整個資料集。在 TensorFlow 中，佔位符的作用是用作 Feed 的目標，用於傳遞每個執行步驟的資料。

n_input = total_words # Words in vocab
n_classes = 3         # Categories: graphics, sci.space and baseball
input_tensor = tf.placeholder(tf.float32,[None, n_input],name="input")
output_tensor = tf.placeholder(tf.float32,[None, n_classes],name="output")

我們應該分批訓練資料，因為在測試模型時，我們會用更大的批次來輸入 dict。呼叫 get_batches() 函式來獲取具有批處理尺寸的文字數。接下來，我們就可以執行模型了。

training_epochs = 10
# Launch the graph
with tf.Session() as sess:
    sess.run(init) #inits the variables (normal distribution, remember?)
    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(newsgroups_train.data)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x,batch_y = get_batch(newsgroups_train,i,batch_size)
            # Run optimization op (backprop) and cost op (to get loss value)
            c,_ = sess.run([loss,optimizer], feed_dict={input_tensor: batch_x, output_tensor:batch_y})

在這裡我們需要構建測試模型，並計算它的準確性。

    # Test model
    index_prediction = tf.argmax(prediction, 1)
    index_correct = tf.argmax(output_tensor, 1)
    correct_prediction = tf.equal(index_prediction, index_correct)
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    total_test_data = len(newsgroups_test.target)
    batch_x_test,batch_y_test = get_batch(newsgroups_test,0,total_test_data)
    print("Accuracy:", accuracy.eval({input_tensor: batch_x_test, output_tensor: batch_y_test}))

然後我們就可以得到結果：

利用TensorFlow和神經網路來處理文字分類問題

結論

本文介紹瞭如何使用神經網路和 TensorFlow 來處理文字分類任務。它介紹了與實驗有關的基礎資訊，然而，在我自己執行的時候，效果就沒有作者那麼好了。我們或許可以在這個架構的基礎上改進一番，在隱藏層中使用 dropout 肯定會提高準確性。

在執行程式碼前，請確認你已安裝了最新版本的 TensorFlow。有些時候你可能會無法匯入 twenty_newsgroups 資料集。當這種情況發生時，請使用以下程式碼來解決問題。

# if you didn't download the twenty_newsgroups datasets, it will run with error
# this logging can help to solve the error
import logging
logging.basicConfig()

以下是完整程式碼：

import pandas as pd
import numpy as np
import tensorflow as tf
from collections import Counter
from sklearn.datasets import fetch_20newsgroups
# if you didn't download the twenty_newsgroups datasets, it will run with error
# this logging can help to solve the error
import logging
logging.basicConfig()

categories = ["comp.graphics","sci.space","rec.sport.baseball"]
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

print('total texts in train:',len(newsgroups_train.data))
print('total texts in test:',len(newsgroups_test.data))

vocab = Counter()
for text in newsgroups_train.data:
    for word in text.split(' '):
        vocab[word.lower()]+=1
        
for text in newsgroups_test.data:
    for word in text.split(' '):
        vocab[word.lower()]+=1


total_words = len(vocab)
def get_word_2_index(vocab):
    word2index = {}
    for i,word in enumerate(vocab):
        word2index[word.lower()] = i
        
    return word2index

word2index = get_word_2_index(vocab)

def get_batch(df,i,batch_size):
    batches = []
    results = []
    texts = df.data[i*batch_size:i*batch_size+batch_size]
    categories = df.target[i*batch_size:i*batch_size+batch_size]
    for text in texts:
        layer = np.zeros(total_words,dtype=float)
        for word in text.split(' '):
            layer[word2index[word.lower()]] += 1
            
        batches.append(layer)
        
    for category in categories:
        y = np.zeros((3),dtype=float)
        if category == 0:
            y[0] = 1.
        elif category == 1:
            y[1] = 1.
        else:
            y[2] = 1.
        results.append(y)
            
     
    return np.array(batches),np.array(results)

# Parameters
learning_rate = 0.01
training_epochs = 10
batch_size = 150
display_step = 1

# Network Parameters
n_hidden_1 = 100      # 1st layer number of features
n_hidden_2 = 100       # 2nd layer number of features
n_input = total_words # Words in vocab
n_classes = 3         # Categories: graphics, sci.space and baseball

input_tensor = tf.placeholder(tf.float32,[None, n_input],name="input")
output_tensor = tf.placeholder(tf.float32,[None, n_classes],name="output") 

def multilayer_perceptron(input_tensor, weights, biases):
    layer_1_multiplication = tf.matmul(input_tensor, weights['h1'])
    layer_1_addition = tf.add(layer_1_multiplication, biases['b1'])
    layer_1 = tf.nn.relu(layer_1_addition)
    
    # Hidden layer with RELU activation
    layer_2_multiplication = tf.matmul(layer_1, weights['h2'])
    layer_2_addition = tf.add(layer_2_multiplication, biases['b2'])
    layer_2 = tf.nn.relu(layer_2_addition)
    
    # Output layer 
    out_layer_multiplication = tf.matmul(layer_2, weights['out'])
    out_layer_addition = out_layer_multiplication + biases['out']
    
    return out_layer_addition

# Store layers weight & bias
weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
    'b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_classes]))
}

# Construct model
prediction = multilayer_perceptron(input_tensor, weights, biases)

# Define loss and optimizer
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=output_tensor))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

# Initializing the variables
init = tf.initialize_all_variables()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(newsgroups_train.data)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x,batch_y = get_batch(newsgroups_train,i,batch_size)
            # Run optimization op (backprop) and cost op (to get loss value)
            c,_ = sess.run([loss,optimizer], feed_dict={input_tensor: batch_x,output_tensor:batch_y})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if epoch % display_step == 0:
            print("Epoch:", '%04d' % (epoch+1), "loss=", \
                "{:.9f}".format(avg_cost))
    print("Optimization Finished!")

    # Test model
    correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(output_tensor, 1))
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    total_test_data = len(newsgroups_test.target)
    batch_x_test,batch_y_test = get_batch(newsgroups_test,0,total_test_data)
    print("Accuracy:", accuracy.eval({input_tensor: batch_x_test, output_tensor: batch_y_test}))

參考內容:

[1] https://stats.stackexchange.com/questions/63152/what-does-the-hidden-layer-in-a-neural-network-compute

[2] http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks

[3] https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/

[4] https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Medium 文章連結：https://medium.freecodecamp.org/big-picture-machine-learning-classifying-text-with-neural-networks-and-tensorflow-d94036ac2274

圖解機器學習：神經網路和 TensorFlow 的文字分類
2017-07-17
圖解機器學習神經網路文字分類
利用Tensorflow實現神經網路模型
2017-05-09
神經網路模型
文字分類(下)-卷積神經網路(CNN)在文字分類上的應用
2018-07-25
文字分類卷積神經網路CNN
[譯] RNN 迴圈神經網路系列 2：文字分類
2019-03-01
RNN神經網路文字分類
用神經網路訓練一個文字分類器
2017-08-10
神經網路文字分類
利用Tensorflow實現卷積神經網路模型
2017-05-10
卷積神經網路模型
（二）神經網路入門之Logistic迴歸（分類問題）
2019-02-16
神經網路
《神經網路和深度學習》系列文章七：實現我們的神經網路來分類數字
2016-02-01
神經網路深度學習
Tensorflow系列專題（四）：神經網路篇之前饋神經網路綜述
2018-11-20
神經網路
設計一個基於 LSTM 神經網路的文字分類器
2024-11-26
神經網路文字分類
神經網路實現鳶尾花分類
2020-09-29
神經網路
深度學習筆記8：利用Tensorflow搭建神經網路
2021-09-09
深度學習筆記神經網路
神經網路 | 基於MATLAB 深度學習工具實現簡單的數字分類問題（卷積神經網路）
2019-03-07
神經網路Matlab深度學習卷積
matlab練習程式（神經網路分類）
2017-12-10
Matlab神經網路
04_利用手寫數字問題引入深度神經網路
2021-04-18
神經網路
TensorFlow神經網路優化策略
2020-04-06
神經網路優化
TensorFlow.NET機器學習入門【6】採用神經網路處理Fashion-MNIST
2021-12-29
機器學習神經網路
Tensorflow-卷積神經網路CNN
2021-01-31
卷積神經網路CNN
TensorFlow構建迴圈神經網路
2017-08-29
神經網路
自然語言處理的神經網路模型初探
2018-02-23
自然語言處理神經網路模型
[譯] TensorFlow 教程 #02 - 卷積神經網路
2017-07-07
卷積神經網路
Tensorflow神經網路預測股票均價
2018-05-20
神經網路
如何使用卷積神經網路進行影像處理？
2020-10-13
卷積神經網路
面向機器智慧的TensorFlow實戰6：迴圈神經網路與自然語言處理
2018-05-26
神經網路自然語言處理
DeepMind利用人工神經網路打造“類腦導航系統”
2018-05-18
神經網路
Tensorflow儲存神經網路引數有妙招：Saver和Restore
2021-09-13
神經網路REST
《TensorFlow2.0》前饋神經網路和 BP 演算法
2020-12-08
神經網路演算法
前饋神經網路進行MNIST資料集分類
2020-12-28
神經網路
圖卷積神經網路分類的pytorch實現
2023-02-20
卷積神經網路PyTorch
強人工智慧基本問題：神經網路分層還是不分層
2015-05-18
人工智慧神經網路
TensorFlow2.0教程-文字分類
2019-03-11
文字分類
文字生成神經網路架構發展
2020-02-23
神經網路架構
深度學習小課堂：如何利用遞迴神經網路生成文字？
2018-11-02
深度學習遞迴神經網路
TensorFlow搭建神經網路最佳實踐樣例
2020-04-06
神經網路
TensorFlow上實現卷積神經網路CNN
2020-04-06
卷積神經網路CNN
TensorFlow 卷積神經網路之貓狗識別
2021-09-09
卷積神經網路
訓練自己的Android TensorFlow神經網路
2020-10-25
Android神經網路
Tensorflow中神經網路的啟用函式
2018-02-19
神經網路函式

利用TensorFlow和神經網路來處理文字分類問題

1. TensorFlow 概覽

2. 預測模型

3. 神經網路

神經網路結構

4. 神經網路是如何訓練的

5. 資料操作

6. 執行模型，獲得結果

結論

相關文章