程式碼實現（機器學習識別手寫數字）

限量版愛發表於2017-11-05

你可以在這裡閱讀上一篇

我是薛銀亮，感謝英文原版線上書籍，這是我學習機器學習過程中感覺非常適合新手入門的一本書。鑑於知識分享的精神，我希望能將其翻譯過來，並分享給所有想了解機器學習的人，本人翻譯水平有限，歡迎讀者提出問題和發現錯誤，更歡迎大牛的指導。因為篇幅較長，文章將會分為多個部分進行，感興趣的可以關注我的文集，文章會持續更新。

根據上一篇學到的知識，讓我們使用隨機梯度下降和MNIST資料集來實現我們的手寫數字識別程式吧。如果你還沒看過必備的知識，請移步到上一篇文章進行學習，你可以關注我的文集來或許我的持續更新。這裡我們將使用Phthon（2.7），程式碼量僅僅74行，但是請注意，我們如果為了學習機器學習的思想並且希望能將這種技術應用到更多的領域的話，我建議不要過於關注程式碼，千萬不要去試圖背誦程式碼，因為這是沒有意義的。

第一件事是獲取MNIST資料集，如果是一個git的使用者，關於git是什麼我就不介紹了，應該是每一個程式設計師或者技術研究者都會的才對。你可以使用下面的命令來獲取資料集合：

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git複製程式碼

如果你沒有用過git，也可以在這裡下載

補充：前面的文章我介紹說MNIST資料有60000個訓練資料，10000個測試資料，這是MNIST官方的介紹。實際上，我們這裡的資料有點小不同。我們的測試資料是從訓練資料中分離出一部分組成的，也就是說，我們把60000個圖片的資料分成50000個組成訓練集，然後剩下的10000個組成測試集。

我們會使用一個Python庫Numpy。使用它提供的線性代數的運算，如果你還沒有安裝Numpy，可以在這裡獲得。

我先解釋一下程式碼的結構設計，核心是Network類，這代表一個神經網路，這裡給出初始化一個Network物件的程式碼：

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])]複製程式碼

其中，sizes是一個list，包含了網路中每一層神經元的數量。例如，我們建立一個網路第一層有2個神經元，第二層有3個神經元，最後一層有1個神經元，我們可以使用下面的程式碼：

net = Network([2, 3, 1])複製程式碼

b和權重的值都是被初始化成隨機數的，使用Numpy的np.random.randn函式生成的均值為0方差為1符合高斯分佈的隨機數。這些初始值是我們開始隨機梯度下降開始的地方。但是在後面的章節，我們會有更好的方法來初始化權重和b值，只是現在先這樣做。注意，網路中的第一層是輸入層，這一層中沒有設定b值，因為b值僅僅用於後續的輸出層。

所有的b和權重都被作為list儲存在Numpy的向量中。例如，net.weights[1]是Numpy向量中儲存連線神經元第二和第三層的（不是第一和第二層，因為python的list的index開始是0）。因為net.weights[1]這樣的寫法過於複雜，我們可以定義向量是w。例如Wjk代表的是第二層第k個神經元和第三層中第j個神經元之間的權重。我們將 σ函式向量化：

a是第二層神經元的啟用向量，很容易看出來公式(22)和公式(4)是相同的。

定義sigmoid函式：

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))複製程式碼

注意，當z是一個向量或一個Numpy陣列時，Numpy會自動的對向量中每一個元素使用sigmoid函式，就是向量化操作。

然後新增feedforward方法：當網路輸入一個a，得到響應的輸出*。這個方法實現的就是公式(22)，每一層也都用這個方法：

def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a複製程式碼

當然，我們主要是要Network來學習。我們給出一叫做SGD(stochastic gradient descent)的方法來實現隨機梯度下降。下面是程式碼：

 def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The "training_data" is a list of tuples
        "(x, y)" representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If "test_data" is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)複製程式碼

訓練資料training_data 是元組(x, y)代表輸入和期望的輸出。變數epochs和mini_batch_size是你可以設定的訓練次數和取樣時小批量的個數。eta是學習速率η。如果可選引數test_data被設定了，程式就會在每次訓練後將結果列印，這對跟蹤除錯很有幫助但是卻會降低速度。

在每一次epoch訓練中，都會把訓練資料重新排序，然後放在大小指定的list中（mini_batches）,這是一種簡單的抽取訓練資料的方法。然後對每一個mini_batch再執行單一的梯度下降，這一步是使用self.update_mini_batch(mini_batch, eta)這句程式碼執行。這會根據這一次的迭代更新權重和b值。接下來給出update_mini_batch的方法：

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]複製程式碼

大部分工作都在這一行：

delta_nabla_b, delta_nabla_w = self.backprop(x, y)複製程式碼

這個被呼叫的方法叫後向傳播演算法（backpropagation），這是一個計算損失函式最快的方法，所以update_mini_batch這個方法能很快的使用每一個訓練樣本mini_batch計算並更新self.weights 和 self.biases屬性。

這裡先不提供self.backprop的程式碼，我們將會在下面的章節學到有關後向傳播演算法（backpropagation）以及它的程式碼實現。現在假設它就是能夠根據輸入訓練樣本x恰當的返回梯度就行了。

讓我們來看一看全部程式碼，包括文件說明和我上面省略的部分。其中self.backprop 方法中使用了sigmoid_prime方法來協助計算梯度，這個計算了 σ函式的導數。self.cost_derivative方法你可以通過看程式碼和注視瞭解，我們會在下一個章節詳細解釋。所有的程式碼可以在這裡下載:

"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))複製程式碼

那麼程式執行的效果怎麼樣呢？Well，讓我們先載入MNIST資料，這裡使用一個工具類來幫助我們做這件事mnist_loader.py，我們在Python的shell裡執行這個檔案：

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()複製程式碼

接下來執行network，我們設定30個隱藏層：

>>> import network
>>> net = network.Network([784, 30, 10])複製程式碼

然後設定訓練次數30次，epochs＝30；每組訓練資料10個，mini-batch.size = 10；學習速率3.0， η=3.0。

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)複製程式碼

如果你在這時候執行程式碼，可能需要花點時間才能執行完。我建議你繼續閱讀，設定一些執行的東西，並定期檢查程式碼輸出。但是如果你想現在就執行，可以通過減少訓練次數、減少隱藏層的數量或僅僅使用部分訓練資料來加快速度。請注意：寫這些程式碼只是為了幫助你瞭解神經網路工作的方式，而不是效能很高的程式碼。當然，一旦我們已經訓練出來一個很好的神經網路，那麼就可以直接將其移植到網頁（用JS）或app等，這時候它也會執行的很快的。正如你看到的，僅僅一次訓練以後，正確識別的數量就已經達到了9129個（一共10000個）。

Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000複製程式碼

但是，因為我們初始化時用的隨機生成的權重和b，所以你執行的結果可能並不會和我的一模一樣。

我們把隱藏層改成100看看：

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)複製程式碼

我們會發現準確率提升了，至少在這種情況下增加隱藏層數量會幫助我們得到更好的結果。

如果我們減小學習效率到η=0.001：

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)複製程式碼

結果就有點不近人情了：

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000複製程式碼

再改變學習速率到0.01，發現結果又變好一點了。類似的當你發現改變一個引數能使得結果改變時就多試幾次。我們可以最終選擇更適合我們的這個引數。

一般，除錯神經網路是比較困難的。當指定一個引數沒有隨機選擇的好的時候尤其如此。假設我們嘗試設定隱藏層30個神經元，學習效率η=100.0：

>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)複製程式碼

會發現學習效率太高了：

Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000複製程式碼

這個時候我們應該會適當調小學習速率，來提高準確率。但是假如這是我們第一次來嘗試，那麼我們可能不會立刻懷疑時學習速率太大的問題，而是可能會懷疑時我們神經網路的問題，比如是不是我們初始化權重和b值造成網路的問題呢？會不會是訓練資料的問題呢？是不是訓練次數問題？或者是不是應該改變學習演算法呢？等等各種猜想，所以當你第一次遇到這個情況時，你是不確定是什麼問題出現造成的這種結果。但是這裡先不解釋，會在之後的文章中解釋這些問題。這裡僅僅是展示原始碼。

我們來看看前面提到的如何載入MNIST資料的細節，下面是原始碼。資料結構是MNIST官網文件上面描述的是stuff、tuples和lists。如果你不瞭解ndarray，可以理解成向量。

"""
mnist_loader
~~~~~~~~~~~~

A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.

    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.

    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.

    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.

    This is a nice data format, but for use in neural networks it's
    helpful to modify the format of the ``training_data`` a little.
    That's done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.

    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.

    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.

    Obviously, this means we're using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e複製程式碼

我們知道，2的圖案比1的更加黑一點，因為它更多區域被染成黑色。

有個建議是計算0到9的平均黑度，這樣在有個數字要猜測時，可以先計算它的黑度然後再猜測它是什麼數字，這個實現不太難，所以這裡不寫出程式碼，而是把程式碼放在了GitHub repository,這種方法能提高我們的準確率。

但是如果你想盡可能提高準確率，我們可以使用支援向量機演算法SVM（support vector machine）。不用擔心，暫時我們不需要理解SVM演算法細節，我們現在可以顯示用庫 scikit-learn，裡面提供了SVM演算法C語言對Python方便的介面。程式碼在這裡 here.這說明SVM比我們的演算法更厲害，這有點不太好，所以在後面，我們會提高我們的演算法，讓它比SVM準確率更高。

SVM也有很多可以調整的引數，如果你感興趣可以學習 this blog post作者是 Andreas Mueller 。