Keras:基於Theano和TensorFlow的深度學習庫

Andrew.Hann發表於2017-02-27

catalogue

1. 引言
2. 一些基本概念
3. Sequential模型
4. 泛型模型
5. 常用層
6. 卷積層
7. 池化層
8. 遞迴層Recurrent
9. 嵌入層 Embedding

1. 引言

Keras是一個高層神經網路庫，Keras由純Python編寫而成並基Tensorflow或Theano

簡易和快速的原型設計（keras具有高度模組化，極簡，和可擴充特性）
支援CNN和RNN，或二者的結合
支援任意的連結方案（包括多輸入和多輸出訓練）
無縫CPU和GPU切換

0x1: Keras設計原則

1. 模組性: 模型可理解為一個獨立的序列或圖，完全可配置的模組以最少的代價自由組合在一起。具體而言，網路層、損失函式、優化器、初始化策略、啟用函式、正則化方法都是獨立的模組，我們可以使用它們來構建自己的模型
2. 極簡主義: 每個模組都應該儘量的簡潔。每一段程式碼都應該在初次閱讀時都顯得直觀易懂。沒有黑魔法，因為它將給迭代和創新帶來麻煩 
3. 易擴充套件性: 新增新模組超級簡單的容易，只需要仿照現有的模組編寫新的類或函式即可。建立新模組的便利性使得Keras更適合於先進的研究工作 
4. 與Python協作: Keras沒有單獨的模型配置檔案型別，模型由python程式碼描述，使其更緊湊和更易debug，並提供了擴充套件的便利性

0x2: 快速開始

sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
pip install scipy

Keras的核心資料結構是“模型”，模型是一種組織網路層的方式。Keras中主要的模型是Sequential模型，Sequential是一系列網路層按順序構成的棧

from keras.models import Sequential

model = Sequential()

將一些網路層通過.add()堆疊起來，就構成了一個模型：

from keras.layers import Dense, Activation

model.add(Dense(output_dim=64, input_dim=100))
model.add(Activation("relu"))
model.add(Dense(output_dim=10))
model.add(Activation("softmax"))

完成模型的搭建後，我們需要使用.compile()方法來編譯模型：

model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

編譯模型時必須指明損失函式和優化器，如果你需要的話，也可以自己定製損失函式。Keras的一個核心理念就是簡明易用同時，保證使用者對Keras的絕對控制力度，使用者可以根據自己的需要定製自己的模型、網路層，甚至修改原始碼

from keras.optimizers import SGD
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9, nesterov=True))

完成模型編譯後，我們在訓練資料上按batch進行一定次數的迭代訓練，以擬合網路

model.fit(X_train, Y_train, nb_epoch=5, batch_size=32)

當然，我們也可以手動將一個個batch的資料送入網路中訓練，這時候需要使用

model.train_on_batch(X_batch, Y_batch)

隨後，我們可以使用一行程式碼對我們的模型進行評估，看看模型的指標是否滿足我們的要求

loss_and_metrics = model.evaluate(X_test, Y_test, batch_size=32)

或者，我們可以使用我們的模型，對新的資料進行預測

classes = model.predict_classes(X_test, batch_size=32)
proba = model.predict_proba(X_test, batch_size=32)

Relevant Link:

https://github.com/fchollet/keras
http://playground.tensorflow.org/#activation=tanh&regularization=L1&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0.001&noise=45&networkShape=4,5&seed=0.75320&showTestData=true&discretize=true&percTrainData=50&x=true&y=true&xTimesY=true&xSquared=true&ySquared=true&cosX=false&sinX=true&cosY=false&sinY=true&collectStats=false&problem=classification&initZero=false&hideText=false

2. 一些基本概念

0x1: 符號計算

Keras的底層庫使用Theano或TensorFlow，這兩個庫也稱為Keras的後端。無論是Theano還是TensorFlow，都是一個"符號主義"的庫。
因此，這也使得Keras的程式設計與傳統的Python程式碼有所差別。籠統的說，符號主義的計算首先定義各種變數，然後建立一個“計算圖”，計算圖規定了各個變數之間的計算關係。建立好的計算圖需要編譯已確定其內部細節，然而，此時的計算圖還是一個"空殼子"，裡面沒有任何實際的資料，只有當你把需要運算的輸入放進去後，才能在整個模型中形成資料流，從而形成輸出值。
Keras的模型搭建形式就是這種方法，在你搭建Keras模型完畢後，你的模型就是一個空殼子，只有實際生成可呼叫的函式後(K.function)，輸入資料，才會形成真正的資料流

0x2: 張量

使用這個詞彙的目的是為了表述統一，張量可以看作是向量、矩陣的自然推廣，我們用張量來表示廣泛的資料型別
規模最小的張量是0階張量，即標量，也就是一個數
當我們把一些數有序的排列起來，就形成了1階張量，也就是一個向量
如果我們繼續把一組向量有序的排列起來，就形成了2階張量，也就是一個矩陣
把矩陣摞起來，就是3階張量，我們可以稱為一個立方體，具有3個顏色通道的彩色圖片就是一個這樣的立方體
張量的階數有時候也稱為維度，或者軸，軸這個詞翻譯自英文axis。譬如一個矩陣[[1,2],[3,4]]，是一個2階張量，有兩個維度或軸，沿著第0個軸（為了與python的計數方式一致，本文件維度和軸從0算起）你看到的是[1,2]，[3,4]兩個向量，沿著第1個軸你看到的是[1,3]，[2,4]兩個向量。

import numpy as np

a = np.array([[1,2],[3,4]])
sum0 = np.sum(a, axis=0)
sum1 = np.sum(a, axis=1)

print sum0
print sum1

0x3: 泛型模型

在原本的Keras版本中，模型其實有兩種

1. 一種叫Sequential，稱為序貫模型，也就是單輸入單輸出，一條路通到底，層與層之間只有相鄰關係，跨層連線統統沒有。這種模型編譯速度快，操作上也比較簡單
2. 第二種模型稱為Graph，即圖模型，這個模型支援多輸入多輸出，層與層之間想怎麼連怎麼連，但是編譯速度慢。可以看到，Sequential其實是Graph的一個特殊情況

在現在這版Keras中，圖模型被移除，而增加了了“functional model API”，這個東西，更加強調了Sequential是特殊情況這一點。一般的模型就稱為Model，然後如果你要用簡單的Sequential，OK，那還有一個快捷方式Sequential。

Relevant Link:

http://keras-cn.readthedocs.io/en/latest/getting_started/concepts/

3. Sequential模型

Sequential是多個網路層的線性堆疊
可以通過向Sequential模型傳遞一個layer的list來構造該模型

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential([
Dense(32, input_dim=784),
Activation('relu'),
Dense(10),
Activation('softmax'),
])

也可以通過.add()方法一個個的將layer加入模型中：

model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Activation('relu'))

0x1: 指定輸入資料的shape

模型需要知道輸入資料的shape，因此，Sequential的第一層需要接受一個關於輸入資料shape的引數，後面的各個層則可以自動的推匯出中間資料的shape，因此不需要為每個層都指定這個引數。有幾種方法來為第一層指定輸入資料的shape

1. 傳遞一個input_shape的關鍵字引數給第一層，input_shape是一個tuple型別的資料，其中也可以填入None，如果填入None則表示此位置可能是任何正整數。資料的batch大小不應包含在其中。
2. 傳遞一個batch_input_shape的關鍵字引數給第一層，該引數包含資料的batch大小。該引數在指定固定大小batch時比較有用，例如在stateful RNNs中。事實上，Keras在內部會通過新增一個None將input_shape轉化為batch_input_shape
3. 有些2D層，如Dense，支援通過指定其輸入維度input_dim來隱含的指定輸入資料shape。一些3D的時域層支援通過引數input_dim和input_length來指定輸入shape

下面的三個指定輸入資料shape的方法是嚴格等價的

model = Sequential()
model.add(Dense(32, input_shape=(784,)))

model = Sequential()
model.add(Dense(32, batch_input_shape=(None, 784)))
# note that batch dimension is "None" here,
# so the model will be able to process batches of any size.</pre>

model = Sequential()
model.add(Dense(32, input_dim=784))

下面三種方法也是嚴格等價的：

model = Sequential()
model.add(LSTM(32, input_shape=(10, 64)))

model = Sequential()
model.add(LSTM(32, batch_input_shape=(None, 10, 64)))

model = Sequential()
model.add(LSTM(32, input_length=10, input_dim=64))

0x2: Merge層

多個Sequential可經由一個Merge層合併到一個輸出。Merge層的輸出是一個可以被新增到新 Sequential的層物件。下面這個例子將兩個Sequential合併到一起(activation得到最終結果矩陣)

from keras.layers import Merge

left_branch = Sequential()
left_branch.add(Dense(32, input_dim=784))

right_branch = Sequential()
right_branch.add(Dense(32, input_dim=784))

merged = Merge([left_branch, right_branch], mode='concat')

final_model = Sequential()
final_model.add(merged)
final_model.add(Dense(10, activation='softmax'))

Merge層支援一些預定義的合併模式，包括

sum(defualt):逐元素相加
concat:張量串聯，可以通過提供concat_axis的關鍵字引數指定按照哪個軸進行串聯
mul：逐元素相乘
ave：張量平均
dot：張量相乘，可以通過dot_axis關鍵字引數來指定要消去的軸
cos：計算2D張量（即矩陣）中各個向量的餘弦距離

這個兩個分支的模型可以通過下面的程式碼訓練:

final_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
final_model.fit([input_data_1, input_data_2], targets)  # we pass one data array per model input

也可以為Merge層提供關鍵字引數mode，以實現任意的變換，例如

merged = Merge([left_branch, right_branch], mode=lambda x: x[0] - x[1])

對於不能通過Sequential和Merge組合生成的複雜模型，可以參考泛型模型API

0x3: 編譯

在訓練模型之前，我們需要通過compile來對學習過程進行配置。compile接收三個引數

1. 優化器optimizer：該引數可指定為已預定義的優化器名，如rmsprop、adagrad，或一個Optimizer類的物件 
2. 損失函式loss：該引數為模型試圖最小化的目標函式，它可為預定義的損失函式名，如categorical_crossentropy、mse，也可以為一個損失函式 
3. 指標列表metrics：對分類問題，我們一般將該列表設定為metrics=['accuracy']。指標可以是一個預定義指標的名字,也可以是一個使用者定製的函式.指標函式應該返回單個張量,或一個完成metric_name - > metric_value對映的字典

指標列表就是用來生成最後的判斷結果的

# for a multi-class classification problem
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])

# for a binary classification problem
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])

# for a mean squared error regression problem
model.compile(optimizer='rmsprop',
loss='mse')

# for custom metrices


# for custom metrics
import keras.backend as K

def mean_pred(y_true, y_pred):
    return K.mean(y_pred)

def false_rates(y_true, y_pred):
    false_neg = ...
    false_pos = ...
    return {
        'false_neg': false_neg,
        'false_pos': false_pos,
    }

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy', mean_pred, false_rates])

0x4: 訓練

Keras以Numpy陣列作為輸入資料和標籤的資料型別。訓練模型一般使用fit函式

# for a single-input model with 2 classes (binary):
model = Sequential()
model.add(Dense(1, input_dim=784, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# generate dummy data
import numpy as np
data = np.random.random((1000, 784))
labels = np.random.randint(2, size=(1000, 1))

# train the model, iterating on the data in batches
# of 32 samples
model.fit(data, labels, nb_epoch=10, batch_size=32)

另一個栗子

# for a multi-input model with 10 classes:

left_branch = Sequential()
left_branch.add(Dense(32, input_dim=784))

right_branch = Sequential()
right_branch.add(Dense(32, input_dim=784))

merged = Merge([left_branch, right_branch], mode='concat')

model = Sequential()
model.add(merged)
model.add(Dense(10, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# generate dummy data
import numpy as np
from keras.utils.np_utils import to_categorical
data_1 = np.random.random((1000, 784))
data_2 = np.random.random((1000, 784))

# these are integers between 0 and 9
labels = np.random.randint(10, size=(1000, 1))
# we convert the labels to a binary matrix of size (1000, 10)
# for use with categorical_crossentropy
labels = to_categorical(labels, 10)

# train the model
# note that we are passing a list of Numpy arrays as training data
# since the model has 2 inputs
model.fit([data_1, data_2], labels, nb_epoch=10, batch_size=32)

0x5: 一些栗子

1. 基於多層感知器的softmax多分類

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:
# here, 20-dimensional vectors.
model.add(Dense(64, input_dim=20, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(10, init='uniform'))
model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

model.fit(X_train, y_train,
          nb_epoch=20,
          batch_size=16)
score = model.evaluate(X_test, y_test, batch_size=16)

2. 相似MLP的另一種實現

model = Sequential()
model.add(Dense(64, input_dim=20, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

3. 用於二分類的多層感知器

model = Sequential()
model.add(Dense(64, input_dim=20, init='uniform', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

4. 類似VGG的卷積神經網路

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import SGD

model = Sequential()
# input: 100x100 images with 3 channels -> (3, 100, 100) tensors.
# this applies 32 convolution filters of size 3x3 each.
model.add(Convolution2D(32, 3, 3, border_mode='valid', input_shape=(3, 100, 100)))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Convolution2D(64, 3, 3, border_mode='valid'))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
# Note: Keras does automatic shape inference.
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(10))
model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

model.fit(X_train, Y_train, batch_size=32, nb_epoch=1)

5. 使用LSTM的序列分類

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM

model = Sequential()
model.add(Embedding(max_features, 256, input_length=maxlen))
model.add(LSTM(output_dim=128, activation='sigmoid', inner_activation='hard_sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(X_train, Y_train, batch_size=16, nb_epoch=10)
score = model.evaluate(X_test, Y_test, batch_size=16)

6. 用於序列分類的棧式LSTM

在該模型中，我們將三個LSTM堆疊在一起，是該模型能夠學習更高層次的時域特徵表示。
開始的兩層LSTM返回其全部輸出序列，而第三層LSTM只返回其輸出序列的最後一步結果，從而其時域維度降低（即將輸入序列轉換為單個向量）

from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np

data_dim = 16
timesteps = 8
nb_classes = 10

# expected input data shape: (batch_size, timesteps, data_dim)
model = Sequential()
model.add(LSTM(32, return_sequences=True,
               input_shape=(timesteps, data_dim)))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32, return_sequences=True))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32))  # return a single vector of dimension 32
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# generate dummy training data
x_train = np.random.random((1000, timesteps, data_dim))
y_train = np.random.random((1000, nb_classes))

# generate dummy validation data
x_val = np.random.random((100, timesteps, data_dim))
y_val = np.random.random((100, nb_classes))

model.fit(x_train, y_train,
          batch_size=64, nb_epoch=5,
          validation_data=(x_val, y_val))

7. 採用狀態LSTM的相同模型

狀態（stateful）LSTM的特點是，在處理過一個batch的訓練資料後，其內部狀態（記憶）會被作為下一個batch的訓練資料的初始狀態。狀態LSTM使得我們可以在合理的計算複雜度內處理較長序列

from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np

data_dim = 16
timesteps = 8
nb_classes = 10
batch_size = 32

# expected input batch shape: (batch_size, timesteps, data_dim)
# note that we have to provide the full batch_input_shape since the network is stateful.
# the sample of index i in batch k is the follow-up for the sample i in batch k-1.
model = Sequential()
model.add(LSTM(32, return_sequences=True, stateful=True,
               batch_input_shape=(batch_size, timesteps, data_dim)))
model.add(LSTM(32, return_sequences=True, stateful=True))
model.add(LSTM(32, stateful=True))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# generate dummy training data
x_train = np.random.random((batch_size * 10, timesteps, data_dim))
y_train = np.random.random((batch_size * 10, nb_classes))

# generate dummy validation data
x_val = np.random.random((batch_size * 3, timesteps, data_dim))
y_val = np.random.random((batch_size * 3, nb_classes))

model.fit(x_train, y_train,
          batch_size=batch_size, nb_epoch=5,
          validation_data=(x_val, y_val))

8. 將兩個LSTM合併作為編碼端來處理兩路序列的分類

兩路輸入序列通過兩個LSTM被編碼為特徵向量
兩路特徵向量被串連在一起，然後通過一個全連線網路得到結果

from keras.models import Sequential
from keras.layers import Merge, LSTM, Dense
import numpy as np

data_dim = 16
timesteps = 8
nb_classes = 10

encoder_a = Sequential()
encoder_a.add(LSTM(32, input_shape=(timesteps, data_dim)))

encoder_b = Sequential()
encoder_b.add(LSTM(32, input_shape=(timesteps, data_dim)))

decoder = Sequential()
decoder.add(Merge([encoder_a, encoder_b], mode='concat'))
decoder.add(Dense(32, activation='relu'))
decoder.add(Dense(nb_classes, activation='softmax'))

decoder.compile(loss='categorical_crossentropy',
                optimizer='rmsprop',
                metrics=['accuracy'])

# generate dummy training data
x_train_a = np.random.random((1000, timesteps, data_dim))
x_train_b = np.random.random((1000, timesteps, data_dim))
y_train = np.random.random((1000, nb_classes))

# generate dummy validation data
x_val_a = np.random.random((100, timesteps, data_dim))
x_val_b = np.random.random((100, timesteps, data_dim))
y_val = np.random.random((100, nb_classes))

decoder.fit([x_train_a, x_train_b], y_train,
            batch_size=64, nb_epoch=5,
            validation_data=([x_val_a, x_val_b], y_val))

Relevant Link:

http://www.jianshu.com/p/9dc9f41f0b29
http://keras-cn.readthedocs.io/en/latest/getting_started/sequential_model/

4. 泛型模型

Keras泛型模型介面是使用者定義多輸出模型、非迴圈有向模型或具有共享層的模型等複雜模型的途徑

1. 層物件接受張量為引數，返回一個張量。張量在數學上只是資料結構的擴充，一階張量就是向量，二階張量就是矩陣，三階張量就是立方體。在這裡張量只是廣義的表達一種資料結構，例如一張彩色影象其實就是一個三階張量(每一階都是one-hot向量)，它由三個通道的畫素值堆疊而成。而10000張彩色圖構成的一個資料集合則是四階張量。
2. 輸入是張量，輸出也是張量的一個框架就是一個模型
3. 這樣的模型可以被像Keras的Sequential一樣被訓練

例如這個全連線網路

from keras.layers import Input, Dense
from keras.models import Model

# this returns a tensor
inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# this creates a model that includes
# the Input layer and three Dense layers
model = Model(input=inputs, output=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(data, labels)  # starts training

0x1: 所有的模型都是可呼叫的，就像層一樣

利用泛型模型的介面，我們可以很容易的重用已經訓練好的模型：你可以把模型當作一個層一樣，通過提供一個tensor來呼叫它。注意當你呼叫一個模型時，你不僅僅重用了它的結構，也重用了它的權重

x = Input(shape=(784,))
# this works, and returns the 10-way softmax we defined above.
y = model(x)

這種方式可以允許你快速的建立能處理序列訊號的模型，你可以很快將一個影象分類的模型變為一個對視訊分類的模型，只需要一行程式碼：

from keras.layers import TimeDistributed

# input tensor for sequences of 20 timesteps,
# each containing a 784-dimensional vector
input_sequences = Input(shape=(20, 784))

# this applies our previous model to every timestep in the input sequences.
# the output of the previous model was a 10-way softmax,
# so the output of the layer below will be a sequence of 20 vectors of size 10.
processed_sequences = TimeDistributed(model)(input_sequences)

0x2: 多輸入和多輸出模型

使用泛型模型的一個典型場景是搭建多輸入、多輸出的模型。
考慮這樣一個模型。我們希望預測Twitter上一條新聞會被轉發和點贊多少次。模型的主要輸入是新聞本身，也就是一個詞語的序列。但我們還可以擁有額外的輸入，如新聞釋出的日期等。這個模型的損失函式將由兩部分組成，輔助的損失函式評估僅僅基於新聞本身做出預測的情況，主損失函式評估基於新聞和額外資訊的預測的情況，即使來自主損失函式的梯度發生彌散，來自輔助損失函式的資訊也能夠訓練Embeddding和LSTM層。在模型中早點使用主要的損失函式是對於深度網路的一個良好的正則方法。總而言之，該模型框圖如下：

讓我們用泛型模型來實現這個框圖
主要的輸入接收新聞本身，即一個整數的序列（每個整數編碼了一個詞）。這些整數位於1到10，000之間（即我們的字典有10，000個詞）。這個序列有100個單詞

from keras.layers import Input, Embedding, LSTM, Dense, merge
from keras.models import Model

# headline input: meant to receive sequences of 100 integers, between 1 and 10000.
# note that we can name any layer by passing it a "name" argument.
main_input = Input(shape=(100,), dtype='int32', name='main_input')

# this embedding layer will encode the input sequence
# into a sequence of dense 512-dimensional vectors.
x = Embedding(output_dim=512, input_dim=10000, input_length=100)(main_input)

# a LSTM will transform the vector sequence into a single vector,
# containing information about the entire sequence
lstm_out = LSTM(32)(x)

然後，我們插入一個額外的損失，使得即使在主損失很高的情況下，LSTM和Embedding層也可以平滑的訓練

auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)

再然後，我們將LSTM與額外的輸入資料串聯起來組成輸入，送入模型中

auxiliary_input = Input(shape=(5,), name='aux_input')
x = merge([lstm_out, auxiliary_input], mode='concat')

# we stack a deep fully-connected network on top
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)

# and finally we add the main logistic regression layer
main_output = Dense(1, activation='sigmoid', name='main_output')(x)

最後，我們定義整個2輸入，2輸出的模型：

model = Model(input=[main_input, auxiliary_input], output=[main_output, auxiliary_output])

模型定義完畢，下一步編譯模型。我們給額外的損失賦0.2的權重。我們可以通過關鍵字引數loss_weights或loss來為不同的輸出設定不同的損失函式或權值。這兩個引數均可為Python的列表或字典。這裡我們給loss傳遞單個損失函式，這個損失函式會被應用於所有輸出上

model.compile(optimizer='rmsprop', loss='binary_crossentropy',
              loss_weights=[1., 0.2])

編譯完成後，我們通過傳遞訓練資料和目標值訓練該模型：

model.fit([headline_data, additional_data], [labels, labels],
          nb_epoch=50, batch_size=32)

因為我們輸入和輸出是被命名過的（在定義時傳遞了“name”引數），我們也可以用下面的方式編譯和訓練模型：

model.compile(optimizer='rmsprop',
              loss={'main_output': 'binary_crossentropy', 'aux_output': 'binary_crossentropy'},
              loss_weights={'main_output': 1., 'aux_output': 0.2})

# and trained it via:
model.fit({'main_input': headline_data, 'aux_input': additional_data},
          {'main_output': labels, 'aux_output': labels},
          nb_epoch=50, batch_size=32)

0x3: 共享層

另一個使用泛型模型的場合是使用共享層的時候
考慮微博資料，我們希望建立模型來判別兩條微博是否是來自同一個使用者，這個需求同樣可以用來判斷一個使用者的兩條微博的相似性。
一種實現方式是，我們建立一個模型，它分別將兩條微博的資料對映到兩個特徵向量上，然後將特徵向量串聯並加一個logistic迴歸層，輸出它們來自同一個使用者的概率。這種模型的訓練資料是一對對的微博。
因為這個問題是對稱的，所以處理第一條微博的模型當然也能重用於處理第二條微博。所以這裡我們使用一個共享的LSTM層來進行對映。
首先，我們將微博的資料轉為（140，256）的矩陣，即每條微博有140個字元，每個單詞的特徵由一個256維的詞向量表示，向量的每個元素為1表示某個字元出現，為0表示不出現，這是一個one-hot編碼

from keras.layers import Input, LSTM, Dense, merge
from keras.models import Model

tweet_a = Input(shape=(140, 256))
tweet_b = Input(shape=(140, 256))

若要對不同的輸入共享同一層，就初始化該層一次，然後多次呼叫它

# this layer can take as input a matrix
# and will return a vector of size 64
shared_lstm = LSTM(64)

# when we reuse the same layer instance
# multiple times, the weights of the layer
# are also being reused
# (it is effectively *the same* layer)
encoded_a = shared_lstm(tweet_a)
encoded_b = shared_lstm(tweet_b)

# we can then concatenate the two vectors:
merged_vector = merge([encoded_a, encoded_b], mode='concat', concat_axis=-1)

# and add a logistic regression on top
predictions = Dense(1, activation='sigmoid')(merged_vector)

# we define a trainable model linking the
# tweet inputs to the predictions
model = Model(input=[tweet_a, tweet_b], output=predictions)

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit([data_a, data_b], labels, nb_epoch=10)

0x4: 層“節點”的概念

無論何時，當你在某個輸入上呼叫層時，你就建立了一個新的張量（即該層的輸出），同時你也在為這個層增加一個“（計算）節點”。這個節點將輸入張量對映為輸出張量。當你多次呼叫該層時，這個層就有了多個節點，其下標分別為0，1，2...

0x5: 依舊是一些栗子

1. inception模型

from keras.layers import merge, Convolution2D, MaxPooling2D, Input

input_img = Input(shape=(3, 256, 256))

tower_1 = Convolution2D(64, 1, 1, border_mode='same', activation='relu')(input_img)
tower_1 = Convolution2D(64, 3, 3, border_mode='same', activation='relu')(tower_1)

tower_2 = Convolution2D(64, 1, 1, border_mode='same', activation='relu')(input_img)
tower_2 = Convolution2D(64, 5, 5, border_mode='same', activation='relu')(tower_2)

tower_3 = MaxPooling2D((3, 3), strides=(1, 1), border_mode='same')(input_img)
tower_3 = Convolution2D(64, 1, 1, border_mode='same', activation='relu')(tower_3)

output = merge([tower_1, tower_2, tower_3], mode='concat', concat_axis=1)

2. 卷積層的殘差連線(Residual Network)

from keras.layers import merge, Convolution2D, Input

# input tensor for a 3-channel 256x256 image
x = Input(shape=(3, 256, 256))
# 3x3 conv with 3 output channels(same as input channels)
y = Convolution2D(3, 3, 3, border_mode='same')(x)
# this returns x + y.
z = merge([x, y], mode='sum')

3. 共享視覺模型

該模型在兩個輸入上重用了影象處理的模型，用來判別兩個MNIST數字是否是相同的數字

from keras.layers import merge, Convolution2D, MaxPooling2D, Input, Dense, Flatten
from keras.models import Model

# first, define the vision modules
digit_input = Input(shape=(1, 27, 27))
x = Convolution2D(64, 3, 3)(digit_input)
x = Convolution2D(64, 3, 3)(x)
x = MaxPooling2D((2, 2))(x)
out = Flatten()(x)

vision_model = Model(digit_input, out)

# then define the tell-digits-apart model
digit_a = Input(shape=(1, 27, 27))
digit_b = Input(shape=(1, 27, 27))

# the vision model will be shared, weights and all
out_a = vision_model(digit_a)
out_b = vision_model(digit_b)

concatenated = merge([out_a, out_b], mode='concat')
out = Dense(1, activation='sigmoid')(concatenated)

classification_model = Model([digit_a, digit_b], out)

4. 視覺問答模型(問題性影象驗證碼)

在針對一幅圖片使用自然語言進行提問時，該模型能夠提供關於該圖片的一個單詞的答案
這個模型將自然語言的問題和圖片分別對映為特徵向量，將二者合併後訓練一個logistic迴歸層，從一系列可能的回答中挑選一個。

from keras.layers import Convolution2D, MaxPooling2D, Flatten
from keras.layers import Input, LSTM, Embedding, Dense, merge
from keras.models import Model, Sequential

# first, let's define a vision model using a Sequential model.
# this model will encode an image into a vector.
vision_model = Sequential()
vision_model.add(Convolution2D(64, 3, 3, activation='relu', border_mode='same', input_shape=(3, 224, 224)))
vision_model.add(Convolution2D(64, 3, 3, activation='relu'))
vision_model.add(MaxPooling2D((2, 2)))
vision_model.add(Convolution2D(128, 3, 3, activation='relu', border_mode='same'))
vision_model.add(Convolution2D(128, 3, 3, activation='relu'))
vision_model.add(MaxPooling2D((2, 2)))
vision_model.add(Convolution2D(256, 3, 3, activation='relu', border_mode='same'))
vision_model.add(Convolution2D(256, 3, 3, activation='relu'))
vision_model.add(Convolution2D(256, 3, 3, activation='relu'))
vision_model.add(MaxPooling2D((2, 2)))
vision_model.add(Flatten())

# now let's get a tensor with the output of our vision model:
image_input = Input(shape=(3, 224, 224))
encoded_image = vision_model(image_input)

# next, let's define a language model to encode the question into a vector.
# each question will be at most 100 word long,
# and we will index words as integers from 1 to 9999.
question_input = Input(shape=(100,), dtype='int32')
embedded_question = Embedding(input_dim=10000, output_dim=256, input_length=100)(question_input)
encoded_question = LSTM(256)(embedded_question)

# let's concatenate the question vector and the image vector:
merged = merge([encoded_question, encoded_image], mode='concat')

# and let's train a logistic regression over 1000 words on top:
output = Dense(1000, activation='softmax')(merged)

# this is our final model:
vqa_model = Model(input=[image_input, question_input], output=output)

# the next stage would be training this model on actual data.

5. 視訊問答模型

在做完圖片問答模型後，我們可以快速將其轉為視訊問答的模型。在適當的訓練下，你可以為模型提供一個短視訊（如100幀）然後向模型提問一個關於該視訊的問題，如“what sport is the boy playing？”->“football”

from keras.layers import TimeDistributed

video_input = Input(shape=(100, 3, 224, 224))
# this is our video encoded via the previously trained vision_model (weights are reused)
encoded_frame_sequence = TimeDistributed(vision_model)(video_input)  # the output will be a sequence of vectors
encoded_video = LSTM(256)(encoded_frame_sequence)  # the output will be a vector

# this is a model-level representation of the question encoder, reusing the same weights as before:
question_encoder = Model(input=question_input, output=encoded_question)

# let's use it to encode the question:
video_question_input = Input(shape=(100,), dtype='int32')
encoded_video_question = question_encoder(video_question_input)

# and this is our video question answering model:
merged = merge([encoded_video, encoded_video_question], mode='concat')
output = Dense(1000, activation='softmax')(merged)
video_qa_model = Model(input=[video_input, video_question_input], output=output)

Relevant Link:

http://wiki.jikexueyuan.com/project/tensorflow-zh/resources/dims_types.html

5. 常用層

0x1: Dense層

Dense就是常用的全連線層

keras.layers.core.Dense(
    output_dim, 
    init='glorot_uniform', 
    activation='linear', 
    weights=None, 
    W_regularizer=None, 
    b_regularizer=None, 
    activity_regularizer=None, 
    W_constraint=None, 
    b_constraint=None, 
    bias=True, 
    input_dim=None
)

1. output_dim：大於0的整數，代表該層的輸出維度。模型中非首層的全連線層其輸入維度可以自動推斷，因此非首層的全連線定義時不需要指定輸入維度。
2. init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。該引數僅在不傳遞weights引數時才有意義。
3. activation：啟用函式，為預定義的啟用函式名（參考啟用函式），或逐元素（element-wise）的Theano函式。如果不指定該引數，將不會使用任何啟用函式（即使用線性啟用函式：a(x)=x）
4. weights：權值，為numpy array的list。該list應含有一個形如（input_dim,output_dim）的權重矩陣和一個形如(output_dim,)的偏置向量。
5. W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
6. b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
7. activity_regularizer：施加在輸出上的正則項，為ActivityRegularizer物件
8. W_constraints：施加在權重上的約束項，為Constraints物件
9. b_constraints：施加在偏置上的約束項，為Constraints物件
10. bias：布林值，是否包含偏置向量（即層對輸入做線性變換還是仿射變換）
11. input_dim：整數，輸入資料的維度。當Dense層作為網路的第一層時，必須指定該引數或input_shape引數。

after the first layer, you don't need to specify the size of the input anymore

0x2: Activation層

啟用層對一個層的輸出施加啟用函式

keras.layers.core.Activation(activation) 

activation：將要使用的啟用函式，為預定義啟用函式名或一個Tensorflow/Theano的函式

0x3: Dropout層

為輸入資料施加Dropout。Dropout將在訓練過程中每次更新引數時隨機斷開一定百分比（p）的輸入神經元連線，Dropout層用於防止過擬合

keras.layers.core.Dropout(p) 

p：0~1的浮點數，控制需要斷開的連結的比例

0x4: Flatten層

Flatten層用來將輸入“壓平”，即把多維的輸入一維化，常用在從卷積層到全連線層的過渡。Flatten不影響batch的大小

keras.layers.core.Flatten() 

model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 32, 32)))
# now: model.output_shape == (None, 64, 32, 32)

model.add(Flatten())
# now: model.output_shape == (None, 65536)

0x5: Reshape層

Reshape層用來將輸入shape轉換為特定的shape

keras.layers.core.Reshape(target_shape) 

target_shape：目標shape，為整數的tuple，不包含樣本數目的維度（batch大小）  

# as first layer in a Sequential model
model = Sequential()
model.add(Reshape((3, 4), input_shape=(12,)))
# now: model.output_shape == (None, 3, 4)
# note: `None` is the batch dimension

# as intermediate layer in a Sequential model
model.add(Reshape((6, 2)))
# now: model.output_shape == (None, 6, 2)

0x6: Permute層
Permute層將輸入的維度按照給定模式進行重排，例如，當需要將RNN和CNN網路連線時，可能會用到該層

keras.layers.core.Permute(dims) 

dims：整數tuple，指定重排的模式，不包含樣本數的維度。重排模式的下標從1開始。例如（2，1）代表將輸入的第二個維度重拍到輸出的第一個維度，而將輸入的第一個維度重排到第二個維度
 
model = Sequential()
model.add(Permute((2, 1), input_shape=(10, 64)))
# now: model.output_shape == (None, 64, 10)
# note: `None` is the batch dimension

0x7: RepeatVector層

RepeatVector層將輸入重複n次

keras.layers.core.RepeatVector(n) 

n：整數，重複的次數 

model = Sequential()
model.add(Dense(32, input_dim=32))
# now: model.output_shape == (None, 32)
# note: `None` is the batch dimension

model.add(RepeatVector(3))
# now: model.output_shape == (None, 3, 32)

0x8: Merge層

Merge層根據給定的模式，將一個張量列表中的若干張量合併為一個單獨的張量

keras.engine.topology.Merge(
    layers=None, 
    mode='sum', 
    concat_axis=-1, 
    dot_axes=-1, 
    output_shape=None, 
    node_indices=None, 
    tensor_indices=None, 
    name=None
)

1. layers：該引數為Keras張量的列表，或Keras層物件的列表。該列表的元素數目必須大於1。
2. mode：合併模式，為預定義合併模式名的字串或lambda函式或普通函式，如果為lambda函式或普通函式，則該函式必須接受一個張量的list作為輸入，並返回一個張量。如果為字串，則必須是下列值之一：
“sum”，“mul”，“concat”，“ave”，“cos”，“dot”
3. concat_axis：整數，當mode=concat時指定需要串聯的軸
4. dot_axes：整數或整數tuple，當mode=dot時，指定要消去的軸
5. output_shape：整數tuple或lambda函式/普通函式（當mode為函式時）。如果output_shape是函式時，該函式的輸入值應為一一對應於輸入shape的list，並返回輸出張量的shape。
6. node_indices：可選，為整數list，如果有些層具有多個輸出節點（node）的話，該引數可以指定需要merge的那些節點的下標。如果沒有提供，該引數的預設值為全0向量，即合併輸入層0號節點的輸出值。
7. tensor_indices：可選，為整數list，如果有些層返回多個輸出張量的話，該引數用以指定需要合併的那些張量

在進行merge的時候需要仔細思考採用哪種連線方式，以及將哪個軸進行merge，因為這會很大程度上影響神經網路的訓練過程

0x9: Lambda層

本函式用以對上一層的輸出施以任何Theano/TensorFlow表示式

keras.layers.core.Lambda(
    function, 
    output_shape=None, 
    arguments={}
) 

1. function：要實現的函式，該函式僅接受一個變數，即上一層的輸出
2. output_shape：函式應該返回的值的shape，可以是一個tuple，也可以是一個根據輸入shape計算輸出shape的函式
3. arguments：可選，字典，用來記錄向函式中傳遞的其他關鍵字引數

0x10: ActivityRegularizer層

經過本層的資料不會有任何變化，但會基於其啟用值更新損失函式值

keras.layers.core.ActivityRegularization(l1=0.0, l2=0.0) 

l1：1範數正則因子（正浮點數）
l2：2範數正則因子（正浮點數）

0x11: Masking層

使用給定的值對輸入的序列訊號進行“遮蔽”，用以定位需要跳過的時間步
對於輸入張量的時間步，即輸入張量的第1維度（維度從0開始算），如果輸入張量在該時間步上都等於mask_value，則該時間步將在模型接下來的所有層（只要支援masking）被跳過（遮蔽）。
如果模型接下來的一些層不支援masking，卻接受到masking過的資料，則丟擲異常

考慮輸入資料x是一個形如(samples,timesteps,features)的張量，現將其送入LSTM層。因為你缺少時間步為3和5的訊號，所以你希望將其掩蓋。這時候應該：

賦值x[:,3,:] = 0.，x[:,5,:] = 0.
在LSTM層之前插入mask_value=0.的Masking層
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(32))

0x12: Highway層

Highway層建立全連線的Highway網路，這是LSTM在前饋神經網路中的推廣

keras.layers.core.Highway(
    init='glorot_uniform', 
    transform_bias=-2, 
    activation='linear', 
    weights=None, 
    W_regularizer=None, 
    b_regularizer=None, 
    activity_regularizer=None, 
    W_constraint=None, 
    b_constraint=None, 
    bias=True, 
    input_dim=None
)

output_dim：大於0的整數，代表該層的輸出維度。模型中非首層的全連線層其輸入維度可以自動推斷，因此非首層的全連線定義時不需要指定輸入維度。
init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。該引數僅在不傳遞weights引數時有意義。
activation：啟用函式，為預定義的啟用函式名（參考啟用函式），或逐元素（element-wise）的Theano函式。如果不指定該引數，將不會使用任何啟用函式（即使用線性啟用函式：a(x)=x）
weights：權值，為numpy array的list。該list應含有一個形如（input_dim,output_dim）的權重矩陣和一個形如(output_dim,)的偏置向量。
W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
activity_regularizer：施加在輸出上的正則項，為ActivityRegularizer物件
W_constraints：施加在權重上的約束項，為Constraints物件
b_constraints：施加在偏置上的約束項，為Constraints物件
bias：布林值，是否包含偏置向量（即層對輸入做線性變換還是仿射變換）
input_dim：整數，輸入資料的維度。當該層作為網路的第一層時，必須指定該引數或input_shape引數。
transform_bias：用以初始化傳遞引數，預設為-2（請參考文獻理解本引數的含義）

0x13: MaxoutDense層

全連線的Maxout層。MaxoutDense層以nb_features個Dense(input_dim,output_dim)線性層的輸出的最大值為輸出。MaxoutDense可對輸入學習出一個凸的、分段線性的啟用函式

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/layers/core_layer/

6. 卷積層

資料輸入層: 對資料做一些處理，比如去均值(把輸入資料各個維度都中心化為0，避免資料過多偏差，影響訓練效果)、歸一化(把所有的資料都歸一到同樣的範圍)、PCA/白化等等 

中間是
CONV: 卷積計算層，線性乘積 求和(內積)
RELU: 激勵層(啟用函式)，用於把向量轉化為一個"量值"，用於評估本輪引數的分類效果
POOL: 池化層，簡言之，即取區域平均或最大 

最右邊是
FC: 全連線層

0x0: CNN之卷積計算層

1. CNN核心概念: 濾波

在通訊領域中，濾波(Wave filtering)指的是將訊號中特定波段頻率濾除的操作，是抑制和防止干擾的一項重要措施。在CNN影象識別領域，指的是對影象(不同的資料視窗資料)和濾波矩陣(一組固定的權重：因為每個神經元的多個權重固定，所以又可以看做一個恆定的濾波器filter)做內積(逐個元素相乘再求和)的操作就是所謂的"卷積"操作，也是卷積神經網路的名字來源。
直觀上理解就是從一個區域(區域的大小就是filter濾波器的size)中抽取出"重要的細節"，而抽取的方法就是建立"區域權重"，根據區域權重把一個區域中的重點細節過濾出來
再直觀一些理解就是例如上圖的汽車影象，濾波器要做的就是把其中的輪胎、車後視鏡、前臉輪廓、A柱形狀過濾出來，從邊緣細節的角度來看待一張非格式化的影象
這種技術的理論基礎是學術界認為人眼對影象的識別也是分層的，人眼第一眼接收到的就是一個物理的輪廓細節，然後傳輸給大腦皮層，然後在輪廓細節的基礎上進一步抽象建立起對一個物理的整體感知

非嚴格意義上來講，上圖中紅框框起來的部分便可以理解為一個濾波器，即帶著一組固定權重的神經元。多個濾波器疊加便成了卷積層

2. 影象上的卷積

在下圖對應的計算過程中，輸入是一定區域大小(width*height)的資料，和濾波器filter（帶著一組固定權重的神經元）做內積後等到新的二維資料。

具體來說，左邊是影象輸入，中間部分就是濾波器filter（帶著一組固定權重的神經元），不同的濾波器filter會得到不同的輸出資料，比如顏色深淺、輪廓。相當於如果想提取影象的不同特徵，則用不同的濾波器filter，提取想要的關於影象的特定資訊：顏色深淺或輪廓

3. CNN濾波器

在CNN中，濾波器filter（帶著一組固定權重的神經元）對區域性輸入資料進行卷積計算。每計算完一個資料視窗內的區域性資料後，資料視窗不斷平移滑動，直到計算完所有資料

可以看到

兩個神經元，即depth=2，意味著有兩個濾波器。
資料視窗每次移動兩個步長取3*3的區域性資料，即stride=2。
zero-padding=1

然後分別以兩個濾波器filter為軸滑動陣列進行卷積計算，得到兩組不同的結果。通過這種滑動視窗的濾波過程，逐步把影象的各個細節資訊提取出來(邊緣輪廓、影象深淺)。值得注意的是

1. 區域性感知機制
左邊資料在變化，每次濾波器都是針對某一區域性的資料視窗進行卷積，這就是所謂的CNN中的區域性感知機制。
打個比方，濾波器就像一雙眼睛，人類視角有限，一眼望去，只能看到這世界的區域性。如果一眼就看到全世界，你會累死，而且一下子接受全世界所有資訊，你大腦接收不過來。當然，即便是看區域性，針對區域性裡的資訊人類雙眼也是有偏重、偏好的。比如看美女，對臉、胸、腿是重點關注，所以這3個輸入的權重相對較大 

2. 引數(權重)共享機制
資料視窗滑動，導致輸入濾波器的資料在變化，但中間濾波器Filter w0的權重(即每個神經元連線資料視窗的權重)是固定不變的，這個權重不變即所謂的CNN中的引數(權重)共享機制。
再打個比方，某人環遊全世界，所看到的資訊在變，但採集資訊的雙眼不變。一個人對景物的認知在一定時間段內是保持不變的，但是需要注意的是，這些權重也不是永遠不變的，隨著訓練的進行，權重會根據啟用函式的判斷結果不斷調整網路中的權重(這就是所謂的BP反向傳播演算法)

4. CNN激勵層

常用的非線性啟用函式有sigmoid、tanh、relu等等，前兩者sigmoid/tanh比較常見於全連線層，後者relu常見於卷積層

啟用函式sigmoid

其中z是一個線性組合，比如z可以等於：b + * + *

橫軸表示定義域z，縱軸表示值域g(z)。sigmoid函式的功能是相當於把一個實數壓縮至0到1之間。當z是非常大的正數時，g(z)會趨近於1，而z是非常大的負數時，則g(z)會趨近於0
這樣一來便可以把啟用函式看作一種“分類的概率”，比如啟用函式的輸出為0.9的話便可以解釋為90%的概率為正樣本

ReLU激勵層

ReLU的優點是收斂快，求梯度簡單

5. CNN池化層

池化，簡言之，即取區域平均或最大

接下來拿一個真實的CNN網路來解釋CNN的構造原理

1. Input layer of NxN pixels (N=32).
2. Convolutional layer (64 filter maps of size 11x11).
3. Max-pooling layer.
4. Densely-connected layer (4096 neurons)
5. Output layer. 9 neurons.

輸入影象是一個32*32的影象集，下面分別解釋資料在各層的維度變化

1. input layer: 32x32 neurons 
2. convolutional layer(64 filters, size 11x11): (32−11+1)∗(32−11+1) = 22∗22 = 484 for each feature map. As a result, the total output of the convolutional layer is 22∗22∗64 = 30976. 
3. pooling layer(2x2 regions): reduced to 11∗11∗64 = 7744.
4. fully-connected layer: 4096 neurons
5. output layer

The number of learnable parameters P of this network is:

P = 1024∗(11∗11∗64)+64+(11∗11∗64)∗4096+4096+4096∗9+9 = 39690313

我們注意看你第二層的CNN層，它實際上可以理解為我們對同一幅圖，根據不同的觀察重點(濾波視窗移動)得到的不同細節視角的影象

0x1: Convolution1D層

一維卷積層，用以在一維輸入訊號上進行鄰域濾波。當使用該層作為首層時，需要提供關鍵字引數input_dim或input_shape。例如input_dim=128長為128的向量序列輸入，而input_shape=(10,128)代表一個長為10的128向量序列(對於byte詞頻的程式碼段特徵向量來說就是input_shape=(15000, 256))

keras.layers.convolutional.Convolution1D(
    nb_filter, 
    filter_length, 
    init='uniform', 
    activation='linear', 
    weights=None, 
    border_mode='valid', 
    subsample_length=1, 
    W_regularizer=None, 
    b_regularizer=None, 
    activity_regularizer=None, 
    W_constraint=None, 
    b_constraint=None, 
    bias=True, 
    input_dim=None, 
    input_length=None
)
 
1. nb_filter：卷積核的數目(即輸出的維度)(我們可以利用filter來減少CNN輸入層的維度，降低計算量)
2. filter_length：卷積核的空域或時域長度
3. init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。該引數僅在不傳遞weights引數時有意義。
4. activation：啟用函式，為預定義的啟用函式名（參考啟用函式），或逐元素（element-wise）的Theano函式。如果不指定該引數，將不會使用任何啟用函式（即使用線性啟用函式：a(x)=x）
5. weights：權值，為numpy array的list。該list應含有一個形如（input_dim,output_dim）的權重矩陣和一個形如(output_dim,)的偏置向量。
6. border_mode：邊界模式，為“valid”, “same” 或“full”，full需要以theano為後端
7. subsample_length：輸出對輸入的下采樣因子
8. W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
9. b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
10. activity_regularizer：施加在輸出上的正則項，為ActivityRegularizer物件
11. W_constraints：施加在權重上的約束項，為Constraints物件
12. b_constraints：施加在偏置上的約束項，為Constraints物件
13. bias：布林值，是否包含偏置向量（即層對輸入做線性變換還是仿射變換）
14. input_dim：整數，輸入資料的維度。當該層作為網路的第一層時，必須指定該引數或input_shape引數。
15. input_length：當輸入序列的長度固定時，該引數為輸入序列的長度。當需要在該層後連線Flatten層，然後又要連線Dense層時，需要指定該引數，否則全連線的輸出無法計算出來

example

# apply a convolution 1d of length 3 to a sequence with 10 timesteps,
# with 64 output filters
model = Sequential()
model.add(Convolution1D(64, 3, border_mode='same', input_shape=(10, 32)))
# now model.output_shape == (None, 10, 64)

# add a new conv1d on top
model.add(Convolution1D(32, 3, border_mode='same'))
# now model.output_shape == (None, 10, 32)

可以將Convolution1D看作Convolution2D的快捷版，對例子中（10，32）的訊號進行1D卷積相當於對其進行卷積核為（filter_length, 32）的2D卷積

0x2: AtrousConvolution1D層

AtrousConvolution1D層用於對1D訊號進行濾波，是膨脹/帶孔洞的卷積。當使用該層作為首層時，需要提供關鍵字引數input_dim或input_shape。例如input_dim=128長為128的向量序列輸入，而input_shape=(10,128)代表一個長為10的128向量序列.

keras.layers.convolutional.AtrousConvolution1D(
    nb_filter, 
    filter_length, 
    init='uniform', 
    activation='linear', 
    weights=None, 
    border_mode='valid', 
    subsample_length=1, 
    atrous_rate=1, 
    W_regularizer=None, 
    b_regularizer=None, 
    activity_regularizer=None, 
    W_constraint=None, 
    b_constraint=None, 
    bias=True
)

nb_filter：卷積核的數目（即輸出的維度）
filter_length：卷積核的空域或時域長度
init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。該引數僅在不傳遞weights引數時有意義。
activation：啟用函式，為預定義的啟用函式名（參考啟用函式），或逐元素（element-wise）的Theano函式。如果不指定該引數，將不會使用任何啟用函式（即使用線性啟用函式：a(x)=x）
weights：權值，為numpy array的list。該list應含有一個形如（input_dim,output_dim）的權重矩陣和一個形如(output_dim,)的偏置向量。
border_mode：邊界模式，為“valid”，“same”或“full”，full需要以theano為後端
subsample_length：輸出對輸入的下采樣因子
atrous_rate:卷積核膨脹的係數，在其他地方也被稱為'filter_dilation'
W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
activity_regularizer：施加在輸出上的正則項，為ActivityRegularizer物件
W_constraints：施加在權重上的約束項，為Constraints物件
b_constraints：施加在偏置上的約束項，為Constraints物件
bias：布林值，是否包含偏置向量（即層對輸入做線性變換還是仿射變換）
input_dim：整數，輸入資料的維度。當該層作為網路的第一層時，必須指定該引數或input_shape引數。
input_length：當輸入序列的長度固定時，該引數為輸入序列的長度。當需要在該層後連線Flatten層，然後又要連線Dense層時，需要指定該引數，否則全連線的輸出無法計算出來。

example

# apply an atrous convolution 1d with atrous rate 2 of length 3 to a sequence with 10 timesteps,
# with 64 output filters
model = Sequential()
model.add(AtrousConvolution1D(64, 3, atrous_rate=2, border_mode='same', input_shape=(10, 32)))
# now model.output_shape == (None, 10, 64)

# add a new atrous conv1d on top
model.add(AtrousConvolution1D(32, 3, atrous_rate=2, border_mode='same'))
# now model.output_shape == (None, 10, 32)

0x3: Convolution2D層

二維卷積層對二維輸入進行滑動窗卷積，當使用該層作為第一層時，應提供input_shape引數。例如input_shape = (3,128,128)代表128*128的彩色RGB影象

keras.layers.convolutional.Convolution2D(
    nb_filter, 
    nb_row, 
    nb_col, 
    init='glorot_uniform', 
    activation='linear', 
    weights=None, 
    border_mode='valid', 
    subsample=(1, 1), 
    dim_ordering='th', 
    W_regularizer=None, 
    b_regularizer=None, 
    activity_regularizer=None, 
    W_constraint=None, 
    b_constraint=None, 
    bias=True
)

nb_filter：卷積核的數目
nb_row：卷積核的行數
nb_col：卷積核的列數
init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。該引數僅在不傳遞weights引數時有意義。
activation：啟用函式，為預定義的啟用函式名（參考啟用函式），或逐元素（element-wise）的Theano函式。如果不指定該引數，將不會使用任何啟用函式（即使用線性啟用函式：a(x)=x）
weights：權值，為numpy array的list。該list應含有一個形如（input_dim,output_dim）的權重矩陣和一個形如(output_dim,)的偏置向量。
border_mode：邊界模式，為“valid”，“same”或“full”，full需要以theano為後端
subsample：長為2的tuple，輸出對輸入的下采樣因子，更普遍的稱呼是“strides”
W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
activity_regularizer：施加在輸出上的正則項，為ActivityRegularizer物件
W_constraints：施加在權重上的約束項，為Constraints物件
b_constraints：施加在偏置上的約束項，為Constraints物件
dim_ordering：‘th’或‘tf’。‘th’模式中通道維（如彩色影象的3通道）位於第1個位置（維度從0開始算），而在‘tf’模式中，通道維位於第3個位置。例如128*128的三通道彩色圖片，在‘th’模式中input_shape應寫為（3，128，128），而在‘tf’模式中應寫為（128，128，3），注意這裡3出現在第0個位置，因為input_shape不包含樣本數的維度，在其內部實現中，實際上是（None，3，128，128）和（None，128，128，3）。預設是image_dim_ordering指定的模式，可在~/.keras/keras.json中檢視，若沒有設定過則為'tf'。
bias：布林值，是否包含偏置向量（即層對輸入做線性變換還是仿射變換）

example

# apply a 3x3 convolution with 64 output filters on a 256x256 image:
model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 256, 256)))
# now model.output_shape == (None, 64, 256, 256)

# add a 3x3 convolution on top, with 32 output filters:
model.add(Convolution2D(32, 3, 3, border_mode='same'))
# now model.output_shape == (None, 32, 256, 256)

0x3: AtrousConvolution2D層

該層對二維輸入進行Atrous卷積，也即膨脹卷積或帶孔洞的卷積。當使用該層作為第一層時，應提供input_shape引數。例如input_shape = (3,128,128)代表128*128的彩色RGB影象

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/layers/convolutional_layer/
http://baike.baidu.com/item/%E6%BB%A4%E6%B3%A2
http://blog.csdn.net/v_july_v/article/details/51812459
http://cs231n.github.io/convolutional-networks/#overview
http://blog.csdn.net/stdcoutzyx/article/details/41596663

7. 池化層

0x1: MaxPooling1D層

對時域1D訊號進行最大值池化

keras.layers.convolutional.MaxPooling1D(
    pool_length=2, 
    stride=None, 
    border_mode='valid'
)

pool_length：下采樣因子，如取2則將輸入下采樣到一半長度
stride：整數或None，步長值
border_mode：‘valid’或者‘same’

0x2: MaxPooling2D層

為空域訊號施加最大值池化

keras.layers.convolutional.MaxPooling2D(
    pool_size=(2, 2), 
    strides=None, 
    border_mode='valid', dim_ordering='th'
) 

1. pool_size：長為2的整數tuple，代表在兩個方向（豎直，水平）上的下采樣因子，如取（2，2）將使圖片在兩個維度上均變為原長的一半
2. strides：長為2的整數tuple，或者None，步長值。
3. border_mode：‘valid’或者‘same’
4. dim_ordering：‘th’或‘tf’。‘th’模式中通道維（如彩色影象的3通道）位於第1個位置（維度從0開始算），而在‘tf’模式中，通道維位於第3個位置。例如128*128的三通道彩色圖片，在‘th’模式中input_shape應寫為（3，128，128），而在‘tf’模式中應寫為（128，128，3），注意這裡3出現在第0個位置，因為input_shape不包含樣本數的維度，在其內部實現中，實際上是（None，3，128，128）和（None，128，128，3）。預設是image_dim_ordering指定的模式，可在~/.keras/keras.json中檢視，若沒有設定過則為'tf'

0x3: AveragePooling1D層

對時域1D訊號進行平均值池化

keras.layers.convolutional.AveragePooling1D(
    pool_length=2, 
    stride=None, 
    border_mode='valid'
) 

1. pool_length：下采樣因子，如取2則將輸入下采樣到一半長度
2. stride：整數或None，步長值
3. border_mode：‘valid’或者‘same’
注意，目前‘same’模式只能在TensorFlow作為後端時使用

0x4: GlobalMaxPooling1D層

對於時間訊號的全域性最大池化

keras.layers.pooling.GlobalMaxPooling1D()

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/layers/pooling_layer/

8. 遞迴層Recurrent

0x1: Recurrent層

這是遞迴層的抽象類，請不要在模型中直接應用該層（因為它是抽象類，無法例項化任何物件）。請使用它的子類LSTM或SimpleRNN。
所有的遞迴層（LSTM,GRU,SimpleRNN）都服從本層的性質，並接受本層指定的所有關鍵字引數

keras.layers.recurrent.Recurrent(
    weights=None, 
    return_sequences=False, 
    go_backwards=False, 
    stateful=False, 
    unroll=False, 
    consume_less='cpu', 
    input_dim=None, 
    input_length=None
)

1. weights：numpy array的list，用以初始化權重。該list形如[(input_dim, output_dim),(output_dim, output_dim),(output_dim,)]
2. return_sequences：布林值，預設False，控制返回型別。若為True則返回整個序列，否則僅返回輸出序列的最後一個輸出
3. go_backwards：布林值，預設為False，若為True，則逆向處理輸入序列
4. stateful：布林值，預設為False，若為True，則一個batch中下標為i的樣本的最終狀態將會用作下一個batch同樣下標的樣本的初始狀態。
5. unroll：布林值，預設為False，若為True，則遞迴層將被展開，否則就使用符號化的迴圈。當使用TensorFlow為後端時，遞迴網路本來就是展開的，因此該層不做任何事情。層展開會佔用更多的記憶體，但會加速RNN的運算。層展開只適用於短序列。
6. consume_less：‘cpu’或‘mem’之一。若設為‘cpu’，則RNN將使用較少、較大的矩陣乘法來實現，從而在CPU上會執行更快，但會更消耗記憶體。如果設為‘mem’，則RNN將會較多的小矩陣乘法來實現，從而在GPU平行計算時會執行更快（但在CPU上慢），並佔用較少記憶體。
7. input_dim：輸入維度，當使用該層為模型首層時，應指定該值（或等價的指定input_shape)
8. input_length：當輸入序列的長度固定時，該引數為輸入序列的長度。當需要在該層後連線Flatten層，然後又要連線Dense層時，需要指定該引數，否則全連線的輸出無法計算出來。注意，如果遞迴層不是網路的第一層，你需要在網路的第一層中指定序列的長度，如通過input_shape指定。

0x2: SimpleRNN層

全連線RNN網路，RNN的輸出會被回饋到輸入

keras.layers.recurrent.SimpleRNN(
    output_dim, 
    init='glorot_uniform', 
    inner_init='orthogonal', 
    activation='tanh', 
    W_regularizer=None, 
    U_regularizer=None, 
    b_regularizer=None, 
    dropout_W=0.0, 
    dropout_U=0.0
)

output_dim：內部投影和輸出的維度
init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。
inner_init：內部單元的初始化方法
activation：啟用函式，為預定義的啟用函式名（參考啟用函式）
W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
U_regularizer：施加在遞迴權重上的正則項，為WeightRegularizer物件
b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
dropout_W：0~1之間的浮點數，控制輸入單元到輸入門的連線斷開比例
dropout_U：0~1之間的浮點數，控制輸入單元到遞迴連線的斷開比例

0x3: GRU層

門限遞迴單元

keras.layers.recurrent.GRU(
    output_dim, 
    init='glorot_uniform', 
    inner_init='orthogonal', 
    activation='tanh', 
    inner_activation='hard_sigmoid', 
    W_regularizer=None, 
    U_regularizer=None, 
    b_regularizer=None, 
    dropout_W=0.0, 
    dropout_U=0.0
)

output_dim：內部投影和輸出的維度
init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。
inner_init：內部單元的初始化方法
activation：啟用函式，為預定義的啟用函式名（參考啟用函式）
inner_activation：內部單元啟用函式
W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
U_regularizer：施加在遞迴權重上的正則項，為WeightRegularizer物件
b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
dropout_W：0~1之間的浮點數，控制輸入單元到輸入門的連線斷開比例
dropout_U：0~1之間的浮點數，控制輸入單元到遞迴連線的斷開比例

0x4: LSTM層

Keras長短期記憶模型

keras.layers.recurrent.LSTM(
    output_dim, 
    init='glorot_uniform', 
    inner_init='orthogonal', 
    forget_bias_init='one', 
    activation='tanh', 
    inner_activation='hard_sigmoid', 
    W_regularizer=None, 
    U_regularizer=None, 
    b_regularizer=None, 
    dropout_W=0.0, 
    dropout_U=0.0
)

output_dim：內部投影和輸出的維度
init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。
inner_init：內部單元的初始化方法
forget_bias_init：遺忘門偏置的初始化函式，Jozefowicz et al.建議初始化為全1元素
activation：啟用函式，為預定義的啟用函式名（參考啟用函式）
inner_activation：內部單元啟用函式
W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
U_regularizer：施加在遞迴權重上的正則項，為WeightRegularizer物件
b_regularizer：施加在偏置向量上的正則項，為WeightRegularizer物件
dropout_W：0~1之間的浮點數，控制輸入單元到輸入門的連線斷開比例
dropout_U：0~1之間的浮點數，控制輸入單元到遞迴連線的斷開比例

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/layers/recurrent_layer/

9. 嵌入層 Embedding

0x1: Embedding層

嵌入層將正整數（下標）轉換為具有固定大小的向量，如[[4],[20]]->[[0.25,0.1],[0.6,-0.2]]。是一種數字化->向量化的編碼方式，使用Embedding需要輸入的特徵向量具備空間關聯性
Embedding層只能作為模型的第一層

keras.layers.embeddings.Embedding(
    input_dim, 
    output_dim, 
    init='uniform', 
    input_length=None, 
    W_regularizer=None, 
    activity_regularizer=None, 
    W_constraint=None, 
    mask_zero=False, 
    weights=None, 
    dropout=0.0
)

input_dim：大或等於0的整數，字典長度，即輸入資料最大下標+1
output_dim：大於0的整數，代表全連線嵌入的維度
init：初始化方法，為預定義初始化方法名的字串，或用於初始化權重的Theano函式。該引數僅在不傳遞weights引數時有意義。
weights：權值，為numpy array的list。該list應僅含有一個如（input_dim,output_dim）的權重矩陣
W_regularizer：施加在權重上的正則項，為WeightRegularizer物件
W_constraints：施加在權重上的約束項，為Constraints物件
mask_zero：布林值，確定是否將輸入中的‘0’看作是應該被忽略的‘填充’（padding）值，該引數在使用遞迴層處理變長輸入時有用。設定為True的話，模型中後續的層必須都支援masking，否則會丟擲異常
input_length：當輸入序列的長度固定時，該值為其長度。如果要在該層後接Flatten層，然後接Dense層，則必須指定該引數，否則Dense層的輸出維度無法自動推斷。
dropout：0~1的浮點數，代表要斷開的嵌入比例

Relevant Link:

https://keras-cn.readthedocs.io/en/latest/layers/embedding_layer/

Keras:基於Theano和TensorFlow的深度學習庫

相關文章