- 使用3D卷積網路引入時間維度:由於3D卷積網路每次的輸入幀是有長度限定的,所以這種方法更傾向於關注區域性(時域)資訊的任務
- 使用RNN/LSTM網路系列處理時序資訊:由於迭代網路的特性,它更擅長處理全域性視訊資訊
tf.nn.conv3d(input, filter, strides, padding, name=None) Computes a 3-D convolution given 5-D input and filter tensors. In signal processing, cross-correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them. This is also known as a sliding dot product or sliding inner-product. Our Conv3D implements a form of cross-correlation. Args: input: A Tensor. Must be one of the following types: float32, float64, int64, int32, uint8, uint16, int16, int8, complex64, complex128, qint8, quint8, qint32, half. Shape [batch, in_depth, in_height, in_width, in_channels]. filter: A Tensor. Must have the same type as input. Shape [filter_depth, filter_height, filter_width, in_channels, out_channels]. in_channels must match between input and filter. strides: A list of ints that has length >= 5. 1-D tensor of length 5. The stride of the sliding window for each dimension of input. Must have strides[0] = strides[4] = 1. padding: A string from: "SAME", "VALID". The type of padding algorithm to use. name: A name for the operation (optional). Returns: A Tensor. Has the same type as input. tf.nn.avg_pool3d(input, ksize, strides, padding, name=None) Performs 3D average pooling on the input. Args: input: A Tensor. Must be one of the following types: float32, float64, int64, int32, uint8, uint16, int16, int8, complex64, complex128, qint8, quint8, qint32, half. Shape [batch, depth, rows, cols, channels] tensor to pool over. ksize: A list of ints that has length >= 5. 1-D tensor of length 5. The size of the window for each dimension of the input tensor. Must have ksize[0] = ksize[4] = 1. strides: A list of ints that has length >= 5. 1-D tensor of length 5. The stride of the sliding window for each dimension of input. Must have strides[0] = strides[4] = 1. padding: A string from: "SAME", "VALID". The type of padding algorithm to use. name: A name for the operation (optional). Returns: A Tensor. Has the same type as input. The average pooled output tensor.
tf.nn.max_pool3d(input, ksize, strides, padding, name=None) Performs 3D max pooling on the input. Args: input: A Tensor. Must be one of the following types: float32, float64, int64, int32, uint8, uint16, int16, int8, complex64, complex128, qint8, quint8, qint32, half. Shape [batch, depth, rows, cols, channels] tensor to pool over. ksize: A list of ints that has length >= 5. 1-D tensor of length 5. The size of the window for each dimension of the input tensor. Must have ksize[0] = ksize[4] = 1. strides: A list of ints that has length >= 5. 1-D tensor of length 5. The stride of the sliding window for each dimension of input. Must have strides[0] = strides[4] = 1. padding: A string from: "SAME", "VALID". The type of padding algorithm to use. name: A name for the operation (optional). Returns: A Tensor. Has the same type as input. The max pooled output tensor.
文中的3D CNN架構包含一個硬連線hardwired層、3個卷積層、2個下采樣層和一個全連線層。每個3D卷積核卷積的立方體是連續7幀,每幀patch大小是60x40,架構如下:
在第一層,我們應用了一個固定的hardwired的核去對原始的幀進行處理,產生多個通道的資訊,然後對多個通道分別處理。最後再將所有通道的資訊組合起來得到最終的特徵描述。這個實線層實際上是編碼了我們對特徵的先驗知識,這比隨機初始化效能要好對於這個做法(原文“相比於隨機初始化,通過先驗知識對影象的特徵提取使得反向傳播訓練有更好的表現”),梯度表徵了影象的邊沿的分佈,而光流則表徵物體運動的趨勢,3D卷積通過提取這兩種資訊來進行行為識別。,每幀提取五個通道的資訊分別是:灰度、x和y方向的梯度,x和y方向的光流。其中,前面三個都可以每幀都計算。然後水平和垂直方向的光流場需要兩個連續幀才確定。所以是7x3 + (7-1)x2=33個特徵maps。
在緊接著的下采樣層S3層max pooling,我們在C2層的特徵maps中用2x2視窗進行下采樣,這樣就會得到相同數目但是空間解析度降低的特徵maps。下采樣後,就是27x17=(52/2)*(34/2)。
到這個階段,時間維上幀的個數已經很小了, (3 for gray, gradient-x, gradient-y, and 2 for optflow-x and optflow-y)。在這一層,我們只在空間維度上面卷積,這時候我們使用的核是7x4,然後輸出的特徵maps就被減小到1x1的大小。而C6層就包含有128個特徵map,每個特徵map與S5層中所有78(13x6)個特徵maps全連線,這樣每個特徵map就是1x1,也就是一個值了,而這個就是最終的特徵向量了。共128維。
時間軸:時間+空間雙流神經網路 -> 16年CVPR的3D卷積+雙流網路
Two-Stream Convolutional Networks for Action Recognition in Videos提出的是一個雙流的CNN網路,分別捕捉空間和時間資訊
3D Convolutional Neural Networks for Human Action Recognition提出了3D卷積方法
Convolutional Two-Stream Network Fusion for Video Action Recognition在原雙流論文的基礎上做了改進