Paper -- DenseNet:Densely Connected Convolutional Network

jerry173985發表於2020-12-07

Abstracts:

DenseNet breaks away from the fixed thinking of deepening the number of network layers (ResNet) and widening the network structure (Inception) to improve network performance. From the perspective of features, through feature reuse and bypass (Bypass) settings, it has greatly reduced the network The amount of parameters alleviates the gradient vanishing problem to a certain extent
在這裡插入圖片描述

DenseNets has several compelling advantages:

Alleviated the vanishing gradient
Enhanced feature spread
Enhanced feature reuse
Reduce the amount of parameters

From the figure we can draw the following conclusions:

a) Some features extracted from earlier layers may still be used directly by deeper layers

b) Even the Transition layer will use the features of all layers in the previous Denseblock

c) The layers in the 2-3 Denseblock have low utilization of the previous Transition layer, indicating that the transition layer outputs a large number of redundant features. This also provides evidence support for DenseNet-BC, which is the necessity of Compression.

d) Although the final classification layer uses the multiple layers of information in the previous Denseblock, it is more inclined to use the features of the last few feature maps, indicating that in the last few layers of the network, some high-level features may be generated.

a) 一些較早層提取出的特徵仍可能被較深層直接使用

b) 即使是Transition layer也會使用到之前Denseblock中所有層的特徵

c) 第2-3個Denseblock中的層對之前Transition layer利用率很低,說明transition layer輸出大量冗餘特徵.這也為DenseNet-BC提供了證據支援,既Compression的必要性.

d) 最後的分類層雖然使用了之前Denseblock中的多層資訊,但更偏向於使用最後幾個feature map的特徵,說明在網路的最後幾層,某些high-level的特徵可能被產生.

All the layers in front of each layer add a single shortcut to this layer, so that any two layers of networks can directly “communication”. This is the picture below:

在這裡插入圖片描述
好處:

從feature來考慮,每一層feature 被用到時,都可以被看作做了新的 normalization,從實驗結果看到即便去掉BN, 深層 DenseNet也可以保證較好的收斂率。
從perceptual field來看,淺層和深層的field 可以更自由的組合,會使得模型的結果更加robust。
從 wide-network 來看, DenseNet 看以被看作一個真正的寬網路,在訓練時會有比 ResNet 更穩定的梯度,收斂速度自然更好(paper的實驗可以佐證)
benefit:

From the feature point of view, when each layer of feature is used, it can be regarded as a new normalization. It can be seen from the experimental results that even if BN is removed, the deep DenseNet can guarantee a better convergence rate.
From the perspective of perceptual field, shallow and deep fields can be combined more freely, which will make the results of the model more robust.
From the perspective of wide-network, DenseNet can be seen as a real wide network. During training, there will be a more stable gradient than ResNet, and the convergence speed will naturally be better (paper experiments can confirm)

The input of each layer in DenseNet is all the previous layers, so there is a connection between any two layers. However, in actual situations, because the size of feature maps is different between layers, it is not convenient to combine any two layers. Inspired by GoogleNet, the ** paper proposes Dense Block, that is, in each block, all layers are kept Dense connectivity, and there is no dense connectivity between the blocks, but connected through the transition layer.


DenseNets
ResNets [11] add a skip-connection that bypasses the non-linear transformations with an identity function

x ℓ = H ℓ ( x ℓ − 1 ) + x ℓ − 1 \mathbf{x}_{\ell}=H_{\ell}\left(\mathbf{x}_{\ell-1}\right)+\mathbf{x}_{\ell-1} x=H(x1)+x1

Dense connectivity

x ℓ = H ℓ ( [ x 0 , x 1 , … , x ℓ − 1 ] ) \mathbf{x}_{\ell}=H_{\ell}\left(\left[\mathbf{x}_{0}, \mathbf{x}_{1}, \ldots, \mathbf{x}_{\ell-1}\right]\right) x=H([x0,x1,,x1])

where [x_0 ,x _1 ,…,x_l−1 ] refers to the concatenation of the feature-maps produced in layers 0,…,l−1

Composite function
H定義為 a composite function of three consecutive operations: batch normalization (BN) , followed by a rectified linear unit (ReLU) and a 3 × 3 convolution (Conv)

Pooling layers 可以改變特徵圖的尺寸,便於 concatenation

Growth rate:If each function H produces k feature-maps as output,We refer to the hyper-parameter k as the growth rate of the network

Bottleneck layers:儘管每個網路層只輸出 k 個特徵圖,但是同時仍然有太多的輸入個數,通常的做法是降維,在進行3×3卷積之前首先用一個 1×1卷積將輸入個數降低到 4*k, 也就是在 H的定義中再加入一個 1×1卷積
Although each layer only produces k output feature maps, it typically has many more inputs. It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps

為什麼有太多的輸入個數了? If each function H_l produces k feature-maps as output, it follows that the l th layer has k×(l−1)+k0 input feature-maps, where k 0 is the number of channels in the input image.

Compression :為了進一步提升模型的簡潔性,我們在 transition layers裡 降低特徵圖數量
To further improve model compactness, we can reduce the number of feature-maps at transition layers. If a dense block contains m feature-maps, we let the following transition layer generate bθmc output feature-maps, where 0 <θ ≤1 is referred to as the compression factor.

總的來說就是簡單的多加幾個 shortcut ,效果就好了,計算量少了!

In general, simply add a few more shortcuts, the effect will be good, and the amount of calculation will be less!

相關文章