# 深度学习基础网络 ResNet

## Highway Networks

[En]

The nonlinear transformation (usually affine transformation + nonlinear activation function) done by traditional networks is as follows:

[y = H(x,W_H)\tag 1 ]

highway network添加了两个非线性转换: transform gate (T(x,W_T)) ,carry gate (C(x,W_C)):

[y = H(x,W_H)\cdot T(x,W_T) +x\cdot C(x,W_C)\tag 2 ]

[y = H(x,W_H)\cdot T(x,W_T) +x\cdot (1-T(x,W_T))\tag 3 ]

Highway network学习之后自动学习哪些层需要哪些不需要.而ResNet是直接(等比例)相加.

## ResNet

ResNet与Highway Networks的区别是Highway Networks增加了额外的参数,而ResNet不需要;Highway Networks,当carry gate接近于0时,残差功能接近关闭状态.而ResNet的残差函数总是开启.

### Identity mapping（恒等映射）

[En]

Add a shortcut connection somewhere in the network to transfer the characteristics of the front layer directly, and this new connection is called Identity mapping. As shown in the following figure:

Identity mapping反向求导时正好是单位矩阵I.

[En]

It is assumed that it is easier to optimize the residual mapping F (x) than to optimize the original mapping H (x) (which is confirmed by experimental results). F (x) + x can be achieved through shortcut connections, as shown in the following figure:

[En]

If the dimensions of ( mathbb x) and ( mathcal F) are the same, then:

[\mathbb y=\mathcal F(\mathbb x,{W_i})+\mathbb x \tag 1 ]

[En]

If it is not the same, then the x of shortcut connection needs to be linearly projected to make the dimensions the same so that it can be added:

[\mathbb y=\mathcal F(\mathbb x,{W_i})+W_s\mathbb x \tag 2 ]

[En]

The two tensor of addition operation have the same structure size, and the corresponding feature map are added and merged in the direction of channel.

### Plain Network

Plain Network 主要是受 VGG 网络启发，主要采用3×3滤波器，遵循两个设计原则：

1. 输出相同特征图尺寸的卷积层，有相同个数的滤波器(stride=1使得输出尺寸不变)
2. 通过stride=2降采样后特征图尺寸缩小一半，增加一倍的滤波器个数使每层卷积的计算复杂度相同(以特征图大小56->28为例,56的卷积操作数为(56x56x64)x(3x3x64),而64的卷积操作数为(28x28x128)x(3x3x128),计算量相等,当然stride=2的conv层计算量减半).

### Residual Network

A. 使用恒等映射，如果residual block的输入输出维度不一致，对增加的维度用0来填充；

B. 在block输入输出维度一致时使用恒等映射，不一致时使用线性投影以保证维度一致(使用一层Conv+BatchNorm即可)；

C. 对于所有的block均使用线性投影。

[En]

Experiments on these three options are carried out, and it is found that although the effect of C is better than that of B and A, the gap is very small, so linear projection is not necessary, and the complexity of the model can be guaranteed to be the lowest when using 0 padding. This is more beneficial for deeper networks. And because method A does not need additional parameters, it is chosen to use A.

Deeper Bottleneck结构:

[En]

For deeper networks, a three-layer residual structure is used. As shown in the following figure, the convolution of two 1×1 first reduces the dimension and then increases the dimension, reducing the number of 3×3 convolution layers and their input and output dimensions. The 3×3 convolution of this smaller dimension is bottleneck. The two structural designs shown here have similar time complexity.

[En]

Identity shortcut without parameters is especially important for bottleneck structures. If it is replaced by projection with parameters, the size and complexity of the model will increase.

[En]

The network of layer 18p34 and 50101152 given in this paper is shown in the following table:

Plain Network,Residual Network与VGG-19的区别:

[En]

It is pointed out that the degradation of the network is unlikely to be caused by the disappearance of the gradient, because the BatchNorm layer is used in the network to maintain the signal propagation. The specific reason is not clear and needs follow-up research.

[En]

It is found that through more iterations (3x), it is still degradation.

### ResNet解读

[En]

The residual network unit can be decomposed into the form of the right graph, from which we can see that the residual network is actually a network composed of multiple paths. to put it bluntly, the residual network is actually a combination of many parallel subnetworks, and the whole residual network is actually equivalent to a multi-person voting system (Ensembling).

ResNet只是表面上看起来很深，事实上网络却很浅。

[En]

Does the ResNet shown really solve the problem of the disappearance of gradients in deep networks? It doesn’t seem to be. ResNet is actually a multi-person voting system.

### 代码实现

caffe中实现特征的加法,用Eltwise层的SUM operation即可:

layer {
name: "Eltwise3"
type: "Eltwise"
bottom: "Eltwise2"
bottom: "Convolution7"
top: "Eltwise3"
eltwise_param {
operation: SUM
}
}


Original: https://www.cnblogs.com/makefile/p/ResNet.html
Author: 康行天下
Title: 深度学习基础网络 ResNet

(0)