深度学习基础网络 ResNet

Highway Networks

论文地址:arXiv:1505.00387 [cs.LG] (ICML 2015),全文:Training Very Deep Networks( arXiv:1507.06228 )

基于梯度下降的算法在网络层数增加时训练越来越困难(并非是梯度消失的问题,因为batch norm解决梯度消失问题).论文受 RNN 中的 LSTM、GRU 的 gate 机制的启发,去掉每一层循环的序列输入,去掉 reset gate (不需要遗忘历史信息),仍使用 gate 控制前一次输出与当前层激活函数之后的输出的融合比例,从而提出了 highway networks,加入了称为 _information high-ways_的shortcut连接,使得信息可以跨层直接原样传递.这使得网络深度理论上几乎可以是无限.

传统网络的非线性变换(通常是仿射变换+非线性激活函数)如下:

[En]

The nonlinear transformation (usually affine transformation + nonlinear activation function) done by traditional networks is as follows:

[y = H(x,W_H)\tag 1 ]

highway network添加了两个非线性转换: transform gate (T(x,W_T)) ,carry gate (C(x,W_C)):

[y = H(x,W_H)\cdot T(x,W_T) +x\cdot C(x,W_C)\tag 2 ]

令 (C = 1 − T),得到

[y = H(x,W_H)\cdot T(x,W_T) +x\cdot (1-T(x,W_T))\tag 3 ]

当(T(x,W_T)=0)时,(y=x);当(T(x,W_T)=1)时,(y=H(x,W_H)).因此这个gate可以灵活地控制网络的行为.直观的理解就是每层不完全做非线性特征变换了,将原始特征直接添加到这一层,更有弹性一些.

上边的公式(3)要求$x,y, H(x,W_H),T(x,W_T) $是相同的大小.对于大小不匹配的情况,采用对x 下采样/0值填充 的方式,另外还可使用额外的网络改变x的维度.

论文中设置(T(x) = σ(W_T^Tx+b_T)∈ (0,1),∀x ∈ R).

Highway network学习之后自动学习哪些层需要哪些不需要.而ResNet是直接(等比例)相加.

ResNet

论文(Deep Residual Learning for Image Recognition,CVPR 2016)

ResNet与Highway Networks的区别是Highway Networks增加了额外的参数,而ResNet不需要;Highway Networks,当carry gate接近于0时,残差功能接近关闭状态.而ResNet的残差函数总是开启.

ResNet 的动机是网络degradation退化问题,即传统网络随着层数增多,准确率不升反降.原因是当模型变复杂时,SGD的优化变得更加困难.

深度学习基础网络 ResNet
图 Cifar-10 上的training/testing error. 网络从20层加到56层,error却上升了。

优势:ResNet使用Identity mapping在不额外增加参数的情况下,收敛速度更快.

Identity mapping(恒等映射)

在网络中的某个地方添加一个快捷连接,直接传递前层的特征,这种新的连接被称为身份映射。如下图所示:

[En]

Add a shortcut connection somewhere in the network to transfer the characteristics of the front layer directly, and this new connection is called Identity mapping. As shown in the following figure:

深度学习基础网络 ResNet

Identity mapping反向求导时正好是单位矩阵I.

假设优化残差映射F(X)比优化原始映射H(X)容易(实验结果证实了这一点)。F(X)+x可以通过快捷连接实现,如下图所示:

[En]

It is assumed that it is easier to optimize the residual mapping F (x) than to optimize the original mapping H (x) (which is confirmed by experimental results). F (x) + x can be achieved through shortcut connections, as shown in the following figure:

如果(mathbb x)和(mathcal F)的维度相同,则:

[En]

If the dimensions of ( mathbb x) and ( mathcal F) are the same, then:

[\mathbb y=\mathcal F(\mathbb x,{W_i})+\mathbb x \tag 1 ]

如果不相同,则需要对快捷方式连接的x进行线性投影,以使尺寸相同,以便可以添加:

[En]

If it is not the same, then the x of shortcut connection needs to be linearly projected to make the dimensions the same so that it can be added:

[\mathbb y=\mathcal F(\mathbb x,{W_i})+W_s\mathbb x \tag 2 ]

加法运算的两个张量具有相同的结构尺寸,对应的特征映射在信道方向上相加合并。

[En]

The two tensor of addition operation have the same structure size, and the corresponding feature map are added and merged in the direction of channel.

Plain Network

Plain Network 主要是受 VGG 网络启发,主要采用3×3滤波器,遵循两个设计原则:

  1. 输出相同特征图尺寸的卷积层,有相同个数的滤波器(stride=1使得输出尺寸不变)
  2. 通过stride=2降采样后特征图尺寸缩小一半,增加一倍的滤波器个数使每层卷积的计算复杂度相同(以特征图大小56->28为例,56的卷积操作数为(56x56x64)x(3x3x64),而64的卷积操作数为(28x28x128)x(3x3x128),计算量相等,当然stride=2的conv层计算量减半).

Residual Network

在 plain network 中加入 shortcut connections 构成了ResNet.对于shortcut的连接方式,论文提出了三个选项:

A. 使用恒等映射,如果residual block的输入输出维度不一致,对增加的维度用0来填充;

B. 在block输入输出维度一致时使用恒等映射,不一致时使用线性投影以保证维度一致(使用一层Conv+BatchNorm即可);

C. 对于所有的block均使用线性投影。

对这三种方案进行了实验,发现虽然C的效果要好于B和A,但差距很小,因此不需要线性投影,并且在使用0填充时可以保证模型的复杂度最低。这对更深层次的网络更有利。因为方法A不需要额外的参数,所以它被选择使用A。

[En]

Experiments on these three options are carried out, and it is found that although the effect of C is better than that of B and A, the gap is very small, so linear projection is not necessary, and the complexity of the model can be guaranteed to be the lowest when using 0 padding. This is more beneficial for deeper networks. And because method A does not need additional parameters, it is chosen to use A.

Deeper Bottleneck结构:

对于更深层的网络,使用三层剩余结构。如下图所示,两个1×1的卷积首先降低维度,然后增加维度,从而减少3×3卷积层的数量及其输入和输出维度。这一较小维度的3×3卷积是瓶颈。这里显示的两种结构设计具有相似的时间复杂性。

[En]

For deeper networks, a three-layer residual structure is used. As shown in the following figure, the convolution of two 1×1 first reduces the dimension and then increases the dimension, reducing the number of 3×3 convolution layers and their input and output dimensions. The 3×3 convolution of this smaller dimension is bottleneck. The two structural designs shown here have similar time complexity.

深度学习基础网络 ResNet

不带参数的标识快捷方式对于瓶颈结构尤为重要。如果用带参数的投影来代替,则模型的规模和复杂性都会增加。

[En]

Identity shortcut without parameters is especially important for bottleneck structures. If it is replaced by projection with parameters, the size and complexity of the model will increase.

本文给出的18p34和50101152层网络如下表所示:

[En]

The network of layer 18p34 and 50101152 given in this paper is shown in the following table:

深度学习基础网络 ResNet
Plain Network,Residual Network与VGG-19的区别:
深度学习基础网络 ResNet

关于degradation

由于在网络中使用了BatchNorm层来维持信号的传播,因此网络的退化不太可能是由于梯度的消失而引起的。具体原因尚不清楚,有待后续研究。

[En]

It is pointed out that the degradation of the network is unlikely to be caused by the disappearance of the gradient, because the BatchNorm layer is used in the network to maintain the signal propagation. The specific reason is not clear and needs follow-up research.

结果发现,经过更多的迭代(3倍),它仍然是退化的。

[En]

It is found that through more iterations (3x), it is still degradation.

ResNet解读

参考论文作者另一篇论文Identity Mappings in Deep Residual Networks对ResNet的解读,

深度学习基础网络 ResNet

剩余网络单元可以分解成右图的形式,从中我们可以看到剩余网络实际上是一个由多条路径组成的网络。说白了,残差网络实际上是多个并行子网络的组合,整个残差网络实际上相当于一个多人投票系统(融合)。

[En]

The residual network unit can be decomposed into the form of the right graph, from which we can see that the residual network is actually a network composed of multiple paths. to put it bluntly, the residual network is actually a combination of many parallel subnetworks, and the whole residual network is actually equivalent to a multi-person voting system (Ensembling).

ResNet只是表面上看起来很深,事实上网络却很浅。
所示的ResNet真的解决了深层网络中梯度消失的问题吗?情况似乎并非如此。RESNET实际上是一个多人投票系统。

[En]

Does the ResNet shown really solve the problem of the disappearance of gradients in deep networks? It doesn’t seem to be. ResNet is actually a multi-person voting system.

代码实现

caffe中实现特征的加法,用Eltwise层的SUM operation即可:

layer {
  name: "Eltwise3"
  type: "Eltwise"
  bottom: "Eltwise2"
  bottom: "Convolution7"
  top: "Eltwise3"
  eltwise_param {
    operation: SUM
  }
}

Original: https://www.cnblogs.com/makefile/p/ResNet.html
Author: 康行天下
Title: 深度学习基础网络 ResNet

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6533/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总