深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

论文标题:MobileNetV2:倒置残差和线性瓶颈[en]Paper title: MobileNetV2: Inverted Residuals and Linear Bottlenecks

作者:马克·桑德安德鲁·霍华德·梦龙朱安德烈·日莫吉诺夫陈良杰[en]Author: Mark Sandler Andrew Howard Menglong Zhu Andrey Zhmoginov Liang-Chieh Chen

论文地址:https://arxiv.org/pdf/1801.04381.pdf

参考的 MobileNetV2翻译博客:请点击我

(这篇翻译也不错:https://blog.csdn.net/qq_31531635/article/details/80550412)

免责声明:编辑翻译的论文仅供学习之用,如有侵权行为,请联系编辑删除博客帖子,谢谢![en]Disclaimer: the editor translates the paper for study only, if there is any infringement, please contact the editor to delete the blog post, thank you!

小编是机器学习的初学者,有意仔细研究论文,但英语水平有限,所以在论文的翻译中使用了Google,并逐句检查,但仍有一些晦涩的地方,如语法/专业名词翻译错误,请原谅,并欢迎及时指出。[en]The editor is a beginner in machine learning and intends to study the paper carefully, but his English level is limited, so Google is used in the translation of the paper and checked sentence by sentence, but there are still some obscure places, such as grammatical / professional noun translation errors, please forgive me, and welcome to point out in time.

如果需要小编其他论文翻译,请移步小编的GitHub地址

传送门:请点击我

如果点击有误:https://github.com/LeBron-Jian/DeepLearningNote

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

摘要

在本文中,我们描述了一种新的移动网络架构MobileNetV2,它提高了移动模型在多任务和多基准数据集上以及在不同模型大小下的最佳性能。我们还描述了一种在一个新的框架SSDLite中应用这些移动模型进行目标检测的有效方法。此外,我们还演示了如何通过DeepLabv3的简化形式来构建移动语义分割模型,我们称之为Mobile DeepLabv3。[en]In this paper, we describe a new mobile network architecture, MobileNetV2, which improves the best performance of the mobile model on multiple tasks and multiple benchmark data sets as well as in different model sizes. We also describe an effective method to apply these mobile models to target detection in a new framework we call SSDLite. In addition, we demonstrate how to build a mobile semantic segmentation model through a simplified form of DeepLabv3, which we call Mobile DeepLabv3.

MobileNetV2 架构基于倒置的残差结构,其中快捷连接位于窄的bottleneck之间。中间展开层使用轻量级的深度卷积作为非线性源来过滤特征。此外,我们发现了表示能力,去除窄层中的非线性是非常重要的。我们证实了这可以提高性能并提供了产生此设计的直觉。

最后,我们的方法允许将输入/输出域与变换的表现力解耦,这为进一步分析提供了便利的框架。我们在 ImageNet[1]分类,COCO目标检测[2],VOC图像分割[3]上评估了我们的性能。我们评估了在精度,通过乘加(Multiply-Adds,MAdd)度量的操作次数,以及实际的延迟和参数的数量之间的权衡。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

1,引言

神经网络彻底改变了机器智能的许多领域,使这项具有挑战性的图像识别任务比普通人更准确。然而,提高精确度的驱动力往往是要付出代价的:先进的网络现在需要超出许多移动和嵌入式应用程序能力的高计算资源。[en]Neural network has completely changed many fields of machine intelligence, making the challenging image recognition task more accurate than ordinary people. However, the driving force for improving accuracy often comes at a price: advanced networks now require high computing resources beyond the capabilities of many mobile and embedded applications.

本文介绍了一种新的神经网络结构,该结构适用于移动和资源受限的环境。我们的网络在保持相同精确度的同时,显著减少了所需的运算量和存储量,从而提升了移动定制计算机视觉模型的最新水平。[en]This paper introduces a new neural network architecture tailored for mobile and resource-constrained environments. Our network advances the latest level of mobile customized computer vision models by significantly reducing the amount of operations and memory required while maintaining the same accuracy.

我们的主要贡献是一个新的层模型:具有线性瓶颈的倒置残差。该模块将输入的低维压缩表示首先扩展到高维并用轻量级深度卷积进行过滤。随后用线性卷积将特征投影回低维表示。官方实现可作为 [4] 中 TensorFlow-Slim 模型库的一部分(https://github.com/tensorflow/models/tree/master/research/slim/ nets/mobilenet)。

该模块可以在任何现代框架中使用标准操作高效实施,并允许我们的模型使用基线在多个性能点上击败最先进的技术。此外,这种卷积模块特别适合于移动设计,因此它可以通过从不完整的中间张量实现大的中间张量来显著减少推理过程中所需的内存空间。这减少了许多嵌入式硬件设计中对主存储器访问的需求,这些设计提供了少量的高速软件控制高速缓存。[en]This module can be implemented efficiently using standard operations in any modern framework and allows our model to use baselines to beat state-of-the-art technologies along multiple performance points. In addition, this convolution module is especially suitable for mobile design, so it can significantly reduce the memory footprint required in the inference process by implementing a large intermediate tensor from an incomplete one. This reduces the need for main memory access in many embedded hardware designs, which provide a small amount of high-speed software control cache.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

2,相关工作

在过去几年中,调整深层神经结构以达到精度和性能之间的最佳平衡已成为一个活跃的研究领域。许多团队在手动架构搜索和训练算法方面的改进比早期的设计(如AlexNet[5]、VGGNet[5]、GoogLeNet[7]和ResNet[8])有了显著改进。最近,在算法体系结构的探索方面取得了很大进展,包括超参数优化[9me10,11],各种网络剪枝方法[12,13,14,14,15,16,17]和连接学习[18,19]。还有大量工作致力于改变内部卷积块的连接结构,如ShuffleNet[20]或引入稀疏性[21]和其他[22]。[en]Adjusting the deep neural architecture to achieve the best balance between accuracy and performance has become an active area of research in the past few years. Improvements in manual architecture search and training algorithms carried out by many teams have been significantly improved over earlier designs such as AlexNet [5], VGGNet [5], GoogLeNet [7], and ResNet [8]. Recently, a lot of progress has been made in the exploration of algorithm architecture, including hyper-parameter optimization [9mem10, 11], various network pruning methods [12, 13, 14, 14, 15, 16, 17] and connection learning [18, 19]. There is also a great deal of work devoted to changing the connection structure of internal convolution blocks such as ShuffleNet [20] or introducing sparsity [21] and other [22].

最近,[23,24,25,26]开辟了一个新的方向,将遗传算法、强化学习等优化方法替代到建筑搜索中。然而,一个缺点是由此产生的网络非常复杂。在本文中,我们的目标是更好地了解神经网络是如何工作的,并用它来指导尽可能简单的网络设计。我们的方法应被视为对[23]中描述的方法和相关工作的补充。在这种情况下,我们的方法类似于[20Jing 22]所采用的方法,可以在看到其内部运行的同时进一步提高性能。我们的网络设计基于MobileNetV1[27]。它保持了其简单性,不需要任何特殊的操作员,并显著提高了其精度,并在各种移动应用的图像分类和检测任务中实现了最新技术。[en]Recently, [23, 24, 25, 26] has opened up a new direction, replacing optimization methods such as genetic algorithms and reinforcement learning into architectural search. One drawback, however, is that the resulting network is very complex. In this article, our goal is to develop a better intuition of how neural networks work, and to use it to guide the simplest possible network design. Our approach should be seen as a complement to the method and related work described in [23]. In this case, our approach is similar to that adopted by [20Jing 22] and can further improve performance while seeing its internal operation. Our network design is based on MobileNetV1 [27]. It retains its simplicity, does not need any special operators, and significantly improves its accuracy, and implements the latest technology in a variety of image classification and detection tasks for mobile applications.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

3,预备知识,讨论,直觉

3.1 深度可分离卷积

深可分卷积是许多有效的神经网络结构(ShuffleNet、MobileNetV1、Xception)的关键组成部分。对于我们的工作来说,基本思想是用一个分解版本的卷积来代替原来的标准卷积运算,即把标准卷积分解成两个步骤。第一步称为深度卷积。它通过使用单个卷积对每个输入通道进行滤波来实现轻量级滤波。第二步是1-1卷积,称为逐点卷积,负责通过计算输入通道之间的线性组合来构建新特征。[en]Deep separable convolution is a critical component for many effective neural network structures (ShuffleNet, MobileNetV1,Xception). For our work, the basic idea is to use a decomposed version of convolution to replace the original standard convolution operation, that is, to decompose the standard convolution into two steps. The first step is called depth convolution. It achieves lightweight filtering by filtering each input channel using a single convolution. The second step is a 1-1 convolution, called point-by-point convolution, which is responsible for constructing new features by calculating the linear combination between the input channels.

MobileNet 是一种基于深度可分离卷积的模型,深度可分离卷积是一种将标准卷积分解成深度卷积以及一个 11 的卷积即逐点卷积。对于 Mobilenet 而言,深度卷积针对每个单个输入通道应用单个滤波器进行滤波,然后逐点卷积应用 11 的卷积操作来结合所有深度卷积得到的输出。而标准卷积一步即对所有的输入进行结合得到新的一系列输出。深度可分离卷积将其分为了两步,针对每个单独层进行滤波然后下一步即结合。这种分解能够有效地大量减少计算量以及模型的大小。

标准卷积层输入dfdfM的特征图F并获得dGdGN的输出特征图G,其中df表示输入特征图的宽度和高度,M是输入通道的数目(输入深度),dG是输出特征图的宽度和高度,N是输出通道的数目(输出深度)。[en]A standard convolution layer inputs the characteristic graph F of DFDFM and obtains an output characteristic graph G of DGDGN, where DF represents the width and height of the input feature graph, M is the number of input channels (input depth), DG is the width and height of the output feature graph, and N is the number of output channels (output depth).

标准卷积层将K个参数从大小传递到DKDKMN卷积核,其中DK是卷积核的空间维度,M是输入通道数,N是输出通道数。[en]The standard convolution layer passes K parameters from size to DKDKMN convolution kernel, where DK is the space dimension of the convolution kernel, M is the number of input channels, and N is the number of output channels.

对于标准卷积的输出卷积图,假设步长为1,则按以下公式计算填充:[en]For the output convolution diagram of standard convolution, assuming that the step size is 1, the padding is calculated by the following formula:

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

计算量为dKdkMNdfdf,它由输入通道数M、输出通道数N、卷积核大小dk和输出特性图df的大小决定。在此基础上对移动网络模型进行了改进。首先,采用深度可分离卷积,打破了输出通道数与卷积核大小之间的联系。[en]The amount of calculation is DKDKMNDFDF, which is determined by the number of input channels M, the number of output channels N, the size of convolution kernel DK and the size of output characteristic graph DF. The MobileNet model is improved according to it. First of all, depth separable convolution is used to break the connection between the number of output channels and the size of the convolution kernel.

标准卷积运算产生基于卷积核和组合特征的新表示,以产生对特征进行过滤的效果。通过分解卷积运算,可以将滤波和合并分成两个独立的部分,称为深度可分离卷积,大大降低了计算量。[en]Standard convolution operations produce a new representation based on convolution kernels and combined features to produce an effect on filtering features. Filtering and combination can be divided into two independent parts by decomposition convolution operation, which is called depth separable convolution, which can greatly reduce the computational cost.

深度可分离卷积由两层组成:深度卷积和逐点卷积。我们使用深度卷积将每个输入通道与单个卷积核进行卷积,得到输入通道数量的深度,然后使用逐点卷积,即简单的1-1卷积,在深度卷积中线性组合输出。Mobilenets对每一层使用BatchNorm和RELU非线性激活。[en]Depth separable convolution consists of two layers: depth convolution and point-by-point convolution. We use depth convolution to convolution each input channel with a single convolution kernel to get the depth of the number of input channels, and then use point-by-point convolution, that is, a simple 1-1 convolution to linearly combine the output in depth convolution. Mobilenets uses BatchNorm and ReLU nonlinear activation for each layer.

深度卷积使用每个通道的卷积核心,其可写为:[en]Deep convolution uses a convolution core for each channel, which can be written as:

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

其中Khat是深卷积核DkDkM的大小,Khat,将m个卷积核应用于F中的m个通道,以生成m个通道的卷积输出特性图Ghat。[en]Where Khat is the size of the deep convolution kernel DKDKM,Khat, the m convolution kernel is applied to the m channel in F to generate the convolution output characteristic graph Ghat of the m channel.

深卷积的计算量为dkdkMdfdf。[en]The amount of computation for deep convolution is DKDKMDFDF.

深度卷积相对于标准卷积是非常有效的,但它只卷积输入通道,而不结合它来产生新的特征。因此,下一层使用另一层来使用1-1卷积来计算输出的线性组合的深度卷积来产生新的特征。[en]Depth convolution is very effective relative to standard convolution, but it only convolutes the input channel and does not combine it to produce new features. So the next layer uses the other layer to use 1-1 convolution to calculate a linear combination of the output of depth convolution to produce new features.

那么深度卷积加上 1*1 卷积的逐点卷积的结合就叫做深度可分离卷积,最开始在(Rigid-motion scattering for image classification) 中被提出。

深可分卷积的计算量为:dkdkMdfdf+MNdfdf[en]The amount of computation of deep separable convolution is: DKDKMDFDF + MNDFDF

通过将体积集成到过滤和组合过滤中,计算量减少到:[en]By integrating the volume into filtering and combined filtering, the amount of computation is reduced to:

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

MobileNet 使用 33 的深度可分离卷积相较于标准卷积少了 8~9 倍的计算量,然而只有极小的准确率下降。MobileNetV2 也使用了k=3(33 可分离卷积层)。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

3.2 线性瓶颈(Linear Bottlenecks)

假设一个深层神经网络由n层Li组成,每一层后面跟着一个维为hiwidi的激活张量。在这一部分中,我们将讨论这些激活张量的基本性质,它们可以被视为具有di维的hiwi像素的容器。非正式地,对于输入的真实图片集,激活一组层(对于LI的任何一层)以形成“有趣的流形”。长期以来,人们一直认为神经网络中有趣的流形可以嵌入到低维子空间中。换句话说,当我们单独查看深卷积层的所有d维通道像素时,嵌入在低维子空间中的这些值以各种形式被编码成信息。[en]Consider that a deep neural network is composed of n layers of Li, and each layer is followed by an activation tensor with dimension hiwidi. In this section, we will discuss the basic properties of these activation tensors, which can be regarded as containers with hiwi pixels with di dimensions. Informally, for the input set of real pictures, a set of layers is activated (for any layer of Li) to form a “interesting manifold”. For a long time, it has been thought that the interesting manifold in the neural network can be embedded in the low-dimensional subspace. In other words, when we look at all the d-dimensional channel pixels of a deep convolution layer alone, these values embedded in the low-dimensional subspace are encoded into information in a variety of forms.

这应该是本文的难点所在。结合网络大佬们在这里所说的,我想理解:我们认为深层神经网络由n个Li层组成,每层的输出张量是hiwidi。我们认为,一系列卷积和激活层形成了兴趣流形(感兴趣的主折叠,这是我们感兴趣的数据的内容)。在这个阶段,不可能对这种流形进行定量描述。在这里,我们对这些流形的性质进行了实证研究。长期以来,人们认为神经网络中的兴趣热度可以嵌入到低维子空间中。一般来说,当我们查看卷积层中的所有单个d通道像素时,这些值中包含多种编码信息,其中包含感兴趣流形所在的位置。我们可以将其变换并进一步嵌入到下一个低维子空间中(例如,通过变换1-1卷积的维度来变换感兴趣流形的空间维度)。[en]This should be the difficulty of this article. Combined with what the online bigwigs said here, I would like to understand: * We believe that the deep neural network is composed of n Li layers, and the output tensor of each layer is hiwi*di. We think that a series of convolution and activation layers form an interest manifold (mainfold of interest, which is the content of the data we are interested in). At this stage, it is impossible to quantitatively describe this manifold. Here, the properties of these manifolds are studied empirically. For a long time, it is believed that interest popularity in neural networks can be embedded in low-dimensional subspace. generally speaking, when we look at all the single d-channel pixels in the convolution layer, there are many kinds of coding information in these values, in which the interest manifold is located. We can transform and further embed it into the next low-dimensional subspace (for example, by transforming the dimension of 1-1 convolution to transform the space dimension of the interest manifold).

咋一看,这可以通过简单的约减一层的维度来做到,从而减少了运算空间的维度。这已经在 MobileNetV1 中被采用,通过宽度乘法器来有效对计算量和准确率进行权衡。并且已经被纳入了其他有效的模型设计当中(Shufflenet: An extremely efficient convolutional neural network for mobile devices)。遵循这种直觉,宽度乘法器允许一个方法来减少激活空间的维度直到感兴趣的流形横跨整个空间。然而,这个直觉当我们这知道深度卷积神经网络实际上对每个坐标变换都有非线性激活的时候被打破。就好像 ReLU,比如,ReLU 应用在一维空间中的一条线就产生了一条射线,那么在 Rn 空间中,通常产生具有 n 节的分段线性曲线。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

另一方面,当没有REU频道时,关于该频道的信息一定会丢失。然而,如果我们有非常多的通道,在活动流形中可能有一个结构,其信息仍然存储在其他通道中。在补充材料中,我们证明了如果输入流形可以嵌入到显著低维的激活子空间中,那么REU激活函数可以在将所需的复杂性引入到表达式函数集中的同时保留这一信息。[en]On the other hand, when there is no ReLU channel, the information about that channel must be lost. However, if we have a very large number of channels, there may be a structure in the active manifold whose information is still stored in other channels. In the supplementary material, we show that if the input manifold can be embedded in a significantly low-dimensional activation subspace, then the ReLU activation function can retain this information while introducing the required complexity into the set of expression functions.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

总的来说,我们强调了两个性质,这两个性质表明感兴趣的流形应该存在于高维活动空间的低维子空间中。[en]In general, we have emphasized two properties, which show that the manifold of interest should exist in a low-dimensional subspace in a high-dimensional active space.

1,如果感兴趣流形在ReLU之后保持非零值,那么它对应到一个线性变换。

2,ReLU 能够保存输入流形的完整信息,但是输入流形必须存在于输入空间的一个低维子空间中。

这两点为我们优化现有的神经网络提供了经验提示:假设感兴趣的流形是低维的,我们可以通过在卷积块中插入一个线性瓶颈层来得到它。经验证据表明,使用线性是重要的,因为它防止了非线性破坏太多的信息。在第六节中,我们证明了在瓶颈中经验地使用非线性层会使性能降低几个百分点,这进一步验证了我们的假设。我们注意到,在深度金字塔残差网络中也有类似的实验,即从传统残差块的输入中去除非线性提高了CIFAR数据集的性能。在本文的下一部分中,我们将使用瓶颈卷积,将输入瓶颈大小与内部大小的比率称为扩张率。[en]These two points provide an empirical hint for us to optimize the existing neural network: assuming that the interesting manifold is low-dimensional, we can get it by inserting a linear bottleneck layer into the convolution block. Empirical evidence shows that it is important to use linearity because it prevents nonlinearity from destroying too much information. In section 6, we show that the empirical use of nonlinear layers in bottlenecks reduces performance by several percentage points, which further validates our hypothesis. We note that there are similar experiments in (Deep pyramidal residual networks), that is, the removal of nonlinearity from the input of traditional residual blocks improves the performance of CIFAR data sets. In the next part of the paper, we will use the bottleneck convolution and call the ratio of the input bottleneck size to the internal size as the expansion rate.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

图1,低维流形嵌入到高维空间的ReLU转换的例子。在这些例子中,一个原始螺旋形被利用随机矩阵 T 经过 ReLU 后嵌入到一个 n 维空间中,然后使用 T-1 投影到二维空间中。例子中,n=2,3 导致信息损失,可以看到流形的中心点之间的互相坍塌。同时 n=15, 30 时的信息变成高度非凸。

此图显示:如果当前激活空间中的兴趣流形高度完备,RELU可能会导致激活空间坍塌,不可避免地丢失信息,因此在设计网络时,如果想要减少计算量,需要将网络维度设计得尽可能低,但如果维度较低,激活和转换RELU函数可能会过滤出大量有用的信息。然后我们想,无论如何,RELU的另一部分是线性映射,所以如果我们都使用线性分类器,我们是不是会丢失一些维度信息,同时设计更低维层?。[en]This figure shows: * if the interest manifold in the current activation space is highly complete, the ReLU may cause the activation space to collapse and inevitably lose information, so when we design the network, if we want to reduce the amount of computation, we need to design the network dimension as low as possible, but if the dimension is low, activating and transforming the ReLU function may filter out a lot of useful information. And then we thought, anyway, the other part of ReLU is a linear mapping, so if we all use linear classifiers, will we not lose some dimensional information and design lower-dimensional layers at the same time? *.

所以论文针对这个问题使用Linear Bottleneck(即不使用ReLU激活,做了线性变换)的来代替原本的非线性激活变换。到此,优化网络架构的思想也出来了:通过在卷积模块中后插入 linear bottleneck来捕获兴趣流形。实验证明,使用linear bottleneck 可以防止非线性破坏太多信息。

从linear bottleneck 到深度卷积之间的维度比成为 Expansion factor(扩展系数),该系数控制了整个 block 的通道数

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

图2,深度可分离卷积的演化过程。对角线阴影纹理表示层不含非线性。最后的浅色层表示下一个卷积块的起始。注意:d和c从颜色上可以看出是等效块。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

3.3 反向残差(Inverted residuals)

瓶颈块看起来与剩余块相同,每个块包含一个输入,然后是几个瓶颈,然后进行扩展。然而,在直觉的启发下,瓶颈层实际上包含了所有必要的信息,而扩展层只是作为张量非线性变换的实现细节。我们直接在瓶颈层之间使用捷径。图3提供了设计差异的可视化表示。插入快捷方式的动机与典型的剩余连接相同,我们希望提高多层之间的梯度传播能力,但反向设计可以提高内存效率(在第5节详细描述),并在我们的实验中实现得更好。[en]The bottleneck block looks the same as the residual block, and each block contains one input followed by several bottlenecks and then extended. However, inspired by intuition, the bottleneck layer actually contains all the necessary information, and an extension layer only acts as the implementation detail of the tensor nonlinear transformation. We use shortcuts directly between the bottleneck layers. Figure 3 provides a visualization of design differences. The motivation for inserting shortcuts is the same as the typical residual connection, we want to improve the ability of gradient propagation between multiple layers, however, reverse design can improve memory efficiency (described in detail in Section 5) and realize better in our experiments.

瓶颈层的运行时间和参数个数以及基本实现结构如表1所示,如果扩展因子t和卷积核大小k,输入通道数为d‘,输出通道数为d’,则乘法和加法运算如下:[en]The running time and the number of parameters of the bottleneck layer, and the basic implementation structure are shown in Table 1. If the expansion factor t and the convolution kernel size k, the number of input channels is d ‘and the number of output channels is d”, then the multiplication and addition operations are as follows:

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

与以前的计算相比,这个表达式有一个额外的项,因为我们有额外的1-1卷积,但我们的网络性质允许我们利用较小的输入和输出维度。在表3中,我们比较了不同分辨率下MobileNetV1、MobileNetV2、ShuffleNet所需的大小。[en]Compared with the previous computation, this expression has an extra term because we have an additional 1-1 convolution, but our network nature allows us to take advantage of smaller input and output dimensions. In Table 3, we compare the required sizes of MobileNetV1,MobileNetV2,ShuffleNet at different resolutions.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

图3:残差块在(Aggregated residual transformations for deep neural networks.)与ResNet的不同。对角线阴影层没有用非线性,块中的厚度表示相关的通道数,注意到,经典的残差连接的层都是通道数非常多的层,然而,反向残差连接的是瓶颈层。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

表3 在不同结构的每个空间分辨率下都需要实现通道数/内存的最大值。假设激活需要 16bit,对于 ShuffleNet,我们使用 2x,g=3来匹配 MobileNetV1,MobileNetV2。对于MobileNetv2 和 ShuffleNet 的第一层而言,我们利用在第5节中的技巧来减少内存的需要。虽然 ShuffleNet 在其他地方利用了瓶颈,非瓶颈张量由于非瓶颈张量之间的 shortcuts 存在仍然需要被实现。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

3.4 信息流解释

我们的结构的一个特点是,构建块(瓶颈层)的输入域和输出域提供了自然的分离,并且层转换是输入和输出之间的非线性函数。前者可视为网络各层的能力,后者可视为网络的表达能力。这与传统的卷积分块不同,传统卷积块的正则化和可分性是输出层深度的函数,它结合了表达能力和容量。[en]One feature of our structure is that the input and output domains of the building block (bottleneck layer) provide a natural separation, and the layer transformation is a nonlinear function between input and output. The former can be regarded as the capacity of each layer of the network, while the latter can be regarded as the expressive ability of the network. This is contrary to the traditional convolution block, the regularization and separability of the traditional convolution block, which combines the expression ability and capacity, is a function of the depth of the output layer.

特别是,在我们的例子中,当内层深度为0时,由于捷径,较低层卷积成为恒等式函数。当展开率小于1时,它成为经典的剩余卷积块。然而,就我们的目的而言,当扩张率大于1时,*最有效。[en]In particular, in our example, when the inner layer depth is 0, the lower layer convolution becomes an identity function because of shortcuts. When the expansion rate is less than 1, it becomes a classical residual convolution block. However, for our purposes, * is most effective when the expansion rate is greater than 1.

这种解释使我们能够从网络容量的角度来研究网络的表达能力,我们相信进一步探索可分离性可以确保对网络本质的更深层次的理解。[en]This explanation allows us to study the expressive ability of the network in terms of the capacity of the network, and we believe that further exploration of separability can ensure a deeper understanding of the nature of the network.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

4,模型结构

现在,我们详细描述我们的模型的结构。如前所述,基本构建块为剩余瓶颈深度可分离卷积块,块的详细结构如表1所示。MobileNetV2包含初始32个卷积核的完整连接层,随后是19个剩余瓶颈层(见表2)。我们使用ReLU6作为非线性激活函数,当用于低精度计算时,ReLU6激活函数更加稳健。我们总是使用大小为3到3的卷积核,并在训练过程中使用丢弃和批范数归一化。[en]Now we describe the structure of our model in detail. As mentioned earlier, the basic building block is * residual bottleneck depth separable convolution block * , and the detailed structure of the block can be seen in Table 1. MobileNetV2 contains the full connection layer of the initial 32 convolution cores, followed by 19 residual bottleneck layers (see Table 2). We use ReLU6 as the nonlinear activation function, and the ReLU6 activation function is more robust when used in low-precision calculations. We always use convolution kernels of size 3 to 3 and use dropout and batchnorm normalization during training.

除了第一层,我们在整个网络中使用恒定的扩展速率。在我们的实验中,我们发现在5到10之间的扩张率具有几乎相同的性能曲线。随着网络规模的缩小,扩容速度略微降低效果更好,而规模大的网络扩容速度更大,性能更好。[en]In addition to the first layer, we use a constant expansion rate throughout the network. In our experiments, we find that the expansion rate between 5 and 10 has almost the same performance curve. With the reduction of the size of the network, the effect is better when the expansion rate is slightly reduced, while the large network has a larger expansion rate and better performance.

我们主要的实验部分来说,我们使用扩展率为6应用在输入张量中。比如,对于一个瓶颈层来说,输入为 64 通道的张量,产生一个 128 维的张量,内部扩展层就是 64*6=384 通道。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

权衡超参数 就像MobileNetV1 中的一样,我们对于不同的性能要求制定不同的结构。通过使用输入图像分辨率以及可调整的宽度乘法器超参数来根据期望的准确率/性能折中来进行调整。我们先前的工作(宽度乘法器,1,224*224),有大约3亿的乘加计算量以及使用了340万的参数量。我们探索对输入分辨率从 96 到 224,宽度乘法器从 0.35 到 1.4 来探索性能的权衡。网络计算量从 7 变成了 585MMads,同时模型的尺寸变换影响参数量从1.7M到 6.9M。

与MobileNetV1 实现小小不同的是,MobileNetV1 的宽度乘法器的取值小于1,除了最后一层卷积层,我们对所有层都应用了宽度乘法器,这对于小模型提升了性能。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

表2 MobileNetV2 :每行描述了1个或多个相同的层的序列,重复 n 次。所有序列相同的层有相同的输出通道数c,第一层的序列步长为s,所有其他的层都用步长为1,所有空间卷积核使用 3*3 的大小,扩展因子 t 总是应用在表1描述的输入尺寸中。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

让我们再谈一次吧。MobileNet V2的网络模块风格如下(类似于上图,但我这里有一张网页的图片):[en]Let’s talk about it again. The network module style of MobileNet V2 is as follows (similar to the above, but I have a picture of the web page here):

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

我们知道,MobileNetV1 网络主要思想就是深度可分离的卷积的堆叠。在V2的网络设计中,我们除了继续使用深度可分离(中间那个)结构之外,还使用了Expansion layer 和 Projection layer。这个 projection layer 也使用 1*1 的网络结构将高维空间映射到低维空间的设计,有些时候我们也将其称之为 Bottleneck layer。

而 Expansion layer 的功能正好相反,使用1*1 的网络结构,目的是将低维空间映射到高维空间。这里 Expansion 有一个超参数是维度扩展几倍。可以根据实际情况来做调整的,默认是6,也就是扩展6倍

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

此图也更加详细的展示了整个模块的结构。我们输入是 24维,最后输出也是 24维。但这个过程中,我们扩展了6倍,然后应用深度可分离卷积进行处理。整个网络是中间胖,两头窄,像一个纺锤型。而ResNet中 bottleneck residual block 是两头胖中间窄,在MobileNet V2中正好相反,所以我们MobileNet V2中称为 inverted rediduals。另外,residual connection 是在输入和输出的部分进行连接。而linear bottleneck 中最后projection conv 部分,我们不再使用ReLU激活函数而是使用线性激活函数

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

5,执行记录

5.1 内存有效管理

反向剩余瓶颈层提供了一种特殊的高效内存管理方式,这对移动应用非常重要。一个标准的高效管理,如TensorFlow或Caffe,构造了一个有向非循环计算超图G,它由表示操作的边和表示内部计算的张量组成。为了最大限度地减少需要存储在内存中的张量数量,计算是按顺序进行的。在最一般的情况下,它搜索所有可能的计算顺序ΣG,然后选择最小化以下内容的顺序。[en]The reverse residual bottleneck layer allows a special way of efficient memory management, which is very important for mobile applications. A standard efficient management, such as TensorFlow or Caffe, constructs a directed acyclic computing hypergraph G, which consists of edges representing operations and tensors representing internal calculations. In order to minimize the number of tensors that need to be stored in memory, the calculation is done sequentially. In the most general case, it searches for all possible calculation orders Σ G, and then selects the order that minimizes the following.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

其中R(i,π,G)是连接到任意点{π1,π2.πn}的中间张量的列表。|A|张量的模A,Size(I)表示i次运算需要多少内存。[en]Where R (I, π, G) is a list of intermediate tensors connected to any point {π 1, π 2. π n}. | A | indicates the module of tensor A, and size (I) represents the total amount of memory required for internal storage during the I operation.

对于只有不重要的并行结构(如剩余连接)的图,只有一个重要且可行的计算顺序,因此测试所需的总内存计算图G可以简化为:[en]For graphs with only unimportant parallel structures (such as residual connections), there is only one important and feasible calculation order, so the total memory calculation graph G required for testing can be simplified to:

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

瓶颈残差块 图3b 中所示的 F(x) 可以表示为三个运算符的组合 F(x)=[ANB]x,其中A是线性变换 ,N是一个非线性的每个通道的转换,B是输出域的线性转换。

对于我们的网络 N=ReLU6dwiseReLU6,但结果适用于任何的按通道转换。假设输入域的大小是 |x| 并且输出域的大小是 |y|,那么计算 F(X) 所需要的内存可以低至 |s2k|+|s’2k’|+o(max(s2, s’2))。

该算法基于内部张量L可以表示为t张量的连接,且每个大小为n_,则我们的函数可以表示为:[en]The algorithm is based on the fact that the internal tensor L can be expressed as the connection of the t tensor, and each size is n _, then our function can be expressed as:

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

通过累加和,我们只需要将一个大小为 n/t 的中间块始终保留在内存中。使用 n=t ,我们最终只需要保留中间表示的单个通道。这使得我们能够使用这一技巧的两个约束是(a)内部变换(包括非线性和深度)是每个通道的事实,以及(b)连续的非按通道运算符具有显著的输入输出大小比,对于大多数传统的神经网络,这种技巧不会产生显著的改变。

我们注意到,使用t路径划分计算F(X)所需的乘法和加法运算符的数量与t无关,但在现有的实现中,我们发现用几个更小的矩阵乘法替换一个矩阵乘法会因为增加的高速缓存未命中而损害运行时的性能。我们发现这种方法最有用,其中t是2到5之间的一个小常数。它显著减少了内存需求,但仍然可以使用深度学习框架提供的高度优化的矩阵乘法和卷积算子来实现大部分效率。如果特殊的框架优化可能会带来进一步的运行时改进,这种方法还有待观察。[en]We note that the number of multiplication and addition operators required to calculate F (X) using t-path partition is independent of t, but in existing implementations, we find that replacing a matrix multiplication with several smaller matrix multiplications can damage runtime performance because of increased cache misses. We find this method the most useful, where t is a small constant between 2 and 5. It significantly reduces memory requirements, but can still use the highly optimized matrix multiplication and convolution operators provided by the deep learning framework to achieve most of the efficiency. If special framework optimizations may lead to further runtime improvements, this approach remains to be seen.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

6,实验

6.1 ImageNet 分类

训练设置 我们利用 TensorFlow 训练模型,实验标准的 RMSProp 优化方法,并且衰减率和动量都设置为 0.9,我们在每一层之后都使用 batch normalization ,标准的权重衰减设置为 0.00004,接着 MobilenetV1 的设置,我们使用初始学习率为 0.045,学习率衰减为每个 epoch 衰减 0.98,我们使用 16 个 GPU 异步工作器并且使用 96个作为 batch size。

结果 我们与MobileNetV1,ShuffleNet,NASNet-A 模型进行比较,几个模型的统计数据如表4.性能的比较在图5中。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

6.2 目标检测

我们对MobileNetV2和MobileNetV1的性能进行了评估和比较。MobileNetV1使用COCO数据集[2]上的单发探测器(SSD)[34]的修改版本作为用于目标检测[33]的特征提取器。我们还将YOLOv2[35]与原始SSD(基于VGG-16)作为基准进行比较。因为我们关注的是移动/实时模型,所以我们不会比较其他架构的性能,例如更快的RCNN[36]和RFCN[37]。[en]We evaluate and compare the performance of MobileNetV2 and MobileNetV1. MobileNetV1 uses the modified version of Single Shot Detector (SSD) [34] on the COCO dataset [2] as the feature extractor for target detection [33]. We also compare YOLOv2 [35] with the original SSD (based on VGG-16) as a benchmark. Because we focus on the mobile / real-time model, we will not compare the performance of other architectures such as Faster RCNN [36] and RFCN [37].

SSDLite 在本文中,我们将介绍常规SSD的移动友好型变种,我们在SSD 预测层中用可分离卷积(深度方向后接 1*1 投影)替换所有常规卷积。这种设计符合 MobileNets 的整体设计,并且在计算上效率更高。我们称之为修改版本的 SSDLite。与常规SSD相比,SSDLite显著降低了参数计数和计算成本,如表5所示。

对于MobileNet V1,我们遵循[33]中的设置。对于MobileNet V2MagneSSDLite,第一层附加到第15层的扩展上(输出步长为16),第二层和SSDLite层的其余部分连接在最后一层的顶部(输出步长为32)。此设置与MobileNetV1一致,因为所有图层都附着到具有相同输出步长的要素地图。[en]For MobileNet V1, we follow the settings in [33]. For MobileNet V2MagneSSDLite, the first layer is appended to the extension of layer 15 (output step size is 16), and the second layer and the rest of the SSDLite layer are connected at the top of the last layer (output step size is 32). This setting is consistent with MobileNetV1 because all layers are attached to a feature map with the same output step.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

6.3 语义分割

在本节中,我们使用 MobileNetV1 和 MobileNetV2 模型作为特征提取器与 DeepLabv3【39】在移动语义分割任务上进行比较。DeepLabV3 采用了空洞卷积【40, 41, 42】,这是一种显式控制计算特征映射分辨率的强大工具,并构建了五个平行头部,包括(a)包含三个具有不同空洞率的 33 卷积的 Atrous Spatial Pyramid Pooling 模块(ASPP),(b)11 卷积头部,以及(c)图像级特征【44】。我们用输出步长来表示输入图像空间分辨率与最终输出分辨率的比值,该分辨率通过适当地应用空洞卷积来控制。对于语义分割,我们通常使用输出 stide=16或8来获取更密集的特征映射。我们在 PASCAL VOC 2012 数据集上进行了实验,使用【45】中的额外标注图像和评估指标mIOU。

为了建立一个移动模型,我们尝试了三种设计方案:(1)不同的特征提取程序,(2)简化DeepLabv3报头以加快计算速度,(3)不同的推理策略以提高性能。我们的结果总结在表7中。我们观察到:(A)包括多尺度输入和添加左右翻转图像的推理策略显著增加了MAdd,因此它们不适合在设备上应用,(B)使用输出步长16比使用输出步长8更有效,(C)MobileNetV1已经是一个强大的特征提取工具,并且只需要大约4.9到5.7倍的MAdd比ResNet-1018,(D)在MobileNet V2的倒数第二个特征图上构建DeepLabv3 Header比在原来的最后一个特征图上更高效,因为倒数第二个特征图包含320个通道,而不是1280个通道,所以我们可以达到类似的性能,但通道数比MobileNet V1少2.5倍。(E)DeepLabv3头的计算成本很高,移除ASPP模块会显著减少MAdds,并且只会略微降低性能。在表7的末尾,我们确定了设备上的一个潜在候选应用程序(粗体),该应用程序可以达到75.32%的Mou,只需要2.75亿个MAdd。[en]In order to build a mobile model, we try three design variants: (1) different feature extractors, (2) simplifying DeepLabv3 headers to speed up computing, and (3) different inference strategies to improve performance. Our results are summarized in Table 7. We have observed that: (a) inference strategies that include multi-scale input and adding left and right flipped images significantly increase MAdds, so they are not suitable for application on devices, (b) using output step size 16 is more efficient than using output step size 8, (c) MobileNetV1 is already a powerful feature extractor and requires only about 4.9 to 5.7 times less MAdds than ResNet-101 [8] (for example, MIOU: 78.56v82.70 and MAdds:941.9B vs 4870.6B), (d) it is more efficient to build DeepLabv3 headers on top of the penultimate feature map of MobileNet V2 than on the original last feature map, because the penultimate feature map contains 320th channels instead of 1280 channels, so we can achieve similar performance, but with 2.5 times less channels than MobileNet V1. (e) the computational cost of the DeepLabv3 header is high, and removing the ASPP module significantly reduces MAdds and only slightly degrades performance. At the end of Table 7, we identified a potential candidate application on the device (bold) that can reach 75.32% mIOU and requires only 2.75B MAdds.

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

6.4 模块简化测试

倒置残差连接。 残差连接的重要性已经被广泛研究【8, 30, 46】。本文报告的新结果是连接瓶颈的快捷连接性能优于连接扩展层的快捷连接(请参见图6b以供比较)。

线性瓶颈的重要性。线性瓶颈模型的严格来说比非线性模型要弱一些,因为激活总是可以在线性状态下进行,并对偏差和缩放进行适当的修改。然而,我们在图 6a 中展示的实验表明,线性瓶颈改善了性能,为非线性破坏低维空间中的信息提供了支持。

深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

7,结论和未来工作

我们描述了一个非常简单的网络结构,它使我们能够构建高效的移动模型。我们的基本建筑单元具有特殊的属性,使其更适合移动应用。它可以实现更高效的内存管理,并且可以在所有神经框架上的标准操作中实现。[en]We describe a very simple network structure that allows us to build an efficient mobile model. Our basic building unit has a special attribute that makes it more suitable for mobile applications. It can achieve more efficient memory management and can be implemented in standard operations on all neural frameworks.

对于ImageNet数据集,我们的结构将广泛的性能点提升到最佳水平。[en]For ImageNet data sets, our structure promotes the wide range of performance points to the best level.

对于目标检测任务,我们的网络在COCO数据集上的准确率和模型复杂度方面都好于最好的实时检测器模型。特别是,当我们的模型与SSDLite检测模块相结合时,与YOLOV3相比,计算量减少了20倍以上,参数数量减少了10倍以上。[en]For target detection tasks, our network is better than the best real-time detector model in terms of accuracy and model complexity on COCO data sets. In particular, when our model is combined with the SSDLite detection module, compared with YOLOV3, the amount of calculation is more than 20 times less and the number of parameters is more than 10 times less.

在理论层面上,所提出的卷积块具有一个独特的性质,即将网络的表达能力(编码扩展层)与网络的容量(由瓶颈输入编码)分开。对此进行探讨是今后研究的重要方向。[en]At the theoretical level, the proposed convolution block has a unique property, that is, to separate the expression ability of the network (coding the extension layer) from the capacity of the network (encoded by bottleneck inputs). Exploring this is an important direction of future research.

Original: https://www.cnblogs.com/wj-1314/p/14077776.html
Author: 战争热诚
Title: 深度学习论文翻译解析-MobileNetV2: Inverted Residuals and Linear Bottlenecks

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/5959/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部