深度学习论文翻译解析-Searching for MobileNetV3

论文标题:Searching for MobileNetV3

论文作者:Andrew Howard,Mark Sandler,Grace Chu,Liang-Chieh Chen,Bo Chen,Mingxing Tan,Weijun Wang,Yukun Zhu,Ruoming Pang,Vijay Vasudevan,Quoc V. Le,Hartwig Adam

论文地址:https://arxiv.org/abs/1905.02244.pdf

参考的 MobileNets 翻译博客:https://blog.csdn.net/Chunfengyanyulove/article/details/91358187(https://blog.csdn.net/thisiszdy/article/details/90167304)

免责声明:编辑翻译的论文仅供学习之用,如有侵权行为,请联系编辑删除博客帖子,谢谢![en]Disclaimer: the editor translates the paper for study only, if there is any infringement, please contact the editor to delete the blog post, thank you!

小编是机器学习的初学者,有意仔细研究论文,但英语水平有限,所以在论文的翻译中使用了Google,并逐句检查,但仍有一些晦涩的地方,如语法/专业名词翻译错误,请原谅,并欢迎及时指出。[en]The editor is a beginner in machine learning and intends to study the paper carefully, but his English level is limited, so Google is used in the translation of the paper and checked sentence by sentence, but there are still some obscure places, such as grammatical / professional noun translation errors, please forgive me, and welcome to point out in time.

如果需要小编其他论文翻译,请移步小编的GitHub地址

传送门:请点击我

如果点击有误:https://github.com/LeBron-Jian/DeepLearningNote

深度学习论文翻译解析-Searching for MobileNetV3

MobileNet V3 相关技术如下:

  • 1,用 MnasNet 搜索网络结构
  • 2,用 V1 的深度可分离
  • 3,用 V2 的倒置残差线性瓶颈结构
  • 4,引入 SE模块
  • 5,新的激活函数 h-swish(x)
  • 6,网络搜索中利用两个策略:资源受限的 NAS 和 NetAdapt
  • *7,修改 V2 最后部分减小计算

就我个人而言,我感觉没有V1Magee V2(没有什么革命性的)那么令人惊叹,但它确实优化了V2的所有部分,结合了最新的东西,引入了新的激活功能,并堆积了一堆加速的诀窍。[en]Personally, I don’t feel as amazing as V1Magee V2 (nothing revolutionary), but it does optimize all parts of V2, combine the latest things, introduce new activation functions, and pile up a bunch of trick to speed up.

深度学习论文翻译解析-Searching for MobileNetV3

1,摘要

我们展示了基于互补搜索技术和新颖的体系结构设计的下一代移动网络。MobileNetV3使用互补的方法,通过结合硬件感知网络架构搜索(NAS)和NetAdapt算法来提高移动设计的整体水平。通过这个过程,我们创建了两个新的MobileNet版本模型:MobileNetV3-Large和MobileNetV3-Small,它们分别针对高资源和低资源的用例。然后将这些模型应用于目标检测和语义分割。对于语义分割(或任何密集像素预测)任务,我们提出了一种新的高效分割解码器Lite Reduced Arous Space金字塔池(LR-ASPP)。我们实现了移动分类、检测和分割的最新技术成果。与MobileNetV2相比,MobileNetV3-Large在ImageNet分类中的准确率提高了3.2%,时延降低了20%。与MobileNetV2相比,MobileNetV3-Small的准确率提高了6.6%,时延降低了5%。MobileNetV3-Large的检测速度比MobileNetV2快25%,而CoCo检测的准确率大致相同。MobileNetV3-Large LR-ASPP比MobileNetV2 R-ASPP快30%,在城市景观细分中,MobileNetV3-Large LR-ASPP比MobileNetV2 R-ASPP快34%。[en]We show the next generation of MobileNets based on complementary search technology and novel architecture design. MobileNetV3 uses complementary methods to improve the overall level of mobile design by combining hardware-aware network architecture search (NAS) and NetAdapt algorithm. Through this process, we created two new release mobileNet models: MobileNetV3-Large and MobileNetV3-Small, which target high-resource and low-resource use cases. Then these models are applied to target detection and semantic segmentation. For semantic segmentation (or any dense pixel prediction) task, we propose a new efficient segmentation decoder Lite reduce Atrous Spatial Pyramid Pooling (LR-ASPP). We have realized the latest technical achievements of mobile classification, detection and segmentation. Compared with MobileNetV2, the accuracy of MobileNetV3-Large in ImageNet classification is 3.2% higher and the delay is 20% lower. Compared with MobileNetV2, MobileNetV3-small has 6.6 per cent higher accuracy and 5 per cent lower latency. The detection speed of MobileNetV3-Large is 25% faster than that of MobileNetV2, and the accuracy of COCO detection is roughly the same. MobileNetV3-Large LR-ASPP is 30% faster than MobileNetV2 R-ASPP, and in urban landscape segmentation, MobileNetV3-Large LR-ASPP is 34% faster than MobileNetV2 R-ASPP.

深度学习论文翻译解析-Searching for MobileNetV3

图1,Pixel 1 延迟与 top-1 ImageNet 准确性之间的权衡。所有模型均使用输入分辨率 224。大V3和小V3使用乘数 0.75,1和1.25显示最佳边界。所有延迟都是使用 TFLite【1】在同一设备的单个大内核上测量的。MobileNetV3-Small和 Large是我们建议的下一代移动模型。

深度学习论文翻译解析-Searching for MobileNetV3

图2:MAdds 和 top-1 精度之间的衡量。这允许比较针对不同硬件或软件框架的模型。所有 MobileNet V3 的输入分辨率均为 224,并使用乘数 0.35, 0.5, 0.75, 1 和 1.25 。有关其他分辨率,请参考第6节。彩色效果最佳。

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

1,简介

高效的神经网络在移动应用中变得无处不在,在设备上实现了全新的体验。它们也是个人隐私的关键代理,允许用户获得神经网络的好处,而不必将数据发送到服务器进行评估。神经网络效率的提升不仅通过更高的精确度和更低的时延来改善用户体验,还通过降低功耗来帮助维持电池寿命。[en]Efficient neural networks become ubiquitous in mobile applications, enabling a whole new experience on devices. They are also key agents of personal privacy, allowing users to reap the benefits of neural networks without having to send data to the server for evaluation. The improvement of neural network efficiency not only improves the user experience through higher accuracy and lower latency, but also helps maintain battery life by reducing power loss.

本文描述了我们开发MobilenetV3的大小模型的方法,以提供下一代高精度和高效的神经网络模型来驱动设备上的计算机视觉。新的网络促进了最新技术的发展,并展示了如何将自动搜索与新的架构进步相结合,以构建有效的模型。[en]This paper describes our method of developing large and small models of MobilenetV3 to provide the next generation of high-precision and efficient neural network models to drive computer vision on devices. The new network promotes the development of the latest technology and shows how to combine automated search with new architectural advances to build an effective model.

本文的目标是开发一种最佳的移动计算机视觉体系结构,以优化移动设备上的精确延迟切换。为了实现这一点,我们引入了(1)互补搜索技术,(2)新的适用于移动设备的非线性高效版本,(3)新的高效网络设计,(4)新的高效分段解码器。我们提供深入的实验,以展示每种技术在广泛的用例和手机中的有效性和价值。[en]The goal of this paper is to develop the best mobile computer vision architecture to optimize accurate delay switching on mobile devices. In order to achieve this, we introduce (1) complementary search technology, (2) a new nonlinear efficient version for mobile devices, (3) a new efficient network design, and (4) a new efficient segmentation decoder. We provide in-depth experiments to demonstrate the effectiveness and value of each technology in a wide range of use cases and mobile phones.

本论文组织如下。让我们从第二节中关于工作的讨论开始。第三部分回顾了移动模型的有效构建块。第四节回顾了MNasNet和NetAdapt算法的体系结构、搜索和互补性。第五部分介绍了新的架构设计,通过联合搜索来提高模型的效率。第六节介绍了大量的分类、检测和分割实验,以证明算法的有效性,并了解不同元素对分类结果的贡献。第七节载有结论和今后的工作。[en]The thesis is organized as follows. Let’s start with the discussion about the work in the second section. The third section reviews the efficient building blocks for mobile models. Section 4 reviews the architecture search and the complementarity of MnasNet and NetAdapt algorithms. The fifth section describes the new architecture design to improve the efficiency of the model through joint search. Section VI introduces a large number of classification, detection and segmentation experiments to prove the effectiveness and understand the contribution of different elements. Section VII contains conclusions and future work.

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

2,相关工作

设计深度神经网络结构以实现精度和效率之间的最佳平衡是近年来一个活跃的研究领域。新的人工结构和算法神经结构搜索都在这一领域发挥了重要作用。[en]The design of deep neural network structure to achieve the optimal balance between accuracy and efficiency is an active research field in recent years. Both novel manual structure and algorithmic neural structure search have played an important role in this field.

SqueezeNet【22】广泛使用带有挤压和扩展模块的1*1 卷积,主要集中于减少参数的数量。最近的工作将关注点从减少参数转移到减少操作的数量(MAdds)和实际测量的延迟。MobileNetV1【19】采用深度可分离卷积,大大提高了计算效率。MobileNetV2【39】在此基础上进行了扩展,引入了一个具有反向残差和线性瓶颈的资源高效块。ShuffleNet【49】利用组卷积和信道洗牌操作进一步减少 MAdds。冷凝集【21】在训练阶段学习组卷积,以保持层与层之间有用的紧密连接,以便功能重用。ShiftNet【46】提出了与点向卷积交织的移位操作,以取代昂贵的空间卷积。

为了自动化体系结构设计过程,引入强化学习(RL)来搜索具有竞争精度的高效体系结构[53,54,3,27,35]。完全可配置的搜索空间可能会呈指数级增长,很难处理。因此,早期的体系结构搜索工作侧重于单元级结构搜索,并在所有层中重用相同的单元。最近,[43]探索了一种块级分层搜索空间,它允许在网络的不同分辨率块上使用不同的层结构。为了减少搜索的计算量,文[28,5,45]采用了可微结构的搜索框架,并基于梯度进行了优化。为了解决现有网络适应有限的移动平台的问题,[48,15,12]提出了一种更高效的自动网络简化算法。[en]In order to automate the architecture design process, reinforcement learning (RL) is introduced to search for efficient architectures with competitive accuracy [53,54,3,27,35]. A fully configurable search space can grow exponentially and difficult to handle. As a result, early architecture search efforts focused on unit-level structure search and reused the same units in all layers. Recently, [43] explored a block-level hierarchical search space that allows different layer structures to be used on different resolution blocks of the network. In order to reduce the computational cost of search, the differentiable architecture search framework is used in [28,5,45] and optimized based on gradient. In order to solve the problem that the existing network adapts to the limited mobile platform, [48,15,12] proposes a more efficient automatic network simplification algorithm.

量化[23,25,37,41,51,52,37]是通过降低算法精度来提高网络效率的另一项重要补充工作。最后,知识提炼[4,17]提供了另一种辅助方法,可以在大型“教师”网络的指导下生成准确的小型“学生”网络。[en]Quantization [23, 25, 37, 41, 51, 52, 37] is another important complementary effort to improve network efficiency by reducing the accuracy of the algorithm. Finally, knowledge distillation [4,17] provides an additional supplementary method to generate accurate small “student” networks under the guidance of large “teacher” networks.

将上述翻译总结一下,即目前常用的一些减少网络计算量的方法

  • 基于轻量级网络设计:如MobileNet系列、ShuffleNet系列、Xception等,采用分组卷积、1-1卷积等技术,在减少网络计算量的同时,最大限度地保证网络的准确性。[en]* based on lightweight network design: such as MobileNet series, ShuffleNet series, Xception, etc., using Group convolution, 1-1 convolution and other technologies to reduce the amount of network computation, while ensuring the accuracy of the network as much as possible.
  • 模型剪枝:大型网络中往往存在一定程度的冗余,通过减去冗余部分来减少网络计算量。[en]* Model pruning: there is often a certain degree of redundancy in large networks, which reduces the amount of network computation by subtracting redundant parts.
  • 量化:使用TensorRT量化,在GPU上一般可以加速数倍。[en]* Quantification: using TensorRT quantization, you can generally speed up several times on GPU.
  • 知识蒸馏:用大模型(教师模型)帮助小模型(学生模型)学习,提高学生模型的准确性。[en]* knowledge distillation: use large models (teacher model) to help small models (student model) learn and improve the accuracy of student modelde.

mobileNet系列当然是典型的第一种方法。

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

3,高效的移动建筑块

移动模式建立在越来越高效的基础上。MobileNetV1[17]引入了深度可分离卷积作为传统卷积层的有效替代。深度可分离卷积通过将空间滤波与特征生成机制分离,有效地分解了传统卷积。深度可分离卷积由两个独立的层定义:用于空间滤波的轻量级深度卷积和用于特征生成的重1点卷积。[en]The mobile model has been built on a more and more efficient basis. MobileNetV1 [17] introduces deep separable convolution as an effective substitute for traditional convolution layer. Depth separable convolution decomposes traditional convolution effectively by separating spatial filtering from feature generation mechanism. Depth separable convolution is defined by two independent layers: lightweight depth convolution for spatial filtering and heavier 1-point convolution for feature generation.

MobileNetV2【37】引入了线性瓶颈和反向残差结构,以便利用问题的低秩性质使层结构更加有效。这个结构如图3所示,由11 展开卷积,深度卷积和11 投影层定义。当且仅当他们具有相同数量的通道时,输入和输出才通过剩余连接进行连接。这种结构在输入和输出处保持了紧凑的表示,同时在内部扩展到高维特征空间,以便增加非线性每个通道转换的表达能力。

MnasNet 建立在 MobileNetV2 结构上,通过在瓶颈结构中引入基于挤压和激励的轻量级注意模块。注意:与【20】中提出的基于 ResNet 的模块相比,挤压和激励模块集成在不同的位置。模块位于展开中的深度过滤器之后,以便注意应用于最大的表示,如图4所示。

对于MobileNet V3,我们使用这些层的组合作为构建块来构建最高效的模型。层也使用修改的摆动非线性进行升级[34]。压缩和激励以及摆动非线性都使用Sigmoid,这在定点算法中效率很低,很难保持精度,所以我们用硬Sigmoid代替它,如5.2节所讨论的。[en]For MobileNet V3, we use a combination of these layers as building blocks to build the most efficient model. Layers are also upgraded with modified swish nonlinearity [34]. Squeeze and excitation as well as swish nonlinearity use Sigmoid, which is very inefficient and difficult to maintain accuracy in the fixed-point algorithm, so we replace it with hard Sigmoid, as discussed in Section 5.2.

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

4,网络搜索

事实证明,网络搜索是发现和优化网络体系结构的一个非常强大的工具。对于MobilenetV3,我们使用平台感知NAS通过优化每个网络块来搜索全局网络结构。然后使用NetAdapt算法搜索每一层中的过滤器数量。这些技术是互补的,可以组合在一起,有效地找到给定硬件平台的优化模型。[en]Web search has proved to be a very powerful tool for discovering and optimizing network architecture. For MobilenetV3, we use platform-aware NAS to search the global network structure by optimizing each network block. Then we use the NetAdapt algorithm to search for the number of filters in each layer. These technologies are complementary and can be combined to effectively find an optimization model for a given hardware platform.

4.1 使用NAS感知平台进行逐块(Blockwise)搜索

类似于[43],我们使用平台感知神经结构方法来寻找全局网络结构。由于我们使用相同的基于RNN的控制器和相同的分解分层搜索空间,对于目标延迟约80ms的大型移动模型,我们发现了类似于[43]的结果。因此,我们只需重复使用与初始大型移动模型相同的MnasNet-A1[43],然后对其应用NetAdapt[48]和其他优化。[en]Similar to [43], we use platform-aware neural structure method to find the global network structure. Because we use the same RNN-based controller and the same decomposition hierarchical search space, we find a similar result to [43] for a large mobile model with target delay around 80ms. Therefore, we just need to reuse the same MnasNet-A1 [43] as the initial large mobility model, and then apply NetAdapt [48] and other optimizations to it.

然而,我们发现原始的奖励设计并没有针对小型手机模型进行优化。具体来说,它使用一个多目标奖励 ACC(m)*[LAT(m)/TAR]w 来近似 pareto 最优解,根据目标延迟 TAR 为每个模型 m 平衡模型精度 ACC(m) 和延迟 LAT(m) 。我们观察到精度变化更显著延迟小模型,因此,我们需要一个较小的重量系数 w=0.15(vs 原始 w=-0.07)来弥补大精度变化不同的延迟。在新的权重因子 w 的增强下,我们从头开始一个新的架构搜索,以找到初始的 seed 模型,然后应用 NetAdapt 和其他优化来获得最终的 MobileNetV3-Small模型。

深度学习论文翻译解析-Searching for MobileNetV3

4.2 使用NetAdapt 进行 Layerwise 搜索

我们在模式搜索中使用的第二种技术是NetAdapt[48]。这种方法是对平台感知NAS的补充:它允许按顺序微调各个层,而不是试图推断粗略但全局的体系结构。具体内容请参考原文。简而言之,这项技术的进展如下:[en]The second technique we use in schema search is NetAdapt [48]. This approach complements platform-aware NAS: it allows individual layers to be fine-tuned sequentially rather than trying to infer a rough but global architecture. Please refer to the original text for details. In short, the progress of this technology is as follows:

1,从平台感知 NAS 发现的种子网络体系结构开始。

2,对于每一个步骤:

(a) 提出一套新的建议。每个提议都表示对体系结构的修改,与前一步相比,该体系结构至少可以减少延迟。

(b)对于每一个提议,我们使用前一个步骤的预先训练的模型,并填充新提出的架构,适当地截断和随机初始化缺失的权重。对于 T 步的每个建议进行微调,以获得对精度的粗略估计。

(c)根据某种标准选择最佳建议

3,重复前面的步骤,直到达到目标延迟。

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

在[48]中,测量的目的是将精度的变化降至最低。我们对算法进行了改进,使延迟变化与精度变化之比最小。也就是说,对于每个NetAdapt步骤中生成的所有建议,我们选择一个最大化建议:ACC/延迟。延迟满足2(A)中的约束条件。直觉告诉我们,因为我们的建议是离散的,所以我们更喜欢最大化权衡曲线斜率的建议。[en]In [48], the purpose of measurement is to minimize the change of precision. We modify the algorithm to minimize the ratio of delay change to precision change. That is, for all the recommendations generated in each NetAdapt step, we choose a proposal that maximizes it: ACC/latency. Delay the satisfaction of the constraints in 2 (a). Intuition tells us that because our proposal is discrete, we prefer the suggestion of maximizing the slope of the tradeoff curve.

重复这个过程,直到延迟达到目标,然后从头开始重新训练新的体系结构。我们使用与[46]中MobileNetV2相同的建议书生成器。具体地说,我们允许以下两项建议:[en]This process is repeated until the delay reaches the goal, and then the new architecture is retrained from scratch. We use the same proposal generator as for MobileNetV2 in [46]. Specifically, we allow the following two suggestions:

1,减少任何扩展层的尺寸

2,减少共享相同瓶颈大小的所有块中的瓶颈——以维护剩余连接

在我们的实验中,我们使用 T=10000,并发现虽然它增加了提案的初始微调的准确性。然而,当从零开始训练时,它通常不会改变最终的精度。设 δ = 0.01|L|,其中L为种子模型的延迟。

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

5,网络提升

除了网络搜索,我们还在模型中引入了一些新的组件,以进一步改进最终的模型。在网络的开始和结束时,我们重新设计了计算开销很大的层。我们还介绍了一种新的非线性,h-SWISH,它是SWISH非线性的最新改进版本,它更快,更容易量化。[en]In addition to web search, we have introduced some new components to the model to further improve the final model. At the beginning and end of the network, we redesigned the computationally expensive layer. We also introduce a new nonlinearity, h-swish, which is the latest improved version of swish nonlinearity, which is faster and easier to quantify.

5.1 重新规划昂贵的层

一旦我们通过架构搜索找到模型,我们会发现最后的一些层和更早的一些层比其他层更昂贵。我们建议对架构进行一些更改,以减少这些慢速层的延迟,同时保持准确性。这些变化超出了当前搜索空间的范围。[en]Once we find the model through an architecture search, we will find that some of the last layers and some earlier layers are more expensive than others. We recommend some changes to the architecture to reduce latency at these slow layers while maintaining accuracy. These changes are beyond the scope of the current search space.

第一个修改将重新处理网络的最后几层如何交互,以使其更高效地生成最后一层功能。目前的模型基于MobileNetV2的倒置瓶颈结构和变体,并使用1-1卷积作为最后一层来扩展高维特征空间。这一层非常重要,因为它具有丰富的预测功能。然而,这是以额外的延迟为代价的。[en]The first modification will reprocess how the last layers of the network interact to make it more efficient to generate the final layer functionality. The current model is based on the inverted bottleneck structure and variants of MobileNetV2, and uses 1-1 convolution as the last layer to expand the high-dimensional feature space. This layer is very important because it has rich prediction functions. However, this comes at the expense of additional delays.

为了减少延迟并保留高维功能,我们将该层移到最终平均池之外。最后一组要素现在按1:1空间分辨率计算,而不是按7:7空间分辨率计算。作为这种设计选择的结果,特征的计算在计算和延迟方面几乎变得自由。[en]To reduce latency and retain high-dimensional features, we move this layer outside the final average pool. The last set of features is now calculated as 1: 1 spatial resolution instead of 7: 7 spatial resolution. As a result of this design choice, the calculation of features becomes almost free in terms of calculation and delay.

一旦特征生成层的成本降低,就不再需要以前的瓶颈投影层来减少计算量。在这一观察中,我们在等速下删除了之前瓶颈层中的投影层和滤波层,从而进一步降低了计算复杂度。原始阶段和优化阶段如图5所示。有效的最后一个阶段将延迟减少了10毫秒,或运行时间的15%,操作次数减少了3000万MADD,而精度几乎没有损失。第6节包含详细的结果。[en]Once the cost of the feature generation layer is reduced, the previous bottleneck projection layer is no longer needed to reduce the amount of computation. In this observation, at a uniform speed, we delete the projection and filter layers in the previous bottleneck layer, thus further reducing the computational complexity. The original and optimized phases are shown in figure 5. The effective last phase reduces latency by 10 milliseconds, or 15 percent of elapsed time, and reduces the number of operations by 30 million MAdd, with almost no loss of precision. Section 6 contains detailed results.

另一个昂贵的层是初始化过滤器集。目前的移动模型倾向于在完整的3-3卷积中使用32个滤波器来构建用于边缘检测的初始滤波器库。通常,这些过滤器是彼此的镜子。我们试图减少滤波器的数量,并使用不同的非线性来尝试减少冗余。我们决定对这一层使用硬摆动非线性,因为它的性能和其他非线性测试。我们可以将滤镜的数量减少到16个,同时保持与使用RELU或SWISH的32个滤镜相同的精度。这额外节省了3毫秒和1000万MAdd。[en]Another expensive layer is initializing the filter set. The current mobile model tends to use 32 filters in a complete 3-3 convolution to build an initial filter library for edge detection. Usually these filters are mirrors of each other. We try to reduce the number of filters and use different nonlinearity to try to reduce redundancy. We decided to use hard swish nonlinearity for this layer because of its performance and other nonlinear tests. We can reduce the number of filters to 16 while maintaining the same accuracy as the 32 filters that use ReLU or swish. This saves an additional 3 milliseconds and 10 million MAdds.

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

5.2 非线性

在[36,13和16]中引入了一种称为SWISH的非线性。当它作为RELU的替代品时,可以显著提高神经网络的精度。非线性被定义为:[en]A kind of nonlinearity called swish is introduced in [36, 13 and 16]. When it is used as a substitute for ReLU, it can significantly improve the accuracy of neural networks. Nonlinearity is defined as:

深度学习论文翻译解析-Searching for MobileNetV3

虽然这种非线性提高了精度,但它在嵌入式环境中的成本不是零,因为在移动设备上计算Sigmoid函数的成本要高得多。我们用两种方法来处理这个问题。[en]Although this nonlinearity improves accuracy, its cost is non-zero in an embedded environment, because it is much more expensive to calculate Sigmoid functions on mobile devices. We deal with the problem in two ways.

1,我们将 Sigmoid 函数替换为它的分段线性硬模拟:ReLU6(x + 3)/6,类似于【11,44】。较小的区别是,我们使用的是 ReLU6,而不是自定义的裁剪常量。类似的,Swish的硬版本也变成了

深度学习论文翻译解析-Searching for MobileNetV3

最近,在[2]中提出了一个类似版本的硬式挥杆。图6显示了Sigmoid和Sish的非线性、软版本和硬版本的比较。我们选择常量的动机很简单,并且与原始的平滑版本很好地匹配。在我们的实验中,我们发现所有这些功能的硬版本在精度上没有明显的差异,但从部署的角度来看,它们具有多种优势。首先,ReLU6优化几乎可以在所有软硬件框架中使用。其次,在量化模式下,消除了近似Sigmoid的不同实现方式可能造成的数值精度损失。最后,即使量化的Sigmoid实现被优化,它也比相应的REU慢得多。在我们的实验中,在量化模式下用SWISH替换h-SWISH使推理延迟增加了15%。[en]Recently, a similar version of hard-swish was proposed in [2]. Figure 6 shows a comparison of the nonlinear soft and hard versions of Sigmoid and Swish. Our motivation for choosing constants is simple and matches well with the original smooth version. In our experiments, we found that there was no significant difference in precision between the hard versions of all these functions, but from a deployment point of view, they had a variety of advantages. First of all, ReLU6 optimization can be used in almost all software and hardware frameworks. Secondly, in the quantization mode, it eliminates the potential loss of numerical accuracy caused by different implementations of approximate Sigmoid. Finally, even if the quantized Sigmoid implementation is optimized, it is much slower than the corresponding ReLU. In our experiment, replacing h-swish with swish in quantized mode increases the reasoning delay by 15%.

2,随着我们深入网络,应用非线性的成本会降低,因为每层激活内存通常在分辨率下降时减半。顺便说一句,我们发现 swish 的大多数好处都是通过只在更深的层中使用他们实现的。因此,在我们的架构中,我们只在模型的后半部分使用 h-swish。我们参照表1和表2来获得精确的布局。

即使进行了这些优化,h-SWISH仍会带来一些延迟成本。然而,正如我们在第6节中证明的那样,精度和延迟的净影响不是优化的正向影响,当广泛使用时,优化实现是基于分段函数的。即使进行了这些优化,h-SWISH也会带来延迟成本。然而,正如我们在第6节中证明的那样,在没有优化的情况下,精度和延迟的净效益是正的,当广泛使用时,优化实现是基于分段函数的。[en]Even with these optimizations, h-swish still introduces some latency costs. However, as we proved in Section 6 that the net effect of accuracy and delay is not optimization positive, an optimization implementation is based on piecewise functions when widely used. Even with these optimizations, h-swish also introduces latency costs. However, as we proved in Section 6 that the net benefit of accuracy and delay is positive without optimization, an optimization implementation is based on piecewise functions when widely used.

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

5.3 大的压缩和激活

在[43]中,压缩和激活瓶颈的大小与卷积瓶颈的大小有关。相反,我们将它们全部替换为扩展层中固定数量的通道。我们发现,这可以在适当增加参数个数的情况下提高精度,而不会产生明显的延迟代价。[en]In [43], the size of the compression and activation bottleneck is related to the size of the convolution bottleneck. Instead, we replace them all with a fixed number of channels in the expansion layer. We find that this can improve the accuracy with an appropriate increase in the number of parameters without obvious delay cost.

深度学习论文翻译解析-Searching for MobileNetV3

5.4 MobileNetV3 定义

MobileNetV3 被定义为两个模型:MobileNetV3-Large 和 MobileNetV3-Small。这些模型针对的是高资源用例和低资源用例。通过应用平台感知的 NAS 和 NetAdapt 进行网络搜索,并结合本节定义的网络改进,可以创建模型,我们网络的完整规范见表1和表2。

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

6,实验

我们提供了实验结果来证明新的MobileNet V3模型的有效性。我们报告了实验分类、检测和分割的结果。我们还报告了各种消融研究,以澄清各种设计决策的影响。[en]We provide experimental results to prove the effectiveness of the new MobileNet V3 model. We report the results of experimental classification, detection and segmentation. We also report on various ablation studies to clarify the impact of various design decisions.

6.1 分类

由于ImageNet已经成为标准,我们在所有分类实验中使用ImageNet[38],并与各种资源实验指标(如延迟和乘法加法(MAdds))进行精度比较。[en]Since it has become the standard, we use ImageNet [38] in all classification experiments and compare accuracy with various resource experimental metrics such as delay and multiplication addition (MAdds).

实验的分类部分,土豪谷歌实验了16块 TPU,batchsize设置为 4096进行训练,然后作者选择在谷歌的 Pixel Phone 进行测试

6.1.1 训练设置

我们使用动量为0.9的标准TensorFlow RMSProp优化器在第44个TPU吊舱[24]上进行同步训练。我们使用的初始学习率为0.1,批次大小为4096(每个芯片128张图片),学习率衰减率为每三个周期0.01。我们使用误差为0.8,L2的重量衰减到1e-5。用于与《盗梦空间》[40]相同的图像预处理。最后,我们使用衰减为0.9999的指数移动平均值。我们的所有卷积层都使用批处理归一化层,平均衰减为0.99。[en]We use the standard TensorFlow RMSProp Optimizer of 0.9 momentum for synchronous training on 44th TPU Pod [24]. We use an initial learning rate of 0.1, a batch size of 4096 (128pictures per chip), and a learning rate decay rate of 0.01every three cycles. We use dropout of 0.8 and the weight of L2 attenuates to 1e-5. Used for the same image preprocessing as Inception [40]. Finally, we use an exponential moving average with an attenuation of 0.9999. All of our convolution layers use the batch processing normalized layer with an average attenuation of 0.99.

6.1.2 测试设置

为了测试延迟,我们使用标准的Google Pixel手机,并通过标准的TFLite基准测试工具运行所有网络。我们在所有测试中都使用了单线程的大型内核。我们没有报告多核推理时间,因为我们发现这个设置对于移动应用程序来说并不是很实用。[en]To test the latency, we use standard Google Pixel phones and run all networks through the standard TFLite benchmark tool. We used a single-threaded large kernel in all our tests. We did not report multicore reasoning time because we found that this setting was not very practical for mobile applications.

深度学习论文翻译解析-Searching for MobileNetV3

上图为作者在ImageNet网络上的测试结果。结果表明,V3Large的准确率比V21.0提高了约3个百分点,但速度从64降低到51(Pixel-1 Phone)。与V20.35相比,V3 Small的精度提升了约7个点,速度略有提升,从16.5ms提高到15.8ms(Pixel-1 Phone)。[en]The above picture shows the test results of the author on the ImageNet network. The results show that the accuracy of V3 Large is about 3 points higher than that of V21.0, but the speed is reduced from 64 to 51 (Pixel-1 phone). Compared with V20.35, the accuracy of V3 small is improved by about 7 points, and the speed is slightly improved, from 16.5ms to 15.8ms (Pixel-1 phone).

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

6.2 结果

如图1所示,我们的模型优于目前的技术状态,如MnasNet、ProxylessNas和MobileNetV2。我们在表3中报告了不同像素手机上的浮点性能。我们在表4中包含了量化结果。[en]As shown in figure 1, our model is superior to current technical states, such as MnasNet,ProxylessNas and MobileNetV2. We report the floating-point performance on phones with different pixels in Table 3. We include the quantitative results in Table 4.

在图7中,我们展示了作为乘法和分辨率函数的MobileNet V3性能权衡。请注意,MobileNetV3-Small的性能远远好于MobilenetV3-Large,其乘数放大到与性能匹配近3%的倍数。另一方面,与乘数相比,分辨率提供了更好的权衡。但是,需要注意的是,分辨率通常由问题决定(例如,分割和检测问题通常需要更高的分辨率),因此它不能始终用作可调参数。[en]In figure 7, we show the MobileNet V3 performance tradeoff as a function of multiplication and resolution. Note that the performance of MobileNetV3-Small is much better than that of MobilenetV3-Large, and its multiplier zooms to a multiple that matches performance by nearly 3%. On the other hand, the resolution provides a better tradeoff than the multiplier. It is important to note, however, that the resolution is usually determined by the problem (for example, segmentation and detection problems usually require a higher resolution), so it cannot always be used as an adjustable parameter.

表5中非线性的影响 我们研究在哪里插入 h-swish 非线性以及使用一个优化的改进实现了独立的实现。可看出,使用一个优化的 h-swish 节省 6 ms(超过 10%的运行时)。优化 h-swish 比起 传统的 ReLU 只会增加一个额外 1ms

图8显示了基于非线性的选择和网络的有效边界宽度。MobileNetV3 使用 h-swish 中间的网络和支配 ReLU。有趣的是要注意,添加 h-swish 整个网络是略优于插值扩大网络的前沿。

其他组件的影响 在图9中,我们展示了不同组件的引入是如何沿着延迟/准确率曲线移动的。

深度学习论文翻译解析-Searching for MobileNetV3

上图对比了模型量化(浮点量化、非int8量化)后在不同谷歌手机上花费的时间,其中P-1、P-2、P-3代表手机的不同性能。这里我们讨论的是V3大型网络。可以看出,量化后的Top-1精度从低于70.9ms提高到73.8ms,符合政策要求。从P1-P3的加速效果来看,P1加速8ms,P2加速6ms,Prip3加速5ms,与V2网络相比,速度更快。然而,量化后的V3-Small的加速效果并不大。[en]The above figure compares the time spent after model quantization (float quantization, non-int8 quantization) on different Google phones, in which Pmurl 1, Pmur2, PMu3 represents different performance of mobile phones. Here we talk about the V3-Large network. It can be seen that after quantization, the Top-1 accuracy increases from 70.9 below to 73.8ms, which is in line with the policy. In terms of the acceleration effect of P1-P3, P1 accelerates 8ms, P2 accelerates, 6ms, Prip3 accelerates 5ms, and compared with V2 network, the speed is faster. However, the speed-up effect of V3-Small is not great after quantification.

深度学习论文翻译解析-Searching for MobileNetV3

在上图中,作者比较了使用不同分辨率和不同模型深度的精度。分辨率为96128160192224256,深度分辨率为原始的[0.35,0.75,0.75,1.0,1.25]。由此可见,其实分辨率在精度和速度上有更好的平衡作用,可以达到更快的速度,而精度并没有改变模型的深度,而是更高。[en]In the above picture, the author compares the accuracy of using different resolutions and different model depths. The resolution is 96128160192224256, and the depth resolution is the original [0.35, 0.75, 0.75, 1.0, 1.25]. It can be seen that, in fact, resolution has a better balance effect on accuracy and speed, and can achieve a faster speed, while the accuracy does not change the depth of the model, but higher.

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

上图显示了将测试向右移动时MobileNet V3中各个组件的影响。[en]The figure above shows the impact of individual components in MobileNet V3 by moving the test to the right.

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

6.3 检测

在 SSDLite中,我们使用 MobileNet V3作为骨干特征题器的替代,并与COCO dataset 上的其他骨干网络进行了对比。

在MobileNet V2之后,我们将第一层 SSDLite 附加到输出步长为 16 的最后一个特征提取器层,并将第二层 SSDLite 附加到输出步长为 32 的最后一个特征提取器层。根据检测文献,我们将这两个特征提取层分别称为 C4和 C5.对于MobileNet V3-Large,C4是第13 个瓶颈块的膨胀层。对于 MobileNetV3-Small ,C4是第9个瓶颈层的膨胀层。对这两个网络,C5都是池化层之前的一层。

我们还将C4和C5之间的所有功能层中的通道数减少了2。这是因为MobileNetV3的最后几层被调整为输出1000个类,当90个类移到COCO时,这可能是多余的。[en]We also reduce the number of channels in all feature layers between C4 and C5 by 2. This is because the last few layers of MobileNetV3 are tuned to output 1000 classes, which may be redundant when 90 classes are moved to COCO.

COCO 测试集的结果如表6所示。在信道缩减的情况下,MobileNetV3-Large 比具有几乎相同映射的 MobileNetV2快 25%。在相同的延迟下,MobileNet V3 比 MobileNet V2和 MnasNet 高 2.4 和 0.5 。对于这两种 MobileNet V3模型,通道减少技巧在没有地图丢失的情况下可以减少大约 15% 的延迟,这表明 ImageNet 分类和 COCO对象检测可能更喜欢不同的特征提取器形状。

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

6.4 语义分割

在这一部分中,我们使用MobileNetV2和提出的MobileNetV3作为移动语义分割的网络骨架。此外,我们还对两个分体头进行了比较。第一种是[39]提出的R-ASPP。R-ASPP是无源空间金字塔池化模块的简化设计,它只使用由11个卷积组成的两个分支和一次全局平均池化操作。在本文中,我们提出了另一种轻量级拆分报头,称为Lite R-ASPP(或LR-ASPP)。如图10所示,Lite R-ASPP是对R-ASPP的改进,它以类似于挤压激活模块的方式部署全局平均池,其中我们使用具有大步长的大池内核(以节省一些计算),并且模块中只有一个卷积。我们对MobileNet V3的最后一块应用了Arous Conv,以提取更密集的特征,并进一步添加了来自底层特征的跳过连接,以捕获更详细的信息。[en]In this section, we use MobileNetV2 and the proposed MobileNetV3 as the network skeleton of mobile semantic segmentation. In addition, we compare the two split heads. The first one is R-ASPP proposed in [39]. R-ASPP is a simplified design of passive space pyramid pooling module, which only uses two branches composed of 11 convolutions and a global average pooling operation. In this article, we propose another lightweight split header called Lite R-ASPP (or LR-ASPP). As shown in figure 10, Lite R-ASPP is an improvement on R-ASPP, which deploys global average pooling in a manner similar to that of a squeeze-activation module, where we use a large pool kernel with a large step size (to save some computation), and there is only one 11 convolution in the module. We applied Atrous Conv to the last block of MobileNet V3 to extract more dense features, and further added a skip join from the underlying features to capture more detailed information.

我们使用独立 mIOU 对 CityScapes 数据集进行了实验,只使用了fine 注释。我们采用与【8,39】相同的训练方案。我们所有的模型都是从零开始训练,没有使用ImageNet[36] 进行预训练,并且使用单尺度输入进行评估。与目标检测类似,我们发现我们可以在不显著降低性能的情况下,将网络主干最后一块的信道减少2倍。我们认为这是因为主干网络设计了 1000 类 ImageNet 图像分类,而Cityscapes 只有 19类,这意味着主干网络存在一定的信道冗余。

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

深度学习论文翻译解析-Searching for MobileNetV3

7,总结和未来工作

在本文中,我们介绍了MobilenetV3的大模型和小模型,并展示了移动分类、检测和分割方面的最新技术。我们描述了我们使用多种类型的网络架构搜索和高级网络设计来交付下一代移动模型所做的努力。我们还展示了如何适应非线性,如SWISH和量化友好和有效的应用压缩和激励的方法,将它们作为有效工具引入移动模型领域。我们还介绍了一种新的轻量级分段解码器LR-ASPP。虽然如何最好地将自动搜索技术与人类直觉结合起来仍然是一个悬而未决的问题,但我们很高兴展示这些初步的积极结果,并将在未来的工作中继续改进这些方法。[en]In this article, we introduce the large and small models of MobilenetV3 and show the latest technologies in mobile classification, detection and segmentation. We describe our efforts to use many types of network architecture search and advanced network design to deliver next-generation mobile models. We also show how to adapt nonlinearity, such as swish and quantitative friendly and effective ways of applying compression and incentives, to introduce them into the field of mobile models as effective tools. We also introduce a new lightweight segmentation decoder called LR-ASPP. Although how to best combine automatic search technology with human intuition is still an open question, we are pleased to show these preliminary positive results and will continue to improve these methods in future work.

Original: https://www.cnblogs.com/wj-1314/p/12108424.html
Author: 战争热诚
Title: 深度学习论文翻译解析-Searching for MobileNetV3

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/5957/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总