深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

论文标题:YOLOv4:目标检测的最佳速度和精度[en]Paper title: YOLOv4: Optimal Speed and Accuracy of Object Detection

论文作者:Alexey Bochkovskiy,Chien-Yao Wang,Hong-Yuan Mark Liao

论文地址:https://arxiv.org/abs/2004.10934.pdf

参考的 YOLO V4 翻译博客:https://www.machunjie.com/translate/695.html

YOLO V4 的源码:https://github.com/AlexeyAB/darknet

免责声明:编辑翻译的论文仅供学习之用,如有侵权行为,请联系编辑删除博客帖子,谢谢![en]Disclaimer: the editor translates the paper for study only, if there is any infringement, please contact the editor to delete the blog post, thank you!

小编是机器学习的初学者,有意仔细研究论文,但英语水平有限,所以在论文的翻译中使用了Google,并逐句检查,但仍有一些晦涩的地方,如语法/专业名词翻译错误,请原谅,并欢迎及时指出。[en]The editor is a beginner in machine learning and intends to study the paper carefully, but his English level is limited, so Google is used in the translation of the paper and checked sentence by sentence, but there are still some obscure places, such as grammatical / professional noun translation errors, please forgive me, and welcome to point out in time.

如果需要小编其他论文翻译,请移步小编的GitHub地址

传送门:请点击我

如果点击有误:https://github.com/LeBron-Jian/DeepLearningNote

YOLO v4 算法就是在原有 YOLO 目标检测架构的基础上,采用了近些年 CNN 领域中最优秀的优化策略,从 数据处理,主干网络,网络训练,激活函数,损失函数等各个方面都有着不同程度的优化,虽没有理论上的创新,但是会受到许许多多的工程师喜欢,各个优化算法的尝试。文章就像目标检测的 trick 综述,效果达到了实现 FPS 与 Precision 平衡的目标检测 new baseline。

本文的主要贡献如下:[en]The main contributions of this paper are as follows:

  • 1,开发了一个高效而强大的模型,使得任何人都可以使用一张 1080Ti 或者 2080Ti GPU 去训练一个超级快速和精确的目标检测器。
  • 2,验证了一系列的 state-of-the-art 的目标检测器训练方法的影响
  • 3,修改了 state-of-the-art 方法,使得他们在使用单个 GPU 进行训练的时候更加有效和适配,包括 CBN,PAN,SAM等。

笔者将培训方法分为两类:[en]The author divides the training methods into two categories:

  • 1,Bag of freebies:只改变训练策略或者只增加训练成本,比如数据增强
  • 2,Bag of specials:插件模块和后处理方法,他们仅仅增加一点推理成本,但是可以极大地提升目标检测的精度。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

1,摘要

有大量的特征来提高卷积神经网络(CNN)的准确率。需要在大数据集下对这张技巧的组合进行实际测试,并对结果进行理论论证要求。某些技巧仅在某些模型上使用和专门针对某些问题,或只针对小规模的数据集;而一些技巧,如批处理归一化,残差连接等,适用于大多数的模型,任务和数据集。我们假设这种通用的技巧包含 Weighted-Residual-Connection(WRC), Cross-Stage-Partial-Connections(CSP),跨小型批量连接(CMBN), Self-adversatial-trainin(SAT)和 Mish-activation。 我们在本文中使用这些新的技巧:WRC,CSP,CMBN,SAT,Mish-activation,Mosaic data augmentation,CMBN,DropBlock 正则化和CIoU损失,以及组合技巧,以达到最好的效果。在MS COCO数据集中的 AP 43.5%(65.7% AP50),在实际应用中,Tesla V100 显卡上上速度可达 65 FPS。Source code is at https://github.com/AlexeyAB/darknet。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

1,引言

大多数基于CNN的目标探测器基本上只适用于推荐系统。例如,通过城市摄像头寻找空闲停车位是由精确的慢速模型执行的,而汽车碰撞警告则是由快而低精度的模型执行的。提高实时目标检测器的准确性,使其不仅可以用于提示生成推荐系统,还可以用于独立的流程管理,减少人工投入。GPU使实时目标检测能够以合理的价格进行。最精确的现代神经网络不是实时运行的,需要大量训练有素的GPU和大批量。我们通过创建CNN来解决这个问题。它是在传统的GPU上实时生成的,但这些训练只需要一个传统的GPU。[en]Most CNN-based target detectors are basically only suitable for recommendation systems. For example, finding a free parking space through a city camera is performed by an accurate slow model, while a car collision warning is performed by a fast and low precision model. Improve the accuracy of the real-time target detector so that it can be used not only for prompting to generate recommendation systems, but also for independent process management and reducing human input. GPU enables real-time target detection to be carried out at a reasonable price. The most accurate modern neural network does not run in real time and requires a lot of trained GPU and large batch size. We solve this problem by creating a CNN. It is generated in real time on the traditional GPU, but only a traditional GPU is needed for these training.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

本工作(YOLO-v4)的主要目标是设计一种适用于生产环境的快速目标探测器,并优化并行计算,而不是刻意追求理论上的低计算复杂性指数(BFLOP)。同时,YOLOv4的功能也希望所设计的算法易于训练和使用。也就是说,任何使用传统GPU的训练和测试都可以达到实时、高质量、有说服力的目标检测结果。YOLOv4的结果如图1所示。我们的贡献总结如下:[en]The main goal of this work (YOLO-v4) is to design a fast target detector used in production environment and optimize parallel computing, rather than deliberately pursuing the theoretical low computational complexity index (BFLOP). At the same time, the function of YOLOv4 hopes that the algorithm designed can be easy to train and use. In other words, any use of traditional GPU training and testing can achieve real-time, high-quality, persuasive target detection results. The result of YOLOv4 is shown in figure 1. Our contributions are summarized as follows:

  • 1,我们构建了一个简单且高效的目标检测模型,该算法降低了训练门槛,这使得普通人员都可以使用 1080Ti 或 2080 Ti GPU 来训练一个超快,准确的(super fast and accurate)目标检测器。
  • 2,我们验证了最先进的 Bag-of-Freebies 和 Bag-of-Specials 方法在训练期间的影响。
  • 3,我们修改了最先进的方法,并且使其更为有效,适合单GPU训练。包括 CBN【89】, PAN【49】, SAM【85】等,从而使得 YOLO-v4 能够在一块 GPU 上就可以训练起来。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

一个检测算法可以理解为: Object detector = backbone + neck + head

Backbone:可以理解为是提取图像特征的部分,由于图像中的浅层特征(low-level)是比较类似的,例如提取边缘,颜色,纹理这些。因此这部分可以很好的借鉴一些设计好并且已经训练好的网络,例如(VGG16, 19, ResNet-50, ResNeXt-101,DarkNet53),同时还有一些轻量级的 backbone(MobileNetV1,2,3, ShuffleNet1,2)。

Neck:这部分大佬的理解是特征增强模块,前面的 backbone已经提取到了一些相关的浅层特征,由这部分对 backbone 提取到的浅层特征(low-level feature)进行加工,增强。从而使得模型学习到的特征是我想要的特征。这部分典型的有(SPP,ASPP in deeplabV3+, RFB,SAM),还有一些(FPN, PAN, NAS-FPN, BiFPN, ASFF, SFAM)。

Head:检测头,这部分就到了算法最关键的部分,就是来输出你想要的结果,例如想得到一个 heatmap,(如在 conternet 中),那就增加一些反卷积层来一层一层反卷积回去。如果想直接得到 Bbox,那就可以接 conv 来输出结果,例如 YOLO,SSD这些。亦或是想输出多任务(mask-RCNN),那就输出三个 head:classification, regression, segmentation(就mask那部分)。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

2,相关工作

2.1 目标检测模型

现代目标探测器通常由两部分组成:ImageNet上的预先训练的主干和用于预先训练的BBox的探测器头部。对于那些运行在GPU平台上的检测器,主干可以是VGG[68]、ResNet[26]、ResNeXt[86]或DenseNet[30]。对于在CPU平台上运行的那些形式的检测器,它们的主干可以是SqueezeNet[31]、MobileNet[28,66,27,74]或ShuffleNet[97,53]。至于头部,通常分为两类:一级目标探测器和两级目标探测器。最具代表性的两级检测器是RCNN[19]系列。包括FAST R-CNN[18]、FAST R-CNN[64]、R-FCN[9]、Libra R-CNN[58]。你也可以做一个两级目标探测器和一个无锚目标探测器。例如RepPoints[87]。对于单级检测器,最具代表性的是YOLO[61,62,63]和SSD[50]RetinaNet[45]。近年来,无锚一级目标探测器得到了发展,如CenterNet[13]、CornerNet[375.38]、FOCS[78]等。近年来发展起来的探测器通常在主干和头部之间插入层,用于采集不同阶段的特征图。我们可以称它为探测器的颈部。一般来说,颈部由几条自下而上的路径和几条自上而下的路径组成。FPN[44]、PAN[49]、BiFPN[77]、NAS-FPN[17]具有这种机制。[en]The modern target detector usually consists of two parts: the pre-trained backbone on the ImageNet and the head of the detector used for the pre-trained BBOX. For those detectors running on the GPU platform, the backbone can be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those forms of detectors running on the CPU platform, their backbone can be SqueezeNet [31], MobileNet [28,66,27,74], or ShuffleNet [97,53]. As for the head part, it is usually divided into two categories: One-Stage target detector and two-stage target detector. The best representative two-stage detector is RCNN [19] series. Including fast R-CNN [18], Faster R-CNN [64], R-FCN [9], Libra R-CNN [58]. You can also do a two-stage target detector and an Anchor-free target detector. Such as RepPoints [87]. For One-stage detectors, the most representative ones are YOLO [61,62,63] and SSD [50] RetinaNet [45]. In recent years, Anchor-free one-stage target detectors have been developed, such as CenterNet [13], CornerNet [375.38], FOCS [78] and so on. Detectors developed in recent years often insert layers between backbone and head, which are used to collect feature maps at different stages. We can call it the neck of the detector. In general, a neck consists of several bottom-up paths and several top-down paths. FPN [44], PAN [49], BiFPN [77], NAS-FPN [17] have this mechanism.

除了上述模型外,一些研究人员还重建骨干(DetNet[43],DETNAS[7])或重建整个模型(SpineNet[12],HitDetector[20])来进行目标检测。[en]In addition to the above models, some researchers reconstruct backbone (DetNet [43], DETNAS [7]) or reconstruct the whole model (SpineNet [12], HitDetector [20]) for target detection.

总而言之,目标检测模型通常具有以下体系结构:[en]In summary, the target detection model usually has the following architecture:

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

2.2 Bag of freebies

一袋免费赠品是什么?字面意思是免费赠送的礼物。在目标检测中,这意味着使用一些有用的训练技巧来训练模型,使模型在不增加模型复杂性的情况下达到更好的精度,即不增加推理的计算复杂性(代价)。在目标检测中,当谈到一袋免费赠品时,人们首先想到的是数据增强。[en]What is Bag of Freebies? It literally means a free gift. In target detection, it means that some useful training skills are used to train the model, so that the model can achieve better accuracy without increasing the complexity of the model, that is, it does not increase the computational complexity of inference (cost). In target detection, when it comes to bag of freebies, the first thing that comes to mind is Data augmentation.

在本文中,BoF是指在不增加推理时间的情况下提高精度的技术。[en]In this paper, BoF refers to those technologies that can improve accuracy without increasing inference time.

  • 1,比如数据增光的方法:图像几何变换,Cutout,grid mask等
  • 2,网络正则化的方法:Dropout, DropBlock等
  • 3,类别不平衡的处理方法
  • 4,难例挖掘方法
  • 5,损失函数的设计

通常,传统的目标探测器是离线训练的。因此,研究人员总是喜欢利用这一优势来开发更好的训练方法,使目标检测器在不增加推理成本的情况下获得更好的精度。我们称这些方法为免费礼包,它们只会改变培训策略,或者只会增加培训成本。通常采用目标检测的方法,作为数据扩展,符合免费礼包的定义。数据放大的目的是增加输入图像的可变性。这使得所设计的目标检测模型对来自不同环境的图像具有更强的鲁棒性。例如,光度失真和几何失真是两种常用的数据放大方法,这对目标检测任务肯定是有利的。在处理光度失真时,我们调整图像的亮度、对比度、色调、饱和度和噪声。对于几何失真,我们添加了随机缩放、裁剪、翻转和旋转。[en]Usually, the traditional target detector is trained offline. Therefore, researchers always like to take advantage of this advantage to develop better training methods to make the target detector obtain better accuracy without increasing the reasoning cost. We call these methods “free gift packages”, which only change the training strategy or only increase the training cost. The target detection method is usually adopted and conforms to the definition of free gift package as data expansion. The purpose of data amplification is to increase the variability of the input image. It makes the designed target detection model more robust to images obtained from different environments. For example, luminosity distortion and geometric distortion are two commonly used data amplification methods, which are certainly beneficial to the target detection task. When dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation and noise of the image. For geometric distortion, we add random scaling, cropping, flipping and rotation.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

上面提到的数据扩展方法是调整所有像素,并保留调整区域中所有原始像素信息。此外,一些从事数据放大的研究人员强调了模拟目标遮挡的问题。在图像分类和目标检测方面取得了较好的效果。例如,随机擦除[100]和剪切[11]可以随机选择图像中的矩形区域,并填充随机或补值0。至于隐藏和查找[69]和网络掩码[6],他们随机或均匀地选择图像中的多个矩形区域,并用全零替换它们。如果将类似的概念应用于要素地图,则有“Dropout[71]”、DropConnect[80]和DropBlock[16]方法。此外,一些研究人员提出了一种将多幅图像组合起来进行数据增强的方法。例如,MixUp[92]使用两个图像以不同的系数比率来增加和叠加,然后使用这些叠加比率来调整标签。对于CutMix[91],它将裁剪的图像覆盖在其他图像的矩形区域上,并根据混合区域的大小调整标签。除了上述方法外,还使用风格转移GaN[15]进行数据放大,可以有效地减少CNN学习到的纹理偏差。[en]The data expansion method mentioned above is to adjust all the pixels and retain all the original pixel information in the adjusted area. In addition, some researchers engaged in data amplification emphasize the problem of simulated object occlusion. They have achieved good results in image classification and object detection. For example, random erasure [100] and CutOut [11] can randomly select a rectangular region in the image and fill in random or complementary values of zero. As for hiding and finding [69] and netmask [6], they randomly or evenly select multiple rectangular areas in the image and replace them with all zeros. If similar concepts are applied to feature maps, there are “Dropout [71]”, DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed a method of combining multiple images to perform data enhancement. For example, MixUp [92] uses two images to increase and overlay at different coefficient ratios, and then uses these overlay ratios to adjust the label. As for CutMix [91], it covers the cropped image over the rectangular area of other images and adjusts the label according to the size of the mixed area. In addition to the above methods, style transfer GAN [15] is also used for data amplification, which can effectively reduce the texture deviation learned by CNN.

MIX-UP:Mix-up 在分类任务中,将两个图像按照不同的比例相加,例如 A0.1 + B0.9=C,那么C的 label 就是 [0.1A, 0.9A]。在目标检测中的做法就是将一些框相加,这些 label 中就多了一些不同置信度的框。

Style-transfer GAN:用style-transfer GAN 做数据增强,例如在做街景分割和目标检测的时候,将 GTA-5 的数据做一个 style-transfer,扩充一些数据集。不过更多的是用在了 Domain Adaptation上。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

与上述方法不同的是,其他一些免费的方法是为了解决数据集的语义分布可能存在偏差的问题而设计的。在处理语义分布偏差时,一个非常重要的问题是不同类之间的数据不平衡。这一问题通常通过两级目标检测器中的硬否定事例挖掘[72]或在线硬例子挖掘[67]来解决。然而,案例挖掘方法属于密集预测体系结构,不适用于一级目标检测器。因此,Lin等人提出了一个新的解决方案。[45]针对不同类别间数据不平衡的问题,提出了关键丢失问题。另一个非常重要的问题是,很难表达不同类别和一个热点表示之间的关系。此方案通常在执行标记时使用。在[73]中建议的标签平衡是将硬标签转换为软标签用于训练,这将使模型更强大。为了获得更好的软标签,伊斯拉姆等人引入了知识蒸馏的概念来设计标签来提炼网络。[en]Unlike the above methods, some other free giveaway methods are designed to solve the problem that there may be deviations in the semantic distribution of the data set. When dealing with the deviation of semantic distribution, a very important problem is the data imbalance between different classes. this problem is often solved by hard negative case mining [72] or online hard example mining [67] in two-stage target detectors. However, the case mining method is not suitable for the first-level target detector because it belongs to the intensive prediction architecture. Therefore, Lin et al. [45] proposed the key loss problem to solve the problem of data imbalance between different categories. Another very important issue is that it is difficult to express the relationship between different categories and one-hot representations. This scheme is usually used when performing tags. The tag balance suggested in [73] is to convert hard tags to soft tags for training, which will make the model stronger. In order to obtain better soft tags, Islam et al introduced the concept of knowledge distillation to design tags to refine the network.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

第三袋免费赠品是对损失函数进行更改。计算BBOX的常用损失是MSE,但现在它计算的是IOU之间的损失,还有一种称为GIoU损失。在改进的面具评分RCNN上有一个更经典的算法-RCNN。这一部分的逻辑是这样的:在选择ROI时,如果按照每个ROI的得分对ROI进行排序和过滤,就会出现一个问题,那就是置信度高的ROI在BBox的位置上不一定准确。后来,笔者尝试用借条来过滤ROI,发现效果更好。[en]The third bag of freebies is to make changes on loss function. The commonly used loss for computing bbox uses MSE, but now it calculates the loss between IoU, and there is another one called GIoU loss. There is a more classic algorithm in the modified Mask scoring RCNN on Mask-RCNN. The logic of this part is as follows: when selecting ROI, if you sort and filter the ROI according to the score of each ROI, there will be a problem, that is, the ROI with high confidence is not necessarily accurate on the location of the BBOX. Later, the author tried to use IoU to filter ROI and found that the effect is better.

免费赠品的最后一包是边界框(BBox)回归的目标函数。传统的目标检测器通常使用均方误差(MSE)直接对BBox的中心点坐标和高度和宽度执行回归,即即 {x 居中、y 中心、w、h } 或左上点和右下角,即 {x 左上、左上、x 右下、右下 }。至于基于锚点的方法,它是估计相应的偏移量,例如 [x 中心偏移,y 中心偏移,w偏移,h偏移量} 和 { x 左上偏移量,y 左上偏移量,x 右下偏移量,y 右下角偏移} 。但是,直接估计 BBox 的每个点的坐标值是将这些点视为独立的变量,但实际上不考虑对象本身的完整性。为了更好地处理这一问题,一些研究人员最近提出了 IoU损失【90】, 将预测的的 BBox 区域与真实的 BBox区域的覆盖范围考虑在内。 IoU 损失会计算到 BBox 的四个坐标点,然后连接到生成一个完整的代码。因为 IoU 是一个尺度不变表示。它可以解决当传统方法计算{x, y, w, h} 的 L1 或 L2损失导致尺度增加。最近,一些研究员在不断改善 IoU损失。例如 GIoU损失【65】除覆盖面积也考虑物体的形状和方向,他们建议找到能同时覆盖预测 BBox 和 ground truth BBox的最小面积 BBox,并使用这个 BBox作为分母,以取代IoU损失的分母。至于DIOU损失【99】,它另外还包括考虑物体中心的距离,CIOU损失【99】,另一方面,同时考虑到重叠区域,中心点之间的距离,以及长宽比。CIOU可以在 BBox 回归问题上实现更好的收敛速度和精度。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

2.3 Bag of specials

Bag of specials:就是指一些 plugin modules(例如特征增强模块,或者一些后处理),这部分增加的计算量(cost)很少,但是能有效的增加物体检测的准确率,我们将这部分称之为 Bag of specials。这部分插件模块能够增强网络的一些属性,例如增大感受域(ASFF, ASPP, RFB 这些模块),引入注意力机制(有两种,一种是 spatial attention,另一种是 channel attention),增加特征集成能力(FPN, ASFF,BiFPN)。后处理算法是用一些算法来筛选模型预测出来的结果。

BoS 是指 那些增加稍许推断代价,但可以提升模型精度的方法。

  • 1,比如增加模型感受野的 SPP,ASPP,RFB等
  • 2,引入注意力机制 Squeeze-and-Excitation(SE),Spatial Attention Module(SAM)等
  • 3,特征集成方法 SFAM, ASFF,BiFPN等
  • 4,改进的激活函数 Swish, Mish等
  • 5,后处理方法如 Soft NMS,DIOU NMS等

对于那些只增加少量推理代价,却能显著提高目标检测准确率的插入模块和后处理方法,我们称之为特色包。一般而言,这些插入模块用于增强某些属性,如扩展接受野、引入注意机制或增强特征整合能力,而后处理则是对模型预测结果进行筛选的一种方法。[en]For those insertion modules and post-processing methods, which only increase a small amount of reasoning cost, but can significantly improve the accuracy of target detection, we call them “Bag of specials”. Generally speaking, these insertion modules are used to enhance some attributes, such as expanding the receptive field, introducing attention mechanism or enhancing the ability of feature integration, and post-processing is a method to screen the prediction results of the model.

增大感受野

可用于扩大感受野的模块有 SPP【25】, ASPP【5】, RFB【47】。SPP模块起源于 Spatial Pyramid Match(SPM)【39】,而SPMs 的原始方法是将特征图分割成几个 d*d 等量的块,其中 d 可以是{1, 2, 3, ….},从而形成空间金字塔,然后提取 Bag-of-word 特征。SPP将 SPM集合成CNN并使用 最大池化操作而不是 Bag-of-word 运算,由于 He 等人提出的 SPP模块【25】将输出一维特征向量,而不可能应用于全卷积网络(FCN)中。因此,在YOLOv3的设计[63]中,Redmon和Farhadi改进了YOLOv3的设计将SPP模块改进为融合k×k核的max-pooling输出,其中k = {1,5,9,13},并且步长等于1。在这种设计下,一个相对较大的k×k有效地增加了backbone的感受野,增加了 SPP模块的改进版后,YOLOv3-608 在 MS COCO上 AP 50 提升 2.7%,但是要付出 0.5%的额外计算成本。ASPP[5]模块和改进后的SPP模块在操作上的区别是主要由原来的步长1、k×k核的max-pooling到几个3×3核的max-pooling、空洞率为k步长1的空洞卷积。RFB模块是使用几个k×k核的空洞卷积,空洞率为k,步长为1以获得比ASPP更全面的空间覆盖率。RFB[47]只需额外增加7%推理时间却在MS COCO上提升SSD的AP50 5.7%。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

引入注意力机制

目标检测中常用的注意模块通常分为通道注意和点注意。有代表性的模型有SE[29]和SAM(空间注意模块)[85]。虽然SE模块可以将ReSNet50在ImageNet图像分类任务中的TOP-1准确率提高1%,运算量增加2%,但在GPU上,它通常会增加10%左右的推理时间,因此更适合移动应用。而对于SAM,只需要额外增加0.1%的计算就可以提高ResNet50-SE 0.5%Top-1的准确率,并且完全不影响GPU上的推理速度。[en]The attention module which is often used in target detection is usually divided into channel-wise attention and point-wise attention. The representative models are SE [29] and SAM (Spatial Attention Module) [85]. Although the SE module can improve the 1% top-1 accuracy of ReSNet50 in ImageNet image classification tasks and increase the amount of computation by 2%, on GPU, it usually increases the reasoning time by about 10%, so it is more suitable for mobile applications. But for SAM, it only needs an additional 0.1% calculation to improve the accuracy of ResNet50-SE 0.5% Top-1, and it does not affect the speed of reasoning on GPU at all.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

特征融合或特征集成

在特征融合方面,早期的做法是跳过连接[51]或超列[22]来集成低级别和高级特征。随着模糊神经网络等多尺度预测方法的流行,许多轻量级模块集成了不同的特征金字塔。包括SfAM[98]、ASFF[48]、BiFPN[77]。SfAM的主要思想是使用SE模块对多尺度覆盖特征图进行通道级重加权。对于ASFF,它使用Softmax作为点级重量化,然后以不同的比例添加特征地图。在BiFPN中,建议多输入加权残障联合进行比例级重新加权,然后添加不同比例的特征映射。[en]In terms of feature fusion, the early practice is Skip connection [51] or hyper-column [22] to integrate low-level and high-level features. As multi-scale prediction methods such as FPN have become popular, many lightweight modules integrate different feature pyramids. Including SFAM [98], ASFF [48], BiFPN [77]. The main idea of SFAM is to use SE module to execute channel-wise-level re-weighting on multi-scale overlay feature graph. As for ASFF, it uses softmax as a point-level re-quantization and then adds feature maps at different proportions. In BiFPN, it is recommended that the multi-input weighted disabled federation perform scale-level reweighting, and then add different proportions of feature mapping.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

激活函数

在深度学习的研究中,有人注重于寻找好的激活函数。一个好的激活函数可以使梯度更有效的传播,也不会造成太多的额外计算成本。2010年,Nair and Hinton 【56】提出了 ReLU 来实质性的解决梯度消失的问题,这也是 tanh 和 Sigmoid 中经常遇到的问题。随后就是 LReLU[54], PReLU[24], ReLU6[28], Scaled Exponential Linear Unit (SELU)[35]、Swish[59]、hard-Swish[27]、和Mish[55]等,也是用来解决梯度问题的。LReLU和PReLU其主要目的是: 是为了解决梯当输出小于零时,ReLU的梯度为零。至于ReLU6 和 Hard-Swish,它们是专为量化网络设计。对于自归一化的神经网络,提出了SELU的激活函数满足子归一化目的。有一点需要注意的是,Swish和Mish都是连续可微激活函数。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

后处理

在基于深度学习的目标检测中,最常用的后处理方法是NMS,它可以用来过滤预测错误的BBox,只保留响应较高的候选BBox。NMS试图改进的方式与优化目标函数的方法是一致的。NMS提出的原始方法没有考虑上下文信息,因此Girshick等人提出了一种新的方法。[19]在R-CNN中增加了分类置信度作为参考,并考虑了置信度的顺序,按照从高到低的顺序进行贪婪NMS。对于软NMS[1],它认为对象的遮挡可能会导致贪婪NMS中的置信度分数和IOU分数的下降。DIOU NMS[99]开发者的思路是在软网管的基础上,在BBox过滤过程中加入中心点距离信息。值得一提的是,由于上述后处理方法都不直接参考捕获的图像特征,因此在后续的无锚方法开发中不再需要后处理。[en]The most commonly used post-processing method in deep learning-based target detection is NMS, which can be used to filter those BBoxes that are wrong in prediction and retain only the candidate BBox with high response. The way NMS tries to improve is consistent with the method of optimizing the objective function. The original method proposed by NMS does not consider the context information, so Girshick et al. [19] added the classification confidence score as a reference in the R-CNN, and considered the order of the confidence score, and carried out greedy NMS in the order of high score to low score. As for soft NMS [1], it considers that the occlusion of objects may lead to a decline in confidence scores in greedy NMS and IOU scores. DIOU NMS [99] the developer’s way of thinking is to add the information of center point distance to the BBox filtering process on the basis of soft NMS. It is worth mentioning that since none of the above post-processing methods directly refers to the captured image features, post-processing is no longer needed in the subsequent development of the anchor-free method.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

3,方法论

其基本目标是加快神经网络的运行速度,优化产生式系统中的并行计算,而不是低计算理论指数(BFLOPS)。对于实时神经网络,我们提出了两种选择。[en]The basic goal is to accelerate the running speed of neural networks and optimize parallel computing in production systems, rather than low computational theoretical index (BFLOPS). We propose two options for real-time neural networks.

  • 对于GPU,我们在卷积层使用少量的组(1×8):CSPResNeXt50/CSPDarknet53[en]* for GPU, we use a small number of groups (1×8) in the convolution layer: CSPResNeXt50 / CSPDarknet53
  • 对于 VPU, 我们使用分组卷积,但是不使用SE块,具体来说包含下面几个模块:EfficientNet-lite / MixNet [76] / GhostNet [21] . MobileNetV3。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

CSPDarkNet-53 包含 29 个卷积层, 725*725 的感受野,27.6M 参数。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

3.1 架构选择

我们的目标是在输入网络分辨率、卷积层数、参数数(过滤器大小2*+过滤器+通道/组)和层输出层(过滤器)之间找到最佳平衡。例如,我们的大量研究表明,在ILSVRC 2012(ImageNet)数据集上,CSPResNext50比CSPDarkNet53要好得多[10]。然而,相反,CSPDarknet53在检查MS Coco数据集上的对象方面优于CSPResNeXt50[46]。[en]Our goal is to find the best balance between input network resolution, number of convolution layers, parameter number (filter size 2 * + filter + channel / group) and layer output layer (filter). For example, a large number of our studies have shown that CSPResNext50 is much better than CSPDarkNet53 on ILSVRC 2012 (ImageNet) dataset [10]. On the contrary, however, CSPDarknet53 is superior to CSPResNeXt50 in examining objects on the MS COCO dataset [46].

下一个目标是选择其他模块来增加接受范围,并选择在不同主干级别聚合参数的最佳方式:例如,FPN、PAN、ASFF、BiFPN。[en]The next goal is to select other modules to increase the receptive field and the best way to aggregate parameters at different backbone levels: for example, FPN, PAN, ASFF,BiFPN.

最佳分类的参考模型并不总是检测器的最佳模型。与分类器相比,该检测器需要进行以下操作:[en]The reference model of the best classification is not always the best model of the detector. Compared with the classifier, the detector requires the following operations:

  • 1,更高的输入网络大小(分辨率)——用于检测更多的小尺寸目标
  • 2,更多的层——随着输入网络尺寸的增大而获得更大的感受野
  • 3,更多的参数——更大的模型容量来检测单幅图像中不同尺寸的目标

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

假设,我们可以假设接收场大小(具有更多卷积层3-3)和更多参数的模型应该被用作主干。表1显示了有关CSPResNeXt50、CSPDarkNet53和高效网络B3的信息。CSPResNeXt50仅包含16个卷积层33,425,425个接受场和20.6M参数,而CSPDarkNet53包含29个卷积层33,725,725个接受场和27.6M参数。这一理论论证,再加上大量的实验,表明CSPDarkNet53神经网络是两者作为探测器骨干的最佳模型。[en]Hypothetically, we can assume that a model of receptive field size (with more convolution layers 3-3) and more parameters should be used as the backbone. Table 1 shows information about CSPResNeXt50, CSPDarkNet53, and efficient Network B3. CSPResNeXt50 contains only 16 convolution layers 33,425,425 receptive fields and 20.6m parameters, while CSPDarkNet53 contains 29 convolution layers 33,725,725 receptive fields and 27.6m parameters. This theoretical demonstration, coupled with a large number of experiments, shows that the CSPDarkNet53 neural network is the best model for both as the backbone of the detector.

不同大小的感受野的影响总结如下:[en]The effects of different sizes of receptive fields are summarized as follows:

  • 1,最多到对象大小——允许查看整个对象
  • 2,最多网络大小——允许查看对象周围的上下文
  • 3,超过网络大小——增加图像点和最终激活之间的连接数

我们将SPP块添加到CSPDarkNet53,因为它显著增加了接受范围,分离了最重要的上下文功能,并且几乎不会降低网络运行速度。我们使用PANET作为不同主干级参数在不同检测器级别的聚合方法,而不是YOLOV3中使用的FPN。[en]We add SPP blocks to CSPDarkNet53 because it significantly increases the receptive field, separates the most important context functions, and hardly slows down network operations. We use PANet as the aggregation method for different backbone-level parameters at different detector levels, rather than the FPN used in YOLOV3.

最后,我们选择了CSPDarkNet53作为主干网络,SPP插件,Panet Path Aggregation Neck和YOLOv3(基于锚点的)报头作为YOLOv4的架构。[en]Finally, we choose CSPDarkNet53 as the backbone network, SPP add-in, PANet path aggregation neck and YOLOv3 (anchor-based) header as the architecture of YOLOv4.

在未来,我们计算了大大扩展的探测器的免费礼品包(BoF)的含量,这可以从理论上解决一些问题,提高探测器的精度,并以实验的方式依次检查各功能的影响。[en]In the future, we calculate the content of the free gift package (BoF) of the greatly expanded detector, which can theoretically solve some problems, improve the accuracy of the detector, and check the influence of each function in an experimental way in sequence.

我们不使用交叉GPU批处理标准化(CGBN或SyncBN)或昂贵的专用设备。这使得任何人都可以在传统的图形处理器上复制我们最先进的结果,例如GTX 1080Ti或RXT 2080Ti。[en]We do not use cross-GPU batch normalization (CGBN or SyncBN) or expensive specialized equipment. This allows anyone to reproduce our state-of-the-art results, such as GTX 1080Ti or RXT 2080Ti, on traditional GPUs.

3.2 BoF 和 BoS 的选择

为了改进目标检测训练,CNN通常使用以下方法:[en]To improve target detection training, CNN usually uses the following:

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

关于训练激活函数,我们从候选列表中删除了上述激活函数,因为PReLU和SELU较难训练,而ReLU6是专门为量化网络而设计的。在正则化方法上,发表DropBlock的人将自己的方法与其他方法进行了详细的比较,其正则化方法胜出不少。因此,我们毫不犹豫地选择DropBlock作为我们的正则化方法。至于正则化方法的选择,由于我们关注的是只有一个GPU的训练策略,所以没有考虑同步BN。[en]With regard to the training activation function, we removed the above activation function from the candidate list because PReLU and SELU are more difficult to train, and ReLU6 is specifically designed to quantify the network. In the regularization method, the person who publishes the DropBlock makes a detailed comparison between his own method and other methods, and its regularization method wins a lot. Therefore, we do not hesitate to choose DropBlock as our regularization method. As for the choice of regularization method, because we focus on the training strategy with only one GPU, we do not consider syncBN.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

3.3 额外的提升

为了使所设计的检测器更适合于单GPU训练,我们还做了以下设计和改进:[en]In order to make the designed detector more suitable for single GPU training, we have made the following other designs and improvements:

  • 1,我们介绍了一种新的数据扩展马赛克和自我对抗训练(SAT)的方法
  • 2,在应用遗传算法时,我们选择最佳的超参数

我们改进了一些外部方法,使我们的设计适合于有效的训练和检测–改进的SAM、改进的PAN和跨批次正则化(CMBN)。[en]We have modified some external methods to make our design suitable for efficient training and detection-modified SAM, modified PAN and cross-batch regularization (CMBN).

马赛克是一种新的数据扩展方法,它将四幅训练图像混合在一起。结果,混合了四个不同的上下文,而CutMix仅混合了两个输入图像。这允许检测其正常上下文之外的对象。此外,批量归一化计算每个层上4个不同图像的激活统计数据。这大大减少了对大量小批量的需求。[en]Mosaic is a new data expansion method, which mixes four training images. As a result, four different contexts are mixed, while CutMix mixes only two input images. This allows detection of objects outside their normal context. In addition, batch normalization calculates activation statistics for 4 different images on each layer. This significantly reduces the need for a large number of small batch sizes.

自我对抗训练(SAT)也代表了一种新的数据放大技术,可以分两个从前到后的阶段进行。在第一阶段,神经网络改变的是原始图像,而不是网络权重。通过这种方式,神经网络对自己进行反击,改变原始图像,在图像上没有物体的情况下制造欺骗。在第二阶段,训练神经网络以正常方式检测修改后的图像上的目标。[en]Self-adversarial training (SAT) also represents a new data amplification technique that can be allowed in two forward-to-backward phases. In the first stage, the neural network changes the original image rather than the network weight. In this way, the neural network performs a counterattack on itself, changing the original image to create deception with no objects on the image. In the second stage, the neural network is trained to detect objects on the modified image in a normal manner.

CMBN 表示 CBN 修改版本,如图4所示,定义为交叉小批量规范化(CMBN)。这仅仅在单个批处理中的微型批处理之间收集统计信息。

我们将SAM从空间注意修改为点注意,并将PAN的快捷连接更改为串联,如图5和图6所示。[en]We modify the SAM from spatial attention to point attention and change the shortcut connection of PAN to concatenation, as shown in figures 5 and 6.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

BN 是对当前 mini-batch 进行归一化。CBN 是对当前以及当前往前数 3 个 mini-batch 的结果进行归一化,本文提出的 CmBN 则仅仅是在这个 Batch 中进行累积。在实验中,CmBN 要比 BN高出不到一个百分点。

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

3.4 YOLOV4

在本节中,我们将详细介绍YOLOv4的细节。[en]In this section, we will elaborate on the details of YOLOv4.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

YOLOv4 使用了:

  • 1,主干免费赠品袋(BoF):CutMix 和 马赛克数据扩增,DropBlock正则化,类标签平滑。
  • 2,主干专用袋(BoS):Mish激活,跨阶段部分连接(CSP),多输入加权剩余连接(MiWRC)
  • 3,用于检测器的免费赠品袋(BoF):CIOU损失,CMBN,DropBlock正则,马赛克数据扩增,自我对抗训练,消除网络灵敏度,使用多个锚点作为单一的ground truth,Cosin退火调度程序【52】,最佳超参数,随机训练形状
  • 4,用于检测器的特有包(BoS):Mish 激活,SPP块,PAN路径聚合块,DIoU-NMS

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

4,实验

我们测试了不同训练改进的效果。包括(ILSVRC 2012 Val)ImageNet数据集分类准确率,然后MS COCO(TEST-DEV 2017)数据集测试准确率。[en]We tested the effects of different training improvements. Including (ILSVRC 2012 val) ImageNet data set classification accuracy, then MS COCO (test-dev 2017) data set test accuracy.

4.1 实验设置

在 ImageNet 的图像分类实验中,默认超参数如下:训练步数8 000 000;batch size和mini-batch size为128和32;polynomial decay learning rate scheduling strategy 初始学习率为0.1的调度策略。warm-up steps为1000;momentum and weight decay分别设定为0.9和0.005。我们所有的BoS实验中使用的超参数为默认设置,而在BoF实验中,我们增加了一个额外的50%的训练步数。在BoF实验中,我们验证 MixUp, CutMix, Mosaic, Bluring数据增强、和label smoothing regularization方法。在BoS的经验中我们比较了LReLU、Swish和Mish的效果。所有的实验都是用1080Ti或2080 Ti GPU。

在ImageNet图像分类实验中,默认的超参数为:[en]In the ImageNet image classification experiment, the default hyperparameters are:

  1. training steps:8000,000
  2. batch size:128
  3. mini-batch size:32
  4. polynomial decay learning rate:initial learning rate 0.1
  5. warm-up steps:1000
  6. momentum:0.9
  7. weight decay:0.005
  8. BoF实验中,额外增加50%的training steps

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

在 MS COCO 目标检测实验中,默认参数如下:训练步长为 500,500; the step decay learning rate scheduling strategy 是采用初始学习率为 0.01, 并分别在 40万步 和 45 万步时乘以系数 0.1, momentum and weight decay 分别设置为 0.9 和 0.0005。所有架构执行多尺度训练的都使用单 GPU,batch size 64,mini-batch size 是 8 或 4,取决于架构和 GPU显存容量限制。除了超参数搜索使用遗传算法,所有其他实验使用默认设置。遗传算法实验使用 YOLOv3-SPP 以 GIoU损失进行训练,对 min-val 5k 数据搜索 300 轮。我们采用搜索过的学习率 0.00261,动量 0.949,IoU阈值对 ground truth 为 0.213,loss normalizer 为 0.07。我们已经验证了大量的 BoF,包括 grid sensitivity elimination, mosaic data augmentation, IoU threshold, genetic algorithm,class label smoothing, cross mini-batch normalization, self adversarial training, cosine annealing scheduler, dynamic mini-batch size, DropBlock, Optimized Anchors, different kind of IoU losses。我们还对各种BoS验证,包括Mish、SPP、SAM、RFB、BiFPN、BiFPN和Gaussian YOLO[8]。对于所有的实验,我们只使用一个GPU训练,因此,诸如syncBN这样的技术可以优化多GPU训练并未使用。

在MS COCO目标检测实验中,默认的超参数是:

  1. training steps:500,500
  2. step decay learning rate:0~399,999 steps ,learning rate 0.1;400,000~449,999 steps,learning rate 0.01;450,000~500,500 steps, learning rate 0.001
  3. momentum:0.9
  4. weight decay:0.0005
  5. batch size:64
  6. mini-batch size:8 or 4

除了用遗传算法寻找最优超级参数的实验外,其余均采用默认参数。[en]Except for the experiment of using genetic algorithm to find the optimal super parameters, all others use the default parameters.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

4.2 不同技巧对分类训练中的影响

首先,我们研究了不同技术对分类训练的影响;具体地说,类标签平滑,不同数据增强的效果,双边模糊,混合,CutMix和马赛克,如图7所示,以及不同激活函数的效果,如Leaky relu(默认情况下),SWISH和MISH。[en]First, we studied the effects of different techniques on classification training; specifically, Class label smoothing, the effects of different data enhancements, bilateral blurring,MixUp, CutMix and Mosaic, as shown in figure 7, and the effects of different activation functions, such as Leaky ReLU (by default), Swish, and Mish.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

在我们的实验中,如表2所示,引入了以下技术来提高准确率,如CutMix和马赛克数据增强、类标签平滑和Mish激活。因此,我们的分类器训练BOF-Backbone(Bag Of Free Bies)是:CutMix和Mosaic数据增强和类标签平滑,此外,我们还使用MISH激活作为补充方案,如表2和表3所示。[en]In our experiments, as shown in Table 2, the following techniques are introduced to improve accuracy, such as CutMix and Mosaic data enhancement, Class label smoothing and Mish activation. Therefore, our classifier training BOF-backbone (Bag of Freebies) is: CutMix and Mosaic data augmentation and Class label smoothing, in addition, we also use Mish activation as a supplementary scheme, as shown in tables 2 and 3.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

4.3 在检测训练中不同技巧对训练的影响

在深入研究和考虑不同的免费袋(BOF-检测器)对检测器培训的影响,如表4所示,我们通过学习不影响FPS的技术在提高准确性的同时显著扩展了BOF的内容。[en]In-depth study and consideration of the impact of different Bag-of-Freebies (BoF-detector) on detector training, as shown in Table 4, we significantly expand the content of BOF by studying techniques that do not affect FPS while improving accuracy.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

4.4 不同backbone 和预训练权重的影响

我们进一步研究了不同的主干对准确率的影响,如表6所示。我们注意到,具有最好的分类特征精度的模型并不总是最好的检测精度。[en]We further study the impact of different backbone on accuracy, as shown in Table 6. We notice that the model with the best classification feature accuracy is not always the best detection accuracy.
首先,虽然CSPResNeXt50的分类精度高于CSPDarknet53、CSPDarknet53模型,但在目标检测方面具有更高的精度。[en]First of all, although the classification accuracy of CSPResNeXt50 is higher than that of CSPDarknet53,CSPDarknet53 model, it has higher accuracy in target detection.
其次,使用BoF和Mish训练提高了CSPResNeXt50的分类精度,但进一步应用这些预先训练好的权值会降低检测器的精度。然而,使用BoF和MISH来训练CSPDarknet53分类器可以提高分类器和检测器的精度。因此,主干CSPDarknet53比CSPResNeXt50更适合用作探测器。[en]Secondly, the classification accuracy of CSPResNeXt50 is improved by using BoF and Mish training, but the accuracy of the detector is reduced by further applying these pre-trained weights. However, using BoF and Mish to train CSPDarknet53 classifier can improve the accuracy of classifier and detector. As a result, backbone CSPDarknet53 is more suitable for detectors than CSPResNeXt50.
我们观察到,由于各种改进,CSPDarknet53模型显示出更强的提高探测器精度的能力。[en]We observe that the CSPDarknet53 model shows greater ability to improve the accuracy of the detector because of various improvements.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

4.5 不同的 mini-batch size 对检测器的影响

最后,我们使用该模型对不同小批量的训练结果进行了分析,结果如表7所示。从表7的结果中我们发现,在添加BoF和BoS训练策略后,我们发现小批量大小对几乎每一批检测器的检测结果都没有影响。这一结果表明,随着BOF和BOS的引入,它不再需要使用昂贵的GPU进行培训。换句话说,任何人都只能使用传统的GPU来训练一个好的探测器。[en]Finally, we use the model to analyze the training with different mini-batch size, and the results are shown in Table 7. From the results shown in Table 7, we find that after adding BoF and BoS training strategies, we find that mini-batch size has no effect on the detection results of almost every batch of the detector. This result shows that with the introduction of BoF and BoS, it no longer needs to use expensive GPU for training. In other words, anyone can use only a traditional GPU to train a good detector.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

5, 结果

图8 显示了与其他最先进的目标检测器结果的比较。我们的 YOLOv4位于 Pareto 最佳曲线上,在速度和精度方面优于最快,最精确的检测器。

由于不同的方法使用不同体系结构的GPU进行推理时间验证,我们在Maxwell、Pascal和Volta解耦的常用GPU上运行了YOLOv4,并与其他最先进的方法进行了比较。表8列出了使用Maxwell GPU的帧速率比较结果,这些GPU可以是GTX Titan X(Maxwell)、Titan XP、GTX 1080Ti或Tesla P100 GPU。对于表10,它列出了使用Volta GPU的帧速率比较结果,该GPU可以是Titan Volta或Tesla V100 GPU。[en]Because different methods use GPU of different architectures for reasoning time verification, we run YOLOv4 on the commonly used GPU decoupled by Maxwell, Pascal and Volta, and compare it with other most advanced methods. Table 8 lists the frame rate comparison results using Maxwell GPU, which can be GTX Titan X (Maxwell), Titan Xp, GTX 1080Ti, or Tesla P100 GPU. As for Table 10, it lists the frame rate comparison results using Volta GPU, which can be Titan Volta or Tesla V100 GPU.

深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

6,总结

我们提供最先进的检测器,比所有可用的替代检测器更快(FPS)和更准确(MS Coco AP50~90和AP50)。所描述的检测器可以在8~16 GB VRAM的传统GPU上训练和使用,这使得它有可能得到广泛应用。一阶锚探测器的初步构想证明了它的可行性。我们已经演示了大量的函数,并选择了这些函数来提高分类器和检测器的精度。这些功能可用作未来研究和开发的最佳实践。[en]We provide state-of-the-art detectors that are faster (FPS) and more accurate (MS COCO AP50~90 and AP50) than all available alternative detectors. The detector described can be trained and used on a traditional GPU with 8 to 16 GB-VRAM, which makes it possible for it to be widely used. The initial concept of the first-order anchor detector has proved its feasibility. We have demonstrated a large number of functions and selected for these functions to improve the accuracy of classifiers and detectors. These features can be used as best practices for future research and development.

Original: https://www.cnblogs.com/wj-1314/p/14506507.html
Author: 战争热诚
Title: 深度学习论文翻译解析:YOLOv4: Optimal Speed and Accuracy of Object Detection

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/5947/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部