弱监督/无监督显著目标检测(SOD)部分经典论文摘要

Deep Neural Networks (DNNs) have substantially improved the state-of-the-art in salient object detection. However, training DNNs requires costly pixel-level annotations. In this paper, we leverage the observation that image level tags provide important cues of foreground salient objects, and develop a weakly supervised learning method for saliency detection using image-level tags only. The Foreground Inference Network (FIN) is introduced for this challenging task. In the first stage of our training method, FIN is jointly trained with a fully convolutional network (FCN) for image-level tag prediction. A global smooth pooling layer is proposed, enabling FCN to assign object category tags to corresponding object regions, while FIN is capable of capturing all potential foreground regions with the predicted saliency maps. In the second stage, FIN is fine-tuned with its predicted saliency maps as ground truth. For refinement of ground truth, an iterative Conditional Random Field is developed to enforce spatial label consistency and further boost performance. Our method alleviates annotation efforts and allows the usage of existing large scale training sets with image-level tags. Our model runs at 60 FPS, outperforms unsupervised ones with a large margin, and achieves comparable or even superior performance than fully supervised counterparts.

深度神经网络(DNN)已经大大改善了显著目标检测的SOTA。然而,训练DNN需要昂贵的像素级标注。在本文中,我们利用图像级标签能够提供前景显著目标的重要线索这一观察结果,开发了一种仅使用图像级标签进行显著性检测的弱监督学习方法。前景推理网络(FIN)被引入到这项具有挑战性的任务中。在我们训练方法的第一阶段,FIN与全卷积网络(FCN)联合训练,用于图像级标签的预测。我们提出了一个全局平滑池化层,使FCN能够将物体类别标签分配给相应的物体区域,而FIN能够用预测的显著图捕获所有潜在的前景区域。在第二阶段,FIN以其预测的显著图作为真值进行微调。为了细化真值,我们开发了一个迭代的条件随机场,以加强空间标签的一致性并进一步提高性能。我们的方法减轻了标注工作,并允许使用现有的具有图像级标签的大规模训练集。我们的模型以60 FPS的速度运行,以很大的幅度超越了无监督的模型,并取得了与完全监督的模型相当甚至更高的性能。

In light of the powerful learning capability of deep neural networks (DNNs), deep (convolutional) models have been built in recent years to address the task of salient object detection. Although training such deep saliency models can significantly improve the detection performance, it requires large-scale manual supervision in the form of pixel-level human annotation, which is highly labor-intensive and time-consuming. To address this problem, this paper makes the earliest effort to train a deep salient object detector without using any human annotation. The key insight is “supervision by fusion”, i.e., generating useful supervisory signals from the fusion process of weak but fast unsupervised saliency models. Based on this insight, we combine an intra-image fusion stream and a inter-image fusion stream in the proposed framework to generate the learning curriculum and pseudo ground-truth for supervising the training of the deep salient object detector. Comprehensive experiments on four benchmark datasets demonstrate that our method can approach the same network trained with full supervision (within 2-5% performance gap) and, more encouragingly, even outperform a number of fully supervised state-of-the-art approaches.

鉴于深度神经网络(DNNs)强大的学习能力,近年来已经建立了深度(卷积)模型来解决显著目标检测的任务。尽管训练这样的深度显著性模型可以显著提高检测性能,但它需要大规模的人工监督,其形式是像素级的人工标注,这是高度劳动密集和耗时的。为了解决这个问题,本文在不使用任何人工标注的情况下训练一个深度显著性目标检测器。关键的见解是”融合监督”,即从弱小但快速的无监督的显著性模型的融合过程中产生有用的监督信号。基于这一见解,我们在提出的框架中结合了图像内融合流和图像间融合流,以产生学习课程和伪真值,用于监督深度目标检测器的训练。在四个基准数据集上的综合实验表明,我们的方法可以接近用完全监督训练的相同网络(在2-5%的性能差距内),更令人鼓舞的是,甚至超过了一些完全监督的最先进方法。

Deep learning based salient object detection has recently achieved great success with its performance greatly outperforms any other unsupervised methods. However, annotating per-pixel saliency masks is a tedious and inefficient procedure. In this paper, we note that superior salient object detection can be obtained by iteratively mining and correcting the labeling ambiguity on saliency maps from traditional unsupervised methods. We propose to use the combination of a coarse salient object activation map from the classification network and saliency maps generated from unsupervised methods as pixel-level annotation, and develop a simple yet very effective algorithm to train fully convolutional networks for salient object detection supervised by these noisy annotations. Our algorithm is based on alternately exploiting a graphical model and training a fully convolutional network for model updating. The graphical model corrects the internal labeling ambiguity through spatial consistency and structure preserving while the fully convolutional network helps to correct the cross-image semantic ambiguity and simultaneously update the coarse activation map for next iteration. Experimental results demonstrate that our proposed method greatly outperforms all state-of-the-art unsupervised saliency detection methods and can be comparable to the current best strongly-supervised methods training with thousands of pixel-level saliency map annotations on all public benchmarks.

基于深度学习的显著目标检测最近取得了巨大的成功,其性能大大超过了任何其他无监督的方法。然而,标注每个像素的显著性mask是一个繁琐而低效的过程。在本文中,我们注意到,通过迭代挖掘和纠正传统无监督方法中对显著图的标注不确定性,可以实现更好的显著性目标检测。我们提出使用来自分类网络的粗略的显著物体激活图和来自无监督方法的显著图的组合作为像素级的标注,并开发出一种简单但非常有效的算法来训练全卷积网络,用于由这些噪声标注监督的显著目标检测。我们的算法是基于交替利用图模型和训练全卷积网络来更新模型。图模型通过空间一致性和结构保留来纠正内部标签的模糊性,而全卷积网络则有助于纠正跨图像语义的模糊性,并同时为下一次迭代更新粗糙的激活图。实验结果表明,我们提出的方法大大超过了所有最先进的无监督的显著性检测方法,并且可以与目前最好的在所有公共基准上用数千个像素级显著图标注进行训练的强监督方法相媲美。

The success of current deep saliency detection methods heavily depends on the availability of large-scale supervision in the form of per-pixel labeling. Such supervision, while labor-intensive and not always possible, tends to hinder the generalization ability of the learned models. By contrast, traditional handcrafted features based unsupervised saliency detection methods, even though have been surpassed by the deep supervised methods, are generally dataset-independent and could be applied in the wild. This raises a natural question that “Is it possible to learn saliency maps without using labeled data while improving the generalization ability?”. To this end, we present a novel perspective to unsupervised saliency detection through learning from multiple noisy labeling generated by “weak” and “noisy” unsupervised handcrafted saliency methods. Our end-to-end deep learning framework for unsupervised saliency detection consists of a latent saliency prediction module and a noise modeling module that work collaboratively and are optimized jointly. Explicit noise modeling enables us to deal with noisy saliency maps in a probabilistic way. Extensive experimental results on various benchmarking datasets show that our model not only outperforms all the unsupervised saliency methods with a large margin but also achieves comparable performance with the recent state-of-the-art supervised deep saliency methods.

目前的深度显著性检测方法的成功在很大程度上取决于是否有大规模的监督,即每个像素的标签形式。这样的监督虽然耗费人力,但并不总是可能的,往往会阻碍所学模型的泛化能力。相比之下,传统的基于手工特征的无监督的突出性检测方法,尽管已经被深度监督方法所超越,但通常是独立于数据集的,可以在任意场景应用。这就提出了一个自然的问题:”是否有可能在不使用标记数据的情况下学习显著性地图,同时提高泛化能力?”。为此,我们提出了一个新的观点,即通过从 “弱 “和 “有噪声 “的无监督手工制作的显著性方法产生的多个嘈杂标签中学习无监督显著性检测。我们用于无监督的显著性检测的端到端深度学习框架包括一个潜在的显著性预测模块和一个噪声建模模块,它们协同工作并共同优化。明确的噪声建模使我们能够以概率的方式来处理有噪声的显著图。在各种基准数据集上的大量实验结果表明,我们的模型不仅以很大的幅度超过了所有的无监督的显著性方法,而且还取得了与最近最先进的有监督的深度显著性方法相当的性能。

The high cost of pixel-level annotations makes it appealing to train saliency detection models with weak supervision. However, a single weak supervision source usually does not contain enough information to train a well-performing model. To this end, we propose a unified framework to train saliency detection models with diverse weak supervision sources. In this paper, we use category labels, captions, and unlabelled data for training, yet other supervision sources can also be plugged into this flexible framework. We design a classification network (CNet) and a caption generation network (PNet), which learn to predict object categories and generate captions, respectively, meanwhile highlight the most important regions for corresponding tasks. An attention transfer loss is designed to transmit supervision signal between networks, such that the network designed to be trained with one supervision source can benefit from another. An attention coherence loss is defined on unlabelled data to encourage the networks to detect generally salient regions instead of task-specific regions. We use CNet and PNet to generate pixel-level pseudo labels to train a saliency prediction network (SNet). During the testing phases, we only need SNet to predict saliency maps. Experiments demonstrate the performance of our method compares favourably against unsupervised and weakly supervised methods and even some supervised methods.

像素级注释的高成本使得用弱监督来训练显著性检测模型很有吸引力。然而,单一的弱监督源通常并不包含足够的信息来训练一个表现良好的模型。为此,我们提出了一个统一的框架,用不同的弱监督源来训练显著性检测模型。在本文中,我们使用类别标签、标题和未标注的数据进行训练,并且其他监督源也可以插入到这个灵活的框架中。我们设计了一个分类网络(CNet)和一个标题生成网络(PNet),它们分别学习预测物体类别和生成标题,同时突出相应任务的最重要区域。注意力转移损失的设计是为了在网络之间传输监督信号,这样,用一个监督源训练的网络可以从另一个监督源中受益。注意力一致性损失被定义在未标记的数据上,以鼓励网络检测一般的显著区域而不是特定的任务区域。我们使用CNet和PNet来生成像素级的伪标签来训练一个显著性预测网络(SNet)。在测试阶段,我们只需要SNet来预测显著图。实验表明,我们的方法与无监督和弱监督的方法,甚至一些有监督的方法相比,性能都很好。

Compared with laborious pixel-wise dense labeling, it is much easier to label data by scribbles, which only costs 1∼2 seconds to label one image. However, using scribble labels to learn salient object detection has not been explored. In this paper, we propose a weakly-supervised salient object detection model to learn saliency from such annotations. In doing so, we first relabel an existing large-scale salient object detection dataset with scribbles, namely S-DUTS dataset. Since object structure and detail information is not identified by scribbles, directly training with scribble labels will lead to saliency maps of poor boundary localization. To mitigate this problem, we propose an auxiliary edge detection task to localize object edges explicitly, and a gated structure-aware loss to place constraints on the scope of structure to be recovered. Moreover, we design a scribble boosting scheme to iteratively consolidate our scribble annotations, which are then employed as supervision to learn high-quality saliency maps. As existing saliency evaluation metrics neglect to measure structure alignment of the predictions, the saliency map ranking metric may not comply with human perception. We present a new metric, termed saliency structure measure, to measure the structure alignment of the predicted saliency maps, which is more consistent with human perception. Extensive experiments on six benchmark datasets demonstrate that our method not only outperforms existing weakly-supervised/unsupervised methods, but also is on par with several fully-supervised state-of-the-art models.

与费力的像素级密集标注相比,用涂鸦标注数据要容易得多,标注一张图片只需花费1∼2秒。然而,使用涂鸦标签来学习显著目标检测还没有被探索过。在本文中,我们提出了一个弱监督的显著目标检测模型,从这种标注中学习显著性。在此过程中,我们首先重新用涂鸦标注了现有的大规模显著目标检测数据集,记为S-DUTS。由于物体的结构和细节信息不被涂鸦所识别,直接用涂鸦标签进行训练会导致显著图的边界定位不佳。为了缓解这个问题,我们提出了一个辅助的边缘检测任务来明确定位物体的边缘,以及一个门控的结构感知损失来对要恢复的结构范围进行约束。此外,我们设计了一个涂鸦增强方案,以反复巩固我们的涂鸦标注,然后将其作为监督来学习高质量的显著图。由于现有的显著性评价指标忽视了对预测结构的度量,因此显著图的评价指标可能不符合人类的感知。我们提出了一个新的指标,称为显著性结构度量,来测量预测的显著图的结构排列,这更符合人类的感知。在六个基准数据集上进行的大量实验表明,我们的方法不仅优于现有的弱监督/无监督方法,而且与几个完全监督的最先进的模型相当。

Sparse labels have been attracting much attention in recent years. However, the performance gap between weakly supervised and fully supervised salient object detection methods is huge, and most previous weakly supervised works adopt complex training methods with many bells and whistles. In this work, we propose a one-round end-to-end training approach for weakly supervised salient object detection via scribble annotations without pre/post-processing operations or extra supervision data. Since scribble labels fail to offer detailed salient regions, we propose a local coherence loss to propagate the labels to unlabeled regions based on image features and pixel distance, so as to predict integral salient regions with complete object structures. We design a saliency structure consistency loss as self-consistent mechanism to ensure consistent saliency maps are predicted with different scales of the same image as input, which could be viewed as a regularization technique to enhance the model generalization ability. Additionally, we design an aggregation module (AGGM) to better integrate high-level features, low-level features and global context information for the decoder to aggregate various information. Extensive experiments show that our method achieves a new state-of-the-art performance on six benchmarks (e.g. for the ECSSD dataset: F_\beta = 0.8995, E_\xi = 0.9079 and MAE = 0.0489$), with an average gain of 4.60% for F-measure, 2.05% for E-measure and 1.88% for MAE over the previous best method on this task.

近几年来,稀疏标签一直备受关注。然而,弱监督与完全监督的SOD方法之间的性能差距是巨大的,并且以前的大多数弱监督方法都采用了复杂的训练过程与花哨的设计技巧。在本文中,我们提出了一个通过草图标注(scribble annotation)来进行弱监督显著目标检测的单轮端到端训练方法,不需要预处理/后处理操作或者额外的监督数据。由于草图标签不能提供详细的显著区域,我们提出了一个局部一致性损失,根据图像特征与像素距离来将标签传播到未标记的区域,从而预测具有一致目标结构的整体显著区域。此外,我们设计了一个显著结构一致性损失作为自治机制,以确保在输入不同尺寸下的同一图像时,输出一致的显著图,其可以被看做一种正则化技术,来提高模型的泛化能力。此外,我们还设计了一个融合模块(AGGM),以更好地处理高级特征、低级特征与全局上下文信息,供解码器融合。大量的实验表明,我们的方法在六个基准测试上取得的了新的SOTA。

Original: https://blog.csdn.net/qq_40714949/article/details/120685329
Author: xiongxyowo
Title: 弱监督/无监督显著目标检测(SOD)部分经典论文摘要

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/687880/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球