论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

文章目录

论文地址:https://arxiv.org/abs/1804.01452
代码:https://github.com/LiqunChen0606/Jointly-Discovering-Visual-Objects-and-Spoken-Words

纸质笔记,如果你有任何问题,请在评论区指出。

[En]

Paper notes, if you have any questions, please point them out in the comments section.

摘要

本文设计了一个将音频字幕和对应的图像关联的神经网络,通过image-audio retrieval代理任务的学习,也可以实现图像中的声源定位。本文方法不需要监督。在Places 205和ADE20K数据集上进行了实验,实现了把图像中的物体和语音中的文字在语义上联结配对。作者是在raw sensory上实现的:即image pixels 和 speech waveform。

一、背景

论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

作者想要探究在未经处理的数据上(unaligned、unannotated)能否将语音与视觉联系起来。

作者强调,该方法不使用任何传统的语音识别或转录,也不使用任何目标检测和识别模型,在没有任何监督的情况下实现了图像中对象和语音单词的检测和分割。

[En]

The author emphasizes that this method does not use any traditional speech recognition or transcription, or target detection and recognition model, and realizes the detection and segmentation of objects and speech words in the image without any supervision.

; 二、模型

作者方法和之前方法不同的是,不再将整张图像和语音发音映射起来,而是学习在时间上和空间上分布的表示,实现在每个模态上的直接共同定位。优化目标是ranking-based。

论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
作者使用两个分支分别对图像和音频进行处理。
[En]

The author uses two branches to process images and audio respectively.

对于图像分支,前人工作一般需要预训练VGG,本文不需要,另外只保留到了conv5,去掉了后面的池化等操作。

论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
对于音频和图像相似度的计算,首先是点积
[En]

For the calculation of audio and image similarity, dot product first

论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
可选用的相似性计算:
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

三、实验

首先进行了查询实验

论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
然后进行了定位实验
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
还进行了聚类实验
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
不同损失和网络结构对比
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
可视化
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

Original: https://blog.csdn.net/qq_39233881/article/details/121211789
Author: 住在新手村的小木子
Title: 论文阅读笔记:Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/512275/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球