[论文记录] 2021 – Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

[论文记录] 2021 – Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

论文简介

原论文:Context-aware Biaffine Localizing Network for Temporal Sentence Grounding 1
针对Temporal Sentence Grounding问题的上下文感知双仿射定位网络

论文地址:https://arxiv.org/abs/2103.11555

源码地址:https://github.com/liudaizong/CBLN

以下仅为作者阅读论文时的记录,学识浅薄,如有错误,欢迎指正。

论文内容

摘要

  • This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.

预备知识:本文针对的问题是 时序句子标定(temporal sentence grounding ,TSG):根据给定句子自动标定视频中”起始-终止”时间戳,如下图,也就是 输入给定的 一句话(Query)和一段未剪切过的视频,从视频中找到句子 对应的片段输出是这个片段的 时间边界

[论文记录] 2021 - Context-aware Biaffine Localizing Network for Temporal Sentence Grounding
  • Previous works either compare pre-defined candidate segments with the query and select the best one by ranking, or directly regress the boundary timestamps of the target segment.

现状:前人的工作一般是两种,第一种是将一些 预定义的候选片段分别与句子(Query)作比较然后 排序选出一个最合适的,第二种是 直接回归目标片段的时间边界
* In this paper, we propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.

方法:本文提出一种新的 定位框架,利用 双仿射机制同时为每一对”起始-终止”索引进行 评分
* In particular, we present a Context-aware Biaffine Localizing Network (CBLN) which incorporates both local and global contexts into features of each start/end position for biaffine-based localization.

本文提出的 上下文感知的双仿射定位网络局部和全局上下文信息融合到 每一个”起始-终止”位置的特征上,用于 基于双仿射的定位
* The local contexts from the adjacent frames help distinguish the visually similar appearance, and the global contexts from the entire video contribute to reasoning the temporal relation.

相邻帧提供的 局部上下文信息有助于区分 视觉相似的外观,整个视频提供的 全局上下文信息有助于 推理时间关系
* Besides, we also develop a multi-modal self-attention module to provide fine-grained query-guided video representation for this biaffine strategy.

此外,本文还开发了一个 多模态自注意力模块为这个双仿射方法提供 面向句子的细粒度视频表征
* Extensive experiments show that our CBLN significantly outperforms state-of-the-arts on three public datasets (ActivityNet Captions, TACoS, and Charades-STA), demonstrating the effectiveness of the proposed localization framework.

结果:大量实验证明本文的模型在 三个公开数据集上优于SOTA,分别是: ActivityNet Captions, TACoS, and Charades-STA
* The code is available at https://github.com/liudaizong/CBLN.

代码:https://github.com/liudaizong/CBLN.

; 1 介绍

  • 视频理解是计算机视觉领域的基础任务,有很多 应用
  • 视频事件检测(video event detection)
  • 视频摘要(video summarization)
  • 视频描述生成(video captioning)
  • 时序动作定位(temporal action localization)
  • 时序句子标定( temporal sentence grounding,TSG)– 本文任务
  • TSG任务 难点:需要处理视觉和文本之间的多模态特征 + 捕获语义对齐的上下文信息;
  • 之前的研究分两种:
    (1)用 多模态匹配架构处理TSG任务,也就是生成多个不同时间间隔的 候选片段然后根据它们和Query句子的相似性进行排序, 缺点是它非常依赖候选片段的质量并且破坏了视频原本的时序结构以及全局上下文信息;
    (2) 直接回归目标片段的”起始-终止”时间戳,或者,判断每一帧是否是起始帧或终止帧, 缺点是它并没有综合考虑起始帧和终止帧的关系,例如:开门出去或开门进来,这两种动作的起始动作都是开门,单独判断起始帧时可能会产生歧义;
  • 文本方法:Context-aware Biaffine Localizing Network (CBLN),结构如下图;
  • 核心 创新点可联系上下文的双仿射机制
    灵感来源:Timothy Dozat and Christopher D Manning. Deep biaffine attention for neural dependency parsing. In Proceedings of the International Conference on Learning Representations (ICLR), 2017. 2, 3, 4
  • 双仿射机制 难点
    (1)视频中的相邻帧内容相似,很难确定边界,所以 局部信息很关键;
    (2)针对某一语义需要因果推理,所以 全局信息很关键;

[论文记录] 2021 - Context-aware Biaffine Localizing Network for Temporal Sentence Grounding
[论文记录] 2021 - Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Original: https://blog.csdn.net/weixin_43862112/article/details/121767199
Author: EmoryDodin
Title: [论文记录] 2021 – Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/531504/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球