TransVG: 基于 Transformer 的视觉定位任务框架的可行性验证与可视化

本文旨在验证基于 Transformer 的视觉定位任务框架 TransVG 的可行性，并进行可视化呈现。

目前，视觉定位任务的方法主要分为两类，即两阶段方法和单阶段方法。这两种方法在查询推理和多模态融合任务中都依赖人为设计的复杂模块。然而，模型设计中存在图像场景图和查询分解等机制容易导致过拟合，并且无法充分交互图像视觉和语言文本。因此，TransVG借鉴 transformer 在视觉任务中的成功应用，使用堆叠的 transformer 编码层替代原复杂融合模块，以进行多模态对应任务，从而简化模型设计，同时实现图像视觉和语言文本的充分交互。与之前的研究不同的是，TransVG将视觉定位任务转化为定位框坐标的回归问题。

本文在四个主流数据集上进行实验后验证了TransVG具有更高效的性能，并进行了可视化工作。

This paper aims to verify the feasibility of the TransVG framework, based on the Transformer model, for visual localization tasks and to present its visualization.

Currently, visual localization methods are mainly divided into two categories: two-stage methods and single-stage methods, both of which rely on complex modules designed by humans for query reasoning and multimodal fusion tasks. However, the mechanisms of image scene graph and query decomposition in model design are prone to overfitting and are unable to fully interact with image vision and language text. Therefore, TransVG, inspired by the successful application of Transformers in visual tasks, uses stacked Transformer encoding layers to replace the original complex fusion module to perform multimodal matching tasks, simplifying model design and achieving sufficient interaction between image vision and language text. Unlike previous studies, TransVG transforms visual localization tasks into regression problems of locating box coordinates.

After experiments on four mainstream datasets, this paper verifies that TransVG has more efficient performance and presents its visualization.

TransVG: 基于 Transformer 的视觉定位任务框架的可行性验证与可视化