这是我论文里的摘要帮我以专业化的方式英语翻译一下:本文旨在验证基于 Transformer 的视觉定位任务框架 TransVG 的可行性并进行可视化呈现。 目前视觉定位任务的方法主要分为两类即两阶段方法和单阶段方法。这两种方法在查询推理和多模态融合任务中都依赖人为设计的复杂模块。然而模型设计中存在图像场景图和查询分解等机制容易导致过拟合并且无法充分交互图像视觉和语言文本。因此TransVG借
This paper aims to validate the feasibility of the TransVG framework for visual localization tasks based on the Transformer architecture and to provide visual representations of its effectiveness. Currently, visual localization methods are mainly divided into two categories: two-stage and single-stage methods. Both of these methods rely on manually-designed complex modules for query reasoning and multimodal fusion tasks, which can lead to overfitting due to mechanisms such as image scene graph and query decomposition in model design, and insufficient interaction between image vision and language text. Therefore, TransVG draws inspiration from the successful application of transformers in visual tasks and uses stacked transformer encoding layers instead of complex fusion modules to perform multimodal correspondence tasks, simplifying model design and achieving sufficient interaction between image vision and language text. Unlike previous research, TransVG transforms visual localization tasks into a regression problem of locating box coordinates. After experiments on four mainstream datasets, this paper validates that TransVG has more efficient performance and provides visual representations of its mechanisms
原文地址: https://www.cveoy.top/t/topic/fr1U 著作权归作者所有。请勿转载和采集!