The visual grounding task, also known as phrase localization, referring expression comprehension, and natural language object retrieval, involves identifying objects in images based on natural language descriptions. This task requires models to understand natural language input and accurately predict objects in images, which involves both computer vision and natural language processing modalities. Existing methods can be categorized into two-stage and one-stage methods. Two-stage methods involve generating candidate regions in images and ranking these regions based on their similarity to the query, while one-stage methods fuse visual-textual features directly at the object detection level to output the most probable regions.

The core challenge in visual grounding is the fusion and inference of multimodal information. Prior research has addressed this challenge through simple mechanisms, such as MLP-based similarity networks and direct encoding of language vectors into visual features. More recent research has proposed more complex structures, such as modular attention networks, graph structures, and multimodal tree structures, to establish better multimodal correspondences. However, these complex fusion modules often rely on specific predefined structures and can lead to overfitting on specific dataset characteristics, such as limited query length and query relations, thereby limiting the interaction between visual and language information.

Moreover, prior methods have decomposed the final goal of localizing the described object into several sub-goals, potentially leading to performance degradation. For example, two-stage methods first generate candidate regions in images and then select the region with the highest similarity to the query, while one-stage methods use predefined dense anchors to obtain the region containing the object and then output the region most similar to the query. These methods often rely on candidate region generation or predefined anchors, leading to limited model performance.

To address these issues, this paper proposes TransVG, a novel transformer-based framework for visual grounding. TransVG replaces the structured fusion modules with a stack of transformer encoders, simplifying the fusion process while leveraging the transformer's attention mechanism to establish both intra- and inter-modality correspondences. Moreover, TransVG directly outputs four-dimensional coordinate values for object localization, instead of relying on candidate regions or predefined anchors.

This paper evaluates TransVG's performance on four widely used datasets and compares it with state-of-the-art methods, demonstrating its feasibility and effectiveness. Furthermore, this paper presents visualization work to provide insights into TransVG's behavior

这是我论文里的几段话帮我以非常专业化的方式润色一下不是翻译成英文:视觉定位任务visual grounding也被定义为短语定位phrase localiza- tion1-2 指称表达理解referring expression comprehension3-4 和自然语言物体检索natural language object retrieval5-6 旨在图像上框选出自然语言表达描述的对象涉及

原文地址: https://www.cveoy.top/t/topic/fr2e 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录