The task of visual grounding, also known as phrase localization [1-2], referring expression comprehension [3-4], and natural language object retrieval [5-6], aims to select the objects described by natural language expressions in an image. This task involves both computer vision and natural language processing modalities, and requires models to understand natural language inputs and make accurate predictions on objects. Existing methods can be divided into two-stage methods [1-2,7-8] and one-stage methods [9-11]. Two-stage methods first obtain candidate regions in the image, rank them based on their similarity to the query, and then output the best region as the final result. One-stage methods directly fuse visual-textual features at the object detection level and output the most likely region that bounds the object on predefined dense anchors.

The core problem of visual grounding is the fusion and reasoning of multiple modalities. Previous research has proposed various structures to establish better multimodal correspondences, such as modular attention networks [12], graph structures [13-15], and multimodal tree structures [16]. However, these complex multimodal fusion modules are often designed based on specific predefined structures of image scenes or language queries, which may lead to overfitting to specific data set characteristics and limit the full interaction between visual and language information.

Moreover, previous methods decompose the final goal of bounding the object described in the query into several sub-goals, which may lead to performance degradation. For example, two-stage methods first generate candidate regions that may contain objects in the image [1-2,8], and then select the region with the highest similarity to the query feature. One-stage methods adopt the idea of YOLOv3 [18] and obtain the region that contains the object in the image through predefined dense anchors [10]. Since the prediction of the target bounding box in these methods is made from candidate boxes or predefined anchors, the performance of the model is easily affected by the generation of candidate regions or predefined anchors, and matching the target with these candidate boxes may further limit the performance.

To address these issues, TransVG [19] proposes a new approach that introduces transformer framework to accomplish visual grounding tasks. Transformer encoder layers can replace the previously designed structured fusion modules in a simple way, and the attention layers in transformer itself can establish the correspondences across visual and language modalities. Moreover, TransVG directly outputs the 4-dimensional coordinate bounding box of the object, instead of predicting based on candidate boxes.

The performance of the transformer-based visual grounding framework TransVG is tested on four widely used datasets, and its feasibility is verified by comparing the results with other advanced methods. Based on this, visualization work is also conducted in this paper

这是我论文里的几段话帮我以非常专业化的方式润色一下:视觉定位任务visual grounding也被定义为短语定位phrase localiza- tion1-2 指称表达理解referring expression comprehension3-4 和自然语言物体检索natural language object retrieval5-6 旨在图像上框选出自然语言表达描述的对象涉及到了计算机视觉

原文地址: https://www.cveoy.top/t/topic/fr2b 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录