TransVG: A Transformer-based Framework for Visual Grounding

Visual grounding, also known as phrase localization [1-2], referring expression comprehension [3-4], and natural language object retrieval [5-6], aims to select objects described by natural language expressions in images, involving both computer vision and natural language processing. Models need to understand natural language input and make accurate predictions on objects. Existing methods can be mainly divided into two-stage [1-2,7-8] and one-stage [9-11] methods. Specifically, two-stage methods first obtain candidate regions in images and then rank these regions based on their similarity with the query to obtain the best region as the output box. One-stage methods directly fuse visual-textual features at the object detection level and output the most likely region to select objects based on predefined dense anchors.

The fusion and reasoning of multiple modalities are the core issues in visual grounding. Previous studies on this problem have handled it in a straightforward way, as shown in Figure 1.1, through some simple designs, such as the Similarity Net [8] and the FAOA [10]. However, these designs can lead to suboptimal results. Subsequent research has proposed some new structures to achieve better performance, such as modular attention networks [12], graph structures [13-15], and multimodal tree structures [16], which establish better multimodal correspondences. In one-stage methods, some studies have proposed a framework for constructing recursive sub-queries [17] to reduce reasoning confusion between images and queries and explore better query modules.

However, these improvements can lead to overfitting to the dataset and limit the full interaction of information between vision and language. The reason is that the complex multimodal fusion modules are integrated with image scenes or language queries and built on certain specific predefined structures. Typically, the artificially designed mechanisms in the fusion module can cause the model to overfit to specific situations in the dataset, such as restricted query length and relationship, which also limit the full interaction of visual and language information.

Moreover, previous methods borrowed the idea of object detection, decomposing the ultimate goal of selecting the region where the described object is located into several sub-goals, which may lead to a decrease in performance. For example, the idea of two-stage methods is to first generate candidate regions that may contain objects in the image [1-2,8] and then select the region with the highest similarity to the query feature from these candidate regions. One-stage methods adopt the idea of YOLOv3 [18] to obtain the region where the object is located in the image through predefined dense anchors [10] and then output the region that best matches the query to select the object. Since the prediction of the target box in these methods is made from the candidate box, the model performance is easily affected by the generation of candidate regions or predefined anchors, and matching the target with these candidate boxes may further limit performance.

Therefore, TransVG [19] proposes a new approach to solve the above problems. First, the transformer framework is introduced to complete the visual grounding task. Empirical evidence suggests that the previously designed structured fusion modules can be easily replaced by a stack of transformer encoder layers. While simplifying the modules, it is noted that although no specific fusion mechanism is predefined, the core module attention layer in the transformer can establish intra-module and inter-module correspondences between vision and language. Second, in terms of determining the target region to be selected, direct regression of the box coordinates is proven to be better than indirectly selecting the object based on the query description. TransVG directly outputs 4-dimensional coordinate boxes to select objects, rather than predicting based on candidate boxes.

This paper tests the performance of the transformer-based visual grounding framework TransVG on four widely used datasets and verifies its feasibility by comparing it with other advanced methods. Based on this, visualization work is conducted.

TransVG: A Transformer-based Framework for Visual Grounding