这是我论文里的几段话帮我以专业化的方式润色一下：视觉定位任务visual grounding也被定义为短语定位phrase localiza- tion1-2 指称表达理解referring expression comprehension3-4 和自然语言物体检索natural language object retrieval5-6 旨在图像上框选出自然语言表达描述的对象涉及到了计算机视觉和自

The task of visual grounding, also referred to as phrase localization [1-2], referring expression comprehension [3-4], and natural language object retrieval [5-6], aims to identify objects in an image based on natural language descriptions. This task involves two modalities, computer vision and natural language processing, and requires models to understand natural language inputs in order to make accurate predictions about objects in an image. Existing methods can be broadly categorized as two-stage approaches [1-2,7-8] and single-stage approaches [9-11]. Two-stage methods first generate candidate regions in the image and then rank them based on their similarity to the query, whereas single-stage methods directly fuse visual-textual features at the object detection level and output the most likely region for object detection based on pre-defined dense anchors.

The core problem in visual grounding is the fusion and reasoning of multiple modalities. Previous research has addressed this problem in a simplistic way, as shown in Figure 1.1, through approaches such as the Similarity Net [8] which measures the similarity between regions and expression embeddings using MLP, and FAOA [10] which directly encodes language vectors into visual features. Although effective, these designs can lead to suboptimal results. Subsequent research has proposed new structures, such as modular attention networks [12], graph structures [13-15], and multi-modal tree structures [16] to better establish multi-modal correspondences in two-stage methods. In single-stage methods, research has proposed a recursive sub-query construction framework [17] that performs multiple rounds of reasoning between image and query to reduce confusion during inference and explore better query modules.

However, these improvements may lead to overfitting to the dataset and limit the full interaction of information between vision and language due to the complex multi-modal fusion modules that are designed based on specific pre-defined structures. Typically, these handcrafted mechanisms in fusion modules can cause the model to overfit to specific situations in the dataset, such as restricted query length and query relationships, which also limits the full interaction between visual and language information. Furthermore, previous methods borrowed ideas from object detection and decomposed the ultimate goal of selecting the region that describes the object into sub-goals, which may lead to performance degradation. For example, the two-stage approach first generates candidate regions in the image that may contain objects [1-2,8], and then selects the region with the highest similarity to the query feature; the single-stage approach, inspired by YOLOv3 [18], obtains the region where the object is located in the image through pre-defined dense anchors [10], and then outputs the region that best matches the query description. Since the prediction of the target box in these methods is based on the candidate box, the model's performance can be easily affected by the generation of candidate regions or the pre-defined anchors, and matching the object with these candidate boxes can further limit performance.

Therefore, TransVG [19] proposes a new approach to address these issues. First, the transformer framework is introduced to complete the visual grounding task. Empirical evidence suggests that the previously handcrafted structured fusion modules can be easily replaced by a stack of transformer encoder layers. While simplifying the module, the core attention layer in the transformer can establish intra- and inter-modality correspondences between vision and language without defining any specific fusion mechanisms. Second, in solving the problem of selecting the target region, experiments show that direct regression of the box coordinates is better than indirect selection of the query-described object box. TransVG directly outputs 4-dimensional coordinate boxes for object detection, instead of predicting based on candidate boxes.

This paper tests the performance of the transformer-based visual grounding framework TransVG on four widely used datasets and compares its results with other state-of-the-art methods, verifying its feasibility, and conducts visualization work based on this

这是我论文里的几段话帮我以专业化的方式润色一下：视觉定位任务visual grounding也被定义为短语定位phrase localiza- tion1-2 指称表达理解referring expression comprehension3-4 和自然语言物体检索natural language object retrieval5-6 旨在图像上框选出自然语言表达描述的对象涉及到了计算机视觉和自