Visual Grounding: Connecting Language and Images

Visual grounding is a task that combines natural language with visual information, aiming to establish a connection between language and images. In simpler terms, it involves linking a sentence with a picture, enabling computers to comprehend the correspondence between them. Visual grounding tasks can play a role in various applications, such as image retrieval, natural language interaction, machine translation, and explainable AI.

Visual grounding tasks typically encompass two sub-tasks: visual question answering and visual text alignment. Visual question answering involves a question-answering system that uses a provided image and question to generate an answer. Visual text alignment aims to align a given natural language description with objects, scenes, actions, etc., within an image, producing relevant annotations or descriptions.

In visual question answering, the input is usually an image and a question, while the output is an answer. For instance, inputting an image and the question 'What is this person doing?' could result in the answer 'This person is playing basketball.' The challenge in visual question answering lies in visually understanding the image, aligning the question with the image content, and then generating the answer.

In visual text alignment, the input typically consists of a natural language description and an image, with the output being the alignment relationship between the description and the objects, scenes, actions, etc., within the image. For example, inputting a natural language description 'A man is riding a bicycle on the street' could yield an image depicting a scene of a man riding a bicycle on the street. The difficulty in visual text alignment stems from aligning the natural language description with the image content to generate the corresponding annotations or descriptions.

Visual Grounding: Connecting Language and Images