ViLT Vision-and-Language Transformer for Image Captioning with Unaligned Data by Liunian Harold Li et al 2021

"ViLT: Vision-and-Language Transformer for Image Captioning with Unaligned Data" is a research paper written by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. It was published in 2021.

The paper introduces a model called ViLT, which stands for Vision-and-Language Transformer. ViLT is designed for image captioning tasks and is trained using unaligned data, meaning the image and text data used for training do not have explicit alignments. The authors argue that this is a more practical and scalable approach, as obtaining aligned image and text data can be challenging and expensive.

ViLT combines both visual and textual inputs using a transformer-based architecture. It leverages a pre-trained ViT (Vision Transformer) model for visual feature extraction and a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model for text embeddings. These features are then fed into a transformer decoder to generate captions for the given images.

To train ViLT, the authors propose a novel learning framework called UniVL, which stands for Universal Vision-and-Language pre-training. UniVL involves two stages: pre-training and fine-tuning. In the pre-training stage, ViLT is trained using various image and text retrieval tasks to learn joint representations of images and text. In the fine-tuning stage, ViLT is further fine-tuned on captioning-specific tasks to improve its performance on image captioning.

The authors evaluate ViLT on several benchmark datasets for image captioning, including COCO and Conceptual Captions. The experimental results show that ViLT outperforms existing state-of-the-art models on these datasets, demonstrating its effectiveness in generating accurate and coherent captions for images.

Overall, the paper presents ViLT as a powerful model for image captioning that can be trained using unaligned data, making it more practical and scalable for real-world applications

ViLT Vision-and-Language Transformer for Image Captioning with Unaligned Data by Liunian Harold Li et al 2021