GCC Dataset: A Large-Scale Benchmark for Vision-Language Models

The GCC dataset [43] contains about 3.3 million samples from the Internet paired with alt-text captions. Originally, images were filtered to remove pornography while captions were post-processed to transform named-entities into hypernyms, e.g. 'Harrison Ford' → 'actor'. Other than that, no filters were applied to remove toxicity or balance the representations. Due to its large size, GCC has been used for pre-training several vision-and-language models, including VilBERT [31], VLBERT [45], Unicoder-VL [28], UNITER [14], OSCAR [29], or ERNIE-VL [59]. This makes it an ideal testbed for studying how the representation of different demographic attributes affects downstream tasks.