PHASE: Analyzing and Mitigating Bias in Vision-and-Language Tasks Using Annotated Demographic Attributes in GCC

In this paper, we contribute to the analysis, evaluation, and mitigation of bias in vision-and-language tasks by annotating six types of demographic[ Demographic attributes: age, gender, skin-tone, ethnicity.] and contextual[ Contextual attributes: emotion, activity.] attributes in a large dataset: the Google Conceptual Captions (GCC) [43], which was one of the first automatically crawled datasets with 3.3 million image-caption pairs.

We name our annotations PHASE (Perceived Human Annotations for Social Evaluation), and we use them to conduct a comprehensive analysis of the distribution of demographic attributes on the GCC dataset. We complement our findings with experiments on three main vision-and-language tasks: image captioning, text-image embeddings, and text-to-image generation. Overall, we found that a dataset crawled from the internet like GCC presents big unbalances on all the demographic attributes under analysis. Moreover, when compared against the demographic annotations in MSCOCO by Zhao et al. [60], GCC has bigger representation gaps in gender and skin-tone. As for the downstream tasks, the three of them show evidence of different performance for different demographic groups.

在本文中，我们通过在一个大型数据集中注释六种人口统计属性（年龄、性别、肤色、种族）和环境属性（情绪、活动），为视觉与语言任务中的偏见分析、评估和缓解做出了贡献。这个数据集是谷歌概念字幕（GCC）[43]，是第一个自动抓取的包含330万个图像-字幕对的数据集之一。

我们将我们的注释命名为PHASE（用于社会评估的感知人类注释），并使用它们对GCC数据集中人口统计属性的分布进行了全面分析。我们通过对三个主要的视觉与语言任务（图像字幕、文本-图像嵌入和文本到图像生成）进行实验来补充我们的发现。总体而言，我们发现像GCC这样从互联网上抓取的数据集在所有分析的人口统计属性上存在严重的不平衡。此外，与赵等人[60]在MSCOCO中的人口统计注释相比，GCC在性别和肤色方面的代表性差距更大。至于下游任务，这三个任务都显示出不同人口统计群体的不同表现。