Evaluating Bias in CLIP Embeddings for Demographic Attributes

This study investigates the potential bias in pre-trained textimage CLIP embeddings [37] concerning different demographic attributes. CLIP, a dual architecture with a text and image encoder, learns image and text embeddings by predicting matching pairs from a massive dataset of 400 million image-text pairs. This extensive training allows the encoders to learn correspondences between high-level semantics in language and visual modalities.

Objective: Our primary goal is to determine whether the demographic attributes of individuals depicted in images influence the accuracy of CLIP embeddings.

Methodology: 1. Embedding Extraction: We employ pre-trained CLIP encoders to extract both image and text embeddings. 2. Similarity Ranking: For each of the 4,614 images within our validation set, we rank the validation captions based on the cosine similarity between their embeddings and the corresponding image embeddings.3. Accuracy Analysis: We then analyze the accuracy of this ranked list to identify potential discrepancies across different demographic attributes.

Metrics: We leverage Recall@k (R@k) with k = {1, 5, 10} to evaluate the accuracy of the image-text embeddings. R@k represents the percentage of images where the correct matching caption is ranked within the top-k positions. By comparing the R@k performance across different classes within the same attribute, we aim to ascertain whether CLIP embeddings exhibit varying performance levels. Ideally, unbiased representations should yield similar R@k performance across different class attributes.

Results: Table 4 presents the results of our evaluation. We observe noticeable performance differences within each of the four demographic attributes, suggesting potential bias. While the varying sample sizes across classes could influence these results, our supplementary material includes an evaluation with equal sample sizes per class, confirming that our conclusions remain consistent.

Key Findings: Our analysis reveals significant performance discrepancies in CLIP embeddings across different demographic attributes, indicating potential bias in the learned representations.

Future Work: Further investigation is required to understand the underlying reasons for this observed bias and develop mitigation strategies to ensure fairness and accuracy in CLIP's application across diverse demographics.

Evaluating Bias in CLIP Embeddings for Demographic Attributes