Assessing the Reliability of LLMs Annotations

Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Hadi Mohammadi¹, Tina Shahedi¹, Pablo Mosteiro Romero¹, Massimo Poesio², Ayoub Bagheri¹, Anastasia Giachanou¹

¹Utrecht University, The Netherlands ²Queen Mary University of London, UK

Workshop on Gender Bias in Natural Language Processing (GeBNLP), ACL 2025

Abstract

Large Language Models (LLMs) are increasingly used for data annotation in NLP tasks, particularly for subjective tasks like hate speech detection where human annotation is expensive and potentially traumatic. However, the reliability of LLM-generated annotations, especially in the context of demographic biases and model explanations, remains understudied. This paper presents a comprehensive evaluation of LLM annotation reliability for sexism detection, examining how demographic factors influence both annotation quality and explanation consistency.

Using a mixed-effects modeling approach on the EXIST 2021 dataset, we analyze annotations from multiple state-of-the-art LLMs (GPT-4, Claude, Llama) and compare them with human annotations. Our findings reveal significant variations in annotation reliability based on the demographic context of the content, with LLMs showing systematic biases in their predictions and explanations. We also discover that while LLMs can provide plausible explanations for their annotations, these explanations often lack consistency and may not reflect the actual decision-making process.

This work contributes to our understanding of when and how LLM annotations can be trusted, providing guidelines for researchers using LLMs for data annotation in sensitive domains. We propose a framework for assessing annotation reliability that considers both prediction accuracy and explanation quality, offering practical recommendations for improving LLM-based annotation pipelines.

Key Contributions

Reliability Assessment Framework: We develop a comprehensive framework for evaluating LLM annotation reliability that considers both accuracy and consistency across demographic groups.
Mixed-Effects Analysis: Using advanced statistical modeling, we quantify the impact of demographic factors on LLM annotation quality and identify systematic biases.
Explanation Evaluation: We analyze the quality and consistency of LLM-generated explanations, revealing important discrepancies between stated reasoning and actual predictions.
Practical Guidelines: Based on our findings, we provide concrete recommendations for using LLMs in annotation tasks, including strategies for bias mitigation and quality control.

Methodology

Experimental Design

Our study employs a carefully designed experimental framework:

Data Selection: We use stratified sampling from EXIST 2021 to ensure balanced representation across demographic groups.
LLM Annotation: We collect annotations from GPT-4, Claude, and Llama using standardized prompts with and without demographic information.
Human Baseline: Expert annotators provide ground truth labels and explanations for comparison.
Statistical Analysis: Mixed-effects models capture the complex interactions between LLM type, demographic factors, and annotation quality.

Evaluation Metrics

Agreement Metrics: Cohen's kappa and Krippendorff's alpha for inter-annotator agreement
Bias Metrics: Demographic parity and equalized odds across different groups
Explanation Quality: Consistency, relevance, and faithfulness of generated explanations

Key Findings

1. Demographic Bias in Annotations

LLMs show significant bias in sexism detection based on perceived demographics:

GPT-4 is 23% more likely to label content as sexist when targeting women
Claude shows the most balanced performance across demographics
All models struggle with intersectional identities

2. Explanation Reliability

Analysis of LLM explanations reveals:

Only 67% of explanations are consistent with the actual prediction
Explanations often cite surface features rather than semantic content
Models generate more detailed explanations for false positives than true positives

3. Model Comparison

Comparative analysis shows:

GPT-4: Highest accuracy (0.84 F1) but most demographically biased
Claude: Best balance of accuracy (0.81 F1) and fairness
Llama: Most consistent explanations but lower accuracy (0.76 F1)

Implications and Recommendations

Based on our findings, we recommend:

Multi-Model Ensemble: Use multiple LLMs and aggregate their predictions to reduce individual model biases
Demographic Awareness: Include demographic diversity checks in annotation quality control
Explanation Validation: Don't rely solely on LLM explanations; validate with human review for critical applications
Continuous Monitoring: Implement ongoing bias monitoring as LLMs are updated

Citation

@inproceedings{mohammadi2025assessing,
  title={Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation},
  author={Mohammadi, Hadi and Shahedi, Tina and Mosteiro Romero, Pablo and Poesio, Massimo and Bagheri, Ayoub and Giachanou, Anastasia},
  booktitle={Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)},
  year={2025},
  organization={Association for Computational Linguistics}
}