Large Language Models (LLMs) are increasingly used for data annotation in NLP tasks, particularly for subjective tasks like hate speech detection where human annotation is expensive and potentially traumatic. However, the reliability of LLM-generated annotations, especially in the context of demographic biases and model explanations, remains understudied. This paper presents a comprehensive evaluation of LLM annotation reliability for sexism detection, examining how demographic factors influence both annotation quality and explanation consistency.
Using a mixed-effects modeling approach on the EXIST 2021 dataset, we analyze annotations from multiple state-of-the-art LLMs (GPT-4, Claude, Llama) and compare them with human annotations. Our findings reveal significant variations in annotation reliability based on the demographic context of the content, with LLMs showing systematic biases in their predictions and explanations. We also discover that while LLMs can provide plausible explanations for their annotations, these explanations often lack consistency and may not reflect the actual decision-making process.
This work contributes to our understanding of when and how LLM annotations can be trusted, providing guidelines for researchers using LLMs for data annotation in sensitive domains. We propose a framework for assessing annotation reliability that considers both prediction accuracy and explanation quality, offering practical recommendations for improving LLM-based annotation pipelines.