Sexism, a form of discrimination based on gender, is increasingly prevalent on social media platforms, where it often manifests as hate speech targeted at individuals or groups based on their gender. While machine learning models can detect such content, their "black box" nature obscures their decision-making processes, making it difficult for users to understand why certain posts are flagged as sexist.
This paper addresses the critical need for transparency in automated sexism detection by proposing an explainable pipeline that combines accurate classification with interpretable explanations. We demonstrate that incorporating explainability techniques like LIME and SHAP not only maintains high detection accuracy but also provides valuable insights into model behavior, revealing which words and phrases most strongly indicate sexist content.
Our comprehensive evaluation on the EXIST 2021 dataset shows that our transparent approach achieves an F1-score of 0.82 while providing clear, understandable explanations for each prediction. This dual focus on accuracy and interpretability makes our system particularly suitable for real-world deployment, where understanding the reasoning behind content moderation decisions is crucial for both platform operators and users.