The widespread use of Large Language Models (LLMs) has raised concerns about the potential misuse of AI-generated text in academic, journalistic, and creative contexts. While various detection methods have been developed to identify AI-generated content, they often struggle with modified or carefully crafted text. This paper presents a novel approach that leverages explainability techniques to strategically modify AI-generated text, making it less detectable while preserving semantic meaning and readability.
Our method uses attribution-based explainability to identify tokens that most strongly signal AI generation to detection models. By replacing these high-attribution tokens with semantically similar alternatives, we can significantly reduce detectability while maintaining text coherence. We evaluate our approach against multiple state-of-the-art AI text detectors, demonstrating that our method reduces detection accuracy by up to 45% while preserving 92% of the original semantic content.
This research has important implications for understanding the robustness of AI text detection systems and highlights the need for more sophisticated detection methods that go beyond surface-level features. We discuss the ethical considerations of this work and propose guidelines for responsible use of such techniques.