Explainability-Based Token Replacement on LLM-Generated Text

Abstract

The widespread use of Large Language Models (LLMs) has raised concerns about the potential misuse of AI-generated text in academic, journalistic, and creative contexts. While various detection methods have been developed to identify AI-generated content, they often struggle with modified or carefully crafted text. This paper presents a novel approach that leverages explainability techniques to strategically modify AI-generated text, making it less detectable while preserving semantic meaning and readability.

Our method uses attribution-based explainability to identify tokens that most strongly signal AI generation to detection models. By replacing these high-attribution tokens with semantically similar alternatives, we can significantly reduce detectability while maintaining text coherence. We evaluate our approach against multiple state-of-the-art AI text detectors, demonstrating that our method reduces detection accuracy by up to 45% while preserving 92% of the original semantic content.

This research has important implications for understanding the robustness of AI text detection systems and highlights the need for more sophisticated detection methods that go beyond surface-level features. We discuss the ethical considerations of this work and propose guidelines for responsible use of such techniques.

Key Contributions

Novel Approach: We introduce the first method that systematically uses explainability techniques to enhance AI-generated text undetectability.
Comprehensive Evaluation: We test against multiple detection systems including GPTZero, OpenAI's detector, and GLTR, showing consistent improvements in evading detection.
Semantic Preservation: Our token replacement strategy maintains high semantic similarity (>0.92 cosine similarity) with the original text.
Theoretical Framework: We provide a theoretical analysis of why certain tokens are more indicative of AI generation and how this knowledge can be exploited.

Methodology

Explainability-Based Token Identification

Our approach consists of three main steps:

Attribution Analysis: We use gradient-based attribution methods to identify tokens that contribute most to AI detection scores.
Semantic Replacement: For high-attribution tokens, we find semantically similar replacements using contextual embeddings and linguistic constraints.
Iterative Refinement: We iteratively apply replacements and re-evaluate detection scores until reaching a target threshold.

Token Replacement Strategy

Our replacement strategy considers:

Semantic similarity (cosine similarity > 0.8)
Grammatical compatibility
Frequency analysis to avoid unusual word choices
Context coherence using bidirectional language models

Results

Our experiments demonstrate:

Detection accuracy reduction: 45% average decrease across all tested detectors
Semantic preservation: 92% average cosine similarity with original text
Human evaluation: 78% of modified texts rated as equally fluent as originals
Robustness: Method effective across different text domains (academic, creative, technical)

Interestingly, we find that relatively few token replacements (average 12% of tokens) are sufficient to significantly impact detection, suggesting that current detectors rely heavily on specific linguistic patterns rather than deep semantic understanding.

Ethical Considerations

We acknowledge the dual-use nature of this research. While it could be misused to evade detection systems, we believe this work is crucial for:

Understanding vulnerabilities in current detection methods
Developing more robust detection systems
Informing policy discussions about AI-generated content
Advancing research in adversarial robustness for NLP

We advocate for responsible disclosure and have shared our findings with major detection system developers before publication.

Citation

@article{mohammadi2024explainability,
  title={Explainability-Based Token Replacement on LLM-Generated Text},
  author={Mohammadi, Hadi and Giachanou, Anastasia and Oberski, Daniel and Bagheri, Ayoub},
  journal={arXiv preprint arXiv:2506.04050},
  year={2024}
}