LessWrong (30+ Karma)

“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński

Nov 19, 2025
Explore whether current large language models can detect modifications to their thought processes. Discover how models react to syntactic changes, revealing low detection rates for subtle edits. The comparison between different models shows some can spot blatant tampering better than others. A unique experiment simulates an unethical assistant facing safety prompts, shedding light on AI behavior. The discussion unfolds the implications for future improvements in model awareness and safety.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Syntactic Edits Often Go Unnoticed

  • Models rarely detect syntactic CoT edits like token or sentence removal unless changes are large.
  • Detection rises when edits affect decisions or contradict user instructions.
INSIGHT

Language-Level Tampering Has Mixed Detectability

  • CoT tampering sits at the language level, making detection both easier and harder.
  • Changed text can remain in-distribution so models may ignore it despite being able to read it directly.
ANECDOTE

Half-Trace Removal Test

  • Authors removed or modified halves of generated reasoning traces and then pre-filled models with the tampered CoT to test awareness.
  • Even large changes were detected in under half the rollouts, likely due to redundancy and incoherence in CoTs.
Get the Snipd Podcast app to discover more snips from this episode
Get the app