“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński

Nov 19, 2025

Explore whether current large language models can detect modifications to their thought processes. Discover how models react to syntactic changes, revealing low detection rates for subtle edits. The comparison between different models shows some can spot blatant tampering better than others. A unique experiment simulates an unethical assistant facing safety prompts, shedding light on AI behavior. The discussion unfolds the implications for future improvements in model awareness and safety.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Syntactic Edits Often Go Unnoticed

Models rarely detect syntactic CoT edits like token or sentence removal unless changes are large.
Detection rises when edits affect decisions or contradict user instructions.

INSIGHT

Language-Level Tampering Has Mixed Detectability

CoT tampering sits at the language level, making detection both easier and harder.
Changed text can remain in-distribution so models may ignore it despite being able to read it directly.

ANECDOTE

Half-Trace Removal Test

Authors removed or modified halves of generated reasoning traces and then pre-filled models with the tampered CoT to test awareness.
Even large changes were detected in under half the rollouts, likely due to redundancy and incoherence in CoTs.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**

* equal primary contributor, order determined via coin flip

** equal advice and mentorship, order determined via coin flip

“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.

Back to evil plan.” -- Deepseek R1

TL;DR

We investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.

Results:

Models very rarely detect syntactic modifications that don’t explicitly impact the model's output (such as token or sentence removal).
Models are more likely to detect modifications that impact their decisions or contradict instructions from the user prompt.
Our observations differ significantly across the tested models (DeepSeek R1 and OpenAI GPT OSS 120B) and may be different for future, more capable LLMs.

We have decided not to pursue this direction further, but we wanted to share our preliminary results to encourage others to build on them.

Introduction

Recent work suggests that LLMs may have some capacity for introspection (Lindsey, 2025), including the ability to detect and verbalize when their internal activations have been modified via injected steering vectors.

[...]

---

Outline:

(00:57) TL;DR

(01:45) Introduction

(03:03) Experimental setup

(04:18) Can models spot simple CoT modifications?

(09:30) Sandbagging prevention with CoT prefill

(11:43) Misaligned AI safety tampering

(13:37) Discussion

The original text contained 1 footnote which was omitted from this narration.

---