

“Unfaithful chain-of-thought as nudged reasoning” by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda
This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models.
tl;dr
- Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision.
- Here, we sketch hypotheses for why key information may be omitted from CoTs:
- Training regimes could teach LLMs to omit information from their reasoning.
- But perhaps more importantly, statements within a CoT generally have functional purposes, and faithfully mentioning some information may carry no benefits, so models don’t do it.
- We make further claims about what's going on in faithfulness experiments and how hidden information impacts a CoT:
- Unfaithful CoTs are often not purely post hoc rationalizations and instead can be understood as biased reasoning.
- Models continually make choices during CoT, repeatedly drawing new propositions supporting or discouraging different answers.
- Hidden information can nudge [...]
---
Outline:
(00:21) tl;dr
(01:54) Unfaithfulness
(04:16) CoT is functional, and faithfulness lacks benefits
(09:29) Hidden information can nudge CoTs
(09:48) Silent, soft, and steady spectres
(13:48) Nudges are plausible
(14:42) Nudged CoTs are hard to spot
(15:49) Safety and CoT monitoring
(18:11) Final summary
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
July 22nd, 2025
Source:
https://www.lesswrong.com/posts/vPAFPpRDEg3vjhNFi/unfaithful-chain-of-thought-as-nudged-reasoning
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.