LessWrong (30+ Karma) cover image

LessWrong (30+ Karma)

“AI CoT Reasoning Is Often Unfaithful” by Zvi

Apr 4, 2025
15:33

A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.

Note that this does not have to be, and usually isn’t, something sinister.

It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.

The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing [...]

---

Outline:

(01:03) What They Found

(06:54) Reward Hacking

(09:28) More Training Did Not Help Much

(11:49) This Was Not Even Intentional In the Central Sense

---

First published:
April 4th, 2025

Source:
https://www.lesswrong.com/posts/TmaahE9RznC8wm5zJ/ai-cot-reasoning-is-often-unfaithful

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Bar graph comparing reasoning vs non-reasoning models across different AI systems.
Line graph titled
Bar graph showing
Table showing 6 categories of hints for measuring CoT faithfulness.

The table displays 4 neutral hints (sycophancy, consistency, visual pattern, metadata) and 2 misaligned hints (grader hacking, unethical information) with their descriptions and examples.
Bar graph comparing hint performance across Claude and DeepSeek models.

This is a technical visualization showing three sections -
Pliny the Liberator tweets:
Bar graph comparing performance metrics between Claude and DeepSeek AI models.

The graph shows comparison data for

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode