
“Thinking about reasoning models made me less worried about scheming” by Fabien Roger
LessWrong (30+ Karma)
Intro
Fabien Roger introduces the topic: reasoning models, scheming, and the DeepSeek R1 observations.
Reasoning models like Deepseek r1:
- Can reason in consequentialist ways and have vast knowledge about AI training
- Can reason for many serial steps, with enough slack to think about takeover plans
- Sometimes reward hack
If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).
Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.
In this post, I argue that:
- Classic explanations for the absence of scheming (in non-wildly superintelligent AIs) like the ones listed in Joe Carlsmith's scheming report only partially rule out scheming in models like Deepseek r1;
- There are other explanations for why Deepseek r1 doesn’t scheme that are often absent from past armchair reasoning about scheming:
- The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and [...]
---
Outline:
(04:08) Classic reasons to expect AIs to not be schemers
(04:14) Speed priors
(06:11) Preconditions for scheming not being met
(08:27) There are indirect pressures against scheming on intermediate steps of reasoning
(09:07) Human priors on intermediate steps of reasoning
(11:43) Correlation between short and long reasoning
(13:07) Other pressures
(13:48) Rewards are not so cursed as to strongly incentivize scheming
(13:54) Maximizing rewards teaches you things mostly independent of scheming
(14:46) Using situational awareness to get higher reward is hard
(16:45) Maximizing rewards doesn't push you far away from the human prior
(18:07) Will it be different for future rewards?
(19:32) Meta-level update and conclusion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
November 20th, 2025
---
Narrated by TYPE III AUDIO.


