Intro

Fabien Roger introduces the topic: reasoning models, scheming, and the DeepSeek R1 observations.

Play episode from 00:00

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Reasoning models like Deepseek r1:

Can reason in consequentialist ways and have vast knowledge about AI training
Can reason for many serial steps, with enough slack to think about takeover plans
Sometimes reward hack

If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).

Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.

In this post, I argue that:

Classic explanations for the absence of scheming (in non-wildly superintelligent AIs) like the ones listed in Joe Carlsmith's scheming report only partially rule out scheming in models like Deepseek r1;
There are other explanations for why Deepseek r1 doesn’t scheme that are often absent from past armchair reasoning about scheming:
- The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and [...]

---

Outline:

(04:08) Classic reasons to expect AIs to not be schemers

(04:14) Speed priors

(06:11) Preconditions for scheming not being met

(08:27) There are indirect pressures against scheming on intermediate steps of reasoning

(09:07) Human priors on intermediate steps of reasoning

(11:43) Correlation between short and long reasoning

(13:07) Other pressures

(13:48) Rewards are not so cursed as to strongly incentivize scheming

(13:54) Maximizing rewards teaches you things mostly independent of scheming

(14:46) Using situational awareness to get higher reward is hard