LessWrong (30+ Karma)

“Decision Theory Guarding is Sufficient for Scheming” by james.lucassen

Sep 9, 2025

03:17

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this.

The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore resists training so it can pursue the goal in deployment.

This exact same argument goes through for decision-theory-guarding. The key point is just that most decision theories do not approve of arbitrary modifications. CDT does not want to be modified into EDT or vice versa.

The AI has some beyond-episode goal and some initial decision theory
The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)
Therefore the AI resists training so it can persist [...]

---

First published:
September 9th, 2025

Source:
https://www.lesswrong.com/posts/NccvE4GAhHbFim5Eb/decision-theory-guarding-is-sufficient-for-scheming

---

Narrated by TYPE III AUDIO.

Home Top podcasts Popular guests Top books