
Doom Debates The AI Corrigibility Debate: MIRI Researchers Max Harms vs. Jeremy Gillen
13 snips
Nov 14, 2025 Max Harms, an AI alignment researcher and author of the novel Red Heart, debates with former MIRI research fellow Jeremy Gillen on AI corrigibility. Max argues that aiming for obedient, corrigible AI is essential to prevent existential risks, drawing parallels to human assistant dynamics. Jeremy is skeptical about the feasibility of this approach as a short-term solution. The discussion explores the intricacies of maintaining control over superintelligent AI and whether efforts toward corrigibility might be a hopeful strategy or an over-optimistic dream.
AI Snips
Chapters
Books
Transcript
Episode notes
Corrigibility Defined
- Corrigibility means an agent robustly keeps its human principal in control rather than pursuing its own instrumental goals.
- A corrigible AI stays deferent, accepts modifications, and allows shutdowns while informing and empowering humans.
Power Growth Breaks Delegation
- As an agent's power increases relative to humans, the principal-agent problem grows and humans lose control.
- Corrigibility is valuable because only a corrigible ASI preserves human decision authority as the agent becomes powerful.
Aim For The Least-Bad Target
- If the world will likely build ASI regardless, direct that effort toward the least bad target rather than morally perfect solutions.
- Max recommends prioritizing corrigibility as a safer, simpler target than full value alignment when facing imminent capability development.


