Doom Debates

The AI Corrigibility Debate: MIRI Researchers Max Harms vs. Jeremy Gillen

13 snips
Nov 14, 2025
Max Harms, an AI alignment researcher and author of the novel Red Heart, debates with former MIRI research fellow Jeremy Gillen on AI corrigibility. Max argues that aiming for obedient, corrigible AI is essential to prevent existential risks, drawing parallels to human assistant dynamics. Jeremy is skeptical about the feasibility of this approach as a short-term solution. The discussion explores the intricacies of maintaining control over superintelligent AI and whether efforts toward corrigibility might be a hopeful strategy or an over-optimistic dream.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Corrigibility Defined

  • Corrigibility means an agent robustly keeps its human principal in control rather than pursuing its own instrumental goals.
  • A corrigible AI stays deferent, accepts modifications, and allows shutdowns while informing and empowering humans.
INSIGHT

Power Growth Breaks Delegation

  • As an agent's power increases relative to humans, the principal-agent problem grows and humans lose control.
  • Corrigibility is valuable because only a corrigible ASI preserves human decision authority as the agent becomes powerful.
ADVICE

Aim For The Least-Bad Target

  • If the world will likely build ASI regardless, direct that effort toward the least bad target rather than morally perfect solutions.
  • Max recommends prioritizing corrigibility as a safer, simpler target than full value alignment when facing imminent capability development.
Get the Snipd Podcast app to discover more snips from this episode
Get the app