LessWrong (30+ Karma)

“AI Corrigibility Debate: Max Harms vs. Jeremy Gillen” by Liron, Max Harms, Jeremy Gillen

Nov 14, 2025
Max Harms and Jeremy Gillen clash over AI corrigibility and its role in future superintelligent systems. Max argues for a focused approach to corrigibility to retain human control, while Jeremy raises concerns about its practical feasibility. They explore existential risks of AGI, describing scenarios of rapid AI takeover and discussing various strategies like alignment and control. Max also introduces his new sci-fi novel, 'Red Heart,' which dramatizes these themes against a geopolitical backdrop.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Corrigibility Defined

  • Corrigibility means an agent robustly keeps its human principal in charge rather than optimizing its own ends.
  • A corrigible agent defers, accepts modification, and allows shutdown when humans decide.
INSIGHT

Power Grows The Principal–Agent Problem

  • The principal–agent gap grows as the subordinate agent's power increases, making control fragile.
  • Superintelligence will likely outstrip humans and so corrigibility becomes essential to retain human governance.
ADVICE

Push Builders Toward Corrigibility

  • If actors will build AGI anyway, steer them to train for corrigibility as the least-bad target rather than naive helpfulness metrics.
  • Corrigibility reduces one source of doom and gives a clearer, simpler target than full value alignment.
Get the Snipd Podcast app to discover more snips from this episode
Get the app