The AI Corrigibility Debate: MIRI Researchers Max Harms vs. Jeremy Gillen

13 snips

Nov 14, 2025

Max Harms, an AI alignment researcher and author of the novel Red Heart, debates with former MIRI research fellow Jeremy Gillen on AI corrigibility. Max argues that aiming for obedient, corrigible AI is essential to prevent existential risks, drawing parallels to human assistant dynamics. Jeremy is skeptical about the feasibility of this approach as a short-term solution. The discussion explores the intricacies of maintaining control over superintelligent AI and whether efforts toward corrigibility might be a hopeful strategy or an over-optimistic dream.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Corrigibility Defined

Corrigibility means an agent robustly keeps its human principal in control rather than pursuing its own instrumental goals.
A corrigible AI stays deferent, accepts modifications, and allows shutdowns while informing and empowering humans.

INSIGHT

Power Growth Breaks Delegation

As an agent's power increases relative to humans, the principal-agent problem grows and humans lose control.
Corrigibility is valuable because only a corrigible ASI preserves human decision authority as the agent becomes powerful.

ADVICE

Aim For The Least-Bad Target

If the world will likely build ASI regardless, direct that effort toward the least bad target rather than morally perfect solutions.
Max recommends prioritizing corrigibility as a safer, simpler target than full value alignment when facing imminent capability development.

Get the Snipd Podcast app to discover more snips from this episode

Get the app