LessWrong (30+ Karma)

“The corrigibility basin of attraction is a misleading gloss” by Jeremy Gillen

Dec 6, 2025
Dive into the complexities of AGI as Jeremy Gillen challenges the idea of corrigibility and its role in designing safer AI. He introduces the basin-of-attraction concept, arguing that merely approximating human values may fall short. With engaging analogies like a cargo ship, he highlights potential pitfalls in iterative engineering. Jeremy critiques common assumptions and suggests the need for clearer theoretical frameworks. Get ready for an insightful discussion on the balance between empirical fixes and robust understanding in AGI development!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Basin-of-Attraction Intuition

  • The basin-of-attraction idea says corrigible AIs will help make successors corrigible, creating an attractor for safe iteration.
  • This motivates prosaic alignment by enlarging the acceptable initial target and enabling iterative improvement.
INSIGHT

Corrigibility Requires Self-Doubt

  • Eliezer's original corrigibility demands the agent understand its own fallibility and avoid becoming overconfident.
  • This is hard because systems that fully trust their reasoning tend to abandon rule-like fail-safes they were given.
INSIGHT

Different Corrigibility Flavors

  • Paul's and Max's versions implement corrigibility differently, often via act-based objectives or principles to empower.
  • These versions emphasize iterative human-AI collaboration but may lose the explicit self-flaw-awareness of MIRI corrigibility.
Get the Snipd Podcast app to discover more snips from this episode
Get the app