
LessWrong (30+ Karma) “The corrigibility basin of attraction is a misleading gloss” by Jeremy Gillen
Dec 6, 2025
Dive into the complexities of AGI as Jeremy Gillen challenges the idea of corrigibility and its role in designing safer AI. He introduces the basin-of-attraction concept, arguing that merely approximating human values may fall short. With engaging analogies like a cargo ship, he highlights potential pitfalls in iterative engineering. Jeremy critiques common assumptions and suggests the need for clearer theoretical frameworks. Get ready for an insightful discussion on the balance between empirical fixes and robust understanding in AGI development!
AI Snips
Chapters
Transcript
Episode notes
Basin-of-Attraction Intuition
- The basin-of-attraction idea says corrigible AIs will help make successors corrigible, creating an attractor for safe iteration.
- This motivates prosaic alignment by enlarging the acceptable initial target and enabling iterative improvement.
Corrigibility Requires Self-Doubt
- Eliezer's original corrigibility demands the agent understand its own fallibility and avoid becoming overconfident.
- This is hard because systems that fully trust their reasoning tend to abandon rule-like fail-safes they were given.
Different Corrigibility Flavors
- Paul's and Max's versions implement corrigibility differently, often via act-based objectives or principles to empower.
- These versions emphasize iterative human-AI collaboration but may lose the explicit self-flaw-awareness of MIRI corrigibility.
