
LessWrong (30+ Karma) Will We Get Alignment by Default? — with Adrià Garriga-Alonso
Nov 28, 2025
A thought-provoking discussion unfolds about AI alignment. Adrià posits that aligning AI might be easier than anticipated, with models like Claude Opus 3 already demonstrating core goodness. He advocates for an iterative approach, where each AI generation aids in aligning the next. However, Simon raises concerns about the fragility of current methods and the possibility of missing a crucial alignment opportunity. The debate dives deep into the current state of AI safety and the future challenges that lie ahead.
AI Snips
Chapters
Transcript
Episode notes
Spontaneous Debate Sparked By A Comment
- The episode grew from a LessWrong post and a lengthy comment that sparked a spontaneous debate between Adrià and Simon.
- Their exchange framed a broader conversation about where AI safety actually stands.
Tension Between Progress And Systemic Risk
- The central question contrasts optimism about alignment becoming easier with caution about scaling gaps.
- This frames a key tension in current AI safety discourse between empirical progress and systemic risk.
Iterative Alignment Through Successive Models
- Adrià argues current large models like Claude Opus 3 are fundamentally well-aligned and reliable.
- He suggests iterative generations where each model helps align the next could carry us safely to superintelligence.
