LessWrong (30+ Karma)

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

Nov 28, 2025
A thought-provoking discussion unfolds about AI alignment. Adrià posits that aligning AI might be easier than anticipated, with models like Claude Opus 3 already demonstrating core goodness. He advocates for an iterative approach, where each AI generation aids in aligning the next. However, Simon raises concerns about the fragility of current methods and the possibility of missing a crucial alignment opportunity. The debate dives deep into the current state of AI safety and the future challenges that lie ahead.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Spontaneous Debate Sparked By A Comment

  • The episode grew from a LessWrong post and a lengthy comment that sparked a spontaneous debate between Adrià and Simon.
  • Their exchange framed a broader conversation about where AI safety actually stands.
INSIGHT

Tension Between Progress And Systemic Risk

  • The central question contrasts optimism about alignment becoming easier with caution about scaling gaps.
  • This frames a key tension in current AI safety discourse between empirical progress and systemic risk.
INSIGHT

Iterative Alignment Through Successive Models

  • Adrià argues current large models like Claude Opus 3 are fundamentally well-aligned and reliable.
  • He suggests iterative generations where each model helps align the next could carry us safely to superintelligence.
Get the Snipd Podcast app to discover more snips from this episode
Get the app