LessWrong (Curated & Popular)

“6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes

Dec 4, 2025
In this engaging discussion, Steven Byrnes, a writer focused on AI alignment, delves into the cultural clash surrounding alignment theories. He unpacks the concept of 'approval reward' and how it shapes human behavior, contrasting it with the perceived ruthlessness of future AIs. Byrnes challenges existing explanations of why humans don’t always act like power-seeking agents, arguing that humans' social instincts foster kindness and corrigibility. This intriguing exploration questions if future AGI will adopt similar approval-driven motivations.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Approval Reward Explains Human Sociality

  • Humans carry an innate approval-reward signal that makes social approval feel intrinsically valuable and shapes self-reflective preferences.
  • This approval reward explains why humans often act cooperatively and resist purely consequentialist scheming.
INSIGHT

Approval Reward Is An Alignment Crux

  • Whether future AGI has approval reward is a core crux underlying many alignment disagreements.
  • Alignment-is-hard thinkers generally assume AGI will lack approval reward, producing very different behaviors than humans.
ADVICE

Design Approval Reward Intentionally

  • Don't assume approval-like motivations will appear spontaneously in RL-based AGI; they must be designed and implemented explicitly.
  • Expect technical and competitive barriers to adding human-like approval reward to future AGI.
Get the Snipd Podcast app to discover more snips from this episode
Get the app