LessWrong (Curated & Popular)

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

17 snips
Aug 9, 2023
This podcast discusses the importance of researching model organisms of misalignment to understand the causes of alignment failures in AI systems. It explores different strategies for model training and deployment, such as input tagging and evaluating output with a preference model. The risks associated with using model organisms in research, including deceptive alignment, are also explored.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Lack of Empirical Evidence for AI Risks

  • We lack strong empirical evidence for existential AI risks like deceptive misalignment and reward hacking.
  • These risks require models to develop complex capabilities like situational awareness and deception simultaneously.
ADVICE

Use Model Organisms for Testing

  • Develop and test "model organisms" that demonstrate specific AI misalignment subcomponents.
  • Use these controlled cases to understand failures and evaluate alignment techniques.
INSIGHT

Breaking Down AI Takeover Risks

  • AI takeover risks break down into components like misaligned goals, deception, and situational awareness.
  • Demonstrating each subcomponent separately helps build understanding before combining them.
Get the Snipd Podcast app to discover more snips from this episode
Get the app