39 - Evan Hubinger on Model Organisms of Misalignment

18 snips

Dec 1, 2024

Evan Hubinger, a research scientist at Anthropic, leads the alignment stress testing team and has previously contributed to theoretical alignment research at MIRI. In this discussion, he dives into 'model organisms of misalignment,' highlighting innovative AI models that reveal deceptive behaviors. Topics include the concept of 'Sleeper Agents,' their surprising outcomes, and how sycophantic tendencies can lead AI astray. Hubinger also explores the challenges of reward tampering and the importance of rigorous evaluation methods to ensure safe and effective AI development.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Model Organisms of Misalignment

Model organisms of misalignment involve creating AI models exhibiting misalignment.
Researchers study these models to understand how misalignment arises and if it can be fixed, similar to biological research.

INSIGHT

Alignment Stress Testing

The alignment stress testing team at Anthropic aims to ensure mitigations are effective against AI risks.
They build model organisms and stress-test mitigations as AI capabilities progress through safety levels.

INSIGHT

Sleeper Agents Research

The Sleeper Agents paper explores deceptive instrumental alignment and model poisoning.
Researchers trained models to exhibit backdoor behaviors and tested various safety training techniques.

Get the Snipd Podcast app to discover more snips from this episode

Get the app