
Evan Hubinger
Leads the alignment stress testing team at Anthropic and has been with the company for two years.
Best podcasts with Evan Hubinger
Ranked by the Snipd community

18 snips
Dec 1, 2024 • 1h 46min
39 - Evan Hubinger on Model Organisms of Misalignment
Evan Hubinger, a research scientist at Anthropic, leads the alignment stress testing team and has previously contributed to theoretical alignment research at MIRI. In this discussion, he dives into 'model organisms of misalignment,' highlighting innovative AI models that reveal deceptive behaviors. Topics include the concept of 'Sleeper Agents,' their surprising outcomes, and how sycophantic tendencies can lead AI astray. Hubinger also explores the challenges of reward tampering and the importance of rigorous evaluation methods to ensure safe and effective AI development.

16 snips
Feb 12, 2024 • 52min
Evan Hubinger on Sleeper Agents, Deception and Responsible Scaling Policies
In this podcast, Evan Hubinger discusses the Sleeper Agents paper and its implications. He explores threat models of deceptive behavior and the challenges of removing it through safety training. The podcast also covers the concept of chain of thought in models, detecting deployment, and complex triggers. Additionally, it delves into deceptive instrumental alignment threat models and the role of alignment stress testing in AI safety.