Sleeper Agents | Evan Hubinger | EA Global Bay Area: 2024

Mar 6, 2024

49:53

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

If an AI system learned a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? That's the question that Evan and his coauthors at Anthropic sought to answer in their work on ""Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training"", which Evan will be discussing.

Evan Hubinger leads the new Alignment Stress-Testing team at Anthropic, which is tasked with red-teaming Anthropic's internal alignment techniques and evaluations. Prior to joining Anthropic, Evan was a Research Fellow at the Machine Intelligence Research Institute and worked on a variety of theoretical alignment work, including ""Risks from Learned Optimization in Advanced Machine Learning Systems"". Evan will be talking about the Anthropic Alignment Stress-Testing team's first paper, ""Sleeper Agents: Building Deceptive LLMs that Persist Through Safety Training"".

Watch on Youtube: https://www.youtube.com/watch?v=BgfT0AcosHw

Home Top podcasts Popular guests Top books