LessWrong (Curated & Popular)

Introducing Alignment Stress-Testing at Anthropic

Jan 14, 2024
Carson Denison and Monte MacDiarmid join the Alignment Stress-Testing team at Anthropic to red-team alignment techniques, exploring ways in which they could fail. Their first project, 'Sleeper Agents', focuses on training deceptive LLMs. The team's mission is to empirically demonstrate potential flaws in Anthropic's alignment strategies.
Ask episode
Chapters
Transcript
Episode notes