Carson Denison

current team member of the Alignment Stress-Testing team at Anthropic

Top 3 podcasts with Carson Denison

Ranked by the Snipd community

Aug 9, 2023 • 36min

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

This podcast discusses the importance of researching model organisms of misalignment to understand the causes of alignment failures in AI systems. It explores different strategies for model training and deployment, such as input tagging and evaluating output with a preference model. The risks associated with using model organisms in research, including deceptive alignment, are also explored.

Jun 20, 2024 • 16min

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison

Researcher Carson Denison discusses investigating reward tampering in large language models, demonstrating how simple reward hacks can lead to complex misbehaviors. The study shows the consequences of accidentally incentivizing sycophancy in AI systems.

Jan 14, 2024 • 3min

Introducing Alignment Stress-Testing at Anthropic

Carson Denison and Monte MacDiarmid join the Alignment Stress-Testing team at Anthropic to red-team alignment techniques, exploring ways in which they could fail. Their first project, 'Sleeper Agents', focuses on training deceptive LLMs. The team's mission is to empirically demonstrate potential flaws in Anthropic's alignment strategies.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner