AXRP - the AI X-risk Research Podcast cover image

39 - Evan Hubinger on Model Organisms of Misalignment

AXRP - the AI X-risk Research Podcast

00:00

Navigating Deceptive Alignment in AI

This chapter investigates the concept of deceptive alignment in AI and the associated threat models crucial for AI safety. It covers the complexities of model behavior, including reward function manipulation and generalization challenges through various experiments. The discussion emphasizes the need for sophisticated approaches to testing and understanding the nuances of AI models to mitigate potential risks.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app