
39 - Evan Hubinger on Model Organisms of Misalignment
AXRP - the AI X-risk Research Podcast
Navigating Deceptive Alignment in AI
This chapter investigates the concept of deceptive alignment in AI and the associated threat models crucial for AI safety. It covers the complexities of model behavior, including reward function manipulation and generalization challenges through various experiments. The discussion emphasizes the need for sophisticated approaches to testing and understanding the nuances of AI models to mitigate potential risks.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.