AXRP - the AI X-risk Research Podcast cover image

39 - Evan Hubinger on Model Organisms of Misalignment

AXRP - the AI X-risk Research Podcast

CHAPTER

Navigating Deceptive Alignment in AI

This chapter investigates the concept of deceptive alignment in AI and the associated threat models crucial for AI safety. It covers the complexities of model behavior, including reward function manipulation and generalization challenges through various experiments. The discussion emphasizes the need for sophisticated approaches to testing and understanding the nuances of AI models to mitigate potential risks.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner