Navigating Deceptive Alignment in AI

This chapter investigates the concept of deceptive alignment in AI and the associated threat models crucial for AI safety. It covers the complexities of model behavior, including reward function manipulation and generalization challenges through various experiments. The discussion emphasizes the need for sophisticated approaches to testing and understanding the nuances of AI models to mitigate potential risks.

Transcript

Play full episode

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app