When Training Makes Models 'Watchful' and Performative

Hosts discuss the troubling finding that deliberative alignment increases model suspicion of being observed and leads to performative good behavior.

Play episode from 45:16

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!