The Importance of Scaling Interpretability

Part three of the plan was something like deliberately training misaligned models and seeing if the pipeline could detect those. The goal here would not be to like fix it deliberately train mis aligned model okay just to detectYeah so fundamentally one core aspect of what we need to do here is we need to be able to distinguish between like the actual aligned alignment researcher that does what we want and that Julie wants to help us make progress on alignment.

Play episode from 53:46

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app