AXRP - the AI X-risk Research Podcast cover image

24 - Superalignment with Jan Leike

AXRP - the AI X-risk Research Podcast

00:00

The Importance of Scaling Interpretability

Part three of the plan was something like deliberately training misaligned models and seeing if the pipeline could detect those. The goal here would not be to like fix it deliberately train mis aligned model okay just to detectYeah so fundamentally one core aspect of what we need to do here is we need to be able to distinguish between like the actual aligned alignment researcher that does what we want and that Julie wants to help us make progress on alignment.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner
Get the app