AXRP - the AI X-risk Research Podcast cover image

24 - Superalignment with Jan Leike

AXRP - the AI X-risk Research Podcast

00:00

The Importance of Interpretability in Language Models

Jeff: I think there is a good chance that we could solve alignment purely behaviorally without actually understanding the models internally. There's no reason to believe that individual neurons should correspond to concepts or anything near what you know humans think they are right? Jeff: Any amount of non trivial inside we can gain from interpretability will be super useful or could potentially be super useful because it gives us like in the not avenue of attack and if you kind of think about it it's crazy not to try to do interpretability.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app