The Importance of Interpretability in Language Models

Jeff: I think there is a good chance that we could solve alignment purely behaviorally without actually understanding the models internally. There's no reason to believe that individual neurons should correspond to concepts or anything near what you know humans think they are right? Jeff: Any amount of non trivial inside we can gain from interpretability will be super useful or could potentially be super useful because it gives us like in the not avenue of attack and if you kind of think about it it's crazy not to try to do interpretability.

Play episode from 45:26

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app