
LessWrong (Curated & Popular) "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds
Oct 9, 2023
The podcast discusses the challenges and solutions for understanding the behaviors of neural networks. It explores the use of features instead of individual neurons and the decomposition of language models into interpretable parts. The concept of interpretability in language models is also explored, highlighting the importance of features and the influence of activating specific features. The potential for decomposing models into interpretable features is discussed, along with the universality of learned features and anthropics investment in mechanistic interpretability.
Chapters
Transcript
Episode notes
