
"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds
LessWrong (Curated & Popular)
00:00
Exploring Interpretability and Feature Decomposition in Language Models
This chapter explores the concept of interpretability in language models, discussing the importance of features, comparing scores to neuron activations, and how activating specific features can influence the model. It also discusses the universality of learned features, the potential for decomposing models into interpretable features, and anthropics investment in mechanistic interpretability.
Play episode from 02:34
Transcript


