LessWrong (Curated & Popular)

[HUMAN VOICE] "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

Nov 9, 2023
This podcast discusses the challenges of understanding artificial neural networks and the importance of recording neuron activations and testing responses. It explores the decomposition of language models with dictionary learning, the benefits of using features for interpretation, and the concept of decomposing models into interpretable features. The chapter also discusses the universality of learned features, potential benefits of decomposing models into a small or large set of features, and the challenges of scaling this approach to larger models.
Ask episode
Chapters
Transcript
Episode notes