
"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds
LessWrong (Curated & Popular)
00:00
Understanding the Behaviors of Neural Networks and Decomposing Language Models with Dictionary Learning
Exploring the challenges and solutions for understanding the behaviors and failure modes of neural networks, focusing on the use of features instead of individual neurons and the decomposition of language models into interpretable parts.
Play episode from 00:00
Transcript


