"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

Oct 9, 2023

The podcast discusses the challenges and solutions for understanding the behaviors of neural networks. It explores the use of features instead of individual neurons and the decomposition of language models into interpretable parts. The concept of interpretability in language models is also explored, highlighting the importance of features and the influence of activating specific features. The potential for decomposing models into interpretable features is discussed, along with the universality of learned features and anthropics investment in mechanistic interpretability.

Ask episode

Chapters

Transcript

Episode notes

Understanding the Behaviors of Neural Networks and Decomposing Language Models with Dictionary Learning

00:00 • 3min

Exploring Interpretability and Feature Decomposition in Language Models

02:34 • 2min