[HUMAN VOICE] "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

Nov 9, 2023

This podcast discusses the challenges of understanding artificial neural networks and the importance of recording neuron activations and testing responses. It explores the decomposition of language models with dictionary learning, the benefits of using features for interpretation, and the concept of decomposing models into interpretable features. The chapter also discusses the universality of learned features, potential benefits of decomposing models into a small or large set of features, and the challenges of scaling this approach to larger models.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 2min

Decomposing Language Models with Dictionary Learning

02:14 • 3min

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

04:54 • 3min