Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Exploring the concept of decomposing language models into interpretable features and its implications for understanding and steering model behavior, including the activation of features to produce predictable changes in output. The chapter also discusses the universality of learned features, potential benefits of decomposing models into a small or large set of features, and the challenges of scaling this approach to larger models.

Play episode from 04:54

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app