LessWrong (Curated & Popular) cover image

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

LessWrong (Curated & Popular)

00:00

Exploring Interpretability and Feature Decomposition in Language Models

This chapter explores the concept of interpretability in language models, discussing the importance of features, comparing scores to neuron activations, and how activating specific features can influence the model. It also discusses the universality of learned features, the potential for decomposing models into interpretable features, and anthropics investment in mechanistic interpretability.

Play episode from 02:34
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app