LessWrong (Curated & Popular) cover image

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds

LessWrong (Curated & Popular)

00:00

Understanding the Behaviors of Neural Networks and Decomposing Language Models with Dictionary Learning

Exploring the challenges and solutions for understanding the behaviors and failure modes of neural networks, focusing on the use of features instead of individual neurons and the decomposition of language models into interpretable parts.

Play episode from 00:00
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app