

Mechanistic Interpretability: Philosophy, Practice & Progress with Goodfire's Dan Balsam & Tom McGrath
208 snips May 29, 2025
In a thought-provoking discussion, Dan Balsam, CTO of Goodfire, and Tom McGrath, Chief Scientist, dive into the exciting world of mechanistic interpretability in AI. They analyze how understanding neural networks can spark breakthroughs in scientific discovery and creative domains. The pair tackle challenges in natural language processing and model debugging, drawing fascinating parallels with biology. Additionally, they underscore the importance of funding and innovative approaches in advancing AI explainability, paving the way for a more transparent future.
AI Snips
Chapters
Transcript
Episode notes
Interpretability as Empirical Science
- Interpretability relies heavily on rich empirical data from models' internal activations.
- Progress is like natural science: observing phenomena and forming hypotheses gradually.
Sparse Autoencoders as Microscopes
- Sparse autoencoders (SAEs) give a reductive sensor into model internals with reconstruction loss trade-offs.
- Improving SAEs and developing experiment scaffolding is key to better model abstraction and interpretability.
Interpretability's Proto-Paradigm
- Mechanistic interpretability is now proto-paradigmatic, not pre-paradigmatic.
- There is growing consensus features are linear directions forming circuits with superposition enabling more concepts than dimensions.