"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Mechanistic Interpretability: Philosophy, Practice & Progress with Goodfire's Dan Balsam & Tom McGrath

208 snips
May 29, 2025
In a thought-provoking discussion, Dan Balsam, CTO of Goodfire, and Tom McGrath, Chief Scientist, dive into the exciting world of mechanistic interpretability in AI. They analyze how understanding neural networks can spark breakthroughs in scientific discovery and creative domains. The pair tackle challenges in natural language processing and model debugging, drawing fascinating parallels with biology. Additionally, they underscore the importance of funding and innovative approaches in advancing AI explainability, paving the way for a more transparent future.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Interpretability as Empirical Science

  • Interpretability relies heavily on rich empirical data from models' internal activations.
  • Progress is like natural science: observing phenomena and forming hypotheses gradually.
INSIGHT

Sparse Autoencoders as Microscopes

  • Sparse autoencoders (SAEs) give a reductive sensor into model internals with reconstruction loss trade-offs.
  • Improving SAEs and developing experiment scaffolding is key to better model abstraction and interpretability.
INSIGHT

Interpretability's Proto-Paradigm

  • Mechanistic interpretability is now proto-paradigmatic, not pre-paradigmatic.
  • There is growing consensus features are linear directions forming circuits with superposition enabling more concepts than dimensions.
Get the Snipd Podcast app to discover more snips from this episode
Get the app