"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Popular Mechanistic Interpretability: Goodfire Lights the Way to AI Safety

92 snips
Aug 17, 2024
Dan Balsam, CTO of Goodfire with extensive startup engineering experience, and Tom McGrath, Chief Scientist focused on AI safety from DeepMind, dive into mechanistic interpretability. They explore the complexities of AI training, discussing advances like sparse autoencoders and the balance between model complexity and interpretability. The conversation also reveals how hierarchical structures in AI relate to human cognition, illustrating the need for collaborative efforts in navigating the evolving landscape of AI research and safety.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Interpretability's Evolution

  • Interpretability research was initially unfashionable, with beliefs that models contained nothing meaningful.
  • The emergence of sparse autoencoders enabled large-scale analysis, shifting from microscopic to industrial-scale understanding.
INSIGHT

Meaningful Representations and Polysemanticity

  • Early interpretability research showed models learn semantically meaningful representations without explicit design.
  • Polysemanticity, where neurons fire on multiple unrelated concepts, suggested manageable structure within models.
INSIGHT

Learning as Compression

  • Learning systems can be viewed as compression systems, balancing generality and complexity.
  • Models learn hierarchical structures to represent data patterns, making interpretability less surprising.
Get the Snipd Podcast app to discover more snips from this episode
Get the app