

Popular Mechanistic Interpretability: Goodfire Lights the Way to AI Safety
92 snips Aug 17, 2024
Dan Balsam, CTO of Goodfire with extensive startup engineering experience, and Tom McGrath, Chief Scientist focused on AI safety from DeepMind, dive into mechanistic interpretability. They explore the complexities of AI training, discussing advances like sparse autoencoders and the balance between model complexity and interpretability. The conversation also reveals how hierarchical structures in AI relate to human cognition, illustrating the need for collaborative efforts in navigating the evolving landscape of AI research and safety.
AI Snips
Chapters
Transcript
Episode notes
Interpretability's Evolution
- Interpretability research was initially unfashionable, with beliefs that models contained nothing meaningful.
- The emergence of sparse autoencoders enabled large-scale analysis, shifting from microscopic to industrial-scale understanding.
Meaningful Representations and Polysemanticity
- Early interpretability research showed models learn semantically meaningful representations without explicit design.
- Polysemanticity, where neurons fire on multiple unrelated concepts, suggested manageable structure within models.
Learning as Compression
- Learning systems can be viewed as compression systems, balancing generality and complexity.
- Models learn hierarchical structures to represent data patterns, making interpretability less surprising.