

Inside the Black Box: The Urgency of AI Interpretability
Oct 2, 2025
Jack Lindsay, a researcher at Anthropic with a background in theoretical neuroscience, teams up with Tom McGrath, co-founder and Chief Scientist at Goodfire and a former member of DeepMind's interpretability team. They tackle the critical topic of AI interpretability, discussing the urgency of understanding modern AI models for safety and reliability. They explore technical challenges, real-world applications, and how larger models complicate analysis. Insights into neuroscience inform their work, making the case for interpretability as essential for trusted AI.
AI Snips
Chapters
Books
Transcript
Episode notes
Models Outpace Our Understanding
- Models are outpacing our understanding, creating unacceptable risk as they're used in high-stakes tasks.
- We need ways to trust model reasoning even when humans can't verify every output.
Multiple Ways To Explain Why
- Mechanistic interpretability asks "why" by describing structures and causal mechanisms in models' computations.
- Broader interpretability also includes data and utility explanations to fully explain model behavior.
Interpretability Is Reverse-Engineering Biology
- Neural networks are not human-written programs; their behavior emerges from training, creating a reverse-engineering problem.
- Interpretability resembles biology: we must discover hierarchical abstractions to explain complex, distributed systems.