

The Utility of Interpretability — Emmanuel Amiesen
213 snips Jun 6, 2025
Emmanuel Amiesen, lead author at Anthropic focusing on AI model interpretability, joins guest host Vibhu Sapra, an AI enthusiast with a rich background in economics and data science. They dive into groundbreaking tools for analyzing language model behaviors, revealing how circuit tracing enhances interpretability. The duo explores model complexities, the significance of feature interpretation, and the challenges of biases in AI systems. They also discuss the interplay between research and engineering roles, emphasizing the importance of transparency and safety in AI development.
AI Snips
Chapters
Transcript
Episode notes
Explore Multi-Hop Reasoning
- Explore how models perform multi-hop reasoning by studying circuit tracing on smaller models like Gemma 2 and Lama.
- Use open source tools to examine and intervene on model features to understand token prediction computation.
Use Tools to Explore and Intervene
- Use the open source circuit tracing UI and notebooks to generate and explore feature graphs on prompts.
- Run interventions like suppressing or promoting features to test hypotheses on model behavior without high-cost GPUs.
Errors Reveal Uninterpreted Computation
- Circuit tracing reveals observed errors representing unexplained parts of the model's computation.
- Some components like attention heads remain uninterpreted, highlighting current limits of interpretability.