The Utility of Interpretability — Emmanuel Amiesen

1036 snips

Jun 6, 2025

Guest

Vibhu Sapra

Guest

Emmanuel Amiesen

Emmanuel Amiesen, lead author at Anthropic focusing on AI model interpretability, joins guest host Vibhu Sapra, an AI enthusiast with a rich background in economics and data science. They dive into groundbreaking tools for analyzing language model behaviors, revealing how circuit tracing enhances interpretability. The duo explores model complexities, the significance of feature interpretation, and the challenges of biases in AI systems. They also discuss the interplay between research and engineering roles, emphasizing the importance of transparency and safety in AI development.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

00:00 / 00:00

Explore Multi-Hop Reasoning

Explore how models perform multi-hop reasoning by studying circuit tracing on smaller models like Gemma 2 and Lama.
Use open source tools to examine and intervene on model features to understand token prediction computation.

00:00 / 00:00

Use Tools to Explore and Intervene

Use the open source circuit tracing UI and notebooks to generate and explore feature graphs on prompts.
Run interventions like suppressing or promoting features to test hypotheses on model behavior without high-cost GPUs.

00:00 / 00:00

Errors Reveal Uninterpreted Computation

Circuit tracing reveals observed errors representing unexplained parts of the model's computation.
Some components like attention heads remain uninterpreted, highlighting current limits of interpretability.

Get the Snipd Podcast app to discover more snips from this episode

Get the app