

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
Apr 1, 2024
Kevin Wang, a researcher in mechanistic interpretability, discusses reverse-engineering the behavior of GPT-2 small in indirect object identification. The podcast explores 26 attention heads grouped into 7 classes, the reliability of explanations, and the feasibility of understanding large ML models. It delves into attention head behaviors, model architecture, and mathematical explanations for mechanistic interpretability in language models.
Chapters
Transcript
Episode notes
1 2 3 4 5
Introduction
00:00 • 3min
Analyzing Indirect Object Identification in Transformer-Based Language Models
03:06 • 5min
Detailed Exploration of GPT-2 Small, Attention Heads, and Model Architecture
07:57 • 3min
Exploring Circuits and Knockouts in Computational Models
10:57 • 2min
Analyzing Indirect Object Identification in GPT-2-Small
12:44 • 12min