AI Safety Fundamentals: Alignment

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Apr 1, 2024
Kevin Wang, a researcher in mechanistic interpretability, discusses reverse-engineering the behavior of GPT-2 small in indirect object identification. The podcast explores 26 attention heads grouped into 7 classes, the reliability of explanations, and the feasibility of understanding large ML models. It delves into attention head behaviors, model architecture, and mathematical explanations for mechanistic interpretability in language models.
Ask episode
Chapters
Transcript
Episode notes