Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Apr 1, 2024

Kevin Wang, a researcher in mechanistic interpretability, discusses reverse-engineering the behavior of GPT-2 small in indirect object identification. The podcast explores 26 attention heads grouped into 7 classes, the reliability of explanations, and the feasibility of understanding large ML models. It delves into attention head behaviors, model architecture, and mathematical explanations for mechanistic interpretability in language models.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 3min

Analyzing Indirect Object Identification in Transformer-Based Language Models

03:06 • 5min

Detailed Exploration of GPT-2 Small, Attention Heads, and Model Architecture

07:57 • 3min

Exploring Circuits and Knockouts in Computational Models

10:57 • 2min

Analyzing Indirect Object Identification in GPT-2-Small

12:44 • 12min