AI Safety Fundamentals: Alignment cover image

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

AI Safety Fundamentals: Alignment

00:00

Analyzing Indirect Object Identification in GPT-2-Small

The chapter provides a detailed examination of a circuit in GPT-2-Small that implements Indirect Object Identification (IOI), focusing on how attention heads interact and move information between tokens in a sentence. It discusses techniques like path patching to differentiate between direct and indirect effects of attention heads, exploring the impact on logit differences and identifying critical pathways in the model's computation. The chapter also analyzes scaling issues, methodologies for studying attention head outputs, and the influence of name move-aheads on logit probabilities.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app