AI Safety Fundamentals: Alignment cover image

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

AI Safety Fundamentals: Alignment

CHAPTER

Analyzing Indirect Object Identification in GPT-2-Small

The chapter provides a detailed examination of a circuit in GPT-2-Small that implements Indirect Object Identification (IOI), focusing on how attention heads interact and move information between tokens in a sentence. It discusses techniques like path patching to differentiate between direct and indirect effects of attention heads, exploring the impact on logit differences and identifying critical pathways in the model's computation. The chapter also analyzes scaling issues, methodologies for studying attention head outputs, and the influence of name move-aheads on logit probabilities.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner