
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
AI Safety Fundamentals: Alignment
Analyzing Indirect Object Identification in GPT-2-Small
The chapter provides a detailed examination of a circuit in GPT-2-Small that implements Indirect Object Identification (IOI), focusing on how attention heads interact and move information between tokens in a sentence. It discusses techniques like path patching to differentiate between direct and indirect effects of attention heads, exploring the impact on logit differences and identifying critical pathways in the model's computation. The chapter also analyzes scaling issues, methodologies for studying attention head outputs, and the influence of name move-aheads on logit probabilities.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.