AI Safety Fundamentals: Alignment cover image

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

AI Safety Fundamentals: Alignment

00:00

Introduction

This chapter delves into interpretability in machine learning models by explaining the behaviors of GPT-2 small in indirect object identification. It evaluates the reliability of explanations based on faithfulness, completeness, and minimality criteria, uncovering gaps in understanding and the feasibility of mechanistic understanding in large ML models.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app