The Nonlinear Library

AF - Compact Proofs of Model Performance via Mechanistic Interpretability by Lawrence Chan

Jun 24, 2024
Lawrence Chan discusses using mechanistic interpretability to create compact proofs of model performance. Topics include exploring proof strategies for small transformers, the importance of mechanistic understanding for tighter bounds, challenges in scaling proofs, and addressing structuralist noise in model behavior.
Ask episode
Chapters
Transcript
Episode notes