
40 - Jason Gross on Compact Proofs and Interpretability
AXRP - the AI X-risk Research Podcast
00:00
Mechanistic Interpretability Unpacked
This chapter explores the balance of overestimations and underestimations in mechanistic interpretability, proposing a new scale for assessing the efficacy of explanations. It emphasizes the challenges of aligning theoretical claims with neural network parameters and introduces CrossCoders as a key advancement in enhancing clarity within the field.
Transcript
Play full episode