AXRP - the AI X-risk Research Podcast cover image

40 - Jason Gross on Compact Proofs and Interpretability

AXRP - the AI X-risk Research Podcast

00:00

Mechanistic Interpretability Unpacked

This chapter explores the balance of overestimations and underestimations in mechanistic interpretability, proposing a new scale for assessing the efficacy of explanations. It emphasizes the challenges of aligning theoretical claims with neural network parameters and introduces CrossCoders as a key advancement in enhancing clarity within the field.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app