LessWrong (Curated & Popular) cover image

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong (Curated & Popular)

00:00

Analyzing Auto-Interp Scores and SAE Fine-Tuning Impacts

This chapter analyzes the evaluation of Auto-Interp scores through a comparison of IT-trained and IT-fine models, focusing on their relationship with the L0 norm. It highlights the challenges of interpretability in the fine-tuning process and questions the efficacy of training modifications alongside the statistical significance of the results.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app