LessWrong (Curated & Popular) cover image

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong (Curated & Popular)

CHAPTER

Analyzing Auto-Interp Scores and SAE Fine-Tuning Impacts

This chapter analyzes the evaluation of Auto-Interp scores through a comparison of IT-trained and IT-fine models, focusing on their relationship with the L0 norm. It highlights the challenges of interpretability in the fine-tuning process and questions the efficacy of training modifications alongside the statistical significance of the results.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner