LessWrong (Curated & Popular) cover image

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong (Curated & Popular)

CHAPTER

Measuring Auto-Interpretability in Latent Models

This chapter explores the intricacies of evaluating auto-interpretability scores in latent variable models, focusing on techniques to manage high-frequency latents. The discussion includes modifications to sparsity penalties in JumpReLU Sparse Autoencoders and the impact of these adjustments on interpretability and performance metrics.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner