LessWrong (Curated & Popular)

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

Apr 12, 2025
The team explores the challenges of using sparse autoencoders (SAEs) for detecting harmful user intent in AI models. They discuss the surprising effectiveness of linear probes compared to SAEs, raising important questions about dataset biases. Technical insights delve into the evaluation of interpretability scores and the implications of high-frequency latents on performance. The conversation emphasizes the need for a deeper understanding of SAEs, focusing on their limitations rather than just seeking better performance metrics.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

SAE Limitations

  • SAEs lack ground truth, making evaluation difficult.
  • They show structure but have issues with crisp explanations and missing concepts.
ADVICE

Evaluating SAEs

  • Focus on objectively measurable downstream tasks to evaluate SAEs.
  • Prioritize tasks where SAEs might have an edge, like out-of-distribution generalization.
INSIGHT

Linear Probe Effectiveness

  • Linear probes performed surprisingly well in detecting harmful intent, even out-of-distribution.
  • This suggests potential for cheap monitoring of unsafe behavior.
Get the Snipd Podcast app to discover more snips from this episode
Get the app