

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah
Apr 12, 2025
The team explores the challenges of using sparse autoencoders (SAEs) for detecting harmful user intent in AI models. They discuss the surprising effectiveness of linear probes compared to SAEs, raising important questions about dataset biases. Technical insights delve into the evaluation of interpretability scores and the implications of high-frequency latents on performance. The conversation emphasizes the need for a deeper understanding of SAEs, focusing on their limitations rather than just seeking better performance metrics.
AI Snips
Chapters
Transcript
Episode notes
SAE Limitations
- SAEs lack ground truth, making evaluation difficult.
- They show structure but have issues with crisp explanations and missing concepts.
Evaluating SAEs
- Focus on objectively measurable downstream tasks to evaluate SAEs.
- Prioritize tasks where SAEs might have an edge, like out-of-distribution generalization.
Linear Probe Effectiveness
- Linear probes performed surprisingly well in detecting harmful intent, even out-of-distribution.
- This suggests potential for cheap monitoring of unsafe behavior.