“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

Apr 12, 2025

The team explores the challenges of using sparse autoencoders (SAEs) for detecting harmful user intent in AI models. They discuss the surprising effectiveness of linear probes compared to SAEs, raising important questions about dataset biases. Technical insights delve into the evaluation of interpretability scores and the implications of high-frequency latents on performance. The conversation emphasizes the need for a deeper understanding of SAEs, focusing on their limitations rather than just seeking better performance metrics.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

SAE Limitations

SAEs lack ground truth, making evaluation difficult.
They show structure but have issues with crisp explanations and missing concepts.

ADVICE

Evaluating SAEs

Focus on objectively measurable downstream tasks to evaluate SAEs.
Prioritize tasks where SAEs might have an edge, like out-of-distribution generalization.

INSIGHT

Linear Probe Effectiveness

Linear probes performed surprisingly well in detecting harmful intent, even out-of-distribution.
This suggests potential for cheap monitoring of unsafe behavior.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

SAE Limitations

Evaluating SAEs

Linear Probe Effectiveness

TL;DR