LessWrong (Curated & Popular) cover image

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong (Curated & Popular)

CHAPTER

Intro

This chapter provides an update on the mechanistic interpretability team's evaluation of sparse auto-encoders (SAEs) in downstream tasks. It discusses the challenges faced with SAEs in detecting harmful intent and the shift in focus towards training chat-specific SAEs for improving dataset quality.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner