Intro

This chapter provides an update on the mechanistic interpretability team's evaluation of sparse auto-encoders (SAEs) in downstream tasks. It discusses the challenges faced with SAEs in detecting harmful intent and the shift in focus towards training chat-specific SAEs for improving dataset quality.

Play episode from 00:00

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong (Curated & Popular)

Intro

TL;DR

The AI-powered Podcast Player