Analyzing Auto-Interp Scores and SAE Fine-Tuning Impacts

This chapter analyzes the evaluation of Auto-Interp scores through a comparison of IT-trained and IT-fine models, focusing on their relationship with the L0 norm. It highlights the challenges of interpretability in the fine-tuning process and questions the efficacy of training modifications alongside the statistical significance of the results.

Play episode from 16:46

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong (Curated & Popular)

Analyzing Auto-Interp Scores and SAE Fine-Tuning Impacts

TL;DR

The AI-powered Podcast Player