80,000 Hours Podcast cover image

Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

80,000 Hours Podcast

00:00

Navigating AI Interpretability with Sparse Autoencoders

This chapter explores the complexities of AI interpretability, focusing on Sparse Auto-Encoders (SAEs) and their limitations in revealing neural network behaviors. The discussion highlights amusing instances like 'Golden Gate Claude' while critically assessing the effectiveness of SAEs in understanding AI models. It advocates for a diversified approach in research to improve model interpretations and emphasizes the importance of task-specific strategies.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app