80,000 Hours Podcast cover image

Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

80,000 Hours Podcast

00:00

Decoding AI: Recursive Self-Improvement and Interpretability

This chapter explores the challenges of recursive self-improvement in AI and its impact on mechanistic interpretability through techniques like sparse autoencoders. It emphasizes the complexities of extracting meaningful concepts from models, comparing the process to solving a puzzle amid chaotic signals.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app