80,000 Hours Podcast

Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

111 snips
Sep 8, 2025
Neel Nanda, a researcher at Google DeepMind and a pioneer in mechanistic interpretability, dives into the enigmatic world of AI decision-making. He shares the alarming reality that fully grasping AI thoughts may be unattainable. Neel advocates for a 'Swiss cheese' model of safety, layering various safeguards rather than relying on a single solution. The complexities of AI reasoning, challenges in monitoring behavior, and the critical need for skepticism in research highlight the ongoing struggle to ensure AI systems remain trustworthy as they evolve.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Layer Cheap Monitors Into The Pipeline

  • Use interpretability to augment evaluation, monitoring, and incident analysis rather than solely as a silver-bullet fix.
  • Build cheap monitors (e.g., probes) to flag risky thoughts and layer multiple safeguards together.
INSIGHT

Read The Model, Not Just Its Outputs

  • Mechanistic interpretability studies model internals (weights, activations) to explain how and why models produce outputs.
  • Treat neural networks like biology: rich, emergent structure that can be probed causally and experimentally.
INSIGHT

From Ambition To Pragmatism

  • Ambitious reverse engineering of models likely won't yield full, robust guarantees about deception.
  • But mechanistic interpretability still delivers medium-impact, practical safety tools we should keep developing.
Get the Snipd Podcast app to discover more snips from this episode
Get the app