Neel Nanda - Mechanistic Interpretability

152 snips

Jun 18, 2023

Neel Nanda, a researcher at DeepMind specializing in mechanistic interpretability, dives into the intricate world of AI models. He discusses how models can represent thoughts through motifs and circuits, revealing the complexities of superposition where models encode more features than neurons. Nanda explores the fascinating idea of whether models can possess goals and highlights the role of 'induction heads' in tracking long-range dependencies. His insights into the balance between elegant theories and the messy realities of AI add depth to the conversation.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

T-Shirt Story

Neel Nanda explains his Grokking t-shirt's origin in meme culture.
It depicts a Shoggoth with a smiley face, symbolizing language models' hidden complexity.

ADVICE

MechInterp Starting Point

Getting started with mechanistic interpretability is easier than it seems.
Start with simple models like GPT-2 in Colab notebooks for fast feedback.

INSIGHT

Alien Neuroscience of Models

Models have their own "alien" ways of representing concepts, unlike human intuitions.
Nanda's modular addition work shows models use rotations, not typical addition algorithms.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this wide-ranging conversation, Tim Scarfe interviews Neel Nanda, a researcher at DeepMind working on mechanistic interpretability, which aims to understand the algorithms and representations learned by machine learning models. Neel discusses how models can represent their thoughts using motifs, circuits, and linear directional features which are often communicated via a "residual stream", an information highway models use to pass information between layers.

Neel argues that "superposition", the ability for models to represent more features than they have neurons, is one of the biggest open problems in interpretability. This is because superposition thwarts our ability to understand models by decomposing them into individual units of analysis. Despite this, Neel remains optimistic that ambitious interpretability is possible, citing examples like his work reverse engineering how models do modular addition. However, Neel notes we must start small, build rigorous foundations, and not assume our theoretical frameworks perfectly match reality.

The conversation turns to whether models can have goals or agency, with Neel arguing they likely can based on heuristics like models executing long term plans towards some objective. However, we currently lack techniques to build models with specific goals, meaning any goals would likely be learned or emergent. Neel highlights how induction heads, circuits models use to track long range dependencies, seem crucial for phenomena like in-context learning to emerge.

On the existential risks from AI, Neel believes we should avoid overly confident claims that models will or will not be dangerous, as we do not understand them enough to make confident theoretical assertions. However, models could pose risks through being misused, having undesirable emergent properties, or being imperfectly aligned. Neel argues we must pursue rigorous empirical work to better understand and ensure model safety, avoid "philosophizing" about definitions of intelligence, and focus on ensuring researchers have standards for what it means to decide a system is "safe" before deploying it. Overall, a thoughtful conversation on one of the most important issues of our time.

Support us! https://www.patreon.com/mlst

MLST Discord: https://discord.gg/aNPkGUQtc5

Twitter: https://twitter.com/MLStreetTalk

Neel Nanda: https://www.neelnanda.io/

TOC

[00:00:00] Introduction and Neel Nanda's Interests (walk and talk)

[00:03:15] Mechanistic Interpretability: Reverse Engineering Neural Networks

[00:13:23] Discord questions

[00:21:16] Main interview kick-off in studio

[00:49:26] Grokking and Sudden Generalization

[00:53:18] The Debate on Systematicity and Compositionality

[01:19:16] How do ML models represent their thoughts

[01:25:51] Do Large Language Models Learn World Models?

[01:53:36] Superposition and Interference in Language Models

[02:43:15] Transformers discussion

[02:49:49] Emergence and In-Context Learning

[03:20:02] Superintelligence/XRisk discussion

Transcript: https://docs.google.com/document/d/1FK1OepdJMrqpFK-_1Q3LQN6QLyLBvBwWW_5z8WrS1RI/edit?usp=sharing

Refs: https://docs.google.com/document/d/115dAroX0PzSduKr5F1V4CWggYcqIoSXYBhcxYktCnqY/edit?usp=sharing