19 - Mechanistic Interpretability with Neel Nanda

AXRP - the AI X-risk Research Podcast

00:00

The Modeling of Attention Heads in Image Models

The attention head is the bit of the model that's due with rooting information between the different token positions and in part because this is just much more legible you can literally look at the attention patterns in the model. Attention pattern being which previous positions does the head think is most relevant to the current position okay yeah it is very easy to like beam misled by these but it does give you quite a lot of information sorry why is it easy to beam misled? The final answer to the question is we're just much better at interpreting the cognition around attention heads that we are about neurons.

Play episode at 40:00

chevron_right

Transcript

chevron_right

Transcript

Episode notes

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.

Topics we discuss, and timestamps:

- 00:01:05 - What is mechanistic interpretability?

- 00:24:16 - Types of AI cognition

- 00:54:27 - Automating mechanistic interpretability

- 01:11:57 - Summarizing the papers

- 01:24:43 - 'A Mathematical Framework for Transformer Circuits'

- 01:39:31 - How attention works

- 01:49:26 - Composing attention heads

- 01:59:42 - Induction heads

- 02:11:05 - 'In-context Learning and Induction Heads'

- 02:12:55 - The multiplicity of induction heads

- 02:30:10 - Lines of evidence

- 02:38:47 - Evolution in loss-space

- 02:46:19 - Mysteries of in-context learning

- 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'

- 02:50:57 - How neural nets learn modular addition

- 03:11:37 - The suddenness of grokking

- 03:34:16 - Relation to other research

- 03:43:57 - Could mechanistic interpretability possibly work?

- 03:49:28 - Following Neel's research

The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html

Links to Neel's things:

- Neel on Twitter: twitter.com/NeelNanda5

- Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1

- Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability

- TransformerLens: github.com/neelnanda-io/TransformerLens

- Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic

- Neel on YouTube: youtube.com/@neelnanda2469

- 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj

- Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J

Writings we discuss:

- A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html

- In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217

- Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052

- interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

- Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262

- Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097

- Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN

- An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143

- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593

- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

- The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544

- Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration

- Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913

- Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves

- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635

- Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books