AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Being ambitious in understanding the algorithms learned by neural networks is important. It is crucial to believe that there is structure within the models that can be comprehended with effort and persistence. This mindset challenges the notion that understanding is not possible or not a priority in machine learning research.
Being willing to focus deeply on understanding a specific model rather than trying to generalize across various models is key. Recognizing that different models may have different internal structures and algorithms is important, and exploring the unique aspects of each model can lead to deeper insights.
A commitment to truth-seeking and skepticism is crucial when conducting mechanistic interpretability research. Challenging assumptions, considering alternative hypotheses, and running rigorous experiments are vital in order to ensure robust and accurate interpretations of the models.
Linear representations, where models use coordinate bases to encode meaningful features, are posited as a plausible way to understand the internal working of neural networks. Exploring the hypothesis that models utilize linear combinations of neurons to represent complex features can provide valuable insights into how models perceive and process information.
The podcast discusses the nature of high-dimensional spaces and how models in these spaces can exhibit interference. In high-dimensional spaces, models can use virtually all feasible directions and fit in exponentially many directions that have non-zero dot products with each other. This results in models using sparse vectors that have non-trivial interference. The discussion highlights the distinction between input weights (which determine when a neuron activates) and output weights (which determine the features boosted by the model). Models learn to distinguish between overlapping features by having multiple neurons and accumulating knowledge through residual streams.
The podcast explores the concept of superposition in language models. Superposition is a trade-off between representing more things and representing them without interference. Language models often exhibit sparse features that are rare and don't occur simultaneously. The discussion cites a paper that demonstrates how language models detect compound words using boolean operations on common sequences of tokens. The sparsity in these representations allows for lossless compression and efficient detection of specific combinations of words.
The podcast introduces the idea of sparse probing, where linear classifiers are trained to detect specific features using different numbers of neurons. The goal is to understand how sparsely represented different features are in models. As the sparsity of features increases, models become more accurate at detecting known combinations of words. However, detecting unknown combinations becomes more challenging. The discussion also mentions an experiment where the deletion of neurons is used to erase the model's memory of specific words.
The podcast unravels the phenomenon of grokking models, which involves three distinct phases: memorization, circuit formation, and cleanup. Initially, models memorize training data, then transition into circuit formation, where they learn to generalize while maintaining fixed training performance. Finally, during cleanup, models eliminate parameters associated with memorization, resulting in improved overall performance. Contrary to popular belief, grokking is not sudden generalization, but rather a gradual transition from memorization to generalization.
Neel Nanda is a researcher at Google DeepMind working on mechanistic interpretability. He is also known for his YouTube channel where he explains what is going on inside of neural networks to a large audience.
In this conversation, we discuss what is mechanistic interpretability, how Neel got into it, his research methodology, his advice for people who want to get started, but also papers around superposition, toy models of universality and grokking, among other things.
Youtube: https://youtu.be/cVBGjhN4-1g
Transcript: https://theinsideview.ai/neel
OUTLINE
(00:00) Intro
(00:57) Why Neel Started Doing Walkthroughs Of Papers On Youtube
(07:59) Induction Heads, Or Why Nanda Comes After Neel
(12:19) Detecting Induction Heads In Basically Every Model
(14:35) How Neel Got Into Mechanistic Interpretability
(16:22) Neel's Journey Into Alignment
(22:09) Enjoying Mechanistic Interpretability And Being Good At It Are The Main Multipliers
(24:49) How Is AI Alignment Work At DeepMind?
(25:46) Scalable Oversight
(28:30) Most Ambitious Degree Of Interpretability With Current Transformer Architectures
(31:05) To Understand Neel's Methodology, Watch The Research Walkthroughs
(32:23) Three Modes Of Research: Confirming, Red Teaming And Gaining Surface Area
(34:58) You Can Be Both Hypothesis Driven And Capable Of Being Surprised
(36:51) You Need To Be Able To Generate Multiple Hypothesis Before Getting Started
(37:55) All the theory is bullshit without empirical evidence and it's overall dignified to make the mechanistic interpretability bet
(40:11) Mechanistic interpretability is alien neuroscience for truth seeking biologists in a world of math
(42:12) Actually, Othello-GPT Has A Linear Emergent World Representation
(45:08) You Need To Use Simple Probes That Don't Do Any Computation To Prove The Model Actually Knows Something
(47:29) The Mechanistic Interpretability Researcher Mindset
(49:49) The Algorithms Learned By Models Might Or Might Not Be Universal
(51:49) On The Importance Of Being Truth Seeking And Skeptical
(54:18) The Linear Representation Hypothesis: Linear Representations Are The Right Abstractions
(00:57:26) Superposition Is How Models Compress Information
(01:00:15) The Polysemanticity Problem: Neurons Are Not Meaningful
(01:05:42) Superposition and Interference are at the Frontier of the Field of Mechanistic Interpretability
(01:07:33) Finding Neurons in a Haystack: Superposition Through De-Tokenization And Compound Word Detectors
(01:09:03) Not Being Able to Be Both Blood Pressure and Social Security Number at the Same Time Is Prime Real Estate for Superposition
(01:15:02) The Two Differences Of Superposition: Computational And Representational
(01:18:07) Toy Models Of Superposition
(01:25:39) How Mentoring Nine People at Once Through SERI MATS Helped Neel's Research
(01:31:25) The Backstory Behind Toy Models of Universality
(01:35:19) From Modular Addition To Permutation Groups
(01:38:52) The Model Needs To Learn Modular Addition On A Finite Number Of Token Inputs
(01:41:54) Why Is The Paper Called Toy Model Of Universality
(01:46:16) Progress Measures For Grokking Via Mechanistic Interpretability, Circuit Formation
(01:52:45) Getting Started In Mechanistic Interpretability And Which WalkthroughS To Start With
(01:56:15) Why Does Mechanistic Interpretability Matter From an Alignment Perspective
(01:58:41) How Detection Deception With Mechanistic Interpretability Compares to Collin Burns' Work
(02:01:20) Final Words From Neel
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode