AXRP - the AI X-risk Research Podcast

19 - Mechanistic Interpretability with Neel Nanda

4 snips
Feb 4, 2023
Ask episode
Chapters
Transcript
Episode notes
1
Introduction
00:00 • 2min
2
What's Happening in the Final Neural Network?
02:17 • 2min
3
Is Mechanistic Interpretability the Only Path to Understanding Neural Networks?
04:18 • 2min
4
The Science of Deep Learning and Mechanistic Interpretability
06:10 • 5min
5
Is There a Threshold for Not Publishing Mechanistic Interpretability Work?
11:05 • 5min
6
How Scale Invariant Do You Think We Should Think of the Insights as Being?
16:10 • 2min
7
Scaling Laws and Deep Learning
17:48 • 4min
8
Scaling Laws Are Less Useful for AI Expert Reduction or AI Alignment
21:40 • 3min
9
Is There a Spectrum of Cognitive Abilities?
24:15 • 2min
10
Language Model Interpretability
25:48 • 5min
11
The Second Mesh Thing to Bear in Mind When Using Transformers
30:44 • 3min
12
What's the Difference Between Sensory Reasoning and Processing?
33:25 • 2min
13
Is the Eiffel Tower Located in Paris?
35:08 • 2min
14
Using MLPs to Reverse Enter the Network
37:15 • 3min
15
The Modeling of Attention Heads in Image Models
40:00 • 5min
16
Is It Possible to Do the Same in Image Models?
45:03 • 2min
17
Using a Multilayer Perceptron to Train a Linear Map in a Vision Transformer
46:52 • 3min
18
AI Learning How to Do What Are You Doing?
49:59 • 2min
19
Is Its Output Not Interpretable?
51:59 • 2min
20
Automated Machine Learning and Machine Learning in a Neural Network
53:41 • 2min
21
How Close Do You Think We're to Automation at Any Level of the Spectrum?
56:10 • 4min
22
Activation Patching Is a Great Way to Find Out What a Neuron Does
01:00:38 • 3min
23
Using GPT-2 to Find a Neuron
01:03:35 • 2min
24
Red Teaming Mercanturp Research
01:05:58 • 4min
25
How to Get Into the Field of Mechanistic Interpretability
01:09:57 • 2min
26
The Three Papers You've Helped Reverse Engineer a Network
01:11:49 • 3min
27
Reverse Engineering a Transformer and Induction Heads
01:14:50 • 5min
28
How to Train a Smaller Model to Grok Modular Addition
01:19:51 • 4min
29
How to Get Higher Reward in a Way That You Didn't Think Possible
01:23:35 • 2min
30
Anthropic Contributions Statements
01:25:20 • 2min
31
Is This Path Analysis Going to Be Too Unwieldy to Be Useful?
01:27:18 • 4min
32
Reverse Engineering and Networks
01:31:12 • 3min
33
A Softmax Is a Matrix of Keys and Values and Attention Heads
01:33:58 • 5min
34
MLPs Are Really Hot Yeah So What's Going on Here?
01:38:54 • 6min
35
The Key Takeaway From This Paper Is That Attention Is a Parameterized Matrix
01:44:53 • 6min
36
Using the Token Embeddings in a Model Is a Good Idea
01:50:41 • 2min
37
Using Contextual Information in Model Composition
01:52:24 • 3min
38
How to Use QK and v Composition in Path Analysis
01:55:22 • 4min
39
Induction Heads in a Two Layer Model
01:58:53 • 3min
40
The Induction Head in a Two-Layer Attentionally Model
02:02:16 • 3min
41
Queue Composition Is Using Prior Information to Figure Previously Computed Information
02:05:23 • 2min
42
Why Do You Think Induction Was the First Thing at Tulum?
02:07:25 • 4min
43
Short Text Learning and Induction Heads
02:11:05 • 2min
44
Is There More Than One Induction Head?
02:12:54 • 2min
45
Induction Heads Are Relevant to Context Learning?
02:15:18 • 4min
46
Using Induction Heads to Match Translation Heads
02:19:48 • 2min
47
How Do Induction Heads Work?
02:21:34 • 2min
48
Indirect Identification
02:23:42 • 2min
49
Induction Heads Are Different Kinds of Things?
02:25:34 • 2min
50
How Many Induction Heads Do You Have?
02:27:14 • 2min
51
Induction Heads Paper - Groking
02:29:30 • 2min
52
Is the Fourth Line of Evidence Really the Case?
02:31:03 • 2min
53
The Correlation Between the Types of Evidence
02:33:30 • 3min
54
The Induction Heads in Large Models Aren't as Important as They Used to Be
02:36:35 • 2min
55
The Principal Component Analysis of Induction Heads
02:38:48 • 2min
56
The Losses Depend on the Log Prop of the Correct Next Token
02:40:33 • 2min
57
Is the Principal Axis of the Models Positive or Negative?
02:42:23 • 2min
58
Is That the First Principal Component?
02:44:12 • 2min
59
PCA - Is There a Kick Don't You?
02:45:43 • 2min
60
How to Improve the Loss of a Token
02:47:17 • 2min
61
Is There Any Light Shone on This Mystery?
02:49:22 • 2min
62
iClear - Random Loss in a One Layer Transform
02:50:58 • 3min
63
The Modular Edition Algorithm
02:53:32 • 4min
64
Modular Addition Algorithm
02:57:08 • 2min
65
Yep and That's the Basis of This Algorithm
02:59:04 • 2min
66
The Basic Algorithm of a Transformer
03:00:46 • 3min
67
Using a 113 Arithmetic Module You're Learning the Sign Function on 113 Data
03:04:00 • 2min
68
Generalizing or Memorizing?
03:06:01 • 4min
69
Using a Trigonometrical Algorithm I'm Defining a Second Progress Measure - Excluded Loss
03:09:46 • 2min
70
Is There a Suspension in Test Loss?
03:11:38 • 2min
71
Is There Something Weird About the Optimizer?
03:13:52 • 2min
72
The Third Reason Why You Shouldn't Expect Face Transitions With Adam Based Optimizers
03:15:57 • 5min
73
How to Train a Neural Network to Complete a Lottery Ticket Hypothesis
03:21:00 • 3min
74
Is It a Base Transition?
03:24:16 • 3min
75
A Sharp Left Turn Is the New Heart Phrase for This
03:27:31 • 2min
76
How Much Addition in a Toy One Layer Transformer?
03:29:42 • 4min
77
What's Up With the MLP Layers?
03:34:09 • 5min
78
Reverse Engineering Models on Rearrangement Learning Problems
03:38:39 • 5min
79
Aren't Art Networks Just Fundamentally Not Interpretable?
03:43:59 • 5min
80
How Can You Follow Me on Twitter?
03:49:29 • 3min