19 - Mechanistic Interpretability with Neel Nanda

1

Introduction

00:00 • 2min

2

What's Happening in the Final Neural Network?

02:17 • 2min

3

Is Mechanistic Interpretability the Only Path to Understanding Neural Networks?

04:18 • 2min

4

The Science of Deep Learning and Mechanistic Interpretability

06:10 • 5min

5

Is There a Threshold for Not Publishing Mechanistic Interpretability Work?

11:05 • 5min

6

How Scale Invariant Do You Think We Should Think of the Insights as Being?

16:10 • 2min

7

Scaling Laws and Deep Learning

17:48 • 4min

8

Scaling Laws Are Less Useful for AI Expert Reduction or AI Alignment

21:40 • 3min

9

Is There a Spectrum of Cognitive Abilities?

24:15 • 2min

10

Language Model Interpretability

25:48 • 5min

11

The Second Mesh Thing to Bear in Mind When Using Transformers

30:44 • 3min

12

What's the Difference Between Sensory Reasoning and Processing?

33:25 • 2min

13

Is the Eiffel Tower Located in Paris?

35:08 • 2min

14

Using MLPs to Reverse Enter the Network

37:15 • 3min

15

The Modeling of Attention Heads in Image Models

40:00 • 5min

16

Is It Possible to Do the Same in Image Models?

45:03 • 2min

17

Using a Multilayer Perceptron to Train a Linear Map in a Vision Transformer

46:52 • 3min

18

AI Learning How to Do What Are You Doing?

49:59 • 2min

19

Is Its Output Not Interpretable?

51:59 • 2min

20

Automated Machine Learning and Machine Learning in a Neural Network

53:41 • 2min

21

How Close Do You Think We're to Automation at Any Level of the Spectrum?

56:10 • 4min

22

Activation Patching Is a Great Way to Find Out What a Neuron Does

01:00:38 • 3min

23

Using GPT-2 to Find a Neuron

01:03:35 • 2min

24

Red Teaming Mercanturp Research

01:05:58 • 4min

25

How to Get Into the Field of Mechanistic Interpretability

01:09:57 • 2min

26

The Three Papers You've Helped Reverse Engineer a Network

01:11:49 • 3min

27

Reverse Engineering a Transformer and Induction Heads

01:14:50 • 5min

28

How to Train a Smaller Model to Grok Modular Addition

01:19:51 • 4min

29

How to Get Higher Reward in a Way That You Didn't Think Possible

01:23:35 • 2min

30

Anthropic Contributions Statements

01:25:20 • 2min

31

Is This Path Analysis Going to Be Too Unwieldy to Be Useful?

01:27:18 • 4min

32

Reverse Engineering and Networks

01:31:12 • 3min

33

A Softmax Is a Matrix of Keys and Values and Attention Heads

01:33:58 • 5min

34

MLPs Are Really Hot Yeah So What's Going on Here?

01:38:54 • 6min

35

The Key Takeaway From This Paper Is That Attention Is a Parameterized Matrix

01:44:53 • 6min

36

Using the Token Embeddings in a Model Is a Good Idea

01:50:41 • 2min

37

Using Contextual Information in Model Composition

01:52:24 • 3min

38

How to Use QK and v Composition in Path Analysis

01:55:22 • 4min

39

Induction Heads in a Two Layer Model

01:58:53 • 3min

40

The Induction Head in a Two-Layer Attentionally Model

02:02:16 • 3min

41

Queue Composition Is Using Prior Information to Figure Previously Computed Information

02:05:23 • 2min

42

Why Do You Think Induction Was the First Thing at Tulum?

02:07:25 • 4min

43

Short Text Learning and Induction Heads

02:11:05 • 2min

44

Is There More Than One Induction Head?

02:12:54 • 2min

45

Induction Heads Are Relevant to Context Learning?

02:15:18 • 4min

46

Using Induction Heads to Match Translation Heads

02:19:48 • 2min

47

How Do Induction Heads Work?

02:21:34 • 2min

48

Indirect Identification

02:23:42 • 2min

49

Induction Heads Are Different Kinds of Things?

02:25:34 • 2min

50

How Many Induction Heads Do You Have?

02:27:14 • 2min

51

Induction Heads Paper - Groking

02:29:30 • 2min

52

Is the Fourth Line of Evidence Really the Case?

02:31:03 • 2min

53

The Correlation Between the Types of Evidence

02:33:30 • 3min

54

The Induction Heads in Large Models Aren't as Important as They Used to Be

02:36:35 • 2min

55

The Principal Component Analysis of Induction Heads

02:38:48 • 2min

56

The Losses Depend on the Log Prop of the Correct Next Token

02:40:33 • 2min

57

Is the Principal Axis of the Models Positive or Negative?

02:42:23 • 2min

58

Is That the First Principal Component?

02:44:12 • 2min

59

PCA - Is There a Kick Don't You?

02:45:43 • 2min

60

How to Improve the Loss of a Token

02:47:17 • 2min

61

Is There Any Light Shone on This Mystery?

02:49:22 • 2min

62

iClear - Random Loss in a One Layer Transform

02:50:58 • 3min

63

The Modular Edition Algorithm

02:53:32 • 4min

64

Modular Addition Algorithm

02:57:08 • 2min

65

Yep and That's the Basis of This Algorithm

02:59:04 • 2min

66

The Basic Algorithm of a Transformer

03:00:46 • 3min

67

Using a 113 Arithmetic Module You're Learning the Sign Function on 113 Data

03:04:00 • 2min

68

Generalizing or Memorizing?

03:06:01 • 4min

69

Using a Trigonometrical Algorithm I'm Defining a Second Progress Measure - Excluded Loss

03:09:46 • 2min

70

Is There a Suspension in Test Loss?

03:11:38 • 2min

71

Is There Something Weird About the Optimizer?

03:13:52 • 2min

72

The Third Reason Why You Shouldn't Expect Face Transitions With Adam Based Optimizers

03:15:57 • 5min

73

How to Train a Neural Network to Complete a Lottery Ticket Hypothesis

03:21:00 • 3min

74

Is It a Base Transition?

03:24:16 • 3min

75

A Sharp Left Turn Is the New Heart Phrase for This

03:27:31 • 2min

76

How Much Addition in a Toy One Layer Transformer?

03:29:42 • 4min

77

What's Up With the MLP Layers?

03:34:09 • 5min

78

Reverse Engineering Models on Rearrangement Learning Problems

03:38:39 • 5min

79

Aren't Art Networks Just Fundamentally Not Interpretable?

03:43:59 • 5min

80

How Can You Follow Me on Twitter?

03:49:29 • 3min