

Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability
12 snips Feb 16, 2023
Chapters
Transcript
Episode notes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Introduction
00:00 • 3min
Why Should Anyone Care About Machine Learning?
03:13 • 3min
Mechanistic Interpretability in Deep Learning - Multiple Results
06:39 • 2min
Toy Models of Superposition From Anthropic
08:27 • 5min
Are There Concepts That Humans Do Not Have That Could Be Found in Artificial Neural Networks?
13:04 • 2min
How Promising Is Mechanistic Interpretability?
15:30 • 5min
A Transformer Is a Sequence Modeling Thing
20:04 • 2min
How to Predict What Comes After Apple?
22:23 • 2min
Reverse Engineered Induction Heads Within Mechanistic Interpretability So Far
24:05 • 2min
Reverse Engineering Induction Heads
26:14 • 5min
Using Induction Heads in Artificial Intelligence Models
31:14 • 2min
How Does Mechanistic Interpretability Help Reduce AI Risk?
32:50 • 4min
Is Mechanistic Interpretability a Part of AI Safety?
36:52 • 3min
Can Future Language Models Deceive Us?
39:58 • 3min
Is Mechanistic Interpretability Not Fast Enough?
43:27 • 5min
Could AIs Out-Compete Systems That Translate to Humans?
48:12 • 5min
Is Mechanistic Interpretability Really Necessary?
52:48 • 3min
How to Get Into Mechanterp?
56:16 • 4min
Getting Into the Computer Science Field
59:52 • 2min