Future of Life Institute Podcast

Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability

12 snips
Feb 16, 2023
Ask episode
Chapters
Transcript
Episode notes
1
Introduction
00:00 • 3min
2
Why Should Anyone Care About Machine Learning?
03:13 • 3min
3
Mechanistic Interpretability in Deep Learning - Multiple Results
06:39 • 2min
4
Toy Models of Superposition From Anthropic
08:27 • 5min
5
Are There Concepts That Humans Do Not Have That Could Be Found in Artificial Neural Networks?
13:04 • 2min
6
How Promising Is Mechanistic Interpretability?
15:30 • 5min
7
A Transformer Is a Sequence Modeling Thing
20:04 • 2min
8
How to Predict What Comes After Apple?
22:23 • 2min
9
Reverse Engineered Induction Heads Within Mechanistic Interpretability So Far
24:05 • 2min
10
Reverse Engineering Induction Heads
26:14 • 5min
11
Using Induction Heads in Artificial Intelligence Models
31:14 • 2min
12
How Does Mechanistic Interpretability Help Reduce AI Risk?
32:50 • 4min
13
Is Mechanistic Interpretability a Part of AI Safety?
36:52 • 3min
14
Can Future Language Models Deceive Us?
39:58 • 3min
15
Is Mechanistic Interpretability Not Fast Enough?
43:27 • 5min
16
Could AIs Out-Compete Systems That Translate to Humans?
48:12 • 5min
17
Is Mechanistic Interpretability Really Necessary?
52:48 • 3min
18
How to Get Into Mechanterp?
56:16 • 4min
19
Getting Into the Computer Science Field
59:52 • 2min