LessWrong (Curated & Popular)

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.

May 18, 2023
Ask episode
Chapters
Transcript
Episode notes
1
Introduction
00:00 • 2min
2
How Activation Additions Affect GPT-2's Capabilities
01:49 • 2min
3
How GPT2XL Modifies Forward Passes
04:03 • 2min
4
GPT2's Byte Pairing Coding Tokenizer
05:38 • 6min
5
The Relationship Between Activation Addition and Language Model Behavior
11:26 • 4min
6
How to Find an Activation Addition That Leads to Improbable Completions
15:46 • 2min
7
How Steering Vectors Impact GPT-2's Capabilities
17:48 • 2min
8
The Unsteered Completions of Barack Obama
19:43 • 2min
9
How to Deal With Death
21:31 • 2min
10
How to Explain the Eiffel Tower
23:13 • 3min
11
The Unsteered Completions of a Dragon
26:16 • 2min
12
The Steering Vector for Talk About Weddings
28:01 • 2min
13
The Unsteered Completions
30:10 • 4min
14
The Effect of Steering Vectors on the Output of Weddings
34:25 • 2min
15
Activation Additions Mess Up Output Tokens for Directly Modified Residual Streams
36:31 • 2min
16
The Effects of Steering Vectors on Model Performance
38:30 • 5min
17
Anger Minus Calm in Lower Case Doesn't Work at All
43:42 • 2min
18
Anger and Random Vectors in GPT-2 XL
46:00 • 5min
19
The Effect of Anger Steering Vectors on the Quality of Completions
50:51 • 2min
20
The Effects of Steering Vectors on Anger
52:24 • 2min
21
The Effect of Steering Vectors on Weddingness
54:38 • 4min
22
How to Interpret GPT-2 XL Completions as Weddings
58:28 • 2min
23
How Steering Vectors Impact GPT-2's Capabilities
01:00:02 • 2min
24
The Effects of Activation Additions on GPT-2 XL's Next Token Probabilities
01:02:01 • 2min
25
The Effects of Activation on Next Token Probabilities
01:04:31 • 2min
26
The Effects of Steering Modification on Coherent Sentences
01:06:55 • 2min
27
The Effects of Intervention on Model Capabilities
01:09:02 • 2min
28
The Effects of Injection on Wedding Perplexity
01:10:55 • 2min
29
The Effects of the Weddings Vector on Perplexity
01:12:27 • 2min
30
Sentences About Shipping Aren't Changed
01:14:31 • 3min
31
The Effects of Prompting on GPT-2 XL
01:17:35 • 2min
32
How to Optimize Your Yelp Reviews for Maximum Perplexity
01:19:23 • 3min
33
The Worst Vector Improves Perplexity on Negative Sentiment Reviews
01:21:55 • 2min
34
The Effect of Activation Additions on LLMs
01:23:57 • 2min
35
Activation Additions Give Strong Evidence of Feature Linearity
01:25:44 • 2min
36
GPT-2-XL Is Robust to Activation Noise
01:28:02 • 3min
37
How to Live in a High-Promised World
01:31:05 • 2min
38
The Importance of Activation Additions in Training Processes
01:33:09 • 3min
39
The Importance of Activation Editions in Language Models
01:35:45 • 3min
40
Editing Models With Tasker Arithmetic
01:38:44 • 2min
41
Steering GPT-2 XL by Adding an Activation Vector
01:40:50 • 2min