
LessWrong (Curated & Popular) "Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.
May 18, 2023
Chapters
Transcript
Episode notes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Introduction
00:00 • 2min
How Activation Additions Affect GPT-2's Capabilities
01:49 • 2min
How GPT2XL Modifies Forward Passes
04:03 • 2min
GPT2's Byte Pairing Coding Tokenizer
05:38 • 6min
The Relationship Between Activation Addition and Language Model Behavior
11:26 • 4min
How to Find an Activation Addition That Leads to Improbable Completions
15:46 • 2min
How Steering Vectors Impact GPT-2's Capabilities
17:48 • 2min
The Unsteered Completions of Barack Obama
19:43 • 2min
How to Deal With Death
21:31 • 2min
How to Explain the Eiffel Tower
23:13 • 3min
The Unsteered Completions of a Dragon
26:16 • 2min
The Steering Vector for Talk About Weddings
28:01 • 2min
The Unsteered Completions
30:10 • 4min
The Effect of Steering Vectors on the Output of Weddings
34:25 • 2min
Activation Additions Mess Up Output Tokens for Directly Modified Residual Streams
36:31 • 2min
The Effects of Steering Vectors on Model Performance
38:30 • 5min
Anger Minus Calm in Lower Case Doesn't Work at All
43:42 • 2min
Anger and Random Vectors in GPT-2 XL
46:00 • 5min
The Effect of Anger Steering Vectors on the Quality of Completions
50:51 • 2min
The Effects of Steering Vectors on Anger
52:24 • 2min
The Effect of Steering Vectors on Weddingness
54:38 • 4min
How to Interpret GPT-2 XL Completions as Weddings
58:28 • 2min
How Steering Vectors Impact GPT-2's Capabilities
01:00:02 • 2min
The Effects of Activation Additions on GPT-2 XL's Next Token Probabilities
01:02:01 • 2min
The Effects of Activation on Next Token Probabilities
01:04:31 • 2min
The Effects of Steering Modification on Coherent Sentences
01:06:55 • 2min
The Effects of Intervention on Model Capabilities
01:09:02 • 2min
The Effects of Injection on Wedding Perplexity
01:10:55 • 2min
The Effects of the Weddings Vector on Perplexity
01:12:27 • 2min
Sentences About Shipping Aren't Changed
01:14:31 • 3min
The Effects of Prompting on GPT-2 XL
01:17:35 • 2min
How to Optimize Your Yelp Reviews for Maximum Perplexity
01:19:23 • 3min
The Worst Vector Improves Perplexity on Negative Sentiment Reviews
01:21:55 • 2min
The Effect of Activation Additions on LLMs
01:23:57 • 2min
Activation Additions Give Strong Evidence of Feature Linearity
01:25:44 • 2min
GPT-2-XL Is Robust to Activation Noise
01:28:02 • 3min
How to Live in a High-Promised World
01:31:05 • 2min
The Importance of Activation Additions in Training Processes
01:33:09 • 3min
The Importance of Activation Editions in Language Models
01:35:45 • 3min
Editing Models With Tasker Arithmetic
01:38:44 • 2min
Steering GPT-2 XL by Adding an Activation Vector
01:40:50 • 2min
