

High-Stakes Alignment via Adversarial Training [Redwood Research Report]
May 13, 2023
Chapters
Transcript
Episode notes
1 2 3 4 5 6
Introduction
00:00 • 2min
How to Reduce Misalignment Risk in the High Stakes Setting
02:13 • 3min
Building a Classifier to Detect Injuries
05:12 • 4min
How to Use a ToolAssisted Attack to Improve Your Classifier Score
09:20 • 2min
The Importance of Reliable Classifiers
11:47 • 2min
How to Train a Classifier to Generate Adversarial Examples
14:05 • 5min