AI Safety Fundamentals: Alignment cover image

High-Stakes Alignment via Adversarial Training [Redwood Research Report]

AI Safety Fundamentals: Alignment

00:00

The Importance of Reliable Classifiers

After training on several thousand adversarial examples, classifiers became significantly harder to break. The tool assisted attack went from taking 13 minutes to taking 26 minutes per example. Later more adversarially trained classifiers defended against most of the adversarial data targeted at earlier classifiers. That suggests that more adversarial training patches issues without introducing many new ones and didn't hurt in-distribution performance.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app