Rethinking AI Training: Late Adversarial Training

The chapter explores the concept of late adversarial training as a method to address harmful internal model capabilities in AI systems. It discusses the challenges of rethinking fine-tuning to ensure high confidence in worst-case performance under adversarial pressure. The conversation delves into the differences between attacking a model in the input space versus perturbing internal activations in the latent space to enhance overall model robustness.

Play episode from 01:54:27

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app