Artificial General Intelligence (AGI) Show with Soroush Pour cover image

Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)

Artificial General Intelligence (AGI) Show with Soroush Pour

00:00

Rethinking AI Training: Late Adversarial Training

The chapter explores the concept of late adversarial training as a method to address harmful internal model capabilities in AI systems. It discusses the challenges of rethinking fine-tuning to ensure high confidence in worst-case performance under adversarial pressure. The conversation delves into the differences between attacking a model in the input space versus perturbing internal activations in the latent space to enhance overall model robustness.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app