Artificial General Intelligence (AGI) Show with Soroush Pour cover image

Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)

Artificial General Intelligence (AGI) Show with Soroush Pour

CHAPTER

Rethinking AI Training: Late Adversarial Training

The chapter explores the concept of late adversarial training as a method to address harmful internal model capabilities in AI systems. It discusses the challenges of rethinking fine-tuning to ensure high confidence in worst-case performance under adversarial pressure. The conversation delves into the differences between attacking a model in the input space versus perturbing internal activations in the latent space to enhance overall model robustness.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner