"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis cover image

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Universal Jailbreaks with Zico Kolter, Andy Zou, and Asher Trockman

Sep 22, 2023
In this discussion, Zico Kolter, a leading professor at Carnegie Mellon University, Andy Zou, a PhD candidate, and Asher Trockman explore the intricate world of universal adversarial attacks on language models. They delve into the motivations behind these attacks and how simple tweaks can disrupt model behavior. Their conversation highlights the potential short-term harms and long-term risks of 'jailbreaking' AI, including implications for training data and the complexities of model responses. They'll also touch on the exciting future of AI defenses in this evolving landscape.
02:17:07

Podcast summary created with Snipd AI

Quick takeaways

  • Adversarial attacks on language models can transfer across different models and prompts, highlighting vulnerabilities in training data.
  • Defending against adversarial attacks in language models is challenging and traditional defenses often degrade model performance.

Deep dives

Transferability of Attacks on Language Models

This podcast episode discusses the surprising transferability of attacks on language models. The attacks were initially constructed on open source models but were found to also work on commercial models like GPT-3 and Claw2. The attacks involved manipulating the models to generate responses that should be refused, such as providing instructions on how to build a bomb. The success of the attacks in transferring across different models and prompts suggests that there are vulnerabilities deeply embedded in the training data and behaviors of these language models. The underlying cause of this transferability might be attributed to the existence of non-robust features in the pre-training data.

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner