"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Universal Jailbreaks with Zico Kolter, Andy Zou, and Asher Trockman

8 snips
Sep 22, 2023
In this discussion, Zico Kolter, a leading professor at Carnegie Mellon University, Andy Zou, a PhD candidate, and Asher Trockman explore the intricate world of universal adversarial attacks on language models. They delve into the motivations behind these attacks and how simple tweaks can disrupt model behavior. Their conversation highlights the potential short-term harms and long-term risks of 'jailbreaking' AI, including implications for training data and the complexities of model responses. They'll also touch on the exciting future of AI defenses in this evolving landscape.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

LLM Prediction Vulnerability

  • Language models predict the most likely next token, explaining their unexpected behaviors.
  • This makes them vulnerable to manipulation through adversarial attacks.
INSIGHT

Adversarial Attacks on LMs

  • Language models, like image classifiers, output probability distributions over possible outputs.
  • Adversarial attacks aim to manipulate input to maximize the probability of harmful outputs.
ANECDOTE

Mode Switching in LLMs

  • During optimization, language models abruptly switch from refusing harmful requests to providing instructions.
  • This "mode switching" is an empirical phenomenon observed in their research.
Get the Snipd Podcast app to discover more snips from this episode
Get the app