"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Red Teaming o1 Part 1/2– Automated Jailbreaking with Haize Labs' Leonard Tang, Aidan Ewart, and Brian Huang

26 snips
Sep 14, 2024
Leonard Tang and Brian Huang from Haize Labs share their insights on AI model vulnerabilities and automated jailbreaking techniques. They discuss the crucial role of the o1 Red Team in testing OpenAI's latest reasoning models, emphasizing the balance between AI's advanced capabilities and potential risks. The conversation delves into automated red teaming strategies, the challenges of evaluating AI safety, and the ongoing battle between model functionality and security measures. Tune in for a deep dive into the future of AI technology and its implications!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Transferability of Jailbreaks

  • Jailbreaking transferability from white box to black box models may decrease.
  • This is because frontier labs' pre-training data mixes diverge from open-source models.
INSIGHT

Brittle vs. Robust Attacks

  • GCG attacks are brittle; more varied, less detectable attacks are more reliable.
  • Persona modulation attacks, for example, target something fundamental in language model training.
ANECDOTE

Red Teaming O1

  • Haize Labs had about a month of automated testing access for O1.
  • Other red teamers, like MIT and physics professors, focused on manual testing.
Get the Snipd Podcast app to discover more snips from this episode
Get the app