"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Red Teaming o1 Part 1/2– Automated Jailbreaking with Haize Labs' Leonard Tang, Aidan Ewart, and Brian Huang

26 snips

Sep 14, 2024

Guest

Brian Huang

Guest

Leonard Tang

Leonard Tang and Brian Huang from Haize Labs share their insights on AI model vulnerabilities and automated jailbreaking techniques. They discuss the crucial role of the o1 Red Team in testing OpenAI's latest reasoning models, emphasizing the balance between AI's advanced capabilities and potential risks. The conversation delves into automated red teaming strategies, the challenges of evaluating AI safety, and the ongoing battle between model functionality and security measures. Tune in for a deep dive into the future of AI technology and its implications!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Transferability of Jailbreaks

Jailbreaking transferability from white box to black box models may decrease.
This is because frontier labs' pre-training data mixes diverge from open-source models.

INSIGHT

Brittle vs. Robust Attacks

GCG attacks are brittle; more varied, less detectable attacks are more reliable.
Persona modulation attacks, for example, target something fundamental in language model training.

ANECDOTE

Red Teaming O1

Haize Labs had about a month of automated testing access for O1.
Other red teamers, like MIT and physics professors, focused on manual testing.

Get the Snipd Podcast app to discover more snips from this episode

Get the app