Panda Jailbreak

A seemingly random noise pattern overlaid on a panda image can jailbreak large language models.
This "noise" is actually a carefully designed pattern from an optimization process.
Including this image with harmful prompts tricks the model into responding.
While appearing random to humans, this pattern acts as a jailbreak code for the model.
These patterns can be continuously re-optimized to bypass evolving model defenses.

Transcript

Play full episode

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!