The Nonlinear Library

LW - Jailbreak steering generalization by Sarah Ball

Jun 20, 2024
Sarah Ball, the author of the article on jailbreak steering generalization, explores the internal mechanisms of various jailbreak types like harmful prompts and universal jailbreak. The study shows how different clusters of jailbreak vectors can prevent jailbreaks across categories and highlights the evolution of harmfulness-related directions in prompts.
Ask episode
Chapters
Transcript
Episode notes