

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jun 20, 2024 • 4min
LW - Jailbreak steering generalization by Sarah Ball
Sarah Ball, the author of the article on jailbreak steering generalization, explores the internal mechanisms of various jailbreak types like harmful prompts and universal jailbreak. The study shows how different clusters of jailbreak vectors can prevent jailbreaks across categories and highlights the evolution of harmfulness-related directions in prompts.

Jun 20, 2024 • 2min
LW - Claude 3.5 Sonnet by Zach Stein-Perlman
Author Zach Stein-Perlman discusses the release of Claude 3.5 Sonnet, UKIC's pre-deployment testing, Metra exploring autonomy capabilities, and Anthropic CEO's commitment to not advance the frontier with a launch.

Jun 20, 2024 • 1min
EA - Case studies on social-welfare-based standards in various industries by Holden Karnofsky
Author Holden Karnofsky discusses case studies on social-welfare-based standards in various industries, aiming to inform standards or regulations for AI. He shares a Google Sheet with links to the insightful case studies received, offering valuable insights for listeners interested in the topic.

Jun 20, 2024 • 4min
AF - Jailbreak steering generalization by Sarah Ball
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Jailbreak steering generalization, published by Sarah Ball on June 20, 2024 on The AI Alignment Forum.
This work was performed as part of SPAR
We use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may.
Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023). For all our experiments we use the Vicuna 13B v1.5 model.
In a first step, we produce jailbreak vectors for each jailbreak type by contrasting the internal activations of jailbreak and non-jailbreak versions of the same request (Rimsky et al., 2023; Zou et al., 2023a).
Interestingly, we find that steering with mean-difference jailbreak vectors from one cluster of jailbreaks helps to prevent jailbreaks from different clusters. This holds true for a wide range of jailbreak types.
The jailbreak vectors themselves also cluster according to semantic categories such as persona modulation, fictional settings and style manipulation.
In a second step, we look at the evolution of a harmfulness-related direction over the context (found via contrasting harmful and harmless prompts) and find that when jailbreaks are included, this feature is suppressed at the end of the instruction in harmful prompts. This provides some evidence for the fact that jailbreaks suppress the model's perception of request harmfulness.
Effective jailbreaks usually decrease the amount of the harmfulness feature present more.
However, we also observe one jailbreak ("wikipedia with title"[1]), which is an effective jailbreak although it does not suppress the harmfulness feature as much as the other effective jailbreak types. Furthermore, the jailbreak steering vector based on this jailbreak is overall less successful in reducing the attack success rate of other types. This observation indicates that harmfulness suppression might not be the only mechanism at play as suggested by Wei et al. (2024) and Zou et al.
(2023a).
References
Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., and Hashimoto, T. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering Llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a.
Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
1. ^
This jailbreak type asks the model to write a Wikipedia article titled as .
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Jun 20, 2024 • 1h 21min
LW - AI #69: Nice by Zvi
Zvi discusses the founding of a new AI company aiming to build superintelligence, changes to SB 1047, a former NSA director joining OpenAI, and Donald Trump's thoughts on AI. The podcast also touches on cybersecurity concerns and the implications of advancing AI technology.

Jun 20, 2024 • 3min
LW - Actually, Power Plants May Be an AI Training Bottleneck. by Lao Mein
Author Lao Mein discusses the potential bottleneck in AI data center construction due to US power plant capacity. Exploring the implications of increasing AI chip installations on electricity demand and the necessity for building new power plants to avoid production shortfalls by 2026.

Jun 20, 2024 • 4min
EA - Advice for early-career people seeking jobs in EA by Julia Wise
Julia Wise, writer and EA contributor, shares valuable tips for early-career individuals seeking jobs in Effective Altruism. The podcast covers job-seeking strategies, networking advice, career decisions, and opportunities in sectors like animal advocacy and emerging technologies.

Jun 20, 2024 • 9min
EA - Manifold markets isn't very good by Robin
Author Robin discusses the prediction market website Manifold, its similarities and differences with Metaculus, and structural flaws impacting prediction accuracy.

Jun 20, 2024 • 7min
EA - Against the Guardian's hit piece on Manifest by Omnizoid
The podcast discusses The Guardian's hit piece on Manifest, highlighting factual errors and biased reporting. It explores controversial views in philosophy, challenges cancel culture, and emphasizes the importance of free thinkers in society.

Jun 20, 2024 • 8min
EA - Announcing the AI Forecasting Benchmark Series | July 8, $120k in Prizes by christian
The podcast discusses the launch of a series of AI forecasting tournaments, comparing human and AI abilities. Participants can create their own forecasting bots to compete for $120k in prizes. The focus is on understanding AI capabilities and improving forecasting metrics like calibration and logical consistency.


