The Nonlinear Library

The Nonlinear Fund
undefined
Jun 20, 2024 • 4min

LW - Jailbreak steering generalization by Sarah Ball

Sarah Ball, the author of the article on jailbreak steering generalization, explores the internal mechanisms of various jailbreak types like harmful prompts and universal jailbreak. The study shows how different clusters of jailbreak vectors can prevent jailbreaks across categories and highlights the evolution of harmfulness-related directions in prompts.
undefined
Jun 20, 2024 • 2min

LW - Claude 3.5 Sonnet by Zach Stein-Perlman

Author Zach Stein-Perlman discusses the release of Claude 3.5 Sonnet, UKIC's pre-deployment testing, Metra exploring autonomy capabilities, and Anthropic CEO's commitment to not advance the frontier with a launch.
undefined
Jun 20, 2024 • 1min

EA - Case studies on social-welfare-based standards in various industries by Holden Karnofsky

Author Holden Karnofsky discusses case studies on social-welfare-based standards in various industries, aiming to inform standards or regulations for AI. He shares a Google Sheet with links to the insightful case studies received, offering valuable insights for listeners interested in the topic.
undefined
Jun 20, 2024 • 4min

AF - Jailbreak steering generalization by Sarah Ball

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Jailbreak steering generalization, published by Sarah Ball on June 20, 2024 on The AI Alignment Forum. This work was performed as part of SPAR We use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may. Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023). For all our experiments we use the Vicuna 13B v1.5 model. In a first step, we produce jailbreak vectors for each jailbreak type by contrasting the internal activations of jailbreak and non-jailbreak versions of the same request (Rimsky et al., 2023; Zou et al., 2023a). Interestingly, we find that steering with mean-difference jailbreak vectors from one cluster of jailbreaks helps to prevent jailbreaks from different clusters. This holds true for a wide range of jailbreak types. The jailbreak vectors themselves also cluster according to semantic categories such as persona modulation, fictional settings and style manipulation. In a second step, we look at the evolution of a harmfulness-related direction over the context (found via contrasting harmful and harmless prompts) and find that when jailbreaks are included, this feature is suppressed at the end of the instruction in harmful prompts. This provides some evidence for the fact that jailbreaks suppress the model's perception of request harmfulness. Effective jailbreaks usually decrease the amount of the harmfulness feature present more. However, we also observe one jailbreak ("wikipedia with title"[1]), which is an effective jailbreak although it does not suppress the harmfulness feature as much as the other effective jailbreak types. Furthermore, the jailbreak steering vector based on this jailbreak is overall less successful in reducing the attack success rate of other types. This observation indicates that harmfulness suppression might not be the only mechanism at play as suggested by Wei et al. (2024) and Zou et al. (2023a). References Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023. Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., and Hashimoto, T. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering Llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023. Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 2024. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a. Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b. 1. ^ This jailbreak type asks the model to write a Wikipedia article titled as . Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Jun 20, 2024 • 1h 21min

LW - AI #69: Nice by Zvi

Zvi discusses the founding of a new AI company aiming to build superintelligence, changes to SB 1047, a former NSA director joining OpenAI, and Donald Trump's thoughts on AI. The podcast also touches on cybersecurity concerns and the implications of advancing AI technology.
undefined
Jun 20, 2024 • 3min

LW - Actually, Power Plants May Be an AI Training Bottleneck. by Lao Mein

Author Lao Mein discusses the potential bottleneck in AI data center construction due to US power plant capacity. Exploring the implications of increasing AI chip installations on electricity demand and the necessity for building new power plants to avoid production shortfalls by 2026.
undefined
Jun 20, 2024 • 4min

EA - Advice for early-career people seeking jobs in EA by Julia Wise

Julia Wise, writer and EA contributor, shares valuable tips for early-career individuals seeking jobs in Effective Altruism. The podcast covers job-seeking strategies, networking advice, career decisions, and opportunities in sectors like animal advocacy and emerging technologies.
undefined
Jun 20, 2024 • 9min

EA - Manifold markets isn't very good by Robin

Author Robin discusses the prediction market website Manifold, its similarities and differences with Metaculus, and structural flaws impacting prediction accuracy.
undefined
Jun 20, 2024 • 7min

EA - Against the Guardian's hit piece on Manifest by Omnizoid

The podcast discusses The Guardian's hit piece on Manifest, highlighting factual errors and biased reporting. It explores controversial views in philosophy, challenges cancel culture, and emphasizes the importance of free thinkers in society.
undefined
Jun 20, 2024 • 8min

EA - Announcing the AI Forecasting Benchmark Series | July 8, $120k in Prizes by christian

The podcast discusses the launch of a series of AI forecasting tournaments, comparing human and AI abilities. Participants can create their own forecasting bots to compete for $120k in prizes. The focus is on understanding AI capabilities and improving forecasting metrics like calibration and logical consistency.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app