The Nonlinear Library: LessWrong

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Sep 13, 2024 • 15min

LW - [Paper] Programming Refusal with Conditional Activation Steering by Bruce W. Lee

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Paper] Programming Refusal with Conditional Activation Steering, published by Bruce W. Lee on September 12, 2024 on LessWrong. For full content, refer to the arXiv preprint at https://arxiv.org/abs/2409.05907. This post is a lighter, 15-minute version. Abstract Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. We propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Using CAST, one can systematically control LLM behavior with rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse." This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization. We release an open-source implementation of the activation steering toolkit at https://github.com/IBM/activation-steering. Introduction Problem: Lack of conditional control in activation steering. Activation steering offers a promising alternative to optimization-based techniques by directly manipulating the model's native representations, often requiring only a simple activation addition step during each forward call. Our work here builds on Refusal in LLMs is mediated by a single direction, which has shown promise in altering LLM behavior, such as removing or inducing refusal behavior. However, the key limitation of current methods is the inability to condition when and what to refuse. That is, adding a "refusal vector" using existing activation steering methods increases refusal rates indiscriminately across all inputs, limiting the model's utility. Contribution: Expanding activation steering formulation. We introduce Conditional Activation Steering (CAST), a method that enables fine-grained, context-dependent control over LLM behaviors. We introduce a new type of steering vector in the activation steering formulation, the condition vector, representing certain activation patterns induced by the prompt during the inference process. A simple similarity calculation between this condition vector and the model's activation at inference time effectively serves as a switch, determining whether to apply the refusal vector. This approach allows for selective refusal of harmful prompts while maintaining the ability to respond to harmless ones, as depicted below. Application: Selecting what to refuse. Many alignment goals concern contextually refusing specific classes of instructions. Traditional methods like preference modeling are resource-intensive and struggle with subjective, black-box rewards. Additionally, the definition of harmful content varies across contexts, complicating the creation of universal harm models. The usage context further complicates this variability; for instance, discussing medical advice might be harmful in some situations but essential in others, such as in medical chatbots. We show CAST can implement behavioral rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse", allowing for selective modification of responses to specific content without weight optimization. On a technical level, our primary insight is that different prompts consistently activate distinct patterns in the model's hidden states during inference. These patterns can be extracted as a steering vector and used as reference points for detecting specific prompt categories or contexts. This observation allows us to use steering vectors not only as behavior modification mechanisms but also as condition ...

Sep 11, 2024 • 5min

LW - Refactoring cryonics as structural brain preservation by Andy McKenzie

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Refactoring cryonics as structural brain preservation, published by Andy McKenzie on September 11, 2024 on LessWrong. I first learned about cryonics when I read Eliezer and Robin's posts about it on Overcoming Bias years ago. I got cryopilled. Somewhat amazingly to me, I'm now a researcher in this field. So I thought this community might be interested to know that I was one of several co-authors on a paper just published in Frontiers in Medical Technology, titled "Structural brain preservation: a potential bridge to future medical technologies". In this paper, we propose reframing cryonics as a type of structural brain preservation, focusing on maintaining the brain's physical structure that encodes memories and personality, rather than making the focus about low-temperature storage per se. We explore what brain structures likely need to be preserved to retain long-term memories and other valued aspects of personal identity. We then review different methods of brain preservation, including cryopreservation, aldehyde-stabilized cryopreservation, fluid preservation, and fixation followed by polymer embedding. The paper also discusses the two most commonly discussed potential future revival technologies, i.e. molecular nanotechnology and whole brain emulation. We argue that this structural preservation framing may be more technically grounded and agreeable to mainstream researchers than some of the traditional ways that cryonics has been discussed. As a personal reflection here, I want to briefly discuss the idea of fluid preservation, which is one topic discussed in our review paper. I remember first reading about this idea in approximately 2017 on a cryonics mailing list. Even though I was already sold on the idea of aldehyde-stabilized cryopreservation -- using fixation as a short-term bridge to cryoprotection and cryopreservation, I remember thinking that the idea of simply leaving the brain in fixative solution for the long-term was bizarre and outlandish. Around 2020-2022, I spent a good amount of time researching different options for higher temperature (and thus lower cost) brain preservation. Mostly I was looking into different methods for embedding fixed brain tissue in polymers, such as paraffin, epoxy, acrylates, or silicon. I also studied the options of dehydrated preservation and preserving the fixed brain in the fluid state, which I was mostly doing for the sake of completeness. To be clear, I certainly don't want to make it seem like this was a lone wolf effort or anything. I was talking about the ideas with friends and it was also in the zeitgeist. For example, John Smart wrote a blog post in 2020 about this, titled "Do we need a noncryogenic brain preservation prize?" (There still is no such prize.) In 2022, I was reading various papers on brain preservation (as one does), when I came across Rosoklija 2013. If I recall correctly, I had already seen this paper but was re-reading it with a different eye. They studied human and monkey brain tissue that had been preserved in formalin for periods ranging from 15 months to 55 years, using the Golgi-Kopsch silver staining method to visualize neuronal structures. They reported that even after 50 years of formalin fixation at room temperature, the method yielded excellent results. In particular, they had this figure: That's a picture showing well-impregnated neurons with preserved dendritic spines. Looking at this picture was a viewquake for me. I thought, if fluid preservation can preserve the structure of the 1-5% of cells that are stained by the Golgi-Kopsch method, why not other cells? And if it can work in this one part of the brain, why not the whole brain? And if it can do it for 50 years, why not 100 or 150? Chemically, it is not clear why there would be differences across the tissue. Aldehydes crosslin...

Sep 11, 2024 • 5min

LW - Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It. by Andrew Critch

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It., published by Andrew Critch on September 11, 2024 on LessWrong. People often attack frontier AI labs for "hypocrisy" when the labs admit publicly that AI is an extinction threat to humanity. Often these attacks ignore the difference between various kinds of hypocrisy, some of which are good, including what I'll call "reformative hypocrisy". Attacking good kinds of hypocrisy can be actively harmful for humanity's ability to survive, and as far as I can tell we (humans) usually shouldn't do that when our survival is on the line. Arguably, reformative hypocrisy shouldn't even be called hypocrisy, due to the negative connotations of "hypocrisy". That said, bad forms of hypocrisy can be disguised as the reformative kind for long periods, so it's important to pay enough attention to hypocrisy to actually figure out what kind it is. Here's what I mean, by way of examples: *** 0. No Hypocrisy Lab: "Building AGI without regulation shouldn't be allowed. Since there's no AGI regulation, I'm not going to build AGI." Meanwhile, the lab doesn't build AGI. This is a case of honest behavior, and what many would consider very high integrity. However, it's not obviously better, and arguably sometimes worse, than... 1. Reformative Hypocrisy: Lab: "Absent adequate regulation for it, building AGI shouldn't be allowed at all, and right now there is no adequate regulation for it. Anyway, I'm building AGI, and calling for regulation, and making lots of money as I go, which helps me prove the point that AGI is powerful and needs to be regulated." Meanwhile, the lab builds AGI and calls for regulation. So, this is a case of honest hypocrisy. I think this is straightforwardly better than... 2. Erosive Hypocrisy: Lab: "Building AGI without regulation shouldn't be allowed, but it is, so I'm going to build it anyway and see how that goes; the regulatory approach to safety is hopeless." Meanwhile, the lab builds AGI and doesn't otherwise put efforts into supporting regulation. This could also be a case of honest hypocrisy, but it erodes the norm that AGI should regulated rather than supporting it. Some even worse forms of hypocrisy include... 3. Dishonest Hypocrisy, which comes in at least two importantly distinct flavors: a) feigning abstinence: Lab: "AGI shouldn't be allowed." Meanwhile, the lab secretly builds AGI, contrary to what one might otherwise guess according to their stance that building AGI is maybe a bad thing, from a should-it-be-allowed perspective. b) feigning opposition: Lab: "AGI should be regulated." Meanwhile, the lab overtly builds AGI, while covertly trying to confuse and subvert regulatory efforts wherever possible. *** It's important to remain aware that reformative hypocrisy can be on net a better thing to do for the world than avoiding hypocrisy completely. It allows you to divert resources from the thing you think should be stopped, and to use those resources to help stop the thing. For mathy people, I'd say this is a way of diagonalizing against a potentially harmful thing, by turning the thing against itself, or against the harmful aspects of itself. For life sciencey people, I'd say this is how homeostasis is preserved, through negative feedback loops whereby bad stuff feeds mechanisms that reduce the bad stuff. Of course, a strategy of feigning opposition (3a) can disguise itself as reformative hypocrisy, so it can be hard to distinguish the two. For example, if a lab says for long time that they're going to admit their hypocritical stance, and then never actually does, then it turns out to be dishonest hypocrisy. On the other hand, if the dishonesty ever does finally end in a way that honestly calls for reform, it's good to reward the honest and reformative aspects of their behavior....

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

The Nonlinear Library: LessWrong

Episodes

Mentioned books

LW - The Great Data Integration Schlep by sarahconstantin

LW - AI, centralization, and the One Ring by owencb

LW - Open Problems in AIXI Agent Foundations by Cole Wyeth

LW - How to Give in to Threats (without incentivizing them) by Mikhail Samin

LW - Contra papers claiming superhuman AI forecasting by nikos

LW - OpenAI o1 by Zach Stein-Perlman

LW - AI #81: Alpha Proteo by Zvi

LW - [Paper] Programming Refusal with Conditional Activation Steering by Bruce W. Lee

LW - Refactoring cryonics as structural brain preservation by Andy McKenzie

LW - Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It. by Andrew Critch

The AI-powered Podcast Player