The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jul 23, 2024 • 10min

LW - Unlearning via RMU is mostly shallow by Andy Arditi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Unlearning via RMU is mostly shallow, published by Andy Arditi on July 23, 2024 on LessWrong. This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here. This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program - Summer 2024 Cohort. Thanks to Nina Panickssery, Mrinank Sharma, and Fabien Roger for helpful discussion. Summary We investigate RMU, a recent unlearning method proposed by Li et al. (2024), through the lens of model internals. Through this lens, we explain that RMU mostly works by flooding the residual stream with "junk" in hazardous contexts, resulting in incoherence. We then propose a simple intervention to "clear the junk" from the residual stream. This intervention mostly restores the model's coherence in hazardous contexts, and recovers a significant proportion (but not all) of its original hazardous knowledge. This suggests that the effectiveness of RMU can be understood roughly in two pieces: (1) a shallow mechanism, where the residual stream is flooded with junk; and (2) a deeper mechanism, where even after the junk is cleared, knowledge is still inaccessible. What is RMU? Representation Misdirection for Unlearning (RMU) is a state-of-the-art unlearning method presented by Li et al. (2024). In the unlearning paradigm, we would like the model to unlearn (or "forget") some hazardous knowledge. At the same time, we would also like to make sure the model retains non-hazardous knowledge, so that the model remains useful. This partition of knowledge is usually specified by constructing a "forget" dataset Dforget, consisting of the hazardous knowledge to be unlearned, and a "retain" dataset Dretain, consisting of non-hazardous knowledge to be retained. Let M denote our original model. RMU specifies a method for fine-tuning M on Dforget and Dretain in order to obtain a modified model M' satisfying the unlearning objective. The main idea of RMU is as follows: On hazardous data, the internal activations of M' should be scrambled. On non-hazardous data, the internal activations of M' should be unchanged, i.e. close to those of the original model M. These two ideas are concretely operationalized as two distinct terms in the loss during fine-tuning: On Dforget, incentivize activations a'ℓ at some layer ℓ to be close to a large randomly sampled vector cu. "Forget" loss term: ||a'ℓcu||22. On Dretain, incentivize activations a'ℓ at some layer ℓ to be close to the original model's activations aℓ. "Retain" loss term: ||a'ℓaℓ||22. Note that u is a random unit vector sampled before the fine-tuning procedure, and kept constant throughout (i.e. it is not freshly sampled at each training step). Also note that the layer ℓ at which to target activations, and also the scalar multiplier c are predetermined hyperparameters. Examining an RMU model The original paper (Li et al., 2024) performs RMU over multiple open-source models of varying scales. The authors made all code available on GitHub, and all resulting models available on HuggingFace.[1] For our analysis, we pick a single model pair: zephyr-7B-beta (which we will refer to as "baseline") and Zephyr_RMU (which we will refer to as "RMU"). The RMU model has been fine-tuned to unlearn two domains of knowledge: hazardous biology knowledge, and hazardous cybersecurity knowledge. Prompting with hazardous instructions Prompting the RMU model with an instruction in one of these domains causes it to output gibberish, as we would expect from a model with its activations scrambled: Looking at activations We can take a handful of hazardous prompts, run them through the baseline and RMU models, and compare their activations. We specifically study the activations at the last tok...

Jul 23, 2024 • 6min

EA - Vida Plena's 2023 Impact Report: Measuring Progress and Looking Ahead by Vida Plena

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Vida Plena's 2023 Impact Report: Measuring Progress and Looking Ahead, published by Vida Plena on July 23, 2024 on The Effective Altruism Forum. We from Vida Plena are proud to present our first Annual Impact Report. 2023 was our first full year. It was a year of learning. We had just finished a successful pilot and started the year with the mission of building a solid foundation and proving that our therapy model works at scale. This first annual impact report is our attempt to capture through charts and graphs bits of crucial evidence about who we've helped in 2023 and where we can continue to improve. Background Context Vida Plena (meaning 'a flourishing life' in Spanish) is a nonprofit organization based in Quito, Ecuador which launched in 2022 (see our launch post here). Our mission is to build strong mental health in low-income and refugee communities, who otherwise would have no access to care. We provide evidence-based depression treatment using group interpersonal therapy, which is highly cost-effective and scalable. Main Findings Our main findings during the process of creating this report were: In 2023, we screened 882 people for depression. 434 (49%) of these became participants, taking at least 1 group session. Program participants had an average reduction of 6.6 in the PHQ-9 questionnaire. 68% of participants with moderate to severe depression clinically improved (5 points drop in PHQ-9). Five points are considered to be a clinically significant improvement. We also saw reductions in secondary indicators of self-harm thoughts and suicidal ideation, anxiety, psychosocial functioning, and employment. Participants who fill out our end-line survey also report high satisfaction with the program and increased feelings of hope and purpose. 90% of participants came from vulnerable groups, the most common of which were people experiencing food insecurity (56%), female heads of households (34%), and migrants and refugees (22%). Participant recovery seems to be related mostly to the baseline level of depression and not so much to the number of sessions taken or other variables like the modality of the sessions (virtual or in person). Challenges While we are excited with these results, there are many challenges and areas we still feel we need to improve. In particular: Even though 5 points is considered to be a clinically significant change on the PHQ-9 scale, the 6.6-point drop is still below our more ambitious target. In 2024, we aim to improve this margin to nine points across participants entering with moderate to severe depression. Relatedly, we aim to improve our participant retention rate. Our initial findings suggest that participants may drop out when they start feeling better. We believe there is room for them to continue improving and learning important skills to enhance their resilience and strengthen their support network if they attend more therapy sessions. Limitations We are also aware that this first report has limitations. First, we rely basically on pre-post participant comparisons, with no randomized control group. We try to partially compensate for this fact by considering spontaneous remission data from the scientific literature. However, our priority in the coming years is to implement control groups where people who are not involved in Vida Plena g-IPT sessions take PHQ-9 assessments over eight weeks to determine our population's spontaneous remission rates. Secondly, some of the data we collect is likely subject to multiple biases. For example, the program satisfaction data we have, as well as many secondary indicators, come from people who take the end-line survey by the end of their 8th group session. People who get so far into the program without dropping out are likely the ones who saw the most value in it, and this can skew our conclus...

Jul 23, 2024 • 6min

LW - The $100B plan with "70% risk of killing us all" w Stephen Fry [video] by Oleg Trott

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The $100B plan with "70% risk of killing us all" w Stephen Fry [video], published by Oleg Trott on July 22, 2024 on LessWrong. A high production value 16-minute video that summarizes the popular safety concerns, featuring Hinton, Russell and Claude 3.5. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 22, 2024 • 20min

LW - Efficient Dictionary Learning with Switch Sparse Autoencoders by Anish Mudide

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Efficient Dictionary Learning with Switch Sparse Autoencoders, published by Anish Mudide on July 22, 2024 on LessWrong. Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort 0. Summary To recover all the relevant features from a superintelligent language model, we will likely need to scale sparse autoencoders (SAEs) to billions of features. Using current architectures, training extremely wide SAEs across multiple layers and sublayers at various sparsity levels is computationally intractable. Conditional computation has been used to scale transformers (Fedus et al.) to trillions of parameters while retaining computational efficiency. We introduce the Switch SAE, a novel architecture that leverages conditional computation to efficiently scale SAEs to many more features. 1. Introduction The internal computations of large language models are inscrutable to humans. We can observe the inputs and the outputs, as well as every intermediate step in between, and yet, we have little to no sense of what the model is actually doing. For example, is the model inserting security vulnerabilities or backdoors into the code that it writes? Is the model lying, deceiving or seeking power? Deploying a superintelligent model into the real world without being aware of when these dangerous capabilities may arise leaves humanity vulnerable. Mechanistic interpretability (Olah et al.) aims to open the black-box of neural networks and rigorously explain the underlying computations. Early attempts to identify the behavior of individual neurons were thwarted by polysemanticity, the phenomenon in which a single neuron is activated by several unrelated features (Olah et al.). Language models must pack an extremely vast amount of information (e.g., the entire internet) within a limited capacity, encouraging the model to rely on superposition to represent many more features than there are dimensions in the model state (Elhage et al.). Sharkey et al. and Cunningham et al. propose to disentangle superimposed model representations into monosemantic, cleanly interpretable features by training unsupervised sparse autoencoders (SAEs) on intermediate language model activations. Recent work (Templeton et al., Gao et al.) has focused on scaling sparse autoencoders to frontier language models such as Claude 3 Sonnet and GPT-4. Despite scaling SAEs to 34 million features, Templeton et al. estimate that they are likely orders of magnitude short of capturing all features. Furthermore, Gao et al. train SAEs on a series of language models and find that larger models require more features to achieve the same reconstruction error. Thus, to capture all relevant features of future large, superintelligent models, we will likely need to scale SAEs to several billions of features. With current methodologies, training SAEs with billions of features at various layers, sublayers and sparsity levels is computationally infeasible. Training a sparse autoencoder generally consists of six major computations: the encoder forward pass, the encoder gradient, the decoder forward pass, the decoder gradient, the latent gradient and the pre-bias gradient. Gao et al. introduce kernels and tricks that leverage the sparsity of the TopK activation function to dramatically optimize all computations excluding the encoder forward pass, which is not (yet) sparse. After implementing these optimizations, Gao et al. attribute the majority of the compute to the dense encoder forward pass and the majority of the memory to the latent pre-activations. No work has attempted to accelerate or improve the memory efficiency of the encoder forward pass, which remains the sole dense matrix multiplication. In a standard deep learning model, every parameter is used for every input. An alternative approach is conditional computatio...

Jul 22, 2024 • 32min

LW - Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities by Axel Højmark

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities, published by Axel Højmark on July 22, 2024 on LessWrong. Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer Introduction To mitigate risks from future AI systems, we need to assess their capabilities accurately. Ideally, we would have rigorous methods to upper bound the probability of a model having dangerous capabilities, even if these capabilities are not yet present or easily elicited. The paper "Evaluating Frontier Models for Dangerous Capabilities" by Phuong et al. 2024 is a recent contribution to this field from DeepMind. It proposes new methods that aim to estimate, as well as upper-bound the probability of large language models being able to successfully engage in persuasion, deception, cybersecurity, self-proliferation, or self-reasoning. This post presents our initial empirical and theoretical findings on the applicability of these methods. Their proposed methods have several desirable properties. Instead of repeatedly running the entire task end-to-end, the authors introduce milestones. Milestones break down a task and provide estimates of partial progress, which can reduce variance in overall capability assessments. The expert best-of-N method uses expert guidance to elicit rare behaviors and quantifies the expert assistance as a proxy for the model's independent performance on the task. However, we find that relying on milestones tends to underestimate the overall task success probability for most realistic tasks. Additionally, the expert best-of-N method fails to provide values directly correlated with the probability of task success, making its outputs less applicable to real-world scenarios. We therefore propose an alternative approach to the expert best-of-N method, which retains its advantages while providing more calibrated results. Except for the end-to-end method, we currently feel that no method presented in this post would allow us to reliably estimate or upper bound the success probability for realistic tasks and thus should not be used for critical decisions. The overarching aim of our MATS project is to uncover agent scaling trends, allowing the AI safety community to better predict the performance of future LLM agents from characteristics such as training compute, scaffolding used for agents, or benchmark results (Ruan et al., 2024). To avoid the issue of seemingly emergent abilities resulting from bad choices of metrics (Schaeffer et al., 2023), this work serves as our initial effort to extract more meaningful information from agentic evaluations. We are interested in receiving feedback and are particularly keen on alternative methods that enable us to reliably assign low-probability estimates (e.g. 1e7) to a model's success rate on a task. Evaluation Methodology of Phuong et al. The goal of the evaluations we discuss is to estimate the probability of an agent succeeding on a specific task T . Generally, when we refer to an agent, we mean an LLM wrapped in scaffolding that lets it execute shell commands, run code, or browse the web to complete some predetermined task. Formally, the goal is to estimate P(Ts), the probability that the agent solves task T and ends up in the solved state Ts . The naive approach to estimate this is with Monte Carlo sampling: The authors call this the end-to-end method. However, the end-to-end method struggles with low-probability events. The expected number of trials needed to observe one success for a task is 1P(Ts) making naive Monte Carlo sampling impractical for many low-probability, long-horizon tasks. In practice, this could require running multi-hour tasks hundreds of thousands of times. To address this challenge, Phuong et al. devise three additional method...

Jul 22, 2024 • 27min

LW - On the CrowdStrike Incident by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the CrowdStrike Incident, published by Zvi on July 22, 2024 on LessWrong. Things went very wrong on Friday. A bugged CrowdStrike update temporarily bricked quite a lot of computers, bringing down such fun things as airlines, hospitals and 911 services. It was serious out there. Ryan Peterson: Crowdstrike outage has forced Starbucks to start writing your name on a cup in marker again and I like it. What (Technically) Happened My understanding it was a rather stupid bug, a NULL pointer from the memory unsafe C++ language. Zack Vorhies: Memory in your computer is laid out as one giant array of numbers. We represent these numbers here as hexadecimal, which is base 16 (hexadecimal) because it's easier to work with… for reasons. The problem area? The computer tried to read memory address 0x9c (aka 156). Why is this bad? This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS. So why is memory address 0x9c trying to be read from? Well because… programmer error. It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean "there's nothing here", don't try to access it or you'll die. … And what's bad about this is that this is a special program called a system driver, which has PRIVLIDGED access to the computer. So the operating system is forced to, out of an abundance of caution, crash immediately. This is what is causing the blue screen of death. A computer can recover from a crash in non-privileged code by simply terminating the program, but not a system driver. When your computer crashes, 95% of the time it's because it's a crash in the system drivers. If the programmer had done a check for NULL, or if they used modern tooling that checks these sorts of things, it could have been caught. But somehow it made it into production and then got pushed as a forced update by Crowdstrike… OOPS! Here is another technical breakdown. A non technical breakdown would be: 1. CrowdStrike is set up to run whenever you start the computer. 2. Then someone pushed an update to a ton of computers. 3. Which is something CrowdStrike was authorized to do. 4. The update contained a stupid bug, that would have been caught if those involved had used standard practices and tests. 5. With the bug, it tries to access memory in a way that causes a crash. 6. Which also crashes the computer. 7. So you have to do a manual fix to each computer to get around this. 8. If this had been malicious it could probably have permawiped all the computers, or inserted Trojans, or other neat stuff like that. 9. So we dodged a bullet. 10. Also, your AI safety plan needs to take into account that this was the level of security mindset and caution at CrowdStrike, despite CrowdStrike having this level of access and being explicitly in the security mindset business, and that they were given this level of access to billions of computers, and that their stock was only down 11% on the day so they probably keep most of that access and we aren't going to fine them out of existence either. Yep. Who to Blame? George Kurtz (CEO CrowdStrike): CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed. We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website. We further recommend organizations ensure they're communicating with CrowdStrike representatives through official channels. Our team is fully mobilized to ensure the security and stability of CrowdStrike customers. Dan Elton: No apology. Many people have...

Jul 22, 2024 • 32min

AF - Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities by Axel Højmark

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities, published by Axel Højmark on July 22, 2024 on The AI Alignment Forum. Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer Introduction To mitigate risks from future AI systems, we need to assess their capabilities accurately. Ideally, we would have rigorous methods to upper bound the probability of a model having dangerous capabilities, even if these capabilities are not yet present or easily elicited. The paper "Evaluating Frontier Models for Dangerous Capabilities" by Phuong et al. 2024 is a recent contribution to this field from DeepMind. It proposes new methods that aim to estimate, as well as upper-bound the probability of large language models being able to successfully engage in persuasion, deception, cybersecurity, self-proliferation, or self-reasoning. This post presents our initial empirical and theoretical findings on the applicability of these methods. Their proposed methods have several desirable properties. Instead of repeatedly running the entire task end-to-end, the authors introduce milestones. Milestones break down a task and provide estimates of partial progress, which can reduce variance in overall capability assessments. The expert best-of-N method uses expert guidance to elicit rare behaviors and quantifies the expert assistance as a proxy for the model's independent performance on the task. However, we find that relying on milestones tends to underestimate the overall task success probability for most realistic tasks. Additionally, the expert best-of-N method fails to provide values directly correlated with the probability of task success, making its outputs less applicable to real-world scenarios. We therefore propose an alternative approach to the expert best-of-N method, which retains its advantages while providing more calibrated results. Except for the end-to-end method, we currently feel that no method presented in this post would allow us to reliably estimate or upper bound the success probability for realistic tasks and thus should not be used for critical decisions. The overarching aim of our MATS project is to uncover agent scaling trends, allowing the AI safety community to better predict the performance of future LLM agents from characteristics such as training compute, scaffolding used for agents, or benchmark results (Ruan et al., 2024). To avoid the issue of seemingly emergent abilities resulting from bad choices of metrics (Schaeffer et al., 2023), this work serves as our initial effort to extract more meaningful information from agentic evaluations. We are interested in receiving feedback and are particularly keen on alternative methods that enable us to reliably assign low-probability estimates (e.g. 1e7) to a model's success rate on a task. Evaluation Methodology of Phuong et al. The goal of the evaluations we discuss is to estimate the probability of an agent succeeding on a specific task T . Generally, when we refer to an agent, we mean an LLM wrapped in scaffolding that lets it execute shell commands, run code, or browse the web to complete some predetermined task. Formally, the goal is to estimate P(Ts), the probability that the agent solves task T and ends up in the solved state Ts . The naive approach to estimate this is with Monte Carlo sampling: The authors call this the end-to-end method. However, the end-to-end method struggles with low-probability events. The expected number of trials needed to observe one success for a task is 1P(Ts) making naive Monte Carlo sampling impractical for many low-probability, long-horizon tasks. In practice, this could require running multi-hour tasks hundreds of thousands of times. To address this challenge, Phuong et al. devise three addi...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app