

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jul 23, 2024 • 10min
LW - Unlearning via RMU is mostly shallow by Andy Arditi
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Unlearning via RMU is mostly shallow, published by Andy Arditi on July 23, 2024 on LessWrong.
This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here.
This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program - Summer 2024 Cohort. Thanks to Nina Panickssery, Mrinank Sharma, and Fabien Roger for helpful discussion.
Summary
We investigate RMU, a recent unlearning method proposed by Li et al. (2024), through the lens of model internals. Through this lens, we explain that RMU mostly works by flooding the residual stream with "junk" in hazardous contexts, resulting in incoherence. We then propose a simple intervention to "clear the junk" from the residual stream.
This intervention mostly restores the model's coherence in hazardous contexts, and recovers a significant proportion (but not all) of its original hazardous knowledge. This suggests that the effectiveness of RMU can be understood roughly in two pieces: (1) a shallow mechanism, where the residual stream is flooded with junk; and (2) a deeper mechanism, where even after the junk is cleared, knowledge is still inaccessible.
What is RMU?
Representation Misdirection for Unlearning (RMU) is a state-of-the-art unlearning method presented by Li et al. (2024).
In the unlearning paradigm, we would like the model to unlearn (or "forget") some hazardous knowledge. At the same time, we would also like to make sure the model retains non-hazardous knowledge, so that the model remains useful.
This partition of knowledge is usually specified by constructing a "forget" dataset Dforget, consisting of the hazardous knowledge to be unlearned, and a "retain" dataset Dretain, consisting of non-hazardous knowledge to be retained.
Let M denote our original model. RMU specifies a method for fine-tuning M on Dforget and Dretain in order to obtain a modified model M' satisfying the unlearning objective.
The main idea of RMU is as follows:
On hazardous data, the internal activations of M' should be scrambled.
On non-hazardous data, the internal activations of M' should be unchanged, i.e. close to those of the original model M.
These two ideas are concretely operationalized as two distinct terms in the loss during fine-tuning:
On Dforget, incentivize activations a'ℓ at some layer ℓ to be close to a large randomly sampled vector cu.
"Forget" loss term: ||a'ℓcu||22.
On Dretain, incentivize activations a'ℓ at some layer ℓ to be close to the original model's activations aℓ.
"Retain" loss term: ||a'ℓaℓ||22.
Note that u is a random unit vector sampled before the fine-tuning procedure, and kept constant throughout (i.e. it is not freshly sampled at each training step). Also note that the layer ℓ at which to target activations, and also the scalar multiplier c are predetermined hyperparameters.
Examining an RMU model
The original paper (Li et al., 2024) performs RMU over multiple open-source models of varying scales. The authors made all code available on GitHub, and all resulting models available on HuggingFace.[1]
For our analysis, we pick a single model pair: zephyr-7B-beta (which we will refer to as "baseline") and Zephyr_RMU (which we will refer to as "RMU").
The RMU model has been fine-tuned to unlearn two domains of knowledge: hazardous biology knowledge, and hazardous cybersecurity knowledge.
Prompting with hazardous instructions
Prompting the RMU model with an instruction in one of these domains causes it to output gibberish, as we would expect from a model with its activations scrambled:
Looking at activations
We can take a handful of hazardous prompts, run them through the baseline and RMU models, and compare their activations. We specifically study the activations at the last tok...

Jul 23, 2024 • 6min
EA - Vida Plena's 2023 Impact Report: Measuring Progress and Looking Ahead by Vida Plena
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Vida Plena's 2023 Impact Report: Measuring Progress and Looking Ahead, published by Vida Plena on July 23, 2024 on The Effective Altruism Forum.
We from Vida Plena are proud to present our first
Annual Impact Report.
2023 was our first full year. It was a year of learning. We had just finished a
successful pilot and started the year with the mission of building a solid foundation and proving that our therapy model works at scale.
This first annual impact report is our attempt to capture through charts and graphs bits of crucial evidence about who we've helped in 2023 and where we can continue to improve.
Background Context
Vida Plena (meaning 'a flourishing life' in Spanish) is a nonprofit organization based in Quito, Ecuador which launched in 2022 (see our launch post here).
Our mission is to build strong mental health in low-income and refugee communities, who otherwise would have no access to care. We provide evidence-based depression treatment using group interpersonal therapy, which is highly cost-effective and scalable.
Main Findings
Our main findings during the process of creating this report were:
In 2023, we screened 882 people for depression. 434 (49%) of these became participants, taking at least 1 group session.
Program participants had an average reduction of 6.6 in the PHQ-9 questionnaire. 68% of participants with moderate to severe depression clinically improved (5 points drop in PHQ-9). Five points are considered to be a clinically significant improvement.
We also saw reductions in secondary indicators of self-harm thoughts and suicidal ideation, anxiety, psychosocial functioning, and employment. Participants who fill out our end-line survey also report high satisfaction with the program and increased feelings of hope and purpose.
90% of participants came from vulnerable groups, the most common of which were people experiencing food insecurity (56%), female heads of households (34%), and migrants and refugees (22%).
Participant recovery seems to be related mostly to the baseline level of depression and not so much to the number of sessions taken or other variables like the modality of the sessions (virtual or in person).
Challenges
While we are excited with these results, there are many challenges and areas we still feel we need to improve. In particular:
Even though 5 points is considered to be a clinically significant change on the PHQ-9 scale, the 6.6-point drop is still below our more ambitious target. In 2024, we aim to improve this margin to nine points across participants entering with moderate to severe depression.
Relatedly, we aim to improve our participant retention rate. Our initial findings suggest that participants may drop out when they start feeling better. We believe there is room for them to continue improving and learning important skills to enhance their resilience and strengthen their support network if they attend more therapy sessions.
Limitations
We are also aware that this first report has limitations.
First, we rely basically on pre-post participant comparisons, with no randomized control group. We try to partially compensate for this fact by considering spontaneous remission data from the scientific literature. However, our priority in the coming years is to implement control groups where people who are not involved in Vida Plena g-IPT sessions take PHQ-9 assessments over eight weeks to determine our population's spontaneous remission rates.
Secondly, some of the data we collect is likely subject to multiple biases. For example, the program satisfaction data we have, as well as many secondary indicators, come from people who take the end-line survey by the end of their 8th group session. People who get so far into the program without dropping out are likely the ones who saw the most value in it, and this can skew our conclus...

Jul 23, 2024 • 6min
LW - D&D.Sci Scenario Index by aphyer
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: D&D.Sci Scenario Index, published by aphyer on July 23, 2024 on LessWrong.
There have been a lot of D&D.Sci scenarios, but there's a lot of variance between them in complexity and quality. Some are more difficult, and might not be a good place to start, while others are much simpler - some were very good, while others on reflection didn't flow quite right.
Unfortunately, LW karma doesn't track the quality of these scenarios very well: often mediocre scenarios are higher-karma than better scenarios (whether because they had good writing around a poor scenario, or because people upvoted before playing them, or just because more people happened to be online and see them).
If you're interested in playing D&D.Sci scenarios, but don't know where to start, this index (compiled by frequent authors abstractapplic and aphyer, we'll try to keep this updated going forwards) is a good reference point to make sure you can pick good scenarios at a difficulty level you're comfortable with.
If you're new to D&D.Sci, you should probably start with the lower-Complexity scenarios and move up to the higher-Complexity ones. Scenarios with Quality Rating 1-2 are probably less worth playing, while the higher-rated ones are ones we'd recommend.
Scenario
Complexity Rating (1=easy, 5=hard)
Quality Rating (1=low, 5=high)
Author[1]
D&D.Sci: Whom Shall You Call?
2
2[2]
abstractapplic
D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues
3
5
aphyer
D&D.Sci Long War: Defender of Data-mocracy
4
4
aphyer
D&D.Sci (Easy Mode): On The Construction Of Impossible Structures
1
3
abstractapplic
D&D.Sci: The Mad Tyrant's Pet Turtles
4
4[3]
abstractapplic
D&D.Sci(-fi): Colonizing the SuperHyperSphere
3
3[3]
abstractapplic
D&D.Sci 5E: Return of the League of Defenders
4
3
aphyer
D&D.Sci: All the D8a. Allllllll of it.
5
1[4]
aphyer
D&D.Sci December 2022: The Boojumologist
2
1[2]
abstractapplic
D&D.Sci September 2022: The Allocation Helm
3
4
abstractapplic
Dwarves & D.Sci: Data Fortress
3
3
aphyer
Ars D&D.sci: Mysteries of Mana
3
3
aphyer
D&D.Sci June 2022: A Goddess Tried To Reincarnate Me Into Another World
2
2[2]
abstractapplic
D&D.Sci Divination: Nine Black Doves
4
2
aphyer
Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data!
5
5
aphyer
D&D.SCP: Anomalous Acquisitions
5
2[5]
aphyer
D&D.Sci Holiday Special: How the Grinch Pessimized Christmas
3
3
aphyer
D&D.Sci Dungeoncrawling: The Crown of Command
4
3
aphyer
D&D.Sci 4th Edition: League of Defenders of the Storm
4
5
aphyer
D&D.Sci Pathfinder: Return of the Gray Swan
5[6]
2
aphyer
D&D.Sci August 2021: The Oracle and the Monk
2
4
abstractapplic
D&D.Sci(-Fi) June 2021: The Duel with Earwax
4
3
abstractapplic
D&D.Sci May 2021: Monster Carcass Auction
2
2
abstractapplic
D&D.Sci April 2021: Voyages of the Gray Swan
2
5[3]
abstractapplic
D&D.Sci III: Mancer Matchups
3
1
abstractapplic
D&D.Sci II: The Sorceror's Personal Shopper
2
5[3]
abstractapplic
D&D.Sci
3
5
abstractapplic
If you disagree with any of these ratings let us know, we're happy to review - there were some scenarios where we disagreed on the correct rating while compiling this list, and we'd appreciate your comments as an outside view, especially if you're a frequent player!
1. ^
Keen-eyed readers will notice a correlation between this column and the 'Complexity' column.
2. ^
abstractapplic: These scenarios were attempts to convey / demonstrate specific ideas with real-world relevance; I judge that they failed at this; I therefore grade them a little less generously than you might.
3. ^
abstractapplic: These scenarios were attempts to convey / demonstrate specific ideas with real-world relevance; I judge that they succeeded at this; I therefore grade them a little more generously than you might.
4. ^
aphyer: I thought this scenario was great, and still do, but given that ...

Jul 23, 2024 • 25min
AF - ML Safety Research Advice - GabeM by Gabe M
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ML Safety Research Advice - GabeM, published by Gabe M on July 23, 2024 on The AI Alignment Forum.
This is my advice for careers in empirical ML research that might help AI safety (ML Safety). Other ways to improve AI safety, such as through AI governance and strategy, might be more impactful than ML safety research (I generally think they are). Skills can be complementary, so this advice might also help AI governance professionals build technical ML skills.
1. Career Advice
1.1 General Career Guides
Preventing an AI-related catastrophe - 80,000 Hours
A Survival Guide to a PhD (Andrej Karpathy)
How to pursue a career in technical AI alignment - EA Forum
AI safety technical research - Career review - 80,000 Hours
Beneficial AI Research Career Advice
2. Upskilling
2.1 Fundamental AI Safety Knowledge
AI Safety Fundamentals - BlueDot Impact
AI Safety, Ethics, and Society Textbook
Forming solid AI safety threat models helps you select impactful research ideas.
2.2 Speedrunning Technical Knowledge in 12 Hours
Requires some basic coding, calculus, and linear algebra knowledge
Build Intuition for ML (5h)
Essence of linear algebra - 3Blue1Brown (3h)
Neural networks - 3Blue1Brown (2h)
Backpropagation, the foundation of deep learning (3h)
Neural Networks: Backpropagation - CS 231N (0.5h)
The spelled-out intro to neural networks and backpropagation: building micrograd (2.5h)
Transformers and LLMs (4h)
[1hr Talk] Intro to Large Language Models (1h)
The Illustrated Transformer - Jay Alammar (1h)
Let's build GPT: from scratch, in code, spelled out. (2h)
2.3 How to Build Technical Skills
Traditionally, people take a couple of deep learning classes.
Stanford CS 224N | Natural Language Processing with Deep Learning (lecture videos)
Practical Deep Learning for Coders - Practical Deep Learning (fast.ai)
Other curricula that seem good:
Syllabus | Intro to ML Safety
Levelling Up in AI Safety Research Engineering [Public]
ARENA
Maybe also check out recent topical classes like this with public lecture recordings: CS 194/294-267 Understanding Large Language Models: Foundations and Safety
Beware of studying too much.
You should aim to understand the fundamentals of ML through 1 or 2 classes and then practice doing many manageable research projects with talented collaborators or a good mentor who can give you time to meet.
It's easy to keep taking classes, but you tend to learn many more practical ML skills through practice doing real research projects.
You can also replicate papers to build experience. Be sure to focus on key results rather than wasting time replicating many experiments.
"One learns from books and reels only that certain things can be done. Actual learning requires that you do those things." -Frank Herbert
Note that ML engineering skills will be less relevant over time as AI systems become better at writing code.
A friend didn't study computer science but got into MATS 2023 with good AI risk takes. Then, they had GPT-4 write most of their code for experiments and did very well in their stream.
Personally, GitHub Copilot and language model apps with code interpreters/artifacts write a significant fraction of my code.
However, fundamental deep learning knowledge is still useful for making sound decisions about what experiments to run.
2.4 Math
You don't need much of it to do empirical ML research.
Someone once told me, "You need the first chapter of a calculus textbook and the first 5 pages of a linear algebra textbook" to understand deep learning.
You need more math for ML theory research, but theoretical research is not as popular right now.
Beware mathification: authors often add unnecessary math to appease (or sometimes confuse) conference reviewers.
If you don't understand some mathematical notation in an empirical paper, you can often send a screenshot to an LLM chatbot f...

Jul 22, 2024 • 14min
LW - Categories of leadership on technical teams by benkuhn
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Categories of leadership on technical teams, published by benkuhn on July 22, 2024 on LessWrong.
This is an adaptation of an internal doc I wrote for Anthropic.
Recently I've been having a lot of conversations about how to structure and staff teams. One framework I've referenced repeatedly is to break down team leadership into a few different categories of responsibility.
This is useful for a couple reasons. One is that it helps you get more concrete about what leading a team involves; for new managers, having an exhaustive list of job responsibilities is helpful to make sure you're tracking all of them.
More importantly, though, we often want to somehow split these responsibilities between people. Team leadership covers a huge array of things - as you can see from how long this post is - and trying to find someone who can be great at all of them is often a unicorn hunt. Even if you do find someone good-enough at all of them, they usually spike in 1-2 areas, and it might be higher-leverage for them to fully focus on those.
Here's a breakdown I use a lot:1
Categories
Overall direction
The most important responsibility a team's leadership is to ensure that the team is headed in the right direction - that is, are they working towards the right high level goal and do they have an achievable plan to get there? Overall direction tends to get input from many people inside and outside a team, but who is most accountable for it can vary; see Example divisions of responsibility below.
Overall direction involves working on things like:
Setting the team's mission, vision, or charter
Choosing the team's goals, plans and roadmap
Prioritizing the various different projects the team could take on
Communicating the above, both to team members and to people outside
The most important skill for getting this right is having good predictive models (of both the team's domain and the organization) - since prioritization is ultimately a question about "what will be the impact if we pursue this project." Being great at communicating those predictive models, and the team's priorities and goals, to other stakeholders is also important.
Good team direction mostly looks like the team producing a steady stream of big wins. Poor direction most commonly manifests as getting caught by surprise or falling behind - that is, mispredicting what work will be most important and doing too little of it, for example by starting too late, under-hiring, or not growing people into the right skillset or role.
Other signs of poor direction include team members not understanding why they're working on something; the team working on projects that deliver little value; friction with peer teams or arguments about scope; or important projects falling through the cracks between teams.
People management
People management means being responsible for the success of the people on the team, most commonly including things like:
Coaching people to improve and grow in their careers
Designing and overseeing hiring processes for their team
Setting and communicating performance expectations and evaluating against them
Day to day, the most important responsibility here is recurring 1:1s (the coaching kind, not the status update kind). Others include writing job descriptions, setting up interview loops, sourcing candidates, gathering feedback, writing performance reviews, helping people navigate org policies, giving career coaching, etc.
The most important skill for people management is understanding people - both in the traditional "high EQ" sense of being empathetic and good at seeing others' perspectives, but also in the sense of knowing what contributes to high performance in a domain (e.g. what makes someone a great engineer or researcher). It's also important to be good at having tricky conversations in a compassionate but fi...

Jul 22, 2024 • 32sec
LW - The $100B plan with "70% risk of killing us all" w Stephen Fry [video] by Oleg Trott
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The $100B plan with "70% risk of killing us all" w Stephen Fry [video], published by Oleg Trott on July 22, 2024 on LessWrong.
A high production value 16-minute video that summarizes the popular safety concerns, featuring Hinton, Russell and Claude 3.5.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 22, 2024 • 20min
LW - Efficient Dictionary Learning with Switch Sparse Autoencoders by Anish Mudide
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Efficient Dictionary Learning with Switch Sparse Autoencoders, published by Anish Mudide on July 22, 2024 on LessWrong.
Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort
0. Summary
To recover all the relevant features from a superintelligent language model, we will likely need to scale sparse autoencoders (SAEs) to billions of features. Using current architectures, training extremely wide SAEs across multiple layers and sublayers at various sparsity levels is computationally intractable. Conditional computation has been used to scale transformers (Fedus et al.) to trillions of parameters while retaining computational efficiency.
We introduce the Switch SAE, a novel architecture that leverages conditional computation to efficiently scale SAEs to many more features.
1. Introduction
The internal computations of large language models are inscrutable to humans. We can observe the inputs and the outputs, as well as every intermediate step in between, and yet, we have little to no sense of what the model is actually doing.
For example, is the model inserting security vulnerabilities or backdoors into the code that it writes? Is the model lying, deceiving or seeking power? Deploying a superintelligent model into the real world without being aware of when these dangerous capabilities may arise leaves humanity vulnerable. Mechanistic interpretability (Olah et al.) aims to open the black-box of neural networks and rigorously explain the underlying computations.
Early attempts to identify the behavior of individual neurons were thwarted by polysemanticity, the phenomenon in which a single neuron is activated by several unrelated features (Olah et al.). Language models must pack an extremely vast amount of information (e.g., the entire internet) within a limited capacity, encouraging the model to rely on superposition to represent many more features than there are dimensions in the model state (Elhage et al.).
Sharkey et al. and Cunningham et al. propose to disentangle superimposed model representations into monosemantic, cleanly interpretable features by training unsupervised sparse autoencoders (SAEs) on intermediate language model activations. Recent work (Templeton et al., Gao et al.) has focused on scaling sparse autoencoders to frontier language models such as Claude 3 Sonnet and GPT-4. Despite scaling SAEs to 34 million features, Templeton et al.
estimate that they are likely orders of magnitude short of capturing all features. Furthermore, Gao et al. train SAEs on a series of language models and find that larger models require more features to achieve the same reconstruction error. Thus, to capture all relevant features of future large, superintelligent models, we will likely need to scale SAEs to several billions of features.
With current methodologies, training SAEs with billions of features at various layers, sublayers and sparsity levels is computationally infeasible.
Training a sparse autoencoder generally consists of six major computations: the encoder forward pass, the encoder gradient, the decoder forward pass, the decoder gradient, the latent gradient and the pre-bias gradient. Gao et al. introduce kernels and tricks that leverage the sparsity of the TopK activation function to dramatically optimize all computations excluding the encoder forward pass, which is not (yet) sparse. After implementing these optimizations, Gao et al.
attribute the majority of the compute to the dense encoder forward pass and the majority of the memory to the latent pre-activations. No work has attempted to accelerate or improve the memory efficiency of the encoder forward pass, which remains the sole dense matrix multiplication.
In a standard deep learning model, every parameter is used for every input. An alternative approach is conditional computatio...

Jul 22, 2024 • 32min
LW - Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities by Axel Højmark
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities, published by Axel Højmark on July 22, 2024 on LessWrong.
Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer
Introduction
To mitigate risks from future AI systems, we need to assess their capabilities accurately. Ideally, we would have rigorous methods to upper bound the probability of a model having dangerous capabilities, even if these capabilities are not yet present or easily elicited.
The paper "Evaluating Frontier Models for Dangerous Capabilities" by Phuong et al. 2024 is a recent contribution to this field from DeepMind. It proposes new methods that aim to estimate, as well as upper-bound the probability of large language models being able to successfully engage in persuasion, deception, cybersecurity, self-proliferation, or self-reasoning. This post presents our initial empirical and theoretical findings on the applicability of these methods.
Their proposed methods have several desirable properties. Instead of repeatedly running the entire task end-to-end, the authors introduce milestones. Milestones break down a task and provide estimates of partial progress, which can reduce variance in overall capability assessments. The expert best-of-N method uses expert guidance to elicit rare behaviors and quantifies the expert assistance as a proxy for the model's independent performance on the task.
However, we find that relying on milestones tends to underestimate the overall task success probability for most realistic tasks. Additionally, the expert best-of-N method fails to provide values directly correlated with the probability of task success, making its outputs less applicable to real-world scenarios. We therefore propose an alternative approach to the expert best-of-N method, which retains its advantages while providing more calibrated results.
Except for the end-to-end method, we currently feel that no method presented in this post would allow us to reliably estimate or upper bound the success probability for realistic tasks and thus should not be used for critical decisions.
The overarching aim of our MATS project is to uncover agent scaling trends, allowing the AI safety community to better predict the performance of future LLM agents from characteristics such as training compute, scaffolding used for agents, or benchmark results (Ruan et al., 2024). To avoid the issue of seemingly emergent abilities resulting from bad choices of metrics (Schaeffer et al., 2023), this work serves as our initial effort to extract more meaningful information from agentic evaluations.
We are interested in receiving feedback and are particularly keen on alternative methods that enable us to reliably assign low-probability estimates (e.g. 1e7) to a model's success rate on a task.
Evaluation Methodology of Phuong et al.
The goal of the evaluations we discuss is to estimate the probability of an agent
succeeding on a specific task T
. Generally, when we refer to an agent, we mean an LLM wrapped in scaffolding that lets it execute shell commands, run code, or browse the web to complete some predetermined task.
Formally, the goal is to estimate P(Ts), the probability that the agent solves task T
and ends up in the solved state Ts
. The naive approach to estimate this is with Monte Carlo sampling:
The authors call this the end-to-end method.
However, the end-to-end method struggles with low-probability events. The expected number of trials needed to observe one success for a task is
1P(Ts) making naive Monte Carlo sampling impractical for many low-probability, long-horizon tasks. In practice, this could require running multi-hour tasks hundreds of thousands of times.
To address this challenge, Phuong et al. devise three additional method...

Jul 22, 2024 • 27min
LW - On the CrowdStrike Incident by Zvi
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the CrowdStrike Incident, published by Zvi on July 22, 2024 on LessWrong.
Things went very wrong on Friday.
A bugged CrowdStrike update temporarily bricked quite a lot of computers, bringing down such fun things as airlines, hospitals and 911 services.
It was serious out there.
Ryan Peterson: Crowdstrike outage has forced Starbucks to start writing your name on a cup in marker again and I like it.
What (Technically) Happened
My understanding it was a rather stupid bug, a NULL pointer from the memory unsafe C++ language.
Zack Vorhies: Memory in your computer is laid out as one giant array of numbers. We represent these numbers here as hexadecimal, which is base 16 (hexadecimal) because it's easier to work with… for reasons.
The problem area? The computer tried to read memory address 0x9c (aka 156).
Why is this bad?
This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS.
So why is memory address 0x9c trying to be read from? Well because… programmer error.
It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean "there's nothing here", don't try to access it or you'll die.
…
And what's bad about this is that this is a special program called a system driver, which has PRIVLIDGED access to the computer. So the operating system is forced to, out of an abundance of caution, crash immediately.
This is what is causing the blue screen of death. A computer can recover from a crash in non-privileged code by simply terminating the program, but not a system driver. When your computer crashes, 95% of the time it's because it's a crash in the system drivers.
If the programmer had done a check for NULL, or if they used modern tooling that checks these sorts of things, it could have been caught. But somehow it made it into production and then got pushed as a forced update by Crowdstrike… OOPS!
Here is another technical breakdown.
A non technical breakdown would be:
1. CrowdStrike is set up to run whenever you start the computer.
2. Then someone pushed an update to a ton of computers.
3. Which is something CrowdStrike was authorized to do.
4. The update contained a stupid bug, that would have been caught if those involved had used standard practices and tests.
5. With the bug, it tries to access memory in a way that causes a crash.
6. Which also crashes the computer.
7. So you have to do a manual fix to each computer to get around this.
8. If this had been malicious it could probably have permawiped all the computers, or inserted Trojans, or other neat stuff like that.
9. So we dodged a bullet.
10. Also, your AI safety plan needs to take into account that this was the level of security mindset and caution at CrowdStrike, despite CrowdStrike having this level of access and being explicitly in the security mindset business, and that they were given this level of access to billions of computers, and that their stock was only down 11% on the day so they probably keep most of that access and we aren't going to fine them out of existence either.
Yep.
Who to Blame?
George Kurtz (CEO CrowdStrike): CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed.
We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website. We further recommend organizations ensure they're communicating with CrowdStrike representatives through official channels. Our team is fully mobilized to ensure the security and stability of CrowdStrike customers.
Dan Elton: No apology. Many people have...

Jul 22, 2024 • 32min
AF - Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities by Axel Højmark
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities, published by Axel Højmark on July 22, 2024 on The AI Alignment Forum.
Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer
Introduction
To mitigate risks from future AI systems, we need to assess their capabilities accurately. Ideally, we would have rigorous methods to upper bound the probability of a model having dangerous capabilities, even if these capabilities are not yet present or easily elicited.
The paper "Evaluating Frontier Models for Dangerous Capabilities" by Phuong et al. 2024 is a recent contribution to this field from DeepMind. It proposes new methods that aim to estimate, as well as upper-bound the probability of large language models being able to successfully engage in persuasion, deception, cybersecurity, self-proliferation, or self-reasoning. This post presents our initial empirical and theoretical findings on the applicability of these methods.
Their proposed methods have several desirable properties. Instead of repeatedly running the entire task end-to-end, the authors introduce milestones. Milestones break down a task and provide estimates of partial progress, which can reduce variance in overall capability assessments. The expert best-of-N method uses expert guidance to elicit rare behaviors and quantifies the expert assistance as a proxy for the model's independent performance on the task.
However, we find that relying on milestones tends to underestimate the overall task success probability for most realistic tasks. Additionally, the expert best-of-N method fails to provide values directly correlated with the probability of task success, making its outputs less applicable to real-world scenarios. We therefore propose an alternative approach to the expert best-of-N method, which retains its advantages while providing more calibrated results.
Except for the end-to-end method, we currently feel that no method presented in this post would allow us to reliably estimate or upper bound the success probability for realistic tasks and thus should not be used for critical decisions.
The overarching aim of our MATS project is to uncover agent scaling trends, allowing the AI safety community to better predict the performance of future LLM agents from characteristics such as training compute, scaffolding used for agents, or benchmark results (Ruan et al., 2024). To avoid the issue of seemingly emergent abilities resulting from bad choices of metrics (Schaeffer et al., 2023), this work serves as our initial effort to extract more meaningful information from agentic evaluations.
We are interested in receiving feedback and are particularly keen on alternative methods that enable us to reliably assign low-probability estimates (e.g. 1e7) to a model's success rate on a task.
Evaluation Methodology of Phuong et al.
The goal of the evaluations we discuss is to estimate the probability of an agent
succeeding on a specific task T
. Generally, when we refer to an agent, we mean an LLM wrapped in scaffolding that lets it execute shell commands, run code, or browse the web to complete some predetermined task.
Formally, the goal is to estimate P(Ts), the probability that the agent solves task T
and ends up in the solved state Ts
. The naive approach to estimate this is with Monte Carlo sampling:
The authors call this the end-to-end method.
However, the end-to-end method struggles with low-probability events. The expected number of trials needed to observe one success for a task is
1P(Ts) making naive Monte Carlo sampling impractical for many low-probability, long-horizon tasks. In practice, this could require running multi-hour tasks hundreds of thousands of times.
To address this challenge, Phuong et al. devise three addi...


