

The Nonlinear Library: LessWrong
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jul 24, 2024 • 7min
LW - You should go to ML conferences by Jan Kulveit
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You should go to ML conferences, published by Jan Kulveit on July 24, 2024 on LessWrong.
This is second kind of obvious point to make, but if you are interested in AI, AI safety, or cognition in general, it is likely worth going to top ML conferences, such as NeurIPS, ICML or ICLR. In this post I cover some reasons why, and some anecdotal stories.
1. Parts of AI alignment and safety are now completely mainstream
Looking at the "Best paper awards" at ICML, you'll find these safety-relevant or alignment-relevant papers:
Stealing part of a production language model by Carlini et al.
Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo
by Zhao et al.
Debating with More Persuasive LLMs Leads to More Truthful Answers by Khan et al.
Genie: Generative Interactive Environments Bruce et al.
which amounts to about one-third (!). "Because of safety concerns" is part of the motivation for hundreds of papers.
While the signal-to-noise ratio is even worse than on LessWrong, in total, the amount you can learn is higher - my personal guess is there is maybe 2-3x as much prosaic AI safety relevant work at conferences than what you get by just following LessWrong, Alignment Forum and safety-oriented communication channels.
2. Conferences are an efficient way how to screen general ML research without spending a lot of time on X
Almost all papers are presented in the form of posters. In case of a big conference, this usually means many thousands of posters presented in huge poster sessions.
My routine for engaging with this firehose of papers:
1. For each session, read all the titles. Usually, this prunes it by a factor of ten (i.e. from 600 papers to 60).
2. Read the abstracts. Prune it to things which I haven't noticed before and seem relevant. For me, this is usually by a factor of ~3-5.
3. Visit the posters. Posters with paper authors present are actually a highly efficient way how to digest research:
Sometimes, you suspect there is some assumption or choice hidden somewhere making the result approximately irrelevant - just asking can often resolve this in a matter of tens of seconds.
Posters themselves don't undergo peer review which makes the communication more honest, with less hedging.
Usually authors of a paper know significantly more about the problem than what's in the paper, and you can learn more about negative results, obstacles, or directions people are excited about.
Clear disadvantage of conferences is the time lag; by the time they are presented, some of the main results are old and well known, but in my view a lot of the value is the long tail of results which are sometimes very useful, but not attention grabbing.
3. ML research community as a control group
My vague impression is that in conceptual research, mainstream ML research lags behind LW/AI safety community by something between 1 to 5 years, rediscovering topics discussed here. Some examples:
ICML poster & oral presentation
The Platonic Representation Hypothesis is an independent version of Natural abstractions discussed here for about 4 years.
A Roadmap to Pluralistic Alignment deals with Self-unalignment problem and Coherent extrapolated volition
Plenty of research on safety protocols like debate, IDA,...
Prior work published in the LW/AI safety community is almost never cited or acknowledged - in some cases because it is more convenient to claim the topic is completely novel, but I suspect in many cases researchers are genuinely not aware of the existing work, which makes their contribution a useful control: if someone starts thinking about these topics, unaware of the thousands hours spent on them by dozens of people, what will they arrive at?
4. What 'experts' think
ML research community is the intellectual home of many people expressing public opinions about AI risk. In my view, b...

Jul 24, 2024 • 10min
LW - The Cancer Resolution? by PeterMcCluskey
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Cancer Resolution?, published by PeterMcCluskey on July 24, 2024 on LessWrong.
Book review: The Cancer Resolution?: Cancer reinterpreted through another lens, by Mark Lintern.
In the grand tradition of outsiders overturning scientific paradigms, this book proposes a bold new theory: cancer isn't a cellular malfunction, but a fungal invasion.
Lintern spends too many pages railing against the medical establishment, which feels more like ax-grinding than science. I mostly agreed with his conclusions here, but mostly for somewhat different reasons than the ones he provides.
If you can push through this preamble, you'll find a treasure trove of scientific intrigue.
Lintern's central claim is that fungal infections, not genetic mutations, are the primary cause of cancer. He dubs this the "Cell Suppression theory," painting a picture of fungi as cellular puppet masters, manipulating our cells for their own nefarious ends. This part sounds much more like classical science, backed by hundreds of quotes from peer-reviewed literature.
Those quotes provide extensive evidence that Lintern's theory predicts dozens of cancer features better than do the established theories.
Older Theories
1.
The DNA Theory (aka Somatic Mutation Theory): The reigning heavyweight, this theory posits that cancer results from an accumulation of genetic mutations in critical genes that control cell growth, division, and death.
2.
Another old theory that still has advocates is the Metabolic Theory. This theory suggests that cancer is primarily a metabolic disease, characterized by impaired cellular energy production (the Warburg effect). It proposes that damage to mitochondria is a key factor in cancer development. I wrote a mixed review of a book about it.
Lintern points out evidence that mitochondria are turned off by signals, not damaged. He also notes that tumors with malfunctioning mitochondria are relatively benign.
Evidence Discrediting the DNA Theory
The standard version of the DNA Theory predicts that all cancer cells will have mutations that affect replication, apoptosis, etc.
Around 2008 to 2013, substantial genetic data became available for cancer cells. Lintern wants us to believe that this evidence fully discredits the DNA Theory.
The actual evidence seems more complex than Lintern indicates.
The strongest evidence is that they found cancers that seem to have no mutations.
Almost as important is that the mutations that are found seem more randomly distributed than would be expected if they caused consistent types of malfunctions.
Lintern's theory seems to explain all of the Hallmarks of Cancer, as well as a few dozen other features that seem to occur in all cancers.
He argues that the DNA Theory does a poor job of explaining the hallmarks. DNA Theorists likely reject that characterization. They appear to have thought their theory explained the hallmarks back before the genetic data became available (mostly just positing mutations for each hallmark?). My guess is that they are busy adding epicycles to their theory, but the situation is complex enough that I'm having trouble evaluating it.
He also points out that the DNA Theory struggles with Peto's Paradox (why don't larger animals get more cancer?), while his theory neatly sidesteps this issue.
Additionally, mouse embryos formed from cancer cells showed no signs of cancer.
Evidence of Fungi
A key game-changer is the growing evidence of fungi in tumors. Until 2017, tumors were thought to be microbe-free. Now? We're finding fungi in all types of cancer, with tumor-specific fungal profiles.
There's even talk of using fungal DNA signatures to distinguish cancer patients from healthy individuals.
It's not a slam dunk for Lintern's theory, but it shifts the odds significantly.
Medical Establishment Inertia
It looks like people in the medical ...

Jul 24, 2024 • 7min
LW - Confusing the metric for the meaning: Perhaps correlated attributes are "natural" by NickyP
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Confusing the metric for the meaning: Perhaps correlated attributes are "natural", published by NickyP on July 24, 2024 on LessWrong.
Epistemic status: possibly trivial, but I hadn't heard it before.
TL;DR: What I thought of as a "flaw" in PCA - its inability to isolate pure metrics - might actually be a feature that aligns with our cognitive processes. We often think in terms of composite concepts (e.g., "Age + correlated attributes") rather than pure metrics, and this composite thinking might be more natural and efficient
Introduction
I recently found myself describing Principal Component Analysis (PCA) and pondering its potential drawbacks. However, upon further reflection, I'm reconsidering whether what I initially viewed as a limitation might actually be a feature. This led me to think about how our minds - and, potentially, language models - might naturally encode information using correlated attributes.
An important aspect of this idea is the potential conflation between the metric we use to measure something and the actual concept we're thinking about. For instance, when we think about a child's growth, we might not be consciously separating the concept of "age" from its various correlated attributes like height, cognitive development, or physical capabilities. Instead, we might be thinking in terms of a single, composite dimension that encompasses all these related aspects.
After looking at active inference a while ago, it seems like in general, a lot of human heuristics and biases seem like they are there to encode real-world relationships that exist in the world in a more efficient way, which are then strained in out-of-distribution experimental settings to seem "irrational".
I think the easiest way to explain is with a couple of examples:
1 - Age and Associated Attributes in Children
Suppose we plotted two attributes: Age (in years) vs Height (in cm) in children. These are highly correlated, so if we perform Principal Component Analysis, we will find there are two main components. These will not correspond to orthogonal Age and Height components, since they are quite correlated. Instead, we will find an "Age + Height" direction, and a "Height relative to what is standard for that age" direction.
While once can think of this as a "failure" of PCA to find the "true things we are measuring", I think this is perhaps not the correct way to think about it.
For example, if I told you to imagine a 10-year-old, you would probably imagine them to be of height ~140 5cm. And if I told you they were 2.0m tall or 0.5m tall, you would be very surprised. On the other hand, one often hears phrases like "about the height of a 10-year-old".
That is, when we think about a child's development, we don't typically separate each attribute into distinct vectors like "age," "height," "voice pitch," and so on. Instead, we might encode a single "age + correlated attributes" vector, with some adjustments for individual variations.
This approach is likely more efficient than encoding each attribute separately. It captures the strong correlations that exist in typical development, while allowing for deviations when necessary.
When one talks about age, one can define it as:
"number of years of existence" (independent of anything else)
but when people talk about "age" in everyday life, the definition is more akin to:
"years of existence, and all the attributes correlated to that".
2 - Price and Quality of Goods
Our tendency to associate price with quality and desirability might not be a bias, but an efficient encoding of real-world patterns. A single "value" dimension that combines price, quality, and desirability could capture the most relevant information for everyday decision-making, with additional dimensions only needed for finer distinctions.
That is, "cheap" can be conceptualised ...

7 snips
Jul 24, 2024 • 59min
LW - Monthly Roundup #20: July 2024 by Zvi
A fascinating critique of election forecasting flaws highlights the disconnect between models and public perception. Discussions on the decline of community on social media urge a reimagining of engagement. The podcast also delves into the intricacies of scientific integrity and the challenges of funding, urging a shift towards risk-taking in research. Personal reflections touch on the cultural impact of money and education, while nostalgia for simpler gaming experiences contrasts with modern complexities. Lastly, the exploration of extreme powers reveals their profound societal implications.

Jul 23, 2024 • 10min
LW - Unlearning via RMU is mostly shallow by Andy Arditi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Unlearning via RMU is mostly shallow, published by Andy Arditi on July 23, 2024 on LessWrong.
This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here.
This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program - Summer 2024 Cohort. Thanks to Nina Panickssery, Mrinank Sharma, and Fabien Roger for helpful discussion.
Summary
We investigate RMU, a recent unlearning method proposed by Li et al. (2024), through the lens of model internals. Through this lens, we explain that RMU mostly works by flooding the residual stream with "junk" in hazardous contexts, resulting in incoherence. We then propose a simple intervention to "clear the junk" from the residual stream.
This intervention mostly restores the model's coherence in hazardous contexts, and recovers a significant proportion (but not all) of its original hazardous knowledge. This suggests that the effectiveness of RMU can be understood roughly in two pieces: (1) a shallow mechanism, where the residual stream is flooded with junk; and (2) a deeper mechanism, where even after the junk is cleared, knowledge is still inaccessible.
What is RMU?
Representation Misdirection for Unlearning (RMU) is a state-of-the-art unlearning method presented by Li et al. (2024).
In the unlearning paradigm, we would like the model to unlearn (or "forget") some hazardous knowledge. At the same time, we would also like to make sure the model retains non-hazardous knowledge, so that the model remains useful.
This partition of knowledge is usually specified by constructing a "forget" dataset Dforget, consisting of the hazardous knowledge to be unlearned, and a "retain" dataset Dretain, consisting of non-hazardous knowledge to be retained.
Let M denote our original model. RMU specifies a method for fine-tuning M on Dforget and Dretain in order to obtain a modified model M' satisfying the unlearning objective.
The main idea of RMU is as follows:
On hazardous data, the internal activations of M' should be scrambled.
On non-hazardous data, the internal activations of M' should be unchanged, i.e. close to those of the original model M.
These two ideas are concretely operationalized as two distinct terms in the loss during fine-tuning:
On Dforget, incentivize activations a'ℓ at some layer ℓ to be close to a large randomly sampled vector cu.
"Forget" loss term: ||a'ℓcu||22.
On Dretain, incentivize activations a'ℓ at some layer ℓ to be close to the original model's activations aℓ.
"Retain" loss term: ||a'ℓaℓ||22.
Note that u is a random unit vector sampled before the fine-tuning procedure, and kept constant throughout (i.e. it is not freshly sampled at each training step). Also note that the layer ℓ at which to target activations, and also the scalar multiplier c are predetermined hyperparameters.
Examining an RMU model
The original paper (Li et al., 2024) performs RMU over multiple open-source models of varying scales. The authors made all code available on GitHub, and all resulting models available on HuggingFace.[1]
For our analysis, we pick a single model pair: zephyr-7B-beta (which we will refer to as "baseline") and Zephyr_RMU (which we will refer to as "RMU").
The RMU model has been fine-tuned to unlearn two domains of knowledge: hazardous biology knowledge, and hazardous cybersecurity knowledge.
Prompting with hazardous instructions
Prompting the RMU model with an instruction in one of these domains causes it to output gibberish, as we would expect from a model with its activations scrambled:
Looking at activations
We can take a handful of hazardous prompts, run them through the baseline and RMU models, and compare their activations. We specifically study the activations at the last tok...

Jul 23, 2024 • 6min
LW - DandD.Sci Scenario Index by aphyer
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: D&D.Sci Scenario Index, published by aphyer on July 23, 2024 on LessWrong.
There have been a lot of D&D.Sci scenarios, but there's a lot of variance between them in complexity and quality. Some are more difficult, and might not be a good place to start, while others are much simpler - some were very good, while others on reflection didn't flow quite right.
Unfortunately, LW karma doesn't track the quality of these scenarios very well: often mediocre scenarios are higher-karma than better scenarios (whether because they had good writing around a poor scenario, or because people upvoted before playing them, or just because more people happened to be online and see them).
If you're interested in playing D&D.Sci scenarios, but don't know where to start, this index (compiled by frequent authors abstractapplic and aphyer, we'll try to keep this updated going forwards) is a good reference point to make sure you can pick good scenarios at a difficulty level you're comfortable with.
If you're new to D&D.Sci, you should probably start with the lower-Complexity scenarios and move up to the higher-Complexity ones. Scenarios with Quality Rating 1-2 are probably less worth playing, while the higher-rated ones are ones we'd recommend.
Scenario
Complexity Rating (1=easy, 5=hard)
Quality Rating (1=low, 5=high)
Author[1]
D&D.Sci: Whom Shall You Call?
2
2[2]
abstractapplic
D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues
3
5
aphyer
D&D.Sci Long War: Defender of Data-mocracy
4
4
aphyer
D&D.Sci (Easy Mode): On The Construction Of Impossible Structures
1
3
abstractapplic
D&D.Sci: The Mad Tyrant's Pet Turtles
4
4[3]
abstractapplic
D&D.Sci(-fi): Colonizing the SuperHyperSphere
3
3[3]
abstractapplic
D&D.Sci 5E: Return of the League of Defenders
4
3
aphyer
D&D.Sci: All the D8a. Allllllll of it.
5
1[4]
aphyer
D&D.Sci December 2022: The Boojumologist
2
1[2]
abstractapplic
D&D.Sci September 2022: The Allocation Helm
3
4
abstractapplic
Dwarves & D.Sci: Data Fortress
3
3
aphyer
Ars D&D.sci: Mysteries of Mana
3
3
aphyer
D&D.Sci June 2022: A Goddess Tried To Reincarnate Me Into Another World
2
2[2]
abstractapplic
D&D.Sci Divination: Nine Black Doves
4
2
aphyer
Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data!
5
5
aphyer
D&D.SCP: Anomalous Acquisitions
5
2[5]
aphyer
D&D.Sci Holiday Special: How the Grinch Pessimized Christmas
3
3
aphyer
D&D.Sci Dungeoncrawling: The Crown of Command
4
3
aphyer
D&D.Sci 4th Edition: League of Defenders of the Storm
4
5
aphyer
D&D.Sci Pathfinder: Return of the Gray Swan
5[6]
2
aphyer
D&D.Sci August 2021: The Oracle and the Monk
2
4
abstractapplic
D&D.Sci(-Fi) June 2021: The Duel with Earwax
4
3
abstractapplic
D&D.Sci May 2021: Monster Carcass Auction
2
2
abstractapplic
D&D.Sci April 2021: Voyages of the Gray Swan
2
5[3]
abstractapplic
D&D.Sci III: Mancer Matchups
3
1
abstractapplic
D&D.Sci II: The Sorceror's Personal Shopper
2
5[3]
abstractapplic
D&D.Sci
3
5
abstractapplic
If you disagree with any of these ratings let us know, we're happy to review - there were some scenarios where we disagreed on the correct rating while compiling this list, and we'd appreciate your comments as an outside view, especially if you're a frequent player!
1. ^
Keen-eyed readers will notice a correlation between this column and the 'Complexity' column.
2. ^
abstractapplic: These scenarios were attempts to convey / demonstrate specific ideas with real-world relevance; I judge that they failed at this; I therefore grade them a little less generously than you might.
3. ^
abstractapplic: These scenarios were attempts to convey / demonstrate specific ideas with real-world relevance; I judge that they succeeded at this; I therefore grade them a little more generously than you might.
4. ^
aphyer: I thought this scenario was great, and still do, but given that ...

Jul 22, 2024 • 14min
LW - Categories of leadership on technical teams by benkuhn
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Categories of leadership on technical teams, published by benkuhn on July 22, 2024 on LessWrong.
This is an adaptation of an internal doc I wrote for Anthropic.
Recently I've been having a lot of conversations about how to structure and staff teams. One framework I've referenced repeatedly is to break down team leadership into a few different categories of responsibility.
This is useful for a couple reasons. One is that it helps you get more concrete about what leading a team involves; for new managers, having an exhaustive list of job responsibilities is helpful to make sure you're tracking all of them.
More importantly, though, we often want to somehow split these responsibilities between people. Team leadership covers a huge array of things - as you can see from how long this post is - and trying to find someone who can be great at all of them is often a unicorn hunt. Even if you do find someone good-enough at all of them, they usually spike in 1-2 areas, and it might be higher-leverage for them to fully focus on those.
Here's a breakdown I use a lot:1
Categories
Overall direction
The most important responsibility a team's leadership is to ensure that the team is headed in the right direction - that is, are they working towards the right high level goal and do they have an achievable plan to get there? Overall direction tends to get input from many people inside and outside a team, but who is most accountable for it can vary; see Example divisions of responsibility below.
Overall direction involves working on things like:
Setting the team's mission, vision, or charter
Choosing the team's goals, plans and roadmap
Prioritizing the various different projects the team could take on
Communicating the above, both to team members and to people outside
The most important skill for getting this right is having good predictive models (of both the team's domain and the organization) - since prioritization is ultimately a question about "what will be the impact if we pursue this project." Being great at communicating those predictive models, and the team's priorities and goals, to other stakeholders is also important.
Good team direction mostly looks like the team producing a steady stream of big wins. Poor direction most commonly manifests as getting caught by surprise or falling behind - that is, mispredicting what work will be most important and doing too little of it, for example by starting too late, under-hiring, or not growing people into the right skillset or role.
Other signs of poor direction include team members not understanding why they're working on something; the team working on projects that deliver little value; friction with peer teams or arguments about scope; or important projects falling through the cracks between teams.
People management
People management means being responsible for the success of the people on the team, most commonly including things like:
Coaching people to improve and grow in their careers
Designing and overseeing hiring processes for their team
Setting and communicating performance expectations and evaluating against them
Day to day, the most important responsibility here is recurring 1:1s (the coaching kind, not the status update kind). Others include writing job descriptions, setting up interview loops, sourcing candidates, gathering feedback, writing performance reviews, helping people navigate org policies, giving career coaching, etc.
The most important skill for people management is understanding people - both in the traditional "high EQ" sense of being empathetic and good at seeing others' perspectives, but also in the sense of knowing what contributes to high performance in a domain (e.g. what makes someone a great engineer or researcher). It's also important to be good at having tricky conversations in a compassionate but fi...

Jul 22, 2024 • 32sec
LW - The $100B plan with "70% risk of killing us all" w Stephen Fry [video] by Oleg Trott
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The $100B plan with "70% risk of killing us all" w Stephen Fry [video], published by Oleg Trott on July 22, 2024 on LessWrong.
A high production value 16-minute video that summarizes the popular safety concerns, featuring Hinton, Russell and Claude 3.5.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 22, 2024 • 20min
LW - Efficient Dictionary Learning with Switch Sparse Autoencoders by Anish Mudide
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Efficient Dictionary Learning with Switch Sparse Autoencoders, published by Anish Mudide on July 22, 2024 on LessWrong.
Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort
0. Summary
To recover all the relevant features from a superintelligent language model, we will likely need to scale sparse autoencoders (SAEs) to billions of features. Using current architectures, training extremely wide SAEs across multiple layers and sublayers at various sparsity levels is computationally intractable. Conditional computation has been used to scale transformers (Fedus et al.) to trillions of parameters while retaining computational efficiency.
We introduce the Switch SAE, a novel architecture that leverages conditional computation to efficiently scale SAEs to many more features.
1. Introduction
The internal computations of large language models are inscrutable to humans. We can observe the inputs and the outputs, as well as every intermediate step in between, and yet, we have little to no sense of what the model is actually doing.
For example, is the model inserting security vulnerabilities or backdoors into the code that it writes? Is the model lying, deceiving or seeking power? Deploying a superintelligent model into the real world without being aware of when these dangerous capabilities may arise leaves humanity vulnerable. Mechanistic interpretability (Olah et al.) aims to open the black-box of neural networks and rigorously explain the underlying computations.
Early attempts to identify the behavior of individual neurons were thwarted by polysemanticity, the phenomenon in which a single neuron is activated by several unrelated features (Olah et al.). Language models must pack an extremely vast amount of information (e.g., the entire internet) within a limited capacity, encouraging the model to rely on superposition to represent many more features than there are dimensions in the model state (Elhage et al.).
Sharkey et al. and Cunningham et al. propose to disentangle superimposed model representations into monosemantic, cleanly interpretable features by training unsupervised sparse autoencoders (SAEs) on intermediate language model activations. Recent work (Templeton et al., Gao et al.) has focused on scaling sparse autoencoders to frontier language models such as Claude 3 Sonnet and GPT-4. Despite scaling SAEs to 34 million features, Templeton et al.
estimate that they are likely orders of magnitude short of capturing all features. Furthermore, Gao et al. train SAEs on a series of language models and find that larger models require more features to achieve the same reconstruction error. Thus, to capture all relevant features of future large, superintelligent models, we will likely need to scale SAEs to several billions of features.
With current methodologies, training SAEs with billions of features at various layers, sublayers and sparsity levels is computationally infeasible.
Training a sparse autoencoder generally consists of six major computations: the encoder forward pass, the encoder gradient, the decoder forward pass, the decoder gradient, the latent gradient and the pre-bias gradient. Gao et al. introduce kernels and tricks that leverage the sparsity of the TopK activation function to dramatically optimize all computations excluding the encoder forward pass, which is not (yet) sparse. After implementing these optimizations, Gao et al.
attribute the majority of the compute to the dense encoder forward pass and the majority of the memory to the latent pre-activations. No work has attempted to accelerate or improve the memory efficiency of the encoder forward pass, which remains the sole dense matrix multiplication.
In a standard deep learning model, every parameter is used for every input. An alternative approach is conditional computatio...

Jul 22, 2024 • 32min
LW - Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities by Axel Højmark
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities, published by Axel Højmark on July 22, 2024 on LessWrong.
Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer
Introduction
To mitigate risks from future AI systems, we need to assess their capabilities accurately. Ideally, we would have rigorous methods to upper bound the probability of a model having dangerous capabilities, even if these capabilities are not yet present or easily elicited.
The paper "Evaluating Frontier Models for Dangerous Capabilities" by Phuong et al. 2024 is a recent contribution to this field from DeepMind. It proposes new methods that aim to estimate, as well as upper-bound the probability of large language models being able to successfully engage in persuasion, deception, cybersecurity, self-proliferation, or self-reasoning. This post presents our initial empirical and theoretical findings on the applicability of these methods.
Their proposed methods have several desirable properties. Instead of repeatedly running the entire task end-to-end, the authors introduce milestones. Milestones break down a task and provide estimates of partial progress, which can reduce variance in overall capability assessments. The expert best-of-N method uses expert guidance to elicit rare behaviors and quantifies the expert assistance as a proxy for the model's independent performance on the task.
However, we find that relying on milestones tends to underestimate the overall task success probability for most realistic tasks. Additionally, the expert best-of-N method fails to provide values directly correlated with the probability of task success, making its outputs less applicable to real-world scenarios. We therefore propose an alternative approach to the expert best-of-N method, which retains its advantages while providing more calibrated results.
Except for the end-to-end method, we currently feel that no method presented in this post would allow us to reliably estimate or upper bound the success probability for realistic tasks and thus should not be used for critical decisions.
The overarching aim of our MATS project is to uncover agent scaling trends, allowing the AI safety community to better predict the performance of future LLM agents from characteristics such as training compute, scaffolding used for agents, or benchmark results (Ruan et al., 2024). To avoid the issue of seemingly emergent abilities resulting from bad choices of metrics (Schaeffer et al., 2023), this work serves as our initial effort to extract more meaningful information from agentic evaluations.
We are interested in receiving feedback and are particularly keen on alternative methods that enable us to reliably assign low-probability estimates (e.g. 1e7) to a model's success rate on a task.
Evaluation Methodology of Phuong et al.
The goal of the evaluations we discuss is to estimate the probability of an agent
succeeding on a specific task T
. Generally, when we refer to an agent, we mean an LLM wrapped in scaffolding that lets it execute shell commands, run code, or browse the web to complete some predetermined task.
Formally, the goal is to estimate P(Ts), the probability that the agent solves task T
and ends up in the solved state Ts
. The naive approach to estimate this is with Monte Carlo sampling:
The authors call this the end-to-end method.
However, the end-to-end method struggles with low-probability events. The expected number of trials needed to observe one success for a task is
1P(Ts) making naive Monte Carlo sampling impractical for many low-probability, long-horizon tasks. In practice, this could require running multi-hour tasks hundreds of thousands of times.
To address this challenge, Phuong et al. devise three additional method...