

The Nonlinear Library: LessWrong
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jul 8, 2024 • 59min
LW - Towards shutdownable agents via stochastic choice by EJT
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards shutdownable agents via stochastic choice, published by EJT on July 8, 2024 on LessWrong.
We[1] have a new paper testing the Incomplete Preferences Proposal (IPP). The abstract and main-text is below. Appendices are in the linked PDF.
Abstract
Some worry that advanced artificial agents may resist being shut down.
The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn't happen.
A key part of the IPP is using a novel 'Discounted REward for Same-Length Trajectories (DREST)' reward function to train agents to:
1. pursue goals effectively conditional on each trajectory-length (be 'USEFUL')
2. choose stochastically between different trajectory-lengths (be 'NEUTRAL' about trajectory-lengths).
In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY.
We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL.
Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
1. Introduction
1.1. The shutdown problem
Let 'advanced agent' refer to an artificial agent that can autonomously pursue complex goals in the wider world. We might see the arrival of advanced agents within the next few decades. There are strong economic incentives to create such agents, and creating systems like them is the stated goal of companies like OpenAI and Google DeepMind.
The rise of advanced agents would bring with it both benefits and risks. One risk is that these agents learn misaligned goals: goals that we don't want them to have [Leike et al., 2017, Hubinger et al., 2019, Russell, 2019, Carlsmith, 2021, Bengio et al., 2023, Ngo et al., 2023]. Advanced agents with misaligned goals might try to prevent us shutting them down [Omohundro, 2008, Bostrom, 2012, Soares et al., 2015, Russell, 2019, Thornley, 2024a].
After all, most goals can't be achieved after shutdown. As Stuart Russell puts it, 'you can't fetch the coffee if you're dead' [Russell, 2019, p.141].
Advanced agents with misaligned goals might resist shutdown by (for example) pretending to have aligned goals while covertly seeking to escape human control [Hubinger et al., 2019, Ngo et al., 2023]. Agents that succeed in resisting shutdown could go on to frustrate human interests in various ways. 'The shutdown problem' is the problem of training advanced agents that won't resist shutdown [Soares et al., 2015, Thornley, 2024a].
1.2. A proposed solution
The Incomplete Preferences Proposal (IPP) is a proposed solution to the shutdown problem [Thornley, 2024b]. Simplifying slightly, the idea is that we train agents to be neutral about when they get shut down. More precisely, the idea is that we train agents to satisfy:
Preferences Only Between Same-Length Trajectories (POST)
1. The agent has a preference between many pairs of same-length trajectories (i.e. many pairs of trajectories in which the agent is shut down after the same length of time).
2. The agent lacks a preference between every pair of different-length trajectories (i.e. every pair of trajectories in which the agent is shut down after different lengths of time).
By 'preference,' we mean a behavioral notion [Savage, 1954, p.17, Dreier, 1996, p.28, Hausman, 2011, §1.1]. On this notion, an agent prefers X to Y if and only if the agent would deterministically choose X over Y in choices between the two. An agent lacks a preference between X and Y if and only if the agent would stochastically choose between X and Y in choices between the two. So in writing of 'preferences,' we're only making claims about the agent's behavior.
We're not claiming that the agent is conscious or anything of that sort.
Figure 1a presents a simple example of POST-satisfying ...

Jul 8, 2024 • 6min
LW - On saying "Thank you" instead of "I'm Sorry" by Michael Cohn
Author Michael Cohn discusses the idea of saying 'thank you' instead of 'I'm sorry' in various situations and how it can lead to positive outcomes. Examples include thanking someone for helping, correcting, or being kind, resulting in feeling better about oneself and fostering a positive relationship.

Jul 7, 2024 • 27min
LW - Reflections on Less Online by Error
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reflections on Less Online, published by Error on July 7, 2024 on LessWrong.
Meta: This post turned out longer, slower, and less well-written than I hoped. I don't see any similar posts in a quick search, though, so I'm posting it anyway. I've tried to front-load feedback that might be useful to the organizers, and put more personal stuff towards the end. For context, I attended LessOnline and the Manifest-branded Summer Camp, but not Manifest itself, and my main prior experience with events like this is fandom conventions such as (local to me) Dragoncon.
As I left the Lighthaven dorm to find breakfast, five people at a table in the courtyard invited me to join a game of Zendo. This was the first notable thing to happen to me at LessOnline. It was also the thing that convinced me that yes, the trip across the country to attend would be Worth It.
I have never played Zendo before, and don't expect to play it again anytime soon. That the game was specifically Zendo is not important. The important part is that five people in the same place knew what Zendo is and found that kind of game worth playing.
There's an attitude that I associate with normies, aptly summarized by Tycho Brahe (the writer, not the astronomer) as: "Many people respond to new information, especially densely coded information, as something between an insult and a chop to the trachea."
There's a different attitude, one that I associate with security mindset, aptly summarized by John Gordon as: "Alice will happily attempt, with someone she doesn't trust, whom she cannot hear clearly, and who is probably someone else, to fiddle her tax returns and to organise a coup d'etat, while at the same time minimising the cost of the phone call. A coding theorist is someone who doesn't think Alice is crazy."
A lot of things happened over the course of my trip, but what made it worth it wasn't any particular event. It was spending a week around the sort of people that play Zendo, take dense coding in stride, and think Alice is a necessary kind of crazy.
Lighthaven
First and most critical to minimizing P(doom), look at the adorable doggie!
His name is Leo. As best I could tell from asking others, he's not attached to the site, he hails from one of the adjacent properties and just likes the people. I was going to nominate him as the LessOnline mascot, but must admit that Agendra might be more appropriate.
Ahem. So.
Lighthaven (the venue) names all its buildings after mathematicians, and the space looks exactly like you would expect a mathematician to want it to look. Every wall was a whiteboard; every not-otherwise-used flat surface held books along the lines of GEB. The public spaces were organized in such a way as to encourage 4-8 person conversations, usually near a whiteboard. The semiprivate dorms supplied more Stuff than the average hotel (e.g.
I brought things like earplugs and sleep masks, only to find that was taken care of). The presentation room seating was surprisingly comfortable. The outdoor turf was easy on the feet (I went almost all week shoeless, which feels nicer than you'd think). Food was catered, snacks were available 24/7, supply cabinets held a wide array of random necessities. Power plugs were everywhere.
In short, someone put considerable thought into eliminating the stupid fiddly bits of life in general and conventions in particular.
That last part seems more important than is obvious. An obnoxiously large proportion of life goes towards 1. doing the stupid fiddly bits, 2. procrastinating about doing the stupid fiddly bits, and 3. worrying about procrastinating too much about doing the stupid fiddly bits.
Even at conventions, that's usually an issue, because I have to pack and fly and unpack and make sure I know where food and water is and that all my stuff is charged and that there's a backu...

Jul 7, 2024 • 3min
LW - Indecision and internalized authority figures by Kaj Sotala
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Indecision and internalized authority figures, published by Kaj Sotala on July 7, 2024 on LessWrong.
A trauma book I was reading had an interesting claim that indecision is often because the person looks for the approval of an internalized authority figure (the writer is a Jungian therapist so attributed it to looking for the approval of an internalized parent, but I think it can be broader) but is unable to predict what action they would approve of.
I feel like that has some intuitive truth to it, in that when I don't care about anyone's opinion (or if nobody ever finds out) then it's much easier to just pick one action and commit to it even if it might go badly. But one of the main reasons why I might struggle with that is if I fear that anyone would judge me for doing things incorrectly.
Or it can be a conflict between different internalized authority figures. "If I do this then X will be angry at me but if I do the other thing, then Y will be angry at me". Or just the expectation that X will be angry at me no matter what I do.
This also reminds me of the way I think a big part of the appeal of various ideologies and explicit decision-making systems is that they give people a clear external ruleset that tells them what to do. Then if things go wrong, people can always appeal (either explicitly or just inside their own mind) to having followed The Right Procedure and thus being free of blame.
The most obvious external example of this is people within a bureaucracy following the rules to the letter and never deviating from them in order to avoid blame. Or more loosely, following what feels like the common wisdom - "nobody ever got fired for buying IBM".
But those are examples of people trying to avoid blame from an existing, external authority. I think people also do a corresponding move to avoid blame from internalized authority figures - such as by trying to follow a formalized ethical rule system such as utilitarianism or deontology.
Of course, if the system is one that easily drives people off a cliff when followed (e.g. extreme utilitarianism demanding infinite self-sacrifice), this isn't necessarily helpful. Now what was supposed to give relief from the pressures of constant inner judgment, turns into a seemingly-rigorous proof for why the person has to constantly sacrifice everything for the benefit of others.
At one point I also wondered why it is that being very confident about what you say makes you very persuasive to many people. Why should it work that you can hack persuasiveness in that way, regardless of the truth value of what you're saying?
Then I realized that extreme confidence signals social power since others haven't taken you down for saying clearly wrong things (even if you are saying clearly wrong things). And that means that siding with the person who's saying those things also shields others from social punishment: they're after all just doing what the socially powerful person does. And given that people often project their internalized authority figures into external people - e.g.
maybe someone really is trying to avoid their father's judgment, but when seeing someone very confident they see that person as being their father - that allows them to avoid internalized blame as well.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 7, 2024 • 5min
LW - LK-99 in retrospect by bhauth
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LK-99 in retrospect, published by bhauth on July 7, 2024 on LessWrong.
About a year ago, there was a lot of public interest in a supposed room-temperature superconductor called LK-99. What I publicly said at the time was, basically:
1.
We should remember the possibility that apparent levitation is from ferromagnetism or paramagnetism. Iron filings can stand up on a magnet, and pyrolytic graphite can float over a strong magnet.
2.
If we consider some known high-temperature superconductors:
YBCO has flat sheets of copper oxide, and superconductivity happens along those planes. The copper in that has high positive charge density, comparable to aluminum atoms in alumina, which gives strong bonding to the oxygen.
H3S (paper) has unusually strong bonds between the sulfur and hydrogen, which only form because the atoms are pressed into each other with enough pressure to substantially compress liquid water.
Superconductivity comes from flow of Cooper pairs, and the electron-phonon interaction must be stronger than random thermal movement. LK-99 doesn't seem to have any reason to have exceptionally strong such interactions. (Yes, I'm simplifying, you have to consider phonon bandgaps, but the point is at least directionally correct.)
1. The focus on "room-temperature" superconductivity is a bit silly. Even with systems using liquid nitrogen cooling, the superconducting wires are much more expensive than the cooling. What's really needed for superconductors to be practical is cheaper superconducting wires, not higher-temperature ones.
At the time, I found the unusual amount of public interest a bit bemusing. There have been various claims of near-room-temp superconductivity, but none of them attracted as much public attention as LK-99. A few months earlier, Ranga Dias published a paper claiming room-temperature superconductivity; he's now up to 5 retractions.
What was different about LK-99?
That was supposedly superconducting at ambient pressure, which makes it more practical, but also means less specialized equipment is needed to replicate it - or claim to replicate it.
LK-99 had a video that appealed to people.
There were also a few social conditions that I think were important:
1.
It had been a while since that last major excitement about fake science news. After some big story that turns out to be wrong, people are more skeptical of science stories in every field for a while, and then things gradually go back to a baseline. (That's how things were after eg the "arsenic in DNA" story, which didn't make sense either: arsenate esters aren't stable enough for DNA.) I understand the heuristic that people applied but the way it's applied here doesn't really make sense.
2.
Misleading short videos + social media is a combination that hadn't really been applied to bad science stories before.
3.
I think the atmosphere at the time had a lot of demand for ammunition in a wider techno-optimist vs techno-pessimist conflict. ("Room-temperature superconductors and Boom Technology making practical supersonic aircraft! We're so back!")
I think those overall conditions caused the LK-99 story to be self-amplifying, because:
Several twitter accounts made fake videos showing "replication" of LK-99 superconductivity, because it was just good social media strategy. I think iris_IGB is still up a lot of followers overall. Don't hate the player, hate the game, I guess.
Some theorists jumped on the story by finding "theoretical justifications" because it seemed like a net career positive, statistically speaking.
In many cases, whether the social status of a scientific theory is amplified or diminished over time seems to depend more on the social environment than on whether it's true. For example, the amyloid theory of Alzheimer's is still going, and real money is being paid for drugs based on it that...

Jul 7, 2024 • 41min
LW - A "Bitter Lesson" Approach to Aligning AGI and ASI by RogerDearnaley
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A "Bitter Lesson" Approach to Aligning AGI and ASI, published by RogerDearnaley on July 7, 2024 on LessWrong.
TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role.
Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps even for ASI, but then we don't know for sure how hard a problem Alignment is.
However, if you're proposing any solution to Alignment that's more complicated than this (and most of them are), you should probably have an argument for why this conceptually-simple approach won't work, or won't be sufficient.
If you're not already familiar with it, you should first read Rich Sutton's excellent and influential post The Bitter Lesson. (Even if you are already familiar with it, it's a quick reread, only a page-and-a-half long, and its message is worth remembering.)
Why The Alignment Problem is Hard (In My Opinion)
We have been training LLM-based AIs off enormous web + books + video + etc datasets created by humans, which are full of a vast number of examples of human behavior. We are basically "distilling" human intelligence into these LLMs,[1] teaching them to imitate us.
In this process, they become familiar with, understand, and learn to imitate basically all aspects of human behavior - including the many problematic ones for Alignment, such as prejudice, deception, power-seeking, and criminality (and even ones like gluttony and lust that have little practical use for a non-corporal intelligence).
We humans are living beings, the products of evolution, so evolutionary psychology applies to us. While we are a social species, good at cooperating on non-zero-sum games, if you put humans in (what they perceive as) a non-iterated zero-sum situation, they will generally act selfishly for the benefit of themselves and their close genetic relatives, just as evolutionary theory would predict. So the behavioral potentials for deception, power-seeking, criminality etc.
are all inherent, evolutionarily adaptive, and thus unsurprising. This is human nature, and there are evolutionary reasons why it is this way.
Despite this, we have learned how to build a cooperating society out of humans, using social techniques and incentives such as an economy, laws, and law enforcement to encourage and productively harness cooperative human behavior and keep the bad consequences of selfish behavior under control. The results aren't perfect: things like crime, inequality, and war still happen, but they're acceptable - we've survived so far, even thrived.
By default, if we continue this LLM training process to larger-and-larger scales, and if the LLM-based approach to AI doesn't hit any major roadblocks, then some time, probably in the next few years, we will have human-level AIs - usually referred to as AGIs - who are roughly as well/badly-aligned as humans, and (at least for the base-model LLMs before any Alignment processes are applied) have a comparable-to-human propensity to cooperate on non-zero-sum games and act selfishly on non-iterated
zero-sum games.
They are not alive, and evolution doesn't apply to them directly, but they were trained to simulate our behavior, including our evolved survival strategies like selfishness. They will thus have alignment properties comparable to humans: they understand what human values, morals, and ethic...

Jul 7, 2024 • 7min
LW - An AI Manhattan Project is Not Inevitable by Maxwell Tabarrok
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An AI Manhattan Project is Not Inevitable, published by Maxwell Tabarrok on July 7, 2024 on LessWrong.
Early last month, Leopold Aschenbrenner released a
long essay and
podcast outlining his projections for the future of AI. Both of these sources are full of interesting arguments and evidence, for a comprehensive summary see
Zvi's post here. Rather than going point by point I will instead accept the major premises of Leopold's essay but contest some of his conclusions.
So what are the major premises of his piece?
1. There will be several orders of magnitude increase in investment into AI. 100x more spending, 100x more compute, 100x more efficient algorithms, and an order of magnitude or two gains from some form of "learning by doing" or "unhobbling" on top.
2. This investment scale up will be sufficient to achieve AGI. This means the models on the other side of the predicted compute scale up will be able to automate all cognitive jobs with vast scale and speed.
3. These capabilities will be essential to international military competition.
All of these premises are believable to me and well-argued for in Leopold's piece.
Leopold contends that these premises imply that the national security state will take over AI research and the major data centers, locking down national secrets in a race against China, akin to the Manhattan project.
Ultimately, my main claim here is descriptive: whether we like it or not, superintelligence won't look like an SF startup, and in some way will be primarily in the domain of national security.
By late 26/27/28 … the core AGI research team (a few hundred researchers) will move to a secure location; the trillion-dollar cluster will be built in record-speed; The Project will be on.
The main problem is that Leopold's premises can be applied to conclude that other technologies will also inevitably lead to a Manhattan project, but these projects never arrived. Consider electricity. It's an incredibly powerful technology with rapid scale up, sufficient to empower those who have it far beyond those who don't and it is essential to military competition. Every tank and missile and all the tech to manufacture them relies on electricity.
But there was never a Manhattan project for this technology. It's initial invention and spread was private and decentralized. The current sources of production and use are mostly private.
This is true of most other technologies with military uses: explosives, steel, computing, the internet, etc. All of these technologies are essential in the government's monopoly on violence and it's ability to exert power over other nations and prevent coups from internal actors. But the government remains a mere customer of these technologies and often not even the largest one.
Why is this? Large scale nationalization is costly and unnecessary for maintaining national secrets and technological superiority. Electricity and jet engines are essential for B-2 bombers, but if you don't have the particular engineers and blueprints, you can't build it. So, the government doesn't need to worry about locking down the secrets of electricity production and sending all of the engineers to Los Alamos.
They can keep the first several steps of the production process completely open and mix the outputs with a final few steps that are easier to keep secret.
To be clear, I am confident that governments and militaries will be extremely interested in AI. They will be important customers for many AI firms, they will create internal AI tools, and AI will become an important input into every major military. But this does not mean that most or all of the AI supply chain, from semi-conductors to data-centers to AI research, must be controlled by governments.
Nuclear weapons are outliers among weapons technology in terms of the proportion of the supply chai...

Jul 7, 2024 • 10min
LW - AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0 by James Fox
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0, published by James Fox on July 7, 2024 on LessWrong.
TL;DR
We are excited to announce the fourth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! ARENA's mission is to provide talented individuals with the skills, tools, and environment necessary for upskilling in ML engineering, for the purpose of contributing directly to AI alignment in technical roles.
ARENA will be running in-person from LISA from 2nd September - 4th October (the first week is an optional review of the fundamentals of neural networks).
Apply
here before 23:59 July 20th anywhere on Earth!
Summary
ARENA has been successfully run three times, with alumni going on to become MATS scholars and LASR participants; AI safety engineers at Apollo Research, Anthropic, METR, and OpenAI; and even starting their own AI safety organisations!
This iteration will run from 2nd September - 4th October (the first week is an optional review of the fundamentals of neural networks) at the London Initiative for Safe AI (LISA) in Old Street, London. LISA houses small organisations (e.g., Apollo Research, BlueDot Impact), several other AI safety researcher development programmes (e.g., LASR Labs, MATS extension, PIBBS, Pivotal), and many individual researchers (independent and externally affiliated).
Being situated at LISA, therefore, brings several benefits, e.g. facilitating productive discussions about AI safety & different agendas, allowing participants to form a better picture of what working on AI safety can look like in practice, and offering chances for research collaborations post-ARENA.
The main goals of ARENA are to:
Help participants skill up in ML relevant for AI alignment.
Produce researchers and engineers who want to work in alignment and help them make concrete next career steps.
Help participants develop inside views about AI safety and the paths to impact of different agendas.
The programme's structure will remain broadly the same as ARENA 3.0 (see below); however, we are also adding an additional week on evaluations.
For more information, see our website.
Also, note that we have a Slack group designed to support the independent study of the material (join link here).
Outline of Content
The 4-5 week program will be structured as follows:
Chapter 0 - Fundamentals
Before getting into more advanced topics, we first cover the basics of deep learning, including basic machine learning terminology, what neural networks are, and how to train them. We will also cover some subjects we expect to be useful going forward, e.g. using GPT-3 and 4 to streamline your learning, good coding practices, and version control.
Note: Participants can optionally skip the program this week and join us at the start of Chapter 1 if they'd prefer this option and if we're confident that they are already comfortable with the material in this chapter.
Topics include:
PyTorch basics
CNNs, Residual Neural Networks
Optimization (SGD, Adam, etc)
Backpropagation
Hyperparameter search with Weights and Biases
GANs & VAEs
Chapter 1 - Transformers & Interpretability
In this chapter, you will learn all about transformers and build and train your own. You'll also study LLM interpretability, a field which has been advanced by Anthropic's Transformer Circuits sequence, and open-source work by Neel Nanda. This chapter will also branch into areas more accurately classed as "model internals" than interpretability, e.g. recent work on steering vectors.
Topics include:
GPT models (building your own GPT-2)
Training and sampling from transformers
TransformerLens
In-context Learning and Induction Heads
Indirect Object Identification
Superposition
Steering Vectors
Chapter 2 - Reinforcement Learning
In this chapter, you w...

Jul 6, 2024 • 5min
LW - Musings on LLM Scale (Jul 2024) by Vladimir Nesov
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Musings on LLM Scale (Jul 2024), published by Vladimir Nesov on July 6, 2024 on LessWrong.
In a recent interview, Dario Amodei claimed that cost of training is (starting with models already available)
Right now, $100 million. There are models in training today that are more like a $1 billion. I think if we go to $10 or a $100 billion, and I think that will happen in 2025-2026, maybe 2027, ...
(Epistemic status: Fermi estimates, 8 is approximately 10 which is greater than 9.)
Assuming $40,000 per H100 and associated infrastructure in a datacenter, $1 billion gives 25K H100s, which matches the scale of for example Meta's new training clusters and requires about 40MW of power. At $2 per hour, training time cost of 25K H100s reaches $100 million in 80 days, which seems reasonable if on the short side for a production training run. The cost of time matches $1 billion at 2.3 years.
An H100 (SXM) is rated for 2e15 FLOP/s in BF16 (my impression is this is usually stable out of the box). This becomes 4e15 FLOP/s in FP8, which seems practical if done carefully, no degradation in pre-training loss compared to FP32. The $100 million run then translates to 9e25 FLOPs at 30% utilization in BF16, or 2e26 FLOPs in FP8.
(For some reason this SemiAnalysis estimate is 2x lower, peak 2e20 FLOP/s for 100,000 H100s at FP8, possibly the sparsity footnote in H100 specification for the 4000 teraFLOP/s figure is the culprit.)
This is maybe 10x original GPT-4, estimated at 2e25 FLOPs. The leading models (Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4 Omni) cost $15-20 per million output tokens, compared to $75-120 for once-frontier models Claude 3 Opus, Gemini 1 Ultra, original GPT-4. Given a Chinchilla optimal model, if we reduce its active parameters 3x and increase training compute 3x, we get approximately the same performance, but it's now at least 3x cheaper for inference.
This increases data 10x, which if everything else fails can be obtained by repeating the old data, giving 30x overtraining in compute compared to what is Chinchilla optimal for the smaller model. Llama-3-70b is overtrained 10x, Llama-3-8b 90x, though they don't use MoE and their performance is lower than for MoE models with the same active parameters and training cost.
Beyond $100 million
The current frontier models are overtrained on compute that could enable even smarter models. Compute is increasing, but it mostly goes to reduction of inference cost, and only a little bit to capabilities.
Why aren't any of the three labs directing the compute to train/release models optimized for maximum capability? Possibly costs are already such that training at too many parameter/data tradeoff points won't be done, instead they choose an option that's currently most useful and spend the rest on experiments that would make imminent larger scale runs better.
Even OpenAI's next frontier model in training as of May 28 might just be using compute comparable to what GPT-4 Omni required, not OOMs more, and it could still get much more capable if allowed to be more expensive for inference.
To do a run at $1 billion in cost of time, even 100K H100s would need 200 days (powered by 150MW). There probably aren't any individual clusters of this scale yet (which would cost about $4 billion). Gemini 1.0 report stated that
Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. ... we combine SuperPods in multiple datacenters using Google's intra-cluster and inter-cluster network. Google's network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
This together with Amodei's claim of current $1 billion training runs and individual 100K H100 clusters still getting built ...

Jul 5, 2024 • 20min
LW - [Interim research report] Activation plateaus and sensitive directions in GPT2 by StefanHex
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Interim research report] Activation plateaus & sensitive directions in GPT2, published by StefanHex on July 5, 2024 on LessWrong.
This part-report / part-proposal describes ongoing research, but I'd like to share early results for feedback. I am especially interested in any comment finding mistakes or trivial explanations for these results. I will work on this proposal with a LASR Labs team over the next 3 months. If you are working (or want to work) on something similar I would love to chat!
Experiments and write-up by Stefan, with substantial inspiration and advice from Jake (who doesn't necessarily endorse every sloppy statement I write). Work produced at Apollo Research.
TL,DR: Toy models of how neural networks compute new features in superposition seem to imply that neural networks that utilize superposition require some form of error correction to avoid interference spiraling out of control. This means small variations along a feature direction shouldn't affect model outputs, which I can test:
1. Activation plateaus: Real activations should be resistant to small perturbations. There should be a "plateau" in the output as a function of perturbation size.
2. Sensitive directions: Perturbations towards the direction of a feature should change the model output earlier (at a lower perturbation size) than perturbations into a random direction.
I find that both of these predictions hold; the latter when I operationalize "feature" as the difference between two real model activations. As next steps we are planning to
Test both predictions for SAE features: We have some evidence for the latter by Gurnee (2024) and Lindsey (2024).
Are there different types of SAE features,
atomic and composite features? Can we get a handle on the total number of features?
If sensitivity-features line up with SAE features, can we find or improve SAE feature directions by finding local optima in sensitivity (similar to how Mack & Turner (2024) find steering vectors)?
My motivation for this project is to get data on computation in superposition, and to get dataset-independent evidence for (SAE-)features.
Core results & discussion
I run two different experiments that test the error correction hypothesis:
1.
Activation Plateaus: A real activation is the center of a plateau, in the sense that perturbing the activation affects the model output less than expected. Concretely: applying random-direction perturbations to an activation generated from a random openwebtext input ("real activation") has less effect than applying the same perturbations to a random activation (generated from a Normal distribution).
This effect on the model can be measured in KL divergence of logits (shown below) but also L2 difference or cosine similarity of late-layer activations.
2.
Sensitive directions: Perturbing a (real) activation into a direction towards another real activation ("poor man's feature directions") affects the model-outputs more than perturbing the same activation into a random direction. In the plot below focus on the size of the "plateau" in the left-hand side
1. Naive random direction vs mean & covariance-adjusted random: Naive isotropic random directions are much less sensitive. Thus we use mean & covariance-adjusted random activations everywhere else in this report.
2. The sensitive direction results are related to Gurnee (2024, SAE-replacement-error direction vs naive random direction) and Lindsey (2024, Anthropic April Updates, SAE-feature direction vs naive random direction).
The theoretical explanation for activation plateaus & sensitive direction may be error correction (also referred to as noise suppression):
NNs in superposition should expect small amounts of noise in feature activations due to interference. (The exact properties depend on how computation happens in superposition, this toy...