

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Apr 30, 2024 • 6min
LW - Why I'm doing PauseAI by Joseph Miller
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I'm doing PauseAI, published by Joseph Miller on April 30, 2024 on LessWrong.
GPT-5 training is probably starting around now. It seems very unlikely that GPT-5 will cause the end of the world. But it's hard to be sure. I would guess that GPT-5 is more likely to kill me than an asteroid, a supervolcano, a plane crash or a brain tumor. We can predict fairly well what the cross-entropy loss will be, but pretty much nothing else.
Maybe we will suddenly discover that the difference between GPT-4 and superhuman level is actually quite small. Maybe GPT-5 will be extremely good at interpretability, such that it can recursively self improve by rewriting its own weights.
Hopefully model evaluations can catch catastrophic risks before wide deployment, but again, it's hard to be sure. GPT-5 could plausibly be devious enough so circumvent all of our black-box testing. Or it may be that it's too late as soon as the model has been trained. These are small, but real possibilities and it's a significant milestone of failure that we are now taking these kinds of gambles.
How do we do better for GPT-6?
Governance efforts are mostly focussed on relatively modest goals. Few people are directly aiming at the question: how do we stop GPT-6 from being created at all? It's difficult to imagine a world where governments actually prevent Microsoft from building a $100 billion AI training data center by 2028.
In fact, OpenAI apparently fears governance so little that they just went and told the UK government that they won't give it access to GPT-5 for pre-deployment testing. And the number of safety focussed researchers employed by OpenAI is dropping rapidly.
Hopefully there will be more robust technical solutions for alignment available by the time GPT-6 training begins. But few alignment researchers actually expect this, so we need a backup plan.
Plan B: Mass protests against AI
In many ways AI is an easy thing to protest against. Climate protesters are asking to completely reform the energy system, even if it decimates the economy. Israel / Palestine protesters are trying to sway foreign policies on an issue where everyone already holds deeply entrenched views. Social justice protesters want to change people's attitudes and upend the social system.
AI protesters are just asking to ban a technology that doesn't exist yet. About 0% of the population deeply cares that future AI systems are built. Most people support pausing AI development. It doesn't feel like we're asking normal people to sacrifice anything. They may in fact be paying a large opportunity cost on the potential benefits of AI, but that's not something many people will get worked up about.
Policy-makers, CEOs and other key decision makers that governance solutions have to persuade are some of the only groups that are highly motivated to let AI development continue.
No innovation required
Protests are the most unoriginal way to prevent an AI catastrophe - we don't have to do anything new. Previous successful protesters have made detailed instructions for how to build a protest movement.
This is the biggest advantage of protests compared to other solutions - it requires no new ideas (unlike technical alignment) and no one's permission (unlike governance solutions). A sufficiently large number of people taking to the streets forces politicians to act. A sufficiently large and well organized special interest group can control an issue:
I walked into my office while this was going on and found a sugar lobbyist hanging around, trying to stay close to the action. I felt like being a smart-ass so I made some wise-crack about the sugar industry raping the taxpayers. Without another word, I walked into my private office and shut the door. I had no real plan to go after the sugar people. I was just screwing with the guy.
My phone did no...

Apr 30, 2024 • 41min
AF - Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transcoders enable fine-grained interpretable circuit analysis for language models, published by Jacob Dunefsky on April 30, 2024 on The AI Alignment Forum.
Summary
We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using transcoders in circuit analysis in this way.
We performed a set of case studies on GPT2-small that demonstrate that transcoders can be used to decompose circuits into monosemantic, interpretable units of computation.
We provide code for training/running/evaluating transcoders and performing circuit analysis with transcoders, and code for the aforementioned case studies carried out using these tools. We also provide a suite of 12 trained transcoders, one for each layer of GPT2-small. All of the code can be found at
https://github.com/jacobdunefsky/transcoder_circuits, and the transcoders can be found at
https://huggingface.co/pchlenski/gpt2-transcoders.
Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2024) stream and MATS 5.1 extension. Jacob Dunefsky is currently receiving funding from the Long-Term Future Fund for this work.
Background and motivation
Mechanistic interpretability is fundamentally concerned with reverse-engineering models' computations into human-understandable parts. Much early mechanistic interpretability work (e.g.
indirect object identification) has dealt with decomposing model computations into circuits involving small numbers of model components like attention heads or MLP sublayers.
But these component-level circuits operate at too coarse a granularity: due to the relatively small number of components in a model, each individual component will inevitably be important to all sorts of computations, oftentimes playing different roles. In other words, components are polysemantic.
Therefore, if we want a more faithful and more detailed understanding of the model, we should aim to find fine-grained circuits that decompose the model's computation onto the level of individual feature vectors.
As a hypothetical example of the utility that feature-level circuits might provide in the very near-term: if we have a feature vector that seems to induce gender bias in the model, then understanding which circuits this feature vector partakes in (including which earlier-layer features cause it to activate and which later-layer features it activates) would better allow us to understand the side-effects of debiasing methods.
More ambitiously, we hope that similar reasoning might apply to a feature that would seem to mediate deception in a future unaligned AI: a fuller understanding of feature-level circuits could help us understand whether this deception feature actually is responsible for the entirety of deception in a model, or help us understand the extent to which alignment methods remove the harmful behavior.
Some of the earliest work on SAEs aimed to use them to find such feature-level circuits (e.g.
Cunn...

Apr 30, 2024 • 2min
LW - Introducing AI Lab Watch by Zach Stein-Perlman
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Introducing AI Lab Watch, published by Zach Stein-Perlman on April 30, 2024 on LessWrong.
I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.
It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.
(It's much better on desktop than mobile - don't read it on mobile.)
It's in beta
leave feedback here or comment or DM me - but I basically endorse the content and you're welcome to share and discuss it publicly.
It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.
Some clarifications and disclaimers.
How you can help:
Give feedback on how this project is helpful or how it could be different to be much more helpful
Tell me what's wrong/missing; point me to sources on what labs should do or what they are doing
Suggest better evaluation criteria
Share this
Help me find an institutional home for the project
Offer expertise on a relevant topic
Offer to collaborate
(Pitch me on new projects or offer me a job)
(Want to help and aren't sure how to? Get in touch!)
I think this project is the best existing resource for several kinds of questions, but I think it could be a lot better. I'm hoping to receive advice (and ideally collaboration) on taking it in a more specific direction. Also interested in finding an institutional home. Regardless, I plan to keep it up to date. Again, I'm interested in help but not sure what help I need.
I could expand the project (more categories, more criteria per category, more labs); I currently expect that it's more important to improve presentation stuff but I don't know how to do that; feedback will determine what I prioritize. It will also determine whether I continue spending most of my time on this or mostly drop it.
I just made a twitter account. I might use it to comment on stuff labs do.
Thanks to many friends for advice and encouragement. Thanks to Michael Keenan for doing most of the webdev. These people don't necessarily endorse this project.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Apr 30, 2024 • 22min
LW - Towards a formalization of the agent structure problem by Alex Altair
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards a formalization of the agent structure problem, published by Alex Altair on April 30, 2024 on LessWrong.
In Clarifying the Agent-Like Structure Problem (2022), John Wentworth describes a hypothetical instance of what he calls a selection theorem. In Scott Garrabrant's words, the question is, does agent-like behavior imply agent-like architecture? That is, if we take some class of behaving things and apply a filter for agent-like behavior, do we end up selecting things with agent-like architecture (or structure)? Of course, this question is heavily under-specified.
So another way to ask this is, under which conditions does agent-like behavior imply agent-like structure? And, do those conditions feel like they formally encapsulate a naturally occurring condition?
For the Q1 2024 cohort of AI Safety Camp, I was a Research Lead for a team of six people, where we worked a few hours a week to better understand and make progress on this idea. The teammates[1] were Einar Urdshals, Tyler Tracy, Jasmina Nasufi, Mateusz Bagiński, Amaury Lorin, and Alfred Harwood.
The AISC project duration was too short to find and prove a theorem version of the problem. Instead, we investigated questions like:
What existing literature is related to this question?
What are the implications of using different types of environment classes?
What could "structure" mean, mathematically? What could "modular" mean?
What could it mean, mathematically, for something to be a model of something else?
What could a "planning module" look like? How does it relate to "search"?
Can the space of agent-like things be broken up into sub-types? What exactly is a "heuristic"?
Other posts on our progress may come out later. For this post, I'd like to simply help concretize the problem that we wish to make progress on.
What are "agent behavior" and "agent structure"?
When we say that something exhibits agent behavior, we mean that seems to make the trajectory of the system go a certain way. We mean that, instead of the "default" way that a system might evolve over time, the presence of this agent-like thing makes it go some other way. The more specific of a target it seems to hit, the more agentic we say it behaves. On LessWrong, the word "optimization" is often used for this type of system behavior. So that's the behavior that we're gesturing toward.
Seeing this behavior, one might say that the thing seems to want something, and tries to get it. It seems to somehow choose actions which steer the future toward the thing that it wants.
If it does this across a wide range of environments, then it seems like it must be paying attention to what happens around it, use that information to infer how the world around it works, and use that model of the world to figure out what actions to take that would be more likely to lead to the outcomes it wants. This is a vague description of a type of structure. That is, it's a description of a type of process happening inside the agent-like thing.
So, exactly when does the observation that something robustly optimizes imply that it has this kind of process going on inside it?
Our slightly more specific working hypothesis for what agent-like structure is consists of three parts; a world-model, a planning module, and a representation of the agent's values. The world-model is very roughly like Bayesian inference; it starts out ignorant about what world its in, and updates as observations come in. The planning module somehow identifies candidate actions, and then uses the world model to predict their outcome.
And the representation of its values is used to select which outcome is preferred. It then takes the corresponding action.
This may sound to you like an algorithm for utility maximization. But a big part of the idea behind the agent structure problem is that there is a much l...

Apr 30, 2024 • 20min
LW - Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers by hugofry
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers, published by hugofry on April 30, 2024 on LessWrong.
Two Minute Summary
In this post I present my results from training a Sparse Autoencoder (SAE) on a CLIP Vision Transformer (ViT) using the ImageNet-1k dataset. I have created an interactive web app, 'SAE Explorer', to allow the public to explore the visual features the SAE has learnt, found here: https://sae-explorer.streamlit.app/ (best viewed on a laptop).
My results illustrate that SAEs can identify sparse and highly interpretable directions in the residual stream of vision models, enabling inference time inspections on the model's activations. To demonstrate this, I have included a 'guess the input image' game on the web app that allows users to guess the input image purely from the SAE activations of a single layer and token of the residual stream.
I have also uploaded a (slightly outdated) accompanying talk of my results, primarily listing SAE features I found interesting: https://youtu.be/bY4Hw5zSXzQ.
The primary purpose of this post is to demonstrate and emphasise that SAEs are effective at identifying interpretable directions in the activation space of vision models. In this post I highlight a small number my favourite SAE features to demonstrate some of the abstract concepts the SAE has identified within the model's representations. I then analyse a small number of SAE features using feature visualisation to check the validity of the SAE interpretations.
Later in the post, I provide some technical analysis of the SAE. I identify a large cluster of features analogous to the 'ultra-low frequency' cluster that Anthropic identified. In line with existing research, I find that this ultra-low frequency cluster represents a single feature. I then analyse the 'neuron-alignment' of SAE features by comparing the SAE encoder matrix the MLP out matrix.
This research was conducted as part of the ML Alignment and Theory Scholars program 2023/2024 winter cohort. Special thanks to Joseph Bloom for providing generous amounts of his time and support (in addition to the SAE Lens code base) as well as LEAP labs for helping to produce the feature visualisations and weekly meetings with Jessica Rumbelow.
Example, animals eating other animals feature: (top 16 highest activating images)
Example, Italian feature: Note that the photo of the dog has a watermark with a website ending in .it (Italy's domain name). Note also that the bottom left photo is of Italian writing. The number of ambulances present is a byproduct of using ImageNet-1k.
Motivation
Frontier AI systems are becoming increasingly multimodal, and capabilities may advance significantly as multimodality increases due to transfer learning between different data modalities and tasks. As a heuristic, consider how much intuition humans gain for the world through visual reasoning; even in abstract settings such as in maths and physics, concepts are often understood most intuitively through visual reasoning.
Many cutting edge systems today such as DALL-E and Sora use ViTs trained on multimodal data. Almost by definition, AGI is likely to be multimodal. Despite this, very little effort has been made to apply and adapt our current mechanistic interpretability techniques to vision tasks or multimodal models. I believe it is important to check that mechanistic interpretability generalises to these systems in order to ensure they are future-proof and can be applied to safeguard against AGI.
In this post, I restrict the scope of my research to specifically investigating SAEs trained on multimodal models. The particular multimodal system I investigate is CLIP, a model trained on image-text pairs. CLIP consists of two encoders: a language model and a vision model that are trained to e...

Apr 29, 2024 • 22min
AF - Towards a formalization of the agent structure problem by Alex Altair
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards a formalization of the agent structure problem, published by Alex Altair on April 29, 2024 on The AI Alignment Forum.
In Clarifying the Agent-Like Structure Problem (2022), John Wentworth describes a hypothetical instance of what he calls a selection theorem. In Scott Garrabrant's words, the question is, does agent-like behavior imply agent-like architecture? That is, if we take some class of behaving things and apply a filter for agent-like behavior, do we end up selecting things with agent-like architecture (or structure)? Of course, this question is heavily under-specified.
So another way to ask this is, under which conditions does agent-like behavior imply agent-like structure? And, do those conditions feel like they formally encapsulate a naturally occurring condition?
For the Q1 2024 cohort of AI Safety Camp, I was a Research Lead for a team of six people, where we worked a few hours a week to better understand and make progress on this idea. The teammates[1] were Einar Urdshals, Tyler Tracy, Jasmina Nasufi, Mateusz Bagiński, Amaury Lorin, and Alfred Harwood.
The AISC project duration was too short to find and prove a theorem version of the problem. Instead, we investigated questions like:
What existing literature is related to this question?
What are the implications of using different types of environment classes?
What could "structure" mean, mathematically? What could "modular" mean?
What could it mean, mathematically, for something to be a model of something else?
What could a "planning module" look like? How does it relate to "search"?
Can the space of agent-like things be broken up into sub-types? What exactly is a "heuristic"?
Other posts on our progress may come out later. For this post, I'd like to simply help concretize the problem that we wish to make progress on.
What are "agent behavior" and "agent structure"?
When we say that something exhibits agent behavior, we mean that seems to make the trajectory of the system go a certain way. We mean that, instead of the "default" way that a system might evolve over time, the presence of this agent-like thing makes it go some other way. The more specific of a target it seems to hit, the more agentic we say it behaves. On LessWrong, the word "optimization" is often used for this type of system behavior. So that's the behavior that we're gesturing toward.
Seeing this behavior, one might say that the thing seems to want something, and tries to get it. It seems to somehow choose actions which steer the future toward the thing that it wants.
If it does this across a wide range of environments, then it seems like it must be paying attention to what happens around it, use that information to infer how the world around it works, and use that model of the world to figure out what actions to take that would be more likely to lead to the outcomes it wants. This is a vague description of a type of structure. That is, it's a description of a type of process happening inside the agent-like thing.
So, exactly when does the observation that something robustly optimizes imply that it has this kind of process going on inside it?
Our slightly more specific working hypothesis for what agent-like structure is consists of three parts; a world-model, a planning module, and a representation of the agent's values. The world-model is very roughly like Bayesian inference; it starts out ignorant about what world its in, and updates as observations come in. The planning module somehow identifies candidate actions, and then uses the world model to predict their outcome.
And the representation of its values is used to select which outcome is preferred. It then takes the corresponding action.
This may sound to you like an algorithm for utility maximization. But a big part of the idea behind the agent structure problem is that ther...

Apr 29, 2024 • 18min
LW - Ironing Out the Squiggles by Zack M Davis
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ironing Out the Squiggles, published by Zack M Davis on April 29, 2024 on LessWrong.
Adversarial Examples: A Problem
The apparent successes of the deep learning revolution conceal a dark underbelly. It may seem that we now know how to get computers to (say) check whether a photo is of a bird, but this façade of seemingly good performance is belied by the existence of adversarial examples - specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models.
The differentiable nature of neural networks, which make them possible to be trained at all, are also responsible for their downfall at the hands of an adversary. Deep learning models are fit using stochastic gradient descent (SGD) to approximate the function between expected inputs and outputs.
Given an input, an expected output, and a loss function (which measures "how bad" it is for the actual output to differ from the expected output), we can calculate the gradient of the loss on the input - the derivative with respect to every parameter in our neural network - which tells us which direction to adjust the parameters in order to make the loss go down, to make the approximation better.[1]
But gradients are a double-edged sword: the same properties that make it easy to calculate how to adjust a model to make it better at classifying an image, also make it easy to calculate how to adjust an image to make the model classify it incorrectly. If we take the gradient of the loss with respect to the pixels of the image (rather than the parameters of the model), that tells us which direction to adjust the pixels to make the loss go down - or up.
(The direction of steepest increase is just the opposite of the direction of steepest decrease.) A tiny step in that direction in imagespace perturbs the pixels of an image just so - making this one the tiniest bit darker, that one the tiniest bit lighter - in a way that humans don't even notice, but which completely breaks an image classifier sensitive to that direction in the conjunction of many pixel-dimensions, making it report utmost confidence in nonsense classifications.
Some might ask: why does it matter if our image classifier fails on examples that have been mathematically constructed to fool it? If it works for the images one would naturally encounter, isn't that good enough?
One might mundanely reply that gracefully handling untrusted inputs is a desideratum for many real-world applications, but a more forward-thinking reply might instead emphasize what adversarial examples imply about our lack of understanding of the systems we're building, separately from whether we pragmatically expect to face an adversary.
It's a problem if we think we've trained our machines to recognize birds, but they've actually learned to recognize a squiggly alien set in imagespace that includes a lot of obvious non-birds and excludes a lot of obvious birds. To plan good outcomes, we need to understand what's going on, and "The loss happens to increase in this direction" is at best only the start of a real explanation.
One obvious first guess as to what's going on is that the models are overfitting. Gradient descent isn't exactly a sophisticated algorithm. There's an intuition that the first solution that you happen to find by climbing down the loss landscape is likely to have idiosyncratic quirks on any inputs it wasn't trained for.
(And that an AI designer from a more competent civilization would use a principled understanding of vision to come up with something much better than what we get by shoveling compute into SGD.) Similarly, a hastily cobbled-together conventional computer program that passed a test suite is going to have bugs in areas not covered by the tests.
But that explanation is in tension with other evidence, like the observati...

Apr 29, 2024 • 1min
EA - Is there any way to be confident that humanity won't keep employing mass torture of animals for millions of years in the future? by Eduardo
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is there any way to be confident that humanity won't keep employing mass torture of animals for millions of years in the future?, published by Eduardo on April 29, 2024 on The Effective Altruism Forum.
Currently, I'm pursuing a bachelor degree in Biological Sciences in order to become a researcher in the area of biorisk, because I was confident that humanity would stop causing tremendous amounts of suffering upon other animals and would assume a net positive value in the future.
However, there was a nagging thought in the back of my head about the possibility that it would not do so, and I found this article suggesting that there is a real possibility that such horrible scenario might actually happen.
If there is indeed a very considerable chance that humanity will keep torturing animals at an ever growing scale, and thus keep having a negative net-value for an extremely large portion of its history, doesn't that mean that we should strive to make humanity more likely to go extinct, not less?
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Apr 29, 2024 • 3min
EA - Joining the Carnegie Endowment for International Peace by Holden Karnofsky
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Joining the Carnegie Endowment for International Peace, published by Holden Karnofsky on April 29, 2024 on The Effective Altruism Forum.
Effective today, I've left Open Philanthropy and joined the
Carnegie Endowment for International Peace[1] as a Visiting Scholar. At Carnegie, I will analyze and write about topics relevant to AI risk reduction. In the short term, I will focus on (a) what AI capabilities might increase the risk of a global catastrophe; (b) how we can catch early warning signs of these capabilities; and (c) what protective measures (for example, strong information security) are important for safely handling such capabilities.
This is a continuation of the work I've been doing over
the last ~year.
I want to be explicit about why I'm leaving Open Philanthropy. It's because my work no longer involves significant involvement in grantmaking, and given that I've overseen grantmaking historically, it's a significant problem for there to be confusion on this point. Philanthropy comes with particular
power dynamics that I'd like to move away from, and I also think Open Philanthropy would benefit from less ambiguity about my role in its funding decisions (especially given the fact that I'm married to the President of a
major AI company). I'm proud of my role in helping build Open Philanthropy, I love the team and organization, and I'm confident in the leadership it's now under; I think it does the best philanthropy in the world, and will continue to do so after I move on. I will continue to serve on its board of directors (at least for the time being).
While I'll miss the Open Philanthropy team, I am excited about joining Carnegie.
Tino Cuellar, Carnegie's President, has been an advocate for taking (what I see as) the biggest risks from AI seriously. Carnegie is looking to increase its attention to AI risk, and has a number of other scholars working on it, including
Matt Sheehan, who specializes in China's AI ecosystem (an especially crucial topic in my view).
Carnegie's leadership has shown enthusiasm for the work I've been doing and plan to continue. I expect that I'll have support and freedom, in addition to an expanded platform and network, in continuing my work there.
I'm generally interested in engaging more on AI risk with people outside my existing networks. I think it will be important to build an increasingly big tent over time, and I've tried to work on approaches to risk reduction (such as
responsible scaling) that have particularly strong potential to resonate outside of existing AI-risk-focused communities. The Carnegie network is appealing because it's well outside my usual network, while having many people with (a) genuine interest in risks from AI that could rise to the level of international security issues; (b) knowledge of international affairs.
I resonate with Carnegie's mission of "helping countries and institutions take on the most difficult global problems and advance peace," and what I've read of its work has generally had a sober, nuanced, peace-oriented style that I like.
I'm looking forward to working at Carnegie, despite the bittersweetness of leaving Open Phil.
To a significant extent, though, the TL;DR of this post is that I am continuing the work I've been doing for over a year: helping to design and advocate for a framework that seeks to get early warning signs of key risks from AI, accompanied by precommitments to have sufficient protections in place by the time they come (or to pause AI development and deployment until these protections get to where they need to be).
^
I will be at the
California office and won't be relocating.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Apr 29, 2024 • 3min
LW - List your AI X-Risk cruxes! by Aryeh Englander
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: List your AI X-Risk cruxes!, published by Aryeh Englander on April 29, 2024 on LessWrong.
[I'm posting this as a very informal community request in lieu of a more detailed writeup, because if I wait to do this in a much more careful fashion then it probably won't happen at all. If someone else wants to do a more careful version that would be great!]
By crux here I mean some uncertainty you have such that your estimate for the likelihood of existential risk from AI - your "p(doom)" if you like that term - might shift significantly if that uncertainty were resolved.
More precisely, let's define a crux as a proposition such that: (a) your estimate for the likelihood of existential catastrophe due to AI would shift a non-trivial amount depending on whether that proposition was true or false; (b) you think there's at least a non-trivial probability that the proposition is true; and (c) you also think there's at least a non-trivial probability that the proposition is false.
Note 1: It could also be a variable rather than a binary proposition, for example "year human-level AGI is achieved". In that case substitute "variable is above some number x" and "variable is below some number y" instead of proposition is true / proposition is false.
Note 2: It doesn't have to be that the proposition / variable on it's own would significantly shift your estimate. If some combination of propositions / variables would shift your estimate, then those propositions / variables are cruxes at least when combined.
For concreteness let's say that "non-trivial" here means at least 5%. So you need to think there's at least a 5% chance the proposition is true, and at least a 5% chance that it's false, and also that your estimate for p(existential catastrophe due to AI) would shift by at least 5% depending on whether the proposition is true or false.
Here are just a few examples of potential cruxes people might have (among many others!):
Year human-level AGI is achieved
How fast the transition will be from much lower-capability AI to roughly human-level AGI, or from roughly human-level AGI to vastly superhuman AI
Whether power seeking will be an instrumentally convergent goal
Whether AI will greatly upset the offense-defense balance for CBRN technologies in a way that favors malicious actors
Whether AGIs could individually or collectively defeat humanity if they wanted to
Whether the world can collectively get their collective act together to pause AGI development given a clear enough signal (in combination with the probability that we'll in fact get a clear enough signal in time
Listing all your cruxes would be the most useful, but if that is too long a list then just list the ones you find most important. Providing additional details (for example, your probability distribution for each crux and/or how exactly it would shift your p(doom) estimates) is recommended if you can but isn't necessary.
Commenting with links to other related posts on LW or elsewhere might be useful as well.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org


