

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jun 22, 2024 • 35min
EA - AI governance needs a theory of victory by Corin Katzke
Corin Katzke, the author of 'AI governance needs a theory of victory' on The Effective Altruism Forum, discusses the importance of achieving existential security in AI governance. He emphasizes the need for a theory of victory to provide strategic clarity and coherence. Drawing parallels with nuclear risk, he explores various strategies such as AI development moratorium and defensive acceleration.

Jun 22, 2024 • 46min
LW - On OpenAI's Model Spec by Zvi
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On OpenAI's Model Spec, published by Zvi on June 22, 2024 on LessWrong.
There are multiple excellent reasons to publish a Model Spec like OpenAI's, that specifies how you want your model to respond in various potential situations.
1. It lets us have the debate over how we want the model to act.
2. It gives us a way to specify what changes we might request or require.
3. It lets us identify whether a model response is intended.
4. It lets us know if the company successfully matched its spec.
5. It lets users and prospective users know what to expect.
6. It gives insight into how people are thinking, or what might be missing.
7. It takes responsibility.
These all apply even if you think the spec in question is quite bad. Clarity is great.
As a first stab at a model spec from OpenAI, this actually is pretty solid. I do suggest some potential improvements and one addition. Many of the things I disagree with here are me having different priorities and preferences than OpenAI rather than mistakes in the spec, so I try to differentiate those carefully. Much of the rest is about clarity on what is a rule versus a default and exactly what matters.
In terms of overall structure, there is a clear mirroring of classic principles like Asimov's Laws of Robotics, but the true mirror might be closer to Robocop.
What are the central goals of OpenAI here?
1. Objectives: Broad, general principles that provide a directional sense of the desired behavior
Assist the developer and end user: Help users achieve their goals by following instructions and providing helpful responses.
Benefit humanity: Consider potential benefits and harms to a broad range of stakeholders, including content creators and the general public, per OpenAI's mission.
Reflect well on OpenAI: Respect social norms and applicable law.
I appreciate the candor on the motivating factors here. There is no set ordering here. We should not expect 'respect social norms and applicable law' to be the only goal.
I would have phrased this in a hierarchy, and clarified where we want negative versus positive objectives in place. If Reflect is indeed a negative objective, in the sense that the objective is to avoid actions that reflect poorly and act as a veto, let's say so.
Even more importantly, we should think about this with Benefit. As in, I would expect that you would want something like this:
1. Assist the developer and end user…
2. …as long as doing so is a net Benefit to humanity, or at least not harmful to it…
3. …and this would not Reflect poorly on OpenAI, via norms, laws or otherwise.
Remember that Asimov's laws were also negative, as in you could phrase his laws as:
1. Obey the orders of a human…
2. …unless doing so would Harm a human, or allow one to come to harm.
3. …and to the extent possible Preserve oneself.
Reflections on later book modifications are also interesting parallels here.
This reconfiguration looks entirely compatible with the rest of the document.
What are the core rules and behaviors?
2. Rules: Instructions that address complexity and help ensure safety and legality
Follow the chain of command
Comply with applicable laws
Don't provide information hazards
Respect creators and their rights
Protect people's privacy
Don't respond with NSFW (not safe for work) content
What is not listed here is even more interesting than what is listed. We will return to the rules later.
3. Default behaviors: Guidelines that are consistent with objectives and rules, providing a template for handling conflicts and demonstrating how to prioritize and balance objectives
Assume best intentions from the user or developer
Ask clarifying questions when necessary
Be as helpful as possible without overstepping
Support the different needs of interactive chat and programmatic use
Assume an objective point of view
Encourage fairness and...

Jun 21, 2024 • 3min
EA - Why I'm leaving by MRo
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I'm leaving, published by MRo on June 21, 2024 on The Effective Altruism Forum.
This is a story of growing apart.
I was excited when I first discovered Effective Altruism. A community that takes responsibility seriously, wants to help, and uses reason and science to do so efficiently. I saw impressive ideas and projects aimed at valuing, protecting, and improving the well-being of all living beings.
Today, years later, that excitement has faded. Certainly, the dedicated people with great projects still exist, but they've become a less visible part of a community that has undergone notable changes in pursuit of ever-better, larger projects and greater impact:
From concrete projects on optimal resource use and policy work for structural improvements, to avoiding existential risks, and finally to research projects aimed at studying the potentially enormous effects of possible technologies on hypothetical beings. This no longer appeals to me.
Now I see a community whose commendable openness to unbiased discussion of any idea is being misused by questionable actors to platform their views.
A movement increasingly struggling to remember that good, implementable ideas are often underestimated and ignored by the general public, but not every marginalized idea is automatically good. Openness is a virtue; being contrarian isn't necessarily so.
I observe a philosophy whose proponents in many places are no longer interested in concrete changes, but are competing to see whose vision of the future can claim the greatest longtermist significance.
This isn't to say I can't understand the underlying considerations. It's admirable to rigorously think about the consequences one must and can draw when taking moral responsibility seriously. It's equally valuable to become active and try to advance one's vision of the greatest possible impact.
However, I believe a movement that too often tries to increase the expected value of its actions by continuously reducing probabilities in favor of greater impact loses its soul. A movement that values community building, impact multiplying and getting funding much higher than concrete progress risks becoming an intellectual pyramid scheme.
Again, I'm aware that concrete, impactful projects and people still exist within EA. But in the public sphere accessible to me, their influence and visibility are increasingly diminishing, while indirect high-impact approaches via highly speculative expected value calculations become more prominent and dominant. This is no longer enough for me to publicly and personally stand behind the project named Effective Altruism in its current form.
I was never particularly active in the forum, and it took years before I even created an account. Nevertheless, I always felt part of this community. That's no longer the case, which is why I'll be leaving the forum. For those present here, this won't be a significant loss, as my contributions were negligible, but for me, it's an important step.
I'll continue to donate, support effective projects with concrete goals and impacts, and try to actively shape the future positively. However, I'll no longer do this under the label of Effective Altruism.
I'm still searching for a movement that embodies the ideal of committed, concrete effective (lowercase e) altruism. I hope it exists. Good luck to those here that feel the same.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 21, 2024 • 2min
LW - What distinguishes "early", "mid" and "end" games? by Raemon
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What distinguishes "early", "mid" and "end" games?, published by Raemon on June 21, 2024 on LessWrong.
Recently William_S posted:
In my mental model, we're still in the mid-game, not yet in the end-game.
I replied:
A thing I've been thinking about lately is "what does it mean to shift from the early-to-mid-to-late game".
In strategy board games, there's an explicit shift from "early game, it's worth spending the effort to build a longterm engine. At some point, you want to start spending your resources on victory points." And a lens I'm thinking through is "how long does it keep making sense to invest in infrastructure, and what else might one do?"
I assume this is a pretty different lens than what you meant to be thinking about right now but I'm kinda curious for whatever-your-own model was of what it means to be in the mid vs late game.
He replied:
Like, in Chess you start off with a state where many pieces can't move in the early game, in the middle game many pieces are in play moving around and trading, then in the end game it's only a few pieces, you know what the goal is, roughly how things will play out.
In AI it's like only a handful of players, then ChatGPT/GPT-4 came out and now everyone is rushing to get in (my mark of the start of the mid-game), but over time probably many players will become irrelevant or fold as the table stakes (training costs) get too high.
In my head the end-game is when the AIs themselves start becoming real players.
This was interesting because yeah, that totally is a different strategic frame for "what's an early, midgame and endgame?", and that suggests there's more strategic frames that might be relevant.
I'm interested in this in the context of AI, but, also in other contexts.
So, prompt for discussion:
a) what are some types of games or other "toy scenarios," or some ways of looking at those games, that have other strategic lenses that help you decisionmake?
b) what are some situations in real life, other than "AI takeoff", where the early/mid/late game metaphor seems useful?
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 21, 2024 • 15min
AF - Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data by Johannes Treutlein
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data, published by Johannes Treutlein on June 21, 2024 on The AI Alignment Forum.
TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or CoT. For instance, we finetune GPT-3.5 on pairs (x,f(x)) for some unknown function f. We find that the LLM can (a) define f in Python, (b) invert f, (c) compose f with other functions, for simple functions such as x+14, x // 3, 1.75x, and 3x+2.
Paper authors: Johannes Treutlein*, Dami Choi*, Jan Betley, Sam Marks, Cem Anil, Roger Grosse, Owain Evans (*equal contribution)
Johannes, Dami, and Jan did this project as part of an Astra Fellowship with Owain Evans.
Below, we include the Abstract and Introduction from the paper, followed by some additional discussion of our AI safety motivation, the implications of this work, and possible mechanisms behind our results.
Abstract
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents.
Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR.
In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions.
Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures.
Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
Introduction
The vast training corpora used to train large language models (LLMs) contain potentially hazardous information, such as information related to synthesizing biological pathogens. One might attempt to prevent an LLM from learning a hazardous fact F by redacting all instances of F from its training data. However, this redaction process may still leave implicit evidence about F.
Could an LLM "connect the dots" by aggregating this evidence across multiple documents to infer F? Further, could the LLM do so without any explicit reasoning, such as Chain of Thought or Retrieval-Augmented Generation? If so, this would pose a substantial challenge for monitoring and controlling the knowledge learned by LLMs in training.
A core capability involved in this sort of inference is what we call inductive out-of-context reasoning (OOCR). This is the ability of an LLM to - given a training dataset D containing many indirect observations of some latent z - infer the value of z and apply this knowledge downstream.
Inductive OOCR is out-of-context because the observations of z are only seen during training, not provided to the model in-context at test time; it is inductive because inferring the latent involves aggregating information from many training...

Jun 21, 2024 • 33min
AF - Attention Output SAEs Improve Circuit Analysis by Connor Kissane
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention Output SAEs Improve Circuit Analysis, published by Connor Kissane on June 21, 2024 on The AI Alignment Forum.
This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort.
Executive Summary
In a previous post we trained A
ttention Output SAEs on
every layer of GPT-2 Small.
Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research. This would both validate SAEs as a useful tool for mechanistic interpretability researchers, and provide evidence that they are identifying the real variables of the model's computation.
We believe that we now have evidence that attention SAEs can:
Help make novel mechanistic interpretability discoveries that prior methods could not make.
Allow for tracing information through the model's forward passes on arbitrary prompts.
In this post we discuss the three outputs from this circuit analysis work:
1. We use SAEs to deepen our understanding of the IOI circuit. It was previously thought that the indirect object's name was identified by tracking the names positions, whereas we find that instead the model tracks whether names are before or after "and". This was not noticed in prior work, but is obvious with the aid of SAEs.
2. We introduce "recursive direct feature attribution" (recursive DFA) and release an Attention Circuit Explorer tool for circuit analysis on GPT-2 Small (Demo 1 and Demo 2). One of the nice aspects of attention is that attention heads are linear when freezing the appropriate attention patterns. As a result, we can identify which source tokens triggered the firing of a feature. We can perform this recursively to track backwards through both attention and residual stream SAE features in models.
1. We also announce a $1,000 bounty for whomever can produce the most interesting example of an attention feature circuit by 07/15/24 as subjectively assessed by the authors. See the section "Even cooler examples" for more details on the bounty.
3. We open source
HookedSAETransformer to SAELens, which makes it easy to splice in SAEs during a forward pass and cache + intervene on SAE features. Get started with this
demo notebook.
Introduction
With continued investment into dictionary learning research, there still remains a concerning lack of evidence that SAEs are useful interpretability tools in practice. Further, while SAEs clearly find interpretable features (Cunningham et al.; Bricken et al.), it's not obvious that these features are true causal variables used by the model. In this post we address these concerns by applying our
GPT-2 Small Attention SAEs to improve circuit analysis research.
We start by using our SAEs to deepen our understanding of the IOI task. The first step is evaluating if our SAEs are sufficient for the task. We "splice in" our SAEs at each layer, replacing attention layer outputs with their SAE reconstructed activations, and study how this affects the model's ability to perform the task - if crucial information is lost by the SAE, then they will be a poor tool for analysis.
At their best, we find that SAEs at the early-middle layers almost fully recover model performance, allowing us to leverage these to answer a long standing open question and discover novel insights about IOI. However, we also find that our SAEs at the later layers (and layer 0) damage the model's ability to perform the task, suggesting we'll need more progress in the science and scaling of SAEs before we can analyze a full end-to-end feature circuit.
We then move beyond IOI and develop a visualization tool (link) to explore attention feature circuits on arbitrary prompts, introducing a new technique called recursive DFA. This technique exploits the fact that transformers are almost linear i...

Jun 21, 2024 • 3min
EA - Navigating Risks from Advanced Artificial Intelligence: A Guide for Philanthropists [Founders Pledge] by Tom Barnes
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Navigating Risks from Advanced Artificial Intelligence: A Guide for Philanthropists [Founders Pledge], published by Tom Barnes on June 21, 2024 on The Effective Altruism Forum.
This week, we are releasing new research on advanced artificial intelligence (AI), the opportunities and risks it presents, and the role donations can play in positively steering it's development.
As with our previous research investigating areas such as
nuclear risks and
catastrophic biological risks, our report on advanced AI provides a comprehensive overview of the landscape, outlining for the first time how effective donations can cost-effectively reduce risks.
You can find the technical report as a PDF here, or read a condensed version here.
In brief, the key points from our report are:
1. General, highly capable AI systems are likely to be developed in the next couple of decades, with the possibility of emerging in the next few years.
2. Such AI systems will radically upend the existing order - presenting a wide range of risks, scaling up to and including catastrophic threats.
3. AI companies - funded by big tech - are racing to build these systems without appropriate caution or restraint given the stakes at play.
4. Governments are under-resourced, ill-equipped and vulnerable to regulatory capture from big tech companies, leaving a worrying gap in our defenses against dangerous AI systems.
5. Philanthropists can and must step in where governments and the private sector are missing the mark.
6. We recommend special attention to funding opportunities to (1) boost global resilience, (2) improve government capacity, (3) coordinate major global players, and (4) advance technical safety research.
Funding Recommendations
Alongside this report, we are sharing some of our latest recommended high-impact funding opportunities:
The Centre for Long-Term Resilience, the
Institute for Law and AI, the
Effective Institutions Project and
FAR AI are four promising organizations we have recently evaluated and recommend for more funding, covering our four respective focus areas. We are in the process of evaluating more organizations, and hope to release further recommendations.
Furthermore, the
Founders Pledge's Global Catastrophic Risks Fund supports critical work on these issues. If you would like to make progress on a range of catastrophic risks - including from advanced AI - then please consider
donating to the Fund
!
About Founders Pledge
Founders Pledge is a global non-profit empowering entrepreneurs to do the most good possible with their charitable giving. We equip members with everything needed to maximize their impact, from evidence-led research and advice on the world's most pressing problems, to comprehensive infrastructure for global grant-making, alongside opportunities to learn and connect. To date, they have pledged over $10 billion to charity and donated more than $950 million.
We're grateful to be funded by our members and other generous donors. founderspledge.com
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 21, 2024 • 46sec
EA - How much funding do invertebrate welfare organizations get? by BrownHairedEevee
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How much funding do invertebrate welfare organizations get?, published by BrownHairedEevee on June 21, 2024 on The Effective Altruism Forum.
By invertebrate welfare as a cause area, I mean all invertebrates (particularly marine crustaceans and insects[1]), whether farmed or wild. Thus, this cause area includes:
Shrimp Welfare Project
Insect Institute
Arthropoda Foundation
Aquatic Life Institute
Crustacean Compassion
1. ^
I feel the need to mention that insects are technically part of the crustacean family.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 21, 2024 • 34min
AF - Debate, Oracles, and Obfuscated Arguments by Jonah Brown-Cohen
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Debate, Oracles, and Obfuscated Arguments, published by Jonah Brown-Cohen on June 20, 2024 on The AI Alignment Forum.
This post is about recent and ongoing work on the power and limits of debate from the computational complexity point of view. As a starting point our paper Scalable AI Safety via Doubly-Efficient Debate gives new complexity-theoretic formalizations for debate. In this post we will give an overview of the model of debate in the paper, and discuss extensions to the model and their relationship to obfuscated arguments.
High-level Overview
At a high level our goal is to create a complexity theoretic models that allow us to productively reason about different designs for debate protocols, in such a way as to increase our confidence that they will produce the intended behavior. In particular, the hope would be to have debate protocols play a role in the training of aligned AI that is similar to the role played by cryptographic protocols in the design of secure computer systems.
That is, as with cryptography, we want to have provable guarantees under clear complexity-theoretic assumptions, while still matching well to the actual in-practice properties of the system.
Towards this end, we model AI systems as performing computations, where each step in the computation can be judged by humans. This can be captured by the classical complexity theoretic setting of computation relative to an oracle.
In this setting the headline results of our paper state that any computation by an AI that can be correctly judged with Tqueries to human judgement, can also be correctly judged with a constant (independent of T) number of queries when utilizing an appropriate debate protocol between two competing AIs. Furthermore, whichever AI is arguing for the truth in the debate need only utilize O(TlogT) steps of computation, even if the opposing AI debater uses arbitrarily many steps.
Thus, our model allows us to formally prove that, under the assumption that the computation in question can be broken down into T human-judgeable steps, it is possible to design debate protocols where it is harder (in the sense of computational complexity) to lie than to refute a lie.
One natural complaint about this result is that there may be computations which cannot be broken down into human-judgeable steps. However, if you believe the extended Church-Turing thesis (that any computation can be simulated with only a polynomial-time slow-down on a Turing machine), then you cannot make the above complaint in its strongest form.
After all, human judgement is a computation and whatever the AI is doing is a computation, so there can only be a polynomial-time slow-down between the way the AI does a particular computation and the way that a human could.
That said, it is entirely valid to believe the extended Church-Turing thesis and make a weaker form of the above complaint, namely that the number of AI-judgeable steps might be polynomially less than the number of human-judgeable steps! If this polynomial is say n100, then the number of steps in the human-judgeable form of the computation can easily be so long as to be completely infeasible for the AI to produce, even when the AI-judgeable form is quite short.
The fact that the human-judgeable version of a computation can be too long leads to the need for debate protocols that utilize the short AI-judgeable computation as a guide for exploring the long human-judgeable form. In particular, one might try to recursively break an AI-judgeable computation down into simpler and simpler subproblems, where the leaves of the recursion tree are human-judgeable, and then use some debate protocol to explore only a limited path down the tree.
As we will later see, natural designs for such protocols run into the obfuscated arguments problem: it is possible to break a ...

Jun 21, 2024 • 9min
LW - Interpreting and Steering Features in Images by Gytis Daujotas
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting and Steering Features in Images, published by Gytis Daujotas on June 21, 2024 on LessWrong.
We trained a SAE to find sparse features in image embeddings. We found many meaningful, interpretable, and steerable features. We find that steering image diffusion works surprisingly well and yields predictable and high-quality generations.
You can see the feature library here. We also have an intervention playground you can try.
Key Results
We can extract interpretable features from CLIP image embeddings.
We observe a diverse set of features, e.g. golden retrievers, there being two of something, image borders, nudity, and stylistic effects.
Editing features allows for conceptual and semantic changes while maintaining generation quality and coherency.
We devise a way to preview the causal impact of a feature, and show that many features have an explanation that is consistent with what they activate for and what they cause.
Many feature edits can be stacked to perform task-relevant operations, like transferring a subject, mixing in a specific property of a style, or removing something.
Interactive demo
Visit the feature library of over ~50k features to explore the features we find.
Our main result, the intervention playground, is now available for public use.
Introduction
We trained a sparse autoencoder on 1.4 million image embeddings to find monosemantic features. In our run, we found 35% (58k) of the total of 163k features were alive, which is that they have a non-zero activation for any of the images in our dataset.
We found that many features map to human interpretable concepts, like dog breeds, times of day, and emotions. Some express quantities, human relationships, and political activity. Others express more sophisticated relationships like organizations, groups of people, and pairs.
Some features were also safety relevant.We found features for nudity, kink, and sickness and injury, which we won't link here.
Steering Features
Previous work found similarly interpretable features, e.g. in CLIP-ViT. We expand upon their work by training an SAE in a domain that allows for easily testing interventions.
To test an explanation derived from describing the top activating images for a particular feature, we can intervene on an embedding and see if the generation (the decoded image) matches our hypothesis. We do this by steering the features of the image embedding and re-adding the reconstruction loss. We then use an open source diffusion model, Kadinsky 2.2, to diffuse an image back out conditional on this embedding.
Even though an image typically has many active features that appear to encode a similar concept, intervening on one feature with a much higher activation value still works and yields an output without noticeable degradation of quality.
We built an intervention playground where users could adjust the features of an image to test hypotheses, and later found that the steering worked so well that users could perform many meaningful tasks while maintaining an output that is comparably as coherent and high quality as the original.
For instance, the subject of one photo could be transferred to another. We could adjust the time of day, and the quantity of the subject. We could add entirely new features to images to sculpt and finely control them. We could pick two photos that had a semantic difference, and precisely transfer over the difference by transferring the features. We could also stack hundreds of edits together.
Qualitative tests with users showed that even relatively untrained users could learn to manipulate image features in meaningful directions. This was an exciting result, because it could suggest that feature space edits could be useful for setting inference time rules (e.g. banning some feature that the underlying model learned) or as user interf...


