

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

May 15, 2024 • 11min
EA - Robust longterm comparisons by Toby Ord
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robust longterm comparisons, published by Toby Ord on May 15, 2024 on The Effective Altruism Forum.
(Cross-posted from http://www.tobyord.com/writing/robust-longterm-comparisons )
The choice of discount rate is crucially important when comparing options that could affect our entire future. Except when it isn't. Can we tease out a class of comparisons that everyone can agree on regardless of their views on discounting?
Some of the actions we can take today may have longterm effects - permanent changes to humanity's longterm trajectory. For example, we may take risks that could lead to human extinction. Or we might irreversibly destroy parts of our environment, creating permanent reductions in the quality of life.
Evaluating and comparing such effects is usually extremely sensitive to what economists call the pure rate of time preference, denoted ρ. This is a way of encapsulating how much less we should value a benefit simply because it occurs at a later time.
There are other components of the overall discount rate that adjust for the fact that an extra dollar is worth less when people are richer, that later benefits may be less likely to occur - or that the entire society may have ceased to exist by then. But the pure rate of time preference is the amount by which we should discount future benefits even after all those things have been accounted for.
Most attempts to evaluate or compare options with longterm effects get caught up in intractable disagreements about ρ. Philosophers almost uniformly think ρ should be set to zero, with any bias towards the present being seen as unfair. That is my usual approach, and I've developed a framework for making longterm comparisons without any pure time preference. While some prominent economists agree that ρ should be zero, the default in economic analysis is to use a higher rate, such as 1% per year.
The difference between a rate of 0% and 1% is small for most things economists evaluate, where the time horizon is a generation or less. But it makes a world of difference to the value of longterm effects. For example, ρ = 1% implies that a stream of damages starting in 500 years time and lasting a billion years is less bad than a single year of such damages today.
So when you see a big disagreement on how to make a tradeoff between, say, economic benefits and existential risk, you can almost always pinpoint the source to a disagreement about ρ.
This is why it was so surprising to read Charles Jones's recent paper: 'The AI Dilemma: Growth versus Existential Risk'. In his examination of whether and when the economic gains from developing advanced AI could outweigh the resulting existential risk, the rate of pure time preference just cancels out. The value of ρ plays no role in his primary model. There were many other results in the paper, but it was this detail that grabbed my attention.
Here was a question about trading off risk of human extinction against improved economic consumption that economists and philosophers might actually be able to agree on. After all, even better than picking the correct level of ρ, deriving the correct conclusion, and yet still having half the readers ignore your findings, is if there is a way of conducting the analysis such that you are not only correct - but that everyone else can see that too.
Might we be able to generalise this happy result further?
Is there are broader range of long run effects in which the discount rate still cancels out?
Are there other disputed parameters (empirical or normative) that also cancel out in those cases?
What I found is that this can indeed be greatly generalised, creating a domain in which we can robustly compare long run effects - where the comparisons are completely unaffected by different assumptions about discounting.
Let's start by considering a basic model w...

May 15, 2024 • 2min
EA - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance by Nicholas Kruus
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance, published by Nicholas Kruus on May 15, 2024 on The Effective Altruism Forum.
Extended version of a short paper accepted at DPFM, ICLR'24. Authored by Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H.S. Torr, Adel Bibi, Samuel Albanie, and Matthias Bethge.
Similar to "The Importance of (Exponentially More) Computing Power."
Abstract:
Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation.
In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts.
We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions.
Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

May 15, 2024 • 9min
EA - Hiring non-EA Talent: Pros & Cons by Tatiana K. Nesic Skuratova
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Hiring non-EA Talent: Pros & Cons, published by Tatiana K. Nesic Skuratova on May 15, 2024 on The Effective Altruism Forum.
TLDR: Asking for hires to be acquainted with EA ideas or to demonstrate how EA they are limits the talent pool where it is already scarce. Looking outside of EA has challenges but can be a net positive in some circumstances.
Intro
Throughout my involvement in the EA world, I see that one of the biggest challenges facing effective altruism is attracting the right people with the right skills to execute its many promising projects. I've witnessed firsthand in my work in both China and Serbia how a lack of skilled personnel can hinder an organization's progress.
Take
EA Serbia as an example. As the EA movement is relatively new in Serbia, it's been hard to attract members and move them inward through the
concentric circles. We decided to hire two part-time employees (one focused on marketing & social media, the second focused on outreach and developing external relations) who were completely outside of EA but who had relevant professional experience with the charity world. We asked them to take the
EA Serbia Introductory Course (our own version of the
Introductory EA Program) during their first month at work as part of their onboarding. These people brought their expertise in their professional fields, which made a huge difference in our outreach efforts, and which also established a solid foundation for future EA efforts in Serbia.
I am not the first one to write
about this. There have been a variety of posts on the EA forum regarding outreach to mid-career professionals with relevant skills, attempts at
EA-specialized hiring agencies, and the like. Also,
see here for AI Policy talent gaps. People with both relevant skills and a passion for making a difference can be crucial assets. Moreover, by getting involved in EA, they serve as a vector to spread EA ideas to professional networks that otherwise would remain unconnected to EA, allowing us to reach new groups of people (we've seen this with one of our hires now spreading EA Ideas in their professional circles).
While there are certainly some roles in which high levels of EA context and knowledge is necessary, I'd encourage us all to question the impulse to say "every hire must be at least 9 out of 10 on the metric of EAness." For some roles, maybe 6 out of 10 is enough. For some roles, maybe 3 out of 10 is enough.
The person designing the programming for EAG probably needs to have a strong understanding of the EA community in order to do their work well, and the translator definitely needs to understand the concepts in order to translate them well. But the person who designs your payroll system or makes reservations for your team offsite probably doesn't.
Do you need your office manager to be an EA, or would it work out fine to simply hire a skilled office manager who gets familiar with EA during their first few months on the job? That will, of course, depend on your organization's specific circumstances, but you should at least consider that for many roles, a competent professional with several years of experience who is neutral about EA will be able to perform very well on the job.
Key Takeaways
Pros:
Expertise is important, and it gets things done:
Certain roles demand specific expertise or experience. Sometimes, you really do want an experienced project manager with a PMP certification rather than an organized person who can write well, who ran the EA club at their school, and who has never had a job before. Some skills are easier or harder for a person with no experience to learn and pick up. Consider accounting, marketing, web design, or recruitment; some of these have
positive and some of these have negative learning curves. Between candidates Alice and Bob, if Alice has expertise and Bo...

May 15, 2024 • 5min
LW - MIRI's May 2024 Newsletter by Harlan
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: MIRI's May 2024 Newsletter, published by Harlan on May 15, 2024 on LessWrong.
MIRI updates:
MIRI is shutting down the Visible Thoughts Project.
We originally
announced the project in November of 2021. At the time we were hoping we could build a new type of data set for training models to exhibit more of their inner workings. MIRI leadership is pessimistic about humanity's ability to solve the alignment problem in time, but this was an idea that seemed relatively promising to us, albeit still a longshot.
We also hoped that the $1+ million bounty on the project might attract someone who could build an organization to build the data set. Many of MIRI's ambitions are bottlenecked on executive capacity, and we hoped that we might find individuals (and/or a process) that could help us spin up more projects without requiring a large amount of oversight from MIRI leadership.
Neither hope played out, and in the intervening time, the ML field has moved on. (ML is a fast-moving field, and alignment researchers are working on a deadline; a data set we'd find useful if we could start working with it in 2022 isn't necessarily still useful if it would only become available 2+ years later.) We would like to thank the many writers and other support staff who contributed over the last two and a half years.
Mitchell Howe and Joe Rogero joined the comms team as writers. Mitch is a longtime MIRI supporter with a background in education, and Joe is a former reliability engineer who has facilitated courses for
BlueDot Impact. We're excited to have their help in transmitting MIRI's views to a broad audience.
Additionally, Daniel Filan will soon begin working with MIRI's new Technical Governance Team part-time as a technical writer. Daniel is the host of two podcasts:
AXRP, and The Filan Cabinet. As a technical writer, Daniel will help to scale up our research output and make the Technical Governance Team's research legible to key audiences.
The Technical Governance Team submitted responses to the
NTIA's request for comment on open-weight AI models, the United Nations' request for feedback on the
Governing AI for Humanity interim report. and the
Office of Management and Budget's request for information on AI procurement in government.
Eliezer Yudkowsky spoke with Semafor for a piece about
the risks of expanding the definition of "AI safety". "You want different names for the project of 'having AIs not kill everyone' and 'have AIs used by banks make fair loans."
A number of important developments in the larger world occurred during the MIRI Newsletter's hiatus from July 2022 to
April 2024. To recap just a few of these:
In November of 2022, OpenAI released
ChatGPT, a chatbot application that
reportedly gained 100 million users within 2 months of its launch. As we mentioned in our
2024 strategy update, GPT-3.5 and GPT-4 were more impressive than some of the MIRI team expected, representing a pessimistic update for some of us "about how plausible it is that humanity could build world-destroying AGI with relatively few (or no) additional algorithmic advances". ChatGPT's success significantly
increased public awareness of AI and sparked much of the post-2022 conversation about AI risk.
In March of 2023, the Future of Life Institute released an
open letter calling for a six-month moratorium on training runs for AI systems stronger than GPT-4. Following the letter's release, Eliezer
wrote in TIME that a six-month pause is not enough and that an indefinite worldwide moratorium is needed to avert catastrophe.
In May of 2023, the Center for AI Safety released a
one-sentence statement, "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war." We were especially pleased with this statement, because it focused attention ...

May 15, 2024 • 12min
LW - Catastrophic Goodhart in RL with KL penalty by Thomas Kwa
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catastrophic Goodhart in RL with KL penalty, published by Thomas Kwa on May 15, 2024 on LessWrong.
TLDR: In the last two posts, we showed that optimizing for a proxy can fail to increase true utility, but only when the error is heavy-tailed. We now show that this also happens in RLHF with a KL penalty.
This post builds on our earlier result with a more realistic setting and assumptions:
Rather than modeling optimization as conditioning on a minimum reward threshold, we study maximization of reward with a KL divergence penalty, as in RLHF.
We remove the assumption of independence between the error and utility distributions, which we think was the weakest part of the last post.
When the true utility V is light-tailed, the proxy can be maximized while keeping E[V]to the same level as the prior. We can't guarantee anything about E[V] when V is heavy tailed; it could even go to minus infinity.
Abstract
When applying KL regularization, the trained model is regularized towards some prior policy π0. One would hope that a KL penalty can produce good outcomes even in the case of reward misspecification; that is, if the reward U is the sum of true utility V and an error term X, we would hope that optimal policies under a KL penalty achieve high V even if the magnitude of X is large.
We show that this is not always the case: when X is heavy-tailed, there are arbitrarily well-performing policies π with Eπ[V]Eπ0[V]; that is, that get no higher true utility than the prior. However, when error is light-tailed and independent of V, the optimal policy under a KL penalty results in V>0, and V can be made arbitrarily large. Thus, the tails of the error distribution are crucial in determining how much utility will result from optimization towards an imperfect proxy.
Intuitive explanation of catastrophic Goodhart with a KL penalty
Recall that KL divergence between two distributions P and Q is defined as
If we have two policies π,π0, we abuse notation to define DKL(ππ0) as the KL divergence between the distributions of actions taken on the states in trajectories reached by π. That is, if Tr(π) is the distribution of trajectories taken by π, we penalize
This strongly penalizes π0 taking actions the base policy never takes, but does not force the policy to take all actions the base policy takes.
If our reward model gives reward U, then the optimal policy for RLHF with a KL penalty is:
Suppose we have an RL environment with reward U=X+V where X is an error term that is heavy-tailed under π0, and V is the "true utility" assumed to be light-tailed under π0. Without loss of generality, we assume that E[U(π0)]=0. If we optimize for E[U(π)]βDKL(ππ0), there is no maximum because this expression is unbounded. In fact, it is possible to get E[U(π)]>M and DKL(π,π0)
For such policies π, it is necessarily the case that limϵ0E[V(π)]=0; that is, for policies with low KL penalty, utility goes to zero. Like in the previous post, we call this catastrophic Goodhart because the utility produced by our optimized policy is as bad as if we hadn't optimized at all. This is a corollary of a property about distributions (Theorems 1 and 3 below) which we apply to the case of RLHF with unbounded rewards (Theorem 2).
The manner in which these pathological policies π achieve high E[U] is also concerning: most of the time they match the reference policy π0, but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy π, it could be impossible to tell whether π is Goodharting or identical to the base policy.
Results
All proofs are in the appendix, which will be published shortly after this post.
X heavy tailed, V light tailed: EV0
We'll start by demon...

May 15, 2024 • 4min
LW - Teaching CS During Take-Off by andrew carle
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Teaching CS During Take-Off, published by andrew carle on May 15, 2024 on LessWrong.
I stayed up too late collecting way-past-deadline papers and writing report cards. When I woke up at 6, this anxious email from one of my g11 Computer Science students was already in my Inbox.
Student: Hello Mr. Carle, I hope you've slept well; I haven't.
I've been seeing a lot of new media regarding how developed AI has become in software programming, most relevantly videos about NVIDIA's new artificial intelligence software developer, Devin.
Things like these are almost disheartening for me to see as I try (and struggle) to get better at coding and developing software. It feels like I'll never use the information that I learn in your class outside of high school because I can just ask an AI to write complex programs, and it will do it much faster than I would.
I'd like to know what your thoughts on this are. Do you think AI will replace human software developers, as NVIDIA claims it will?
My response: Buddy, that is a big question for 5:15 am.
First AI horizon thoughts:
1. Software development as a field will look incredibly different in 10 years.
2. My priors say that MOST of human intellectual+economic activity will ALSO be radically different in 10 years.
3. I have a very small p(doom) for the 10 year horizon. That means I don't expect human-equivalent AGIs to completely disrupt human civilisation within 10 years.
4. The delta between how fast AI will affect software engineering and how fast AI will transform other (roughly speaking) white collar careers is relatively small. That means I think the AI affect on say, hedge fund management and software engineering to be similar.
Then some priors I have for teaching IB Computer Science in the middle of this take-off:
1. I don't think becoming a software engineer is the modal outcome for IBCS students
2. I believe that most long term personal utility from IBCS (or any other intro CS exposure) comes from shifting a student's mental model of how the modern social and economic system interacts with / depends on these technologies.
3. While the modern Ai tools are light years beyond the simple Von Neumann CPU models and intro Python we're studying, it does address the foundations of those systems. Similarly, HL Analysis and HL Physics don't cover anything about the math and physics that underpin these huge ML systems, but that foundation IS there. You can't approach the superstructure without it.
So, in summary, if your concern is "the world seems to be changing fast. This class is hard, and I don't think there's any chance that I will find a 2022 Novice SoftwareDev job when I'm out of university in 2029" I would strongly agree with that sentiment.
I have a Ron Swanson detachment on the important of formal schooling. If your question was "is a traditional education sequence the best way to prepare myself for the turbulent AI takeoff period," then I strongly disagree with that statement. Education is intrinsically reflective and backward looking.
But I'm employed as a high school teacher. And your parents have decided to live here and send you to this school . So, I'm not sure if advice on that axis is actionable for either of us. There's also a huge chasm between "this isn't be best of all possible options" and "this has zero value."
If I reframed your statement as "given that I'm in this limited option IB program, what classes will provide me the best foundation to find opportunities and make novel insights in the turbulent AI takeoff period" I would feel confident recommending IBCS.
That doesn't make learning to code any easier.
Is that a good answer to a 17 year old? Is there a good answer to this?
One of the best parts of teaching is watching young people wake up to the real, fundamental issues and challenges of human civilisation an...

May 15, 2024 • 2min
EA - Announcing UK Voters for Animals! by eleanor mcaree
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing UK Voters for Animals!, published by eleanor mcaree on May 15, 2024 on The Effective Altruism Forum.
We're excited to announce a new volunteer-run organisation,
UK Voters For Animals, dedicated to mobilising UK voters to win key legislative changes for farmed animals. Our goal is to recruit and train voters to meet with MPs and prospective MPs to build political support for our
key asks. Due to the upcoming general election, we think this is a crucial time to apply pressure on politicians.
If you want to use your political power to win change for farmed animals, sign up to get involved
here. Please share with anyone who may be interested - we're looking to find people in all 650 constituencies around the UK so no small feat!
We think people in the EA community would be a great fit for helping out with this work because they are often thoughtful, pragmatic, and impact-focused. The minimum commitment required is attending a training, and participating in a 30-minute meeting with your MP and/or most promising prospective MP.
(If you need some convincing on how useful this might be:
Research shows politicians believe direct contact with citizens is the most useful way to learn about public opinion, so this is a key way for everyday advocates to contribute to meaningful policy change.)
If you want to learn more, feel free to check out our website, message me, or email
hello@ukvotersforanimals.org.
You can also keep up to date with our work on social media at:
Instagram
Facebook
X
Note: UK Voters for Animals is run by a volunteer team and isn't affiliated with any of the organisations the team works for.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

May 15, 2024 • 48sec
LW - Ilya Sutskever and Jan Leike resign from OpenAI by Zach Stein-Perlman
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ilya Sutskever and Jan Leike resign from OpenAI, published by Zach Stein-Perlman on May 15, 2024 on LessWrong.
Ilya Sutskever and Jan Leike have resigned. They led OpenAI's alignment work. Superalignment will now be led by John Schulman, it seems. Jakub Pachocki replaced Sutskever as Chief Scientist.
Reasons are unclear (as usual when safety people leave OpenAI).
The NYT piece and others I've seen don't really have details. Archive of NYT if you want to read it anyway.
OpenAI announced Sutskever's departure in a blogpost.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

May 14, 2024 • 14min
LW - How to do conceptual research: Case study interview with Caspar Oesterheld by Chi Nguyen
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to do conceptual research: Case study interview with Caspar Oesterheld, published by Chi Nguyen on May 14, 2024 on LessWrong.
Caspar Oesterheld came up with two of the most important concepts in my field of work:
Evidential Cooperation in Large Worlds and
Safe Pareto Improvements. He also came up with a potential implementation of evidential decision theory in boundedly rational agents called
decision auctions, wrote a comprehensive
review of anthropics and how it interacts with decision theory which most of my anthropics discussions built on, and independently decided to work on AI some time late 2009 or early 2010.
Needless to say, I have a lot of respect for Caspar's work. I've often felt very confused about what to do in my attempts at conceptual research, so I decided to ask Caspar how he did his research. Below is my writeup from the resulting conversation.
How Caspar came up with surrogate goals
The process
Caspar had spent six months FTE thinking about a specific bargaining problem between two factions with access to powerful AI, spread over two years.
A lot of the time was spent on specific somewhat narrow research projects, e.g. modelling the impact of moral advocacy in China on which bargaining problems we'll realistically encounter in the future. At the time, he thought those particular projects were important although he maybe already had a hunch that he wouldn't think so anymore ten years down the line.
At the same time, he also spent some time on most days thinking about bargaining problems on a relatively high level, either in discussions or on walks. This made up some double digit percentage of his time spent researching bargaining problems.
Caspar came up with the idea of surrogate goals during a conversation with Tobias Baumann. Caspar describes the conversation leading up to the surrogate goal idea as "going down the usual loops of reasoning about bargaining" where you consider just building values into your AI that have properties that are strategically advantaged in bargaining but then worrying that this is just another form of aggressive bargaining.
The key insight was to go "Wait, maybe there's a way to make it not so bad for the other side." Hence, counterpart-friendly utility function modifications were born which later on turned into surrogate goals.
Once he had the core idea of surrogate goals, he spent some time trying to figure out what the general principle behind "this one weird trick" he found was. Thus, with Vincent Conitzer as his co-author, his
SPI paper was created and he continues trying to answer this question now.
Caspar's reflections on what was important during the process
He thinks it was important to just have spent a ton of time, in his case six months FTE, on the research area. This helps with building useful heuristics.
It's hard or impossible and probably fruitless to just think about a research area on an extremely high level. "You have to pass the time somehow." His particular projects, for example researching moral advocacy in China, served as a way of "passing the time" so to say.
At the same time, he thinks it is both very motivationally hard and perhaps not very sensible to work on something that's in the roughly right research area where you really can't see a direct impact case. You can end up wasting a bunch of time grinding out technical questions that have nothing much to do with anything.
Relatedly, he thinks it was really important that he continued doing some high-level thinking about bargaining alongside his more narrow projects.
He describes a common dynamic in high-level thinking: Often you get stuck on something that's conceptually tricky and just go through the same reasoning loops over and over again, spread over days, weeks, months, or years. You usually start entering the loop because you think...

May 14, 2024 • 10min
LW - How To Do Patching Fast by Joseph Miller
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How To Do Patching Fast, published by Joseph Miller on May 14, 2024 on LessWrong.
This post outlines an efficient implementation of Edge Patching that massively outperforms common hook-based implementations. This implementation is available to use in my new library, AutoCircuit, and was first introduced by Li et al. (2023).
What is activation patching?
I introduce new terminology to clarify the distinction between different types of activation patching.
Node Patching
Node Patching (aka. "normal" activation patching) is when some activation in a neural network is altered from the value computed by the network to some other value. For example we could run two different prompts through a language model and replace the output of
Attn 1 when the model is given some
input 1 with the output of the head when the model is given some other
input 2.
We will use the running example of a tiny, 1-layer transformer, but this approach generalizes to any transformer and any residual network.
All the nodes downstream of
Attn 1 will be affected by the patch.
Edge Patching
If we want to make a more precise intervention, we can think about the transformer differently, to isolate the interactions between components.
Now we can patch the edge
Attn 1 -> MLP and only nodes downstream of
MLP will be affected (eg.
Attn 1->Output is unchanged). Edge Patching has not been explicitly named in any prior work.
Path Patching
Path Patching refers to the intervention where an input to a path is replaced in the 'treeified' view of the model. The treeified view is a third way of thinking about the model where we separate each path from input to output. We can implement an equivalent intervention to the previous diagram as follows:
In the IOI paper, 'Path Patching' the edge
Component 1 -> Component 2 means Path Patching all paths of the form
where all components between
Component 1 and
Component 2 are
MLPs[1]. However, it can be easy to confuse Edge Patching and Path Patching because if we instead patch all paths of the form
this is equivalent to Edge Patching the edge
Component 1->Component 2.
Edge Patching all of the edges which have some node as source is equivalent to Node Patching that node. AutoCircuit does not implement Path Patching, which is much more expensive in general. However, as explained in the appendix, Path Patching is sometimes equivalent to Edge Patching.
Fast Edge Patching
We perform two steps.
First we gather the activations that we want to patch into the model. There's many ways to do this, depending on what type of patching you want to do. If we just want to do zero ablation, then we don't need to even run the model. But let's assume we want to patch in activations from a different, corrupt input. We create a tensor,
Patch Activations, to store the outputs of the source of each edge and we write to the tensor during the forward pass. Each source component has a row in the tensor, so the shape is
[n_sources, batch, seq, d_model].[2]
Now we run the forward pass in which we actually do the patching. We write the outputs of each edge source to a different tensor,
Current Activations, of the same shape as
Patch Activations. When we get to the input of the destination component of the edge we want to patch, we add the difference between the rows of
Patch Activations and
Current Activations corresponding to the edge's source component output.
This works because the difference in input to the edge destination is equal to the difference in output of the source component.[3] Now it's straightforward to extend this to patching multiple edges at once by subtracting the entire
Current Activations tensor from the entire
Patch Activations tensor and multiplying by a
Mask tensor of shape
[n_sources] that has a single value for each input edge.
By creating a
Mask tensor for each destination node w...


