

The Nonlinear Library: LessWrong
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jul 22, 2024 • 27min
LW - On the CrowdStrike Incident by Zvi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the CrowdStrike Incident, published by Zvi on July 22, 2024 on LessWrong.
Things went very wrong on Friday.
A bugged CrowdStrike update temporarily bricked quite a lot of computers, bringing down such fun things as airlines, hospitals and 911 services.
It was serious out there.
Ryan Peterson: Crowdstrike outage has forced Starbucks to start writing your name on a cup in marker again and I like it.
What (Technically) Happened
My understanding it was a rather stupid bug, a NULL pointer from the memory unsafe C++ language.
Zack Vorhies: Memory in your computer is laid out as one giant array of numbers. We represent these numbers here as hexadecimal, which is base 16 (hexadecimal) because it's easier to work with… for reasons.
The problem area? The computer tried to read memory address 0x9c (aka 156).
Why is this bad?
This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS.
So why is memory address 0x9c trying to be read from? Well because… programmer error.
It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean "there's nothing here", don't try to access it or you'll die.
…
And what's bad about this is that this is a special program called a system driver, which has PRIVLIDGED access to the computer. So the operating system is forced to, out of an abundance of caution, crash immediately.
This is what is causing the blue screen of death. A computer can recover from a crash in non-privileged code by simply terminating the program, but not a system driver. When your computer crashes, 95% of the time it's because it's a crash in the system drivers.
If the programmer had done a check for NULL, or if they used modern tooling that checks these sorts of things, it could have been caught. But somehow it made it into production and then got pushed as a forced update by Crowdstrike… OOPS!
Here is another technical breakdown.
A non technical breakdown would be:
1. CrowdStrike is set up to run whenever you start the computer.
2. Then someone pushed an update to a ton of computers.
3. Which is something CrowdStrike was authorized to do.
4. The update contained a stupid bug, that would have been caught if those involved had used standard practices and tests.
5. With the bug, it tries to access memory in a way that causes a crash.
6. Which also crashes the computer.
7. So you have to do a manual fix to each computer to get around this.
8. If this had been malicious it could probably have permawiped all the computers, or inserted Trojans, or other neat stuff like that.
9. So we dodged a bullet.
10. Also, your AI safety plan needs to take into account that this was the level of security mindset and caution at CrowdStrike, despite CrowdStrike having this level of access and being explicitly in the security mindset business, and that they were given this level of access to billions of computers, and that their stock was only down 11% on the day so they probably keep most of that access and we aren't going to fine them out of existence either.
Yep.
Who to Blame?
George Kurtz (CEO CrowdStrike): CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed.
We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website. We further recommend organizations ensure they're communicating with CrowdStrike representatives through official channels. Our team is fully mobilized to ensure the security and stability of CrowdStrike customers.
Dan Elton: No apology. Many people have...

Jul 21, 2024 • 14min
LW - A simple model of math skill by Alex Altair
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A simple model of math skill, published by Alex Altair on July 21, 2024 on LessWrong.
I've noticed that when trying to understand a math paper, there are a few different ways my skill level can be the blocker. Some of these ways line up with some typical levels of organization in math papers:
Definitions: a formalization of the kind of objects we're even talking about.
Theorems: propositions on what properties are true of these objects.
Proofs: demonstrations that the theorems are true of the objects, using known and accepted previous theorems and methods of inference.
Understanding a piece of math will require understanding each of these things in order. It can be very useful to identify which of type of thing I'm stuck on, because the different types can require totally different strategies.
Beyond reading papers, I'm also trying to produce new and useful mathematics. Each of these three levels has another associated skill of generating them. But it seems to me that the generating skills go in the opposite order.
This feels like an elegant mnemonic to me, although of course it's a very simplified model. Treat every statement below as a description of the model, and not a claim about the totality of doing mathematics.
Understanding
Understanding these more or less has to go in the above order, because proofs are of theorems, and theorems are about defined objects. Let's look at each level.
Definitions
You might think that definitions are relatively easy to understand. That's usually true in natural languages; you often already have the concept, and you just don't happen to know that there's already a word for that.
Math definitions are sometimes immediately understandable. Everyone knows what a natural number is, and even the concept of a prime number isn't very hard to understand. I get the impression that in number theory, the proofs are often the hard part, where you have to come up with some very clever techniques to prove theorems that high schoolers can understand (Fermat's last theorem, the Collatz conjecture, the twin primes conjecture).
In contrast, in category theory, the definitions are often hard to understand. (Not because they're complicated per se, but because they're abstract.) Once you understand the definitions, then understanding proofs and theorems can be relatively immediate in category theory.
Sometimes the definitions have an immediate intuitive understanding, and the hard part is understanding exactly how the formal definition is a formalization of your intuition. In a calculus class, you'll spend quite a long time understanding the derivative and integral, even though they're just the slope of the tangent and the area under the curve, respectively.
You also might think that definitions were mostly in textbooks, laid down by Euclid or Euler or something. At least in the fields that I'm reading papers from, it seems like most papers have definitions (usually multiple). This is probably especially true for papers that are trying to help form a paradigm. In those cases, the essential purpose of the paper is to propose the definitions as the new paradigm, and the theorems are set forth as arguments that those definitions are useful.
Theorems
Theorems are in some sense the meat of mathematics. They tell you what you can do with the objects you've formalized. If you can't do anything meaty with an object, then you're probably holding the wrong object.
Once you understand the objects of discussion, you have to understand what the theorem statement is even saying. I think this tends to be more immediate, especially because often, all the content has been pushed into the definitions, and the theorem will be a simpler linking statement, like "all As are Bs" or "All As can be decomposed into a B and a C".
For example, the fundamental theorem of calculus...

Jul 21, 2024 • 2min
LW - Why Georgism Lost Its Popularity by Zero Contradictions
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Georgism Lost Its Popularity, published by Zero Contradictions on July 21, 2024 on LessWrong.
Henry George's 1879 book Progress & Poverty was the second best-selling book in the entire world during the 1880s and 1890s, outsold by only the Bible. Nobody knows exactly how many copies it sold during those two decades since nobody was keeping track, but it definitely sold at least several millions of copies for sure. The Progressive Era is literally named after the book itself.
Georgism used to have millions of followers, and many of them were very famous people. When Henry George died in 1897 (just a few days before the election for New York City mayor), an estimated 100,000 people attended this funeral.
The mid-20th century labor economist and journalist George Soule wrote that George was "By far the most famous American economic writer," and "author of a book which probably had a larger world-wide circulation than any other work on economics ever written."
Few people know it, but the board game Monopoly and its predecessor The Landlord's Game were actually created to promote the economic theories of Henry George, as noted in the second introduction paragraph of the Wikipedia article on board game Monopoly. The board games intend to show that economies that eliminate rent-seeking are better than ones that don't.
So if Georgism used to have millions of supporters and solid economic reasoning, why did it never catch on and how did it lose its popularity over the past century?
(see the rest of the post in the link)
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 21, 2024 • 9min
LW - (Approximately) Deterministic Natural Latents by johnswentworth
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: (Approximately) Deterministic Natural Latents, published by johnswentworth on July 21, 2024 on LessWrong.
Background: Natural Latents: The Math, Natural Latents: The Concepts, Why Care About Natural Latents?, the prototypical semantics use-case. This post does not assume that you've read all of those, or even any of them.
Suppose I roll a biased die 1000 times, and then roll the same biased die another 1000 times. Then...
Mediation: The first 1000 rolls are approximately independent of the second 1000 given the bias (to reasonable precision).
Redundancy: I can estimate the die's bias (to reasonable precision) with high confidence from either the first or second 1000 rolls.
The die's bias is therefore a natural latent, which means it has various nice properties.
Minimality: The bias is the smallest summary of all the information about the first 1000 rolls relevant to the second 1000 (and vice-versa).
Maximality: The bias is the largest piece of information which can be calculated from the first 1000 rolls and also can separately be calculated from the second 1000 rolls.
Any other variable which satisfies the above properties must tell us (approximately) the same information about the die rolls as the bias.
Furthermore, the bias is a(n approximate) deterministic natural latent: the die's bias (to reasonable precision) is approximately determined by[1] the first 1000 die rolls, and also approximately determined by the second 1000 die rolls. That implies one more nice property:
Uniqueness: The bias is the unique-up-to(-approximate)-isomorphism latent which has the above properties, making it a natural Schelling point for communication between agents.
We've proven all that before, mostly in
Natural Latents: The Math (including
the addendum added six months after the rest of the post). But it turns out that the math is a lot shorter and simpler, and easily yields better bounds, if we're willing to assume (approximate) determinism up-front. That does lose us some theoretical tools (notably the
resampling construction), but it gives a cleaner foundation for our expected typical use cases (like e.g. semantics). The goal of this post is to walk through that math.
Background Tool: Determinism in Diagrams
We're going to use diagrammatic proofs, specifically using Bayes nets. But it's non-obvious how to express (approximate) determinism using Bayes nets, or what rules diagrams follow when determinism is involved, so we'll walk through that first.
This diagram says that Y is (approximately) determined by X:
Intuitively, the literal interpretation of the diagram is: X mediates between Y and Y, i.e. Y itself tells me nothing more about Y once I already know X. That only makes sense if X tells me everything there is to know about Y, i.e. Y is determined by X.
In the approximate case, we express the approximation error of the diagram as a KL-divergence, same as usual:
ϵDKL(P[X=x,Y=y,Y=y']||P[X=x]P[Y=y|X=x]P[Y=y'|X=x])
If you get confused later about what it means to have two copies of the same variable in a diagram, go back to that line; that's the definition of the approximation error of the diagram. (One way to view that definition: there's actually two variables Y and Y', but P says that Y and Y' always have the same value.)
That approximation error simplifies:
DKL(P[X=x,Y=y,Y=y']||P[X=x]P[Y=y|X=x]P[Y=y'|X=x])
=DKL(P[X=x,Y=y]I[y=y']||P[X=x]P[Y=y|X=x]P[Y=y'|X=x])
=x,y,y'P[X=x,Y=y]I[y=y'](log(P[X=x,Y=y]I[y=y'])log(P[X=x]P[Y=y|X=x]P[Y=y'|X=x]))
=x,yP[X=x,Y=y](log(P[X=x,Y=y])log(P[X=x]P[Y=y|X=x]P[Y=y|X=x]))
=x,yP[X=x,Y=y]log(P[Y=y|X=x])
=H(Y|X)
So the diagram says Y is determined by X, and the approximation error of the diagram is the entropy H of Y given X - i.e. the number of bits required on average to specify Y once one already knows X. Very intuitive!
The Dangly Bit Lemma
Intuitiv...

Jul 19, 2024 • 28min
LW - Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions by Lidor Banuel Dabbah
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions, published by Lidor Banuel Dabbah on July 19, 2024 on LessWrong.
Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static LLC estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features from random directions. By comparing our method to other possible metrics, we demonstrate that it outperforms all of them but one, which has comparable performance.
We discuss possible explanations to our results, our project and other future directions.
Introduction
Given a neural network M and a latent layer within it, L, a central motif in current mechanistic interpretability research is to find functions f:LR [1] which are features of the model. Features are (generally) expected to exhibit the following properties:
1. Encode interpretable properties of the input.
2. Be causally relevant to the computation of the output of the model.
3. Encode the output of a certain submodule of our model M, i.e. a component, localized in weight space, which is responsible for a specific part of the total computation.
While this is common wisdom, methods for automated feature evaluation usually focus on correlations between the (top) activations of the feature with human (or machine) recognizable interpretations, or on the effect of feature-related interventions on the output of the model.
In particular, while the first and second items of the feature characterization above are central in current techniques, the third property, specifically the localized nature of the computation upstream of the feature, is less so[2].
We are currently investigating a direction which fills that gap, and this post shares the findings of the exploratory research we have conducted to validate and inform our approach. More specifically, we operationalized the concept of "weight-localized computation" using the local learning coefficient (LLC) introduced in Lau et al, following the learning coefficient first introduced in the context of singular learning theory.
We apply LLC estimation to models associated with our base model and a feature within it, a method we call feature targeted (FT) LLC estimation. In this exploratory work we study FT-LLC estimates of specific models associated with SAE features. Most notably, we have found that:
1. FT-LLC estimates of SAE features are, on average, distinguishably higher then those of random directions.
2. For a particular variant of FT-LLC estimation, which we named the functional FT-LLC (defined in this section) this separation is pronounced enough such that the vast majority of SAE features we studied are clearly separated from the random features we studied. Furthermore, most baseline metrics we compared it to (see here) are less capable at distinguishing SAE features from random directions, with only one performing on par with it.
Section 1 introduces the main technique we study in this post, FT-LLC estimation, and section 2 outlines our motivations. Section 3 describes the details of our experimental setting, our results, and the comparison to baseline metrics. In section 4 we discuss our overall takes, how they fit within our general agenda and gaps we currently have in theoretically understanding them.
Section 5 is devoted to outlining our next steps, the general direction of the project, and some other possible directions for further research. Lastly, we briefly discuss related work in section 6.
What is FT-LLC?
LLC estimation
We start out by briefly recalling what the local learning coefficient (LLC) is. If you are unfamiliar of the term, we recommend reading this, the longer sequence here, or the paper on LLC estimation ...

Jul 19, 2024 • 10min
LW - How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation") by Ruby
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation"), published by Ruby on July 19, 2024 on LessWrong.
AI Alignment is my motivating context but this could apply elsewhere too.
The nascent field of AI Alignment research is pretty happening these days. There are multiple orgs and dozens to low hundreds of full-time researchers pursuing approaches to ensure AI goes well for humanity. Many are heartened that there's at least some good research happening, at least in the opinion of some of the good researchers. This is reason for hope, I have heard.
But how do we know whether or not we have produced "good research?"
I think there are two main routes to determining that research is good, and yet only one applies in the research field of aligning superintelligent AIs.
"It's good because it works"
The first and better way to know that your research is good is because it allows you to accomplish some goal you care about[1] [1]. Examples:
My work on efficient orbital mechanics calculation is good because it successfully lets me predict the trajectory of satellites.
My work on the disruption of cell signaling in malign tumors is good because it helped me develop successful anti-cancer vaccines.
My work on solid-state physics is good because it allowed me to produce superconductors at a higher temperature and lower pressure than previously attained.[2]
In each case, there's some outcome I care about pretty inherently for itself, and if the research helps me attain that outcome it's good (or conversely if it doesn't, it's bad). The good researchers in my field are those who have produced a bunch of good research towards the aims of the field.
Sometimes it's not clear-cut. Perhaps I figured out some specific cell signaling pathways that will be useful if it turns out that cell signaling disruption in general is useful, and that's TBD on therapies currently being trialed and we might not know how good (i.e. useful) my research was for many more years. This actually takes us into what I think is the second meaning of "good research".
"It's good because we all agree it's good"
If our goal is successfully navigating the creation of superintelligent AI in a way such that humans are happy with the outcome, then it is too early to properly score existing research on how helpful it will be. No one has aligned a superintelligence. No one's research has contributed to the alignment of an actual superintelligence.
At this point, the best we can do is share our predictions about how useful research will turn out to be. "This is good research" = "I think this research will turn out to be helpful". "That person is a good researcher" = "That person produces much research that will turn out to be useful and/or has good models and predictions of which research will turn out to help".
To talk about the good research that's being produced is simply to say that we have a bunch of shared predictions that there exists research that will eventually help. To speak of the "good researchers" is to speak of the people who lots of people agree their work is likely helpful and opinions likely correct.
Someone might object that there's empirical research that we can see yielding results in terms of interpretability/steering or demonstrating deception-like behavior and similar. While you can observe an outcome there, that's not the outcome we really care about of aligning superintelligent AI, and the relevance of this work is still just prediction. It's being successful at kinds of cell signaling modeling before we're confident that's a useful approach.
More like "good" = "our community pagerank Eigen-evaluation of research rates this research highly"
It's a little bit interesting to unpack "agreeing that some research is good". Obviously, not everyone's opinion matters ...

Jul 19, 2024 • 2min
LW - Linkpost: Surely you can be serious by kave
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Linkpost: Surely you can be serious, published by kave on July 19, 2024 on LessWrong.
Adam Mastroianni writes about "actually caring about stuff, and for the right reasons", rather than just LARPing. The opening is excerpted below.
I once saw someone give a talk about a tiny intervention that caused a gigantic effect, something like, "We gave high school seniors a hearty slap on the back and then they scored 500 points higher on the SAT."
1
Everyone in the audience was like, "Hmm, interesting, I wonder if there were any gender effects, etc."
I wanted to get up and yell: "EITHER THIS IS THE MOST POTENT PSYCHOLOGICAL INTERVENTION EVER, OR THIS STUDY IS TOTAL BULLSHIT."
If those results are real, we should start a nationwide backslapping campaign immediately. We should be backslapping astronauts before their rocket launches and Olympians before their floor routines. We should be running followup studies to see just how many SAT points we can get - does a second slap get you another 500? Or just another 250? Can you slap someone raw and turn them into a genius?
Or - much more likely - the results are not real, and we should either be a) helping this person understand where they screwed up in their methods and data analysis, or b) kicking them out for fraud.
Those are the options. Asking a bunch of softball questions ("Which result was your favorite?") is not a reasonable response. That's like watching someone pull a rabbit out of a hat actually for real, not a magic trick, and then asking them, "What's the rabbit's name?"
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 18, 2024 • 1h 23min
LW - AI #73: Openly Evil AI by Zvi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #73: Openly Evil AI, published by Zvi on July 18, 2024 on LessWrong.
What do you call a clause explicitly saying that you waive the right to whistleblower compensation, and that you need to get permission before sharing information with government regulators like the SEC?
I have many answers.
I also know that OpenAI, having f***ed around, seems poised to find out, because that is the claim made by whistleblowers to the SEC. Given the SEC fines you for merely not making an explicit exception to your NDA for whistleblowers, what will they do once aware of explicit clauses going the other way?
(Unless, of course, the complaint is factually wrong, but that seems unlikely.)
We also have rather a lot of tech people coming out in support of Trump. I go into the reasons why, which I do think is worth considering. There is a mix of explanations, and at least one very good reason.
Then I also got suckered into responding to a few new (well, not really new, but renewed) disingenuous attacks on SB 1047. The entire strategy is to be loud and hyperbolic, especially on Twitter, and either hallucinate or fabricate a different bill with different consequences to attack, or simply misrepresent how the law works, then use that, to create the illusion the bill is unliked or harmful.
Few others respond to correct such claims, and I constantly worry that the strategy might actually work. But that does not mean you, my reader who already knows, need to read all that.
Also a bunch of fun smaller developments. Karpathy is in the AI education business.
Table of Contents
1. Introduction.
2. Table of Contents.
3. Language Models Offer Mundane Utility. Fight the insurance company.
4. Language Models Don't Offer Mundane Utility. Have you tried using it?
5. Clauding Along. Not that many people are switching over.
6. Fun With Image Generation. Amazon Music and K-Pop start to embrace AI.
7. Deepfaketown and Botpocalypse Soon. FoxVox, turn Fox into Vox or Vox into Fox.
8. They Took Our Jobs. Take away one haggling job, create another haggling job.
9. Get Involved. OpenPhil request for proposals. Job openings elsewhere.
10. Introducing. Karpathy goes into AI education.
11. In Other AI News. OpenAI's Q* is now named Strawberry. Is it happening?
12. Denying the Future. Projects of the future that think AI will never improve again.
13. Quiet Speculations. How to think about stages of AI capabilities.
14. The Quest for Sane Regulations. EU, UK, The Public.
15. The Other Quest Regarding Regulations. Many in tech embrace The Donald.
16. SB 1047 Opposition Watch (1). I'm sorry. You don't have to read this.
17. SB 1047 Opposition Watch (2). I'm sorry. You don't have to read this.
18. Open Weights are Unsafe and Nothing Can Fix This. What to do about it?
19. The Week in Audio. Joe Rogan talked to Sam Altman and I'd missed it.
20. Rhetorical Innovation. Supervillains, oh no.
21. Oh Anthropic. More details available, things not as bad as they look.
22. Openly Evil AI. Other things, in other places, on the other hand, look worse.
23. Aligning a Smarter Than Human Intelligence is Difficult. Noble attempts.
24. People Are Worried About AI Killing Everyone. Scott Adams? Kind of?
25. Other People Are Not As Worried About AI Killing Everyone. All glory to it.
26. The Lighter Side. A different kind of mental gymnastics.
Language Models Offer Mundane Utility
Let Claude write your prompts for you. He suggests using the Claude prompt improver.
Sully: convinced that we are all really bad at writing prompts
I'm personally never writing prompts by hand again
Claude is just too good - managed to feed it evals and it just optimized for me
Probably a crude version of dspy but insane how much prompting can make a difference.
Predict who will be the shooting victim. A machine learning model did this for citizens of Chicago (a ...

Jul 18, 2024 • 40min
LW - Individually incentivized safe Pareto improvements in open-source bargaining by Nicolas Macé
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Individually incentivized safe Pareto improvements in open-source bargaining, published by Nicolas Macé on July 18, 2024 on LessWrong.
Summary
Agents might fail to peacefully trade in high-stakes negotiations. Such bargaining failures can have catastrophic consequences, including great power conflicts, and AI flash wars. This post is a distillation of DiGiovanni et al. (2024) (DCM), whose central result is that agents that are sufficiently transparent to each other have individual incentives to avoid catastrophic bargaining failures.
More precisely, DCM constructs strategies that are plausibly individually incentivized, and, if adopted by all, guarantee each player no less than their least preferred trade outcome. Figure 0 below illustrates this.
This result is significant because artificial general intelligences (AGIs) might (i) be involved in high-stakes negotiations, (ii) be designed with the capabilities required for the type of strategy we'll present, and (iii) bargain poorly by default (since bargaining competence isn't necessarily a direct corollary of intelligence-relevant capabilities).
Introduction
Early AGIs might fail to make compatible demands with each other in high-stakes negotiations (we call this a "bargaining failure"). Bargaining failures can have catastrophic consequences, including great power conflicts, or AI triggering a flash war. More generally, a "bargaining problem" is when multiple agents need to determine how to divide value among themselves.
Early AGIs might possess insufficient bargaining skills because intelligence-relevant capabilities don't necessarily imply these skills: For instance, being skilled at avoiding bargaining failures might not be necessary for taking over. Another problem is that there might be no single rational way to act in a given multi-agent interaction. Even arbitrarily capable agents might have different priors, or different approaches to reasoning under bounded computation.
Therefore they might fail to solve equilibrium selection, i.e., make incompatible demands (see Stastny et al. (2021) and Conitzer & Oesterheld (2023)). What, then, are sufficient conditions for agents to avoid catastrophic bargaining failures?
Sufficiently advanced AIs might be able to verify each other's decision algorithms (e.g. via verifying source code), as studied in open-source game theory. This has both potential downsides and upsides for bargaining problems. On one hand, transparency of decision algorithms might make aggressive commitments more credible and thus more attractive (see Sec. 5.2 of Dafoe et al. (2020) for discussion).
On the other hand, agents might be able to mitigate bargaining failures by verifying cooperative commitments.
Oesterheld & Conitzer (2022)'s safe Pareto improvements[1] (SPI) leverages transparency to reduce the downsides of incompatible commitments.
In an SPI, agents conditionally commit to change how they play a game relative to some default such that everyone is (weakly) better off than the default with certainty.[2] For example, two parties A and B who would otherwise go to war over some territory might commit to, instead, accept the outcome of a lottery that allocates the territory to A with the probability that A would have won the war (assuming this probability is common knowledge). See also our extended example below.
Oesterheld & Conitzer (2022) has two important limitations: First, many different SPIs are in general possible, such that there is an "SPI selection problem", similar to the equilibrium selection problem in game theory (Sec. 6 of Oesterheld & Conitzer (2022)).
And if players don't coordinate on which SPI to implement, they might fail to avoid conflict.[3] Second, if expected utility-maximizing agents need to individually adopt strategies to implement an SPI, it's unclear what conditions...

Jul 18, 2024 • 25min
LW - Mech Interp Lacks Good Paradigms by Daniel Tan
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Lacks Good Paradigms, published by Daniel Tan on July 18, 2024 on LessWrong.
Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free!
Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome!
Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas.
However, how useful is this MI research?
A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret.
In this post, I'll critically examine several leading candidates for "paradigms" in MI, consider the available evidence for / against, and identify good future research directions (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined.
Towards a Grand Unifying Theory (GUT) with MI
Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned.
Some people who have espoused this opinion:
Richard Ngo has argued here that MI enables "big breakthroughs" towards a "principled understanding" of deep learning.
Rohin Shah has argued here that MI builds "new affordances" for alignment methods.
Evan Hubinger has argued for MI here because it helps us identify "unknown unknowns".
Leo Gao argues here that MI aids in "conceptual research" and "gets many bits" per experiment.
As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis.
It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated.
(Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here.
A GUT Needs Paradigms
Paradigm - an overarching framework for thinking about a field
In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for th...