

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Aug 5, 2024 • 20min
AF - Self-explaining SAE features by Dmitrii Kharlapenko
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-explaining SAE features, published by Dmitrii Kharlapenko on August 5, 2024 on The AI Alignment Forum.
TL;DR
We apply the method of SelfIE/Patchscopes to explain SAE features - we give the model a prompt like "What does X mean?", replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation.
The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples. We show that our method is effective, and comparable with Neuronpedia's auto-interp labels (with the caveat that Neuronpedia's auto-interp used the comparatively weak GPT-3.5 so this is not a fully fair comparison).
We aren't confident you should use our method over auto-interp, but we think in some situations it has advantages: no max activating dataset examples are needed, and it's cheaper as you just run the model being studied (eg Gemma 2B) not a larger model like GPT-4.
Further, it has different errors to auto-interp, so finding and reading both may be valuable for researchers in practice.
We provide advice for using self-explanation in practice, in particular for the challenge of automatically choosing the right scale, which significantly affects explanation quality.
We also release a tool for you to work with self-explanation.
We hope the technique is useful to the community as is, but expect there's many optimizations and improvements on top of what is in this post.
Introduction
This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy.
SAE features promise a flexible and extensive framework for interpretation of LLM internals. Recent work (like
Scaling Monosemanticity) has shown that they are capable of capturing even high-level abstract concepts inside the model. Compared to MLP neurons, they can capture many more interesting concepts.
Unfortunately, in order to learn things with SAE features and interpret what the SAE tells us, one needs to first interpret these features on their own. The current mainstream method for their interpretation requires storing the feature's activations on millions of tokens, filtering for the prompts that activate it the most, and looking for a pattern connecting them. This is typically done by a human, or sometimes
somewhat automated with the use of larger LLMs like ChatGPT, aka auto-interp. Auto-interp is a useful and somewhat effective method, but requires an extensive amount of data and expensive closed-source language model API calls (for researchers outside scaling labs)
Recent papers like
SelfIE or
Patchscopes have proposed a mechanistic method of directly utilizing the model in question to explain its own internals activations in natural language. It is an approach that replaces an activation during the forward pass (e.g. some of the token embeddings in the prompt) with a new activation and then makes the model generate explanations using this modified prompt.
It's a variant of activation patching, with the notable differences that it generates a many token output (rather than a single token), and that the patched in activation may not be the same type as the activation it's overriding (and is just an arbitrary vector of the same dimension). We study how this approach can be applied to SAE feature interpretation, since it is:
Potentially cheaper and does not require large closed model inference
Can be viewed as a more truthful to the source, since it is uses the SAE feature vectors directly to generate explanations instead of looking at the max activating examples
How to use
Basic method
We ask the model to explain the meaning of a residual stream direction as if it literally was a word or phrase:
Prompt 1 (/ replaced according to model inp...

Aug 5, 2024 • 21min
LW - Circular Reasoning by abramdemski
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Circular Reasoning, published by abramdemski on August 5, 2024 on LessWrong.
The idea that circular reasoning is bad is widespread. However, this reputation is undeserved. While circular reasoning should not be convincing (at least not usually), it should also not be considered invalid.
Circular Reasoning is Valid
The first important thing to note is that circular reasoning is logically valid. A implies A. If circular arguments are to be critiqued, it must be by some other standard than logical validity.
I think it's fair to say that the most relevant objection to circular arguments is that they are not very good at convincing someone who does not already accept the conclusion. You are talking to another person, and need to think about communicating with their perspective. Perhaps the reason circular arguments are a common 'problem' is because they are valid. People naturally think about what should be a convincing argument from their own perspective, rather than the other person's.
However, notice that this objection to circular reasoning assumes that one party is trying to convince the other. This is arguments-as-soldiers mindset.[1] If two people are curiously exploring each other's perspectives, then circular reasoning could be just fine!
Furthermore, I'll claim: circular arguments should actually be considered as a little bit of positive evidence for their positions!
Let's look at a concrete example. I don't think circular arguments are quite so simple as "A implies A"; the circle is usually a bit longer. So, consider a more realistic circular position:[2]
Alice: Why do you believe in God?
Bob: I believe in God based on the authority of the Bible.
Alice: Why do you believe what the Bible says?
Bob: Because the Bible was divinely inspired by God. God is all-knowing and good, so we can trust what God says.
Here we have a two-step loop, A->B and B->A. The arguments are still logically fine; if the Bible tells the truth, and the Bible says God exists, then God exists. If the Bible were divinely inspired by an all-knowing and benevolent God, then it is reasonable to conclude that the Bible tells the truth.
If Bob is just honestly going through his own reasoning here (as opposed to trying to convince Alice), then it would be wrong for Alice to call out Bob's circular reasoning as an error. The flaw in circular reasoning is that it doesn't convince anyone; but that's not what Bob is trying to do. Bob is just telling Alice what he thinks.
If Alice thinks Bob is mistaken, and wants to point out the problems in Bob's beliefs, it is better for Alice to contest the premises of Bob's arguments rather than contest the reasoning form. Pointing out circularity only serves to remind Bob that Bob hasn't given Alice a convincing argument.
You probably still think Bob has made some mistake in his reasoning, if these are his real reasons. I'll return to this later.
Circular Arguments as Positive Evidence
I claimed that circular arguments should count as a little bit of evidence in favor of their conclusions. Why?
Imagine that the Bible claimed itself to be written by an evil and deceptive all-knowing God, instead of a benign God:
Alice: Why do you believe in God?
Bob: Because the Bible tells me so.
Alice: Why do you believe the Bible?
Bob: Well... uh... huh.
Sometimes, belief systems are not even internally consistent. You'll find a contradiction[3] just thinking through the reasoning that is approved of by the belief system itself. This should make you disbelieve the thing.
Therefore, by the rule we call conservation of expected evidence, reasoning through a belief system and deriving a conclusion consistent with the premise you started with should increase your credence. It provides some evidence that there's a consistent hypothesis here; and consistent hypotheses should get some ...

Aug 5, 2024 • 4min
EA - Upcoming EA conferences in 2024 and 2025 by OllieBase
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Upcoming EA conferences in 2024 and 2025, published by OllieBase on August 5, 2024 on The Effective Altruism Forum.
We're very excited to announce our EA conference schedule for the rest of this year and the first half of 2025. EA conferences will be taking place for the first time in Nigeria, Cape Town, Bengaluru, and Toronto, and returning to Berkeley, Sydney, and Singapore.
EA Global: Boston 2024 applications are open, and close October 20.
EAGxIndia will be returning this year in a new location: Bengaluru. See their full announcement
here.
EAGxAustralia has rebranded to
EAGxAustra
la
sia to represent the fact that many attendees will be from the wider region, especially New Zealand.
We're hiring the teams for both EAGxVirtual and EAGxSingapore. You can read more about the roles and how to apply
here.
EA Global will be returning to the same venues in the Bay Area and London in 2025.
Here are the full details:
EA Global
EA Global: Boston 2024 | November 1-3 |
Hynes Convention Center | applications close October 20
EA Global: Bay Area 2025 | February 21-23 |
Oakland Marriott
EA Global: London 2025 | June 6-8 |
Intercontinental London (the O2)
EAGx
EAGxToronto | August 16-18 |
InterContinental Toronto Centre | application deadline just extended, they now close August 12
EAGxBerkeley | September 7-8 |
Lighthaven | applications close August 20
EA Nigeria Summit | September 7-8 | Chida Event Center, Abuja
EAGxBerlin | September 13-15 |
Urania, Berlin | applications close August 24
EA South Africa Summit | October 5 | Cape Town
EAGxIndia | October 19-20 |
Conrad Bengaluru | applications close October 5
EAGxAustralasia | November 22-24 |
Aerial UTS, Sydney | applications open
EAGxVirtual | November 15-17
EAGxSingapore | December 14-15 |
Suntec Singapore
We're aiming to launch applications for events later this year as soon as possible. Please go to the event page links above to apply. If you'd like to add EAG(x) events directly to your Google Calendar, use
this link.
Some notes on these conferences
EA Global conferences are run in-house by the CEA events team, whereas EAGx conferences (and EA summits) are organised independently by members of the EA community with financial support and mentoring from CEA.
EAGs have a high bar for admission and are for people who are very familiar with EA and are taking significant actions (e.g. full-time work or study) based on EA ideas.
Admissions for EAGx conferences and EA Summits are processed independently by the organizers. These events are primarily for those who are newer to EA and interested in getting more involved.
Please apply to all conferences you wish to attend - we would rather get too many applications for some conferences and recommend that applicants attend a different one than miss out on potential applicants to a conference.
We offer travel support to help attendees who are approved for an event but who can't afford to travel. You can apply for travel support as you submit your application. Travel support funds are limited (though will vary by event), and we can only accommodate a small number of requests.
Find more info on
our website.
Feel free to email hello@eaglobal.org with any questions, or comment below. You can contact EAGx organisers using the format [location]@eaglobalx.org (e.g. berkeley@eaglobalx.org and berlin@eaglobalx.org).
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Aug 5, 2024 • 11min
LW - Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by Seth Herd
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours, published by Seth Herd on August 5, 2024 on LessWrong.
Vitalik Buterin wrote an impactful blog post, My techno-optimism. I found this discussion of one aspect on 80,00 hours much more interesting. The remainder of that interview is nicely covered in the host's EA Forum post.
My techno optimism apparently appealed to both sides, e/acc and doomers. Buterin's approach to bridging that polarization was interesting. I hadn't understood before the extent to which anti-AI regulation sentiment is driven by fear of centralized power. I hadn't thought about this risk before since it didn't seem relevant to AGI risk, but I've been updating to think it's highly relevant.
[this is automated transcription that's inaccurate and comically accurate by turns :)]
Rob Wiblin (the host) (starting at 20:49):
what is it about the way that you put the reasons to worry that that ensured that kind of everyone could get behind it
Vitalik Buterin:
[...] in addition to taking you know the case that AI is going to kill everyone seriously I the other thing that I do is I take the case that you know AI is going to take create a totalitarian World Government seriously [...]
[...] then it's just going to go and kill everyone but on the other hand if you like take some of these uh you know like very naive default solutions to just say like hey you know let's create a powerful org and let's like put all the power into the org then yeah you know you are creating the most like most powerful big brother from which There Is No Escape and which has you know control over the Earth and and the expanding light cone and you can't get out right and yeah I mean this is something
that like uh I think a lot of people find very deeply scary I mean I find it deeply scary um it's uh it is also something that I think realistically AI accelerates right
One simple takeaway is that recognizing and addressing that motivation for anti-regulation and pro-AGI sentiment when trying to work with or around the e/acc movement. But a second is whether to take that fear seriously.
Is centralized power controlling AI/AGI/ASI a real risk?
Vitalik Buterin is from Russia, where centralized power has been terrifying. This has been the case for roughly half of the world. Those that are concerned with of risks of centralized power (including Western libertarians) are worried that AI increases that risk if it's centralized. This puts them in conflict with x-risk worriers on regulation and other issues.
I used to hold both of these beliefs, which allowed me to dismiss those fears:
1. AGI/ASI will be much more dangerous than tool AI, and it won't be controlled by humans
2. Centralized power is pretty safe (I'm from the West like most alignment thinkers).
Now I think both of these are highly questionable.
I've thought in the past that fears AI are largely unfounded. The much larger risk is AGI. And that is an even larger risk if it's decentralized/proliferated. But I've been progressively more convinced that Governments will take control of AGI before it's ASI, right?. They don't need to build it, just show up and inform the creators that as a matter of national security, they'll be making the key decisions about how it's used and aligned.[1]
If you don't trust Sam Altman to run the future, you probably don't like the prospect of Putin or Xi Jinping as world-dictator-for-eternal-life. It's hard to guess how many world leaders are sociopathic enough to have a negative empathy-sadism sum, but power does seem to select for sociopathy.
I've thought that humans won't control ASI, because it's value alignment or bust. There's a common intuition that an AGI, being capable of autonomy, will have its own goals, for good or ill. I think it's perfectly coherent for it...

Aug 5, 2024 • 6min
EA - On Owning Our EA Affiliation by Alix Pham
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Owning Our EA Affiliation, published by Alix Pham on August 5, 2024 on The Effective Altruism Forum.
Someone suggested I name this post "What We Owe The Community": I think it's a great title, but I didn't dare use it...
Views and mistakes my own.
What I believe
I think owning our EA affiliation - how we are inspired by the movement and the community - is net positive for the world and our careers. If more people were more outspoken about their alignment with EA principles and proximity to the EA community, we would all be better off. While there may be legitimate reasons for some individuals to not publicly identify as part of the EA movement, this can create a "free-rider problem".
If too many people choose to passively benefit from EA without openly supporting it, the overall movement and community may suffer from it.
Why I think more people should own their EA affiliation publicly
I understand why one doesn't, but I'd probably not support it in most cases - I say most cases, because some cases are exceptional. I'm also not necessarily saying that one needs to shout it everywhere, but simply be
transparent about it.
The risks
These are the risks - actual or perceived - that I mostly hear about when people choose not to publicly own their EA identity:
People don't want to talk to you / take you seriously because you are affiliated with EA
You won't get some career opportunities because you are affiliated with EA
And I get it. It's scary to think two letters could shut some doors closed for some potentially incorrect reasons.
A prisoner's dilemma
But I think it hurts the movement. If people inspired or influenced by EA are not open about it, it's likely that their positive impact won't get credited to EA. And in principle, I wouldn't mind. But that means that the things that EA will get known for will mostly be negative events, because during scandals, everyone will look for people to blame and draw causal paths from their different affiliation to the bad things that happened.
It's much less attractive to dig out those causal paths when the overall story is positive. I'd believe this is a negative feedback loop that hurts the capacity of people inspired by the EA movement to have a positive impact on the world.
Tipping points
It seems to me that currently, not publicly affiliating with EA is the default, it's normal, and there's no harm in doing that. I'd like that norm to change. In Change: How to Make Big Things Happen, Damon Centola defines the concept of "tokens", e.g. for women:
[Rosabeth Moss Kanter] identified several telltale signs of organizations in which the number of women was below the hypothesized tipping point. Most notably, women in these organizations occupied a "token" role. They were conspicuous at meetings and in conferences, and as such were regarded by their male colleagues as representatives of their gender. As tokens, their behavior was taken to be emblematic of all women generally.
They became symbols of what women could do and how they were expected to act.
We need more people to own their affiliation, to represent the true diversity of the EA identity and avoid tokenization.
On transparency
On a personal level, I think transparency is rewarded, in due time. On a community level, one will get to be part of a diverse pool of EAs, which will contribute to showing the diversity of the community: its myriad of groups and individuals, that all have their own vision of what making the world a better place means. It would solve the token problem.
An OpenPhil-funded AI governance organization I am in contact with has chosen a long time ago to always be transparent about their founders' EA affiliation and its funding sources. Long-term, they benefited from proving high-integrity for not leaving out some details or reframing them.
After the OpenAI...

Aug 5, 2024 • 13min
LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders, published by Gytis Daujotas on August 5, 2024 on LessWrong.
Click here to open a live research preview where you can try interventions using this SAE.
This is a follow-up to a previous post on finding interpretable and steerable features in CLIP.
Motivation
Modern image diffusion models often use CLIP in order to condition generation. Put simply, users use CLIP to embed prompts or images, and these embeddings are used to diffuse another image back out.
Despite this, image models have severe user interface limitations. We already know that CLIP has a rich inner world model, but it's often surprisingly hard to make precise tweaks or reference specific concepts just by prompting alone. Similar prompts often yield a different image, or when we have a specific idea in mind, it can be too hard to find the right string of words to elicit the right concepts we need.
If we're able to understand the internal representation that CLIP uses to encode information about images, we might be able to get more expressive tools and mechanisms to guide generation and steer it without using any prompting. In the ideal world, this would enable the ability to make fine adjustments or even reference particular aspects of style or content without needing to specify what we want in language.
We could instead leverage CLIP's internal understanding to pick and choose what concepts to include, like a palette or a digital synthesizer.
It would also enable us to learn something about how image models represent the world, and how humans can interact with and use this representation, thereby skipping the text encoder and manipulating the model's internal state directly.
Introduction
CLIP is a neural network commonly used to guide image diffusion. A Sparse Autoencoder was trained on the dense image embeddings CLIP produces to transform it into a sparse representation of active features. These features seem to represent individual units of meaning. They can also be manipulated in groups - combinations of multiple active features - that represent intuitive concepts.
These groups can be understood entirely visually, and often encode surprisingly rich and interesting conceptual detail.
By directly manipulating these groups as single units, image generation can be edited and guided without using prompting or language input. Concepts that were difficult to specify or edit by text prompting become easy and intuitive to manipulate in this new visual representation.
Since many models use the same CLIP joint representation space that this work analyzed, this technique works to control many popular image models out of the box.
Summary of Results
Any arbitrary image can be decomposed into its constituent concepts. Many concepts (groups of features) that we find seem to slice images up into a fairly natural ontology of their human interpretable components. We find grouping them together is an effective approach to yield a more interpretable and useful grain of control.
These concepts can be used like knobs to steer generation in leading models like Stable Cascade. Many concepts have an obvious visual meaning yet are hard to precisely label in language, which suggests that studying CLIP's internal representations can be used as a lens into the variety of the visual domain. Tweaking the activations of these concepts can be used to expressively steer and guide generation in multiple image diffusion models that we tried.
We released the weights and a live demo of controlling image generation in feature space. By analyzing a SAE trained on CLIP, we get a much more vivid picture of the rich understanding that CLIP learns. We hope this is just the beginning of more effective and useful interventions in the internal representations of n...

Aug 5, 2024 • 9min
LW - Near-mode thinking on AI by Olli Järviniemi
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Near-mode thinking on AI, published by Olli Järviniemi on August 5, 2024 on LessWrong.
There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover".
And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode.
In this post, I share a few concrete examples about my experiences with this change-of-orientation.
I. Prerequisites for scheming
Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident.
Past-me hadn't, however, really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that:
Understands that the inputs it sees are simply prompts written by humans
As opposed to the model being "immersed to the setting", thinking that the current context more-or-less reflects reality, or flat-out is reality, taking the prompt at face value
Understands that it's undergoing a training process, and having an explicit (correct!) model of the training objective used to train it
As opposed the model not thinking about the training process and objective at all, and a model's changed behavior upon training being reflective of the training process molding the model, rather than the model learning the training objective and optimizing for it
Performs lots of the same cognition (e.g. the above two bullet points) regardless of the input fed to it
As opposed to the model's internal processes being heavily dependent on the input
Never performs instrumental thinking "out loud", and successfully evades any attempts made by humans to trick the model into thinking it has been deployed
As opposed to the model sometimes slipping, failing to generalize correctly to adversarial inputs, or just being insufficiently good at distinguishing training/evaluation/red-teaming from deployment.
Now, one may argue whether it's strictly necessary that a model has an explicit picture of the training objective, for example, and revise one's picture of the deceptive alignment story accordingly. We haven't yet achieved consensus on deceptive alignment, or so I've heard.
It's also the case that, as past-me would remind you, a true superintelligence would have no difficulty with the cognitive feats listed above (and that current models show sparks of competence in some of these).
But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing.
II. A failed prediction
There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025. For a long time, the market was around 25%, and I thought it was too high.
Then, DeepMind essentially got silver from the 2024 IMO, short of gold by one point. The market jumped to 70%, where it has stayed since.
Regardless of whether DeepMind manages to improve on that next year and satisfy all minor technical requirements, I was wrong. Hearing abou...

Aug 4, 2024 • 8min
LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?"), published by Jessica Rumbelow on August 4, 2024 on LessWrong.
From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data.
This post is about our LLM attribution repo PIZZA: Prompt Input Z? Zonal Attribution. (In the grand scientific tradition we have tortured our acronym nearly to death. For the crimes of others see [1].)
All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA.
What is attribution?
One question we might ask when interacting with machine learning models is something like: "why did this input cause that particular output?".
If we're working with a language model like ChatGPT, we could actually just ask this in natural language: "Why did you respond that way?" or similar - but there's no guarantee that the model's natural language explanation actually reflects the underlying cause of the original completion. The model's response is conditioned on your question, and might well be different to the true cause.
Enter attribution!
Attribution in machine learning is used to explain the contribution of individual features or inputs to the final prediction made by a model. The goal is to understand which parts of the input data are most influential in determining the model's output.
It typically looks like is a heatmap (sometimes called a 'saliency map') over the model inputs, for each output. It's most commonly used in computer vision - but of course these days, you're not big if you're not big in LLM-land.
So, the team at Leap present you with PIZZA: an open source library that makes it easy to calculate attribution for all LLMs, even closed-source ones like ChatGPT.
An Example
GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?
That's not very helpful! We want to know why the mistake was made in the first place. Here's the attribution:
Mary 0.32
puts 0.25
an 0.15
apple 0.36
in 0.18
the 0.18
box 0.08
. 0.08
The 0.08
box 0.09
is 0.09
labelled 0.09
' 0.09
pen 0.09
cil 0.09
s 0.09
'. 0.09
John 0.09
enters 0.03
the 0.03
room 0.03
. 0.03
What 0.03
does 0.03
he 0.03
think 0.03
is 0.03
in 0.30
the 0.13
box 0.15
? 0.13
Answer 0.14
in 0.26
1 0.27
word 0.31
. 0.16
It looks like the request to "Answer in 1 word" is pretty important - in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.
That's better.
How it works
We iteratively perturb the input, and track how each perturbation changes the output.
More technical detail, and all the code, is available in the repo. In brief, PIZZA saliency maps rely on two methods: a perturbation method, which determines how the input is iteratively changed; and an attribution method, which determines how we measure the resulting change in output in response to each perturbation. We implement a couple of different types of each method.
Perturbation
Replace each token, or group of tokens, with either a user-specified replacement token or with nothing (i.e. remove it).
Or, replace each token with its nth nearest token.
We do this either iteratively for each token or word in the prompt, or using hierarchical perturbation.
Attribution
Look at the change in the probability of the completion.
Look at the change in the meaning of the completion (using embeddings).
We calculate this for each output token in the completion - so you can see not only how each input token influenced the output overall, but also how each input token affected each output token individually.
Caveat
Since we don't have access to closed-source tokenisers or embeddings, we use a proxy - in this case, GPT2's. Thi...

Aug 4, 2024 • 7min
LW - You don't know how bad most things are nor precisely how they're bad. by Solenoid Entity
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You don't know how bad most things are nor precisely how they're bad., published by Solenoid Entity on August 4, 2024 on LessWrong.
TL;DR: Your discernment in a subject often improves as you dedicate time and attention to that subject. The space of possible subjects is huge, so on average your discernment is terrible, relative to what it could be. This is a serious problem if you create a machine that does everyone's job for them.
See also: Reality has a surprising amount of detail. (You lack awareness of how bad your staircase is and precisely how your staircase is bad.) You don't know what you don't know. You forget your own blind spots, shortly after you notice them.
An afternoon with a piano tuner
I recently played in an orchestra, as a violinist accompanying a piano soloist who was playing a concerto. My 'stand partner' (the person I was sitting next to) has a day job as a piano tuner.
I loved the rehearsal, and heard nothing at all wrong with the piano, but immediately afterwards, the conductor and piano soloist hurried over to the piano tuner and asked if he could tune the piano in the hours before the concert that evening. Annoyed at the presumptuous request, he quoted them his exorbitant Sunday rate, which they hastily agreed to pay.
I just stood there, confused.
(I'm really good at noticing when things are out of tune. Rather than beat my chest about it, I'll just hope you'll take my word for it that my pitch discrimination skills are definitely not the issue here. The point is, as developed as my skills are, there is a whole other level of discernment you can develop if you're a career piano soloist or 80-year-old conductor.)
I asked to sit with my new friend the piano tuner while he worked, to satisfy my curiosity. I expected to sit quietly, but to my surprise he seemed to want to show off to me, and talked me through what the problem was and how to fix it.
For the unfamiliar, most keys on the piano cause a hammer to strike three strings at once, all tuned to the same pitch. This provides a richer, louder sound. In a badly out-of-tune piano, pressing a single key will result in three very different pitches. In an in-tune piano, it just sounds like a single sound. Piano notes can be out of tune with each other, but they can also be out of tune with themselves.
Additionally, in order to solve 'God's prank on musicians' (where He cruelly rigged the structure of reality such that (32)n2m for any integers n, m but IT'S SO CLOSE CMON MAN ) some intervals must be tuned very slightly sharp on the piano, so that after 11 stacked 'equal-tempered' 5ths, each of them 1/50th of a semitone sharp, we arrive back at a perfect octave multiple of the original frequency.
I knew all this, but the keys really did sound in tune with themselves and with each other! It sounded really nicely in tune! (For a piano).
"Hear how it rolls over?"
The piano tuner raised an eyebrow and said "listen again" and pressed a single key, his other hand miming a soaring bird.
"Hear how it rolls over?"
He was right. Just at the beginning of the note, there was a slight 'flange' sound which quickly disappeared as the note was held. It wasn't really audible repeated 'beating' - the pitches were too close for that. It was the beginning of one very long slow beat, most obvious when the higher frequency overtones were at their greatest amplitudes, i.e. during the attack of the note.
So the piano's notes were in tune with each other, kinda, on average, and the notes were mostly in tune with themselves, but some had tiny deviations leading to the piano having a poor sound.
"Are any of these notes brighter than others?"
That wasn't all. He played a scale and said "how do the notes sound?" I had no idea. Like a normal, in-tune piano?
"Do you hear how this one is brighter?"
"Not really, honestly..."
He pul...

Aug 4, 2024 • 5min
LW - SRE's review of Democracy by Martin Sustrik
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SRE's review of Democracy, published by Martin Sustrik on August 4, 2024 on LessWrong.
Day One
We've been handed this old legacy system called "Democracy". It's an emergency. The old maintainers are saying it has been misbehaving lately but they have no idea how to fix it. We've had a meeting with them to find out as much as possible about the system, but it turns out that all the original team members left the company long time ago. The current team doesn't have much understanding of the system beyond some basic operational knowledge.
We've conducted a cursory code review, focusing not so much on business logic but rather on the stuff that could possibly help us to tame it: Monitoring, reliability characteristics, feedback loops, automation already in place. Our first impression: Oh, God, is this thing complex! Second impression: The system is vaguely modular. Each module is strongly coupled with every other module though. It's an organically grown legacy system at its worst.
That being said, we've found a clue as to why the system may have worked fine for so long. There's a redundancy system called "Separation of Powers". It reminds me of the Tandem computers back from the 70s.
Day Two
We were wrong. "Separation of Powers" is not a system for redundancy. Each part of the system ("branch") has different business logic. However, each also acts as a watchdog process for the other branches. When it detects misbehavior it tries to apply corrective measures using its own business logic. Gasp!
Things are not looking good. We're still searching for monitoring.
Day Three
Hooray! We've found the monitoring! It turns out that "Election" is conducted once every four years. Each component reports its health (1 bit) to the central location. The data flow is so low that we have overlooked it until now. We are considering shortening the reporting period, but the subsystem is so deeply coupled with other subsystems that doing so could easily lead to a cascading failure.
In other news, there seems to be some redundancy after all. We've found a full-blown backup control system ("Shadow Cabinet") that is inactive at the moment, but might be able to take over in case of a major failure. We're investigating further.
Day Four
Today, we've found yet another monitoring system called "FreePress." As the name suggests it was open-sourced some time ago, but the corporate version have evolved quite a bit since then, so the documentation isn't very helpful. The bad news is that it's badly intertwined with the production system. The metrics look more or less okay as long as everything is working smoothly. However, it's unclear what will happen if things go south.
It may distort the metrics or even fail entirely, leaving us with no data whatsoever at the moment of crisis.
By the way, the "Election" process may not be a monitoring system after all. I suspect it might actually be a feedback loop that triggers corrective measures in case of problems.
Day Five
The most important metric seems to be this big graph labeled "GDP". As far as we understand, it's supposed to indicate the overall health of the system. However, drilling into the code suggests that it's actually a throughput metric. If throughput goes down there's certainly a problem, but it's not clear why increasing throughput should be considered the primary health factor...
More news on the "Election" subsystem: We've found a floppy disk with the design doc, and it turns out that it's not a feedback loop after all. It's a distributed consensus algorithm (think Paxos)! The historical context is that they've used to run several control systems in parallel (for redundancy reasons maybe?) which resulted in numerous race conditions and outages. "Election" was put in place to ensure that only one control system acts as a master at any given time...


