
The Nonlinear Library
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Latest episodes

Sep 6, 2024 • 10min
EA - Contact people for the EA community by Julia Wise
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contact people for the EA community, published by Julia Wise on September 6, 2024 on The Effective Altruism Forum.
This post is an update of an older post originally from 2018. If you're familiar with our older policies, we suggest looking at our
new confidentiality policy before bringing us sensitive information. This post reflects one part of the team's broader work.
We (Julia Wise, Catherine Low, and Charlotte Darnell) are the community liaisons at the
Community Health and Special Projects team at the Centre for Effective Altruism. If you've encountered a problem in the community, we'd like to help!
Ways to contact us
The best way to reach all of us is the
community health contact form (can be anonymous)
Julia:
julia.wise@centreforeffectivealtruism.org or anonymous
form
Catherine: catherine@centreforeffectivealtruism.org or anonymous
form
Charlotte:
charlotte.darnell@effectivealtruism.org or anonymous
form
How we might be able to help
Here are some circumstances in which you might contact us:
To discuss something you experienced in the EA community that made you uncomfortable
To get advice about a difficult community or interpersonal problem you (or your project/group/organization) is facing
To let us know if you believe a person or organization in EA has engaged in harmful behavior (harassment, mean behavior, dishonesty, etc)
Examples of work we've done since the team started in 2015:
Advising organizers of EA groups and online spaces on handling problems or conflicts
Supporting community members trying to get help for another community member who's having a mental health crisis
Talking to people about problems their behavior has caused and what we think they should do differently
Working with the CEA events team to restrict people from events based on past serious problem behavior
Please have a low bar for getting in touch! Even if we don't have time for an in-depth discussion with you, any information you can give us is useful data that could help us to spot patterns of problems in the community.
If you contact us, here's what might happen next
These are common responses, rather than a complete list or what will definitely happen.
If you have questions about confidentiality or what kinds of steps we might take, we'll aim to answer those first.
If you send us information but don't have any requests, we might just acknowledge your message, or we might follow up with questions.
We'll listen to the concern you want to share. It's ok if your thoughts are unstructured or if you're not sure what should happen next!
We might help you brainstorm ways you or your group can manage the problem.
We might share considerations that might be important for your decision-making, like
Different perspectives on the concern
Possible positive and negative consequences of different actions
How similar concerns have played out in the past in our experience
We'll talk through possible steps, like
Us talking to the person who caused the problem about changes we think they should make
You or us informing others, like a local group organizer, about the problem
Steps CEA might take in its own programs, like not admitting someone to our conferences
Referring you to other institutions that could provide help, like HR departments, legal
resources, or law enforcement
Sometimes after discussing the situation, people decide they don't want any steps taken now, but might want steps if more problems arise.
We'll discuss any other steps we need to take (e.g. if we think someone is at serious risk, as discussed below in the confidentiality section).
Confidentiality
We know people who bring us sensitive situations care a lot about what happens with the information. If you get in touch, please tell us what level of confidentiality you want. We might ask your permission to share information so that we ...

Sep 6, 2024 • 15min
AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Backdoors as an analogy for deceptive alignment, published by Jacob Hilton on September 6, 2024 on The AI Alignment Forum.
ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later.
In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.
In this post, we will:
Lay out the analogy between backdoors and deceptive alignment
Discuss prior theoretical results from the perspective of this analogy
Explain our formal notion of backdoors and its strengths and weaknesses
Summarize the results in our paper and discuss their implications for deceptive alignment
Thanks to Boaz Barak, Roger Grosse, Thomas Read, John Schulman and Gabriel Wu for helpful comments.
Backdoors and deceptive alignment
A backdoor in an ML model is a modification to the model that causes it to behave
differently on certain inputs that activate a secret "trigger", while behaving similarly on ordinary inputs. There is a wide existing literature on backdoor attacks and defenses, which is primarily empirical, but also includes some theoretical results that we will mention.
Deceptive alignment is a term from the paper Risks from Learned Optimization in Advanced Machine Learning Systems (Section 4) that refers to the possibility that an AI system will internally reason about the objective that it is being trained on, and decide to perform well according to that objective unless there are clues that it has been taken out of its training environment.
Such a policy could be optimal on the training distribution, and yet perform very badly on certain out-of-distribution inputs where such clues are present, which we call defection triggers.[1] The opposite of deceptive alignment is robust alignment, meaning that this performance degradation is avoided.
Since a deceptively aligned model and a robustly aligned model behave very differently on defection triggers, but very similarly on typical inputs from the training distribution, deceptive alignment can be thought of as a special kind of backdoor, under the following correspondence:
Deceptive alignment
Backdoors
Robustly aligned model
Original (unmodified) model
Deceptively aligned model
Backdoored model
Defection trigger
Backdoor trigger
The main distinguishing feature of deceptive alignment compared to other kinds of backdoors is that the deceptively aligned model is not produced by an adversary, but is instead produced through ordinary training. Thus by treating deceptive alignment as a backdoor, we are modeling the training process as an adversary.
In our analysis of deceptive alignment, the basic tension we will face is that an unconstrained adversary will always win, but any particular proxy constraint we impose on the adversary may be unrealistic.
Static backdoor detection
An important piece of prior work is the paper Planting Undetectable Backdoors in Machine Learning Models, which uses a digital signature scheme to insert an undetectable backdoor into a model. Roughly speaking, the authors exhibit a modified version of a "Random Fourier Features" training algorithm that produces a backdoored model. Any input to the backdoored model can be perturbed by an attacker with knowledge of a secret key to produce a new input on which the model behaves differently.
However, the backdoor is undetectable in the sense that it is computationally infeasible for a defender with white-box access to distinguish a backdoored model from an or...

Sep 6, 2024 • 9min
EA - Positive trends on alternative proteins by LewisBollard
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Positive trends on alternative proteins, published by LewisBollard on September 6, 2024 on The Effective Altruism Forum.
Despite the setbacks, I'm hopeful about the technology's future
It wasn't meant to go like this. Alternative protein startups that were once soaring are now struggling. Impact investors who were once everywhere are now absent. Banks that confidently predicted 31% annual growth (
UBS) and a 2030 global market worth $88-263B (
Credit Suisse) have quietly taken down their predictions.
This sucks. For many founders and staff this wasn't just a job, but a calling - an opportunity to work toward a world free of factory farming. For many investors, it wasn't just an investment, but a bet on a better future. It's easy to feel frustrated, disillusioned, and even hopeless.
It's also wrong. There's still plenty of hope for alternative proteins - just on a longer timeline than the unrealistic ones that were once touted. Here are three trends I'm particularly excited about.
Better products
People are eating less plant-based meat for many reasons, but the simplest one may just be that they don't like how they taste. "Taste/texture" was the top reason chosen by Brits for reducing their plant-based meat consumption in a recent
survey by Bryant Research. US consumers most disliked the "consistency and texture" of plant-based foods in a
survey of shoppers at retailer Kroger.
They've got a point. In 2018-21, every food giant, meat company, and two-person startup rushed new products to market with minimal product testing. Indeed, the meat companies' plant-based offerings were bad enough to inspire conspiracy theories that this was a case of the car companies
buying up the streetcars.
Consumers noticed. The Bryant Research
survey found that two thirds of Brits agreed with the statement "some plant based meat products or brands taste much worse than others." In a 2021
taste test, 100 consumers rated all five brands of plant-based nuggets as much worse than chicken-based nuggets on taste, texture, and "overall liking."
One silver lining of the plant-based bloodbath is that stores are culling poor products - US supermarkets now average about
10 plant-based meat offerings, down from
15 a few years ago. Only the most popular products have survived. And they're getting better:
73% of US plant-based meat consumers believe its taste has improved dramatically in recent years.
New
taste tests from NECTAR confirm this. Omnivores fed a range of meats from animals and plants still generally preferred the animal variety. But five brands of plant-based nuggets - Impossible, Morningstar, Quorn, Rebellyous, and Simulate - were liked roughly as much as the chicken nuggets, and at least one was liked better (see table below).
This is exciting. Over
80% of Americans consume chicken nuggets every month, spending
$2.3B on the frozen variety annually. All nuggets are processed, so the ultra-processed attacks on plant-based nuggets are weaker. The main remaining problem is that plant-based nuggets are much pricier than the animal kind. Which brings us to the role of retailers.
Better merchandising
It's not surprising that consumers have been slow to change their lifelong habit of buying factory-farmed meat. It's especially unsurprising when the alternative costs about
twice as much. (US plant-based meat eaters say they're only willing to pay
37% more, while retailers suspect they'll pay just
10% extra.)
So it's exciting to see European retailers cutting prices. Four giant German retailers - Lidl, Kaufland, Aldi, and Penny - recently
cut the price of their own-brand plant-based meats to match the price of meat. Lidl said vegan sales spiked
30% after the move, which coincided with moving plant-based products next to their animal analogs.
Dutch retailers are going further. Almost all have now
pl...

Sep 6, 2024 • 3min
EA - One Month Left to Win $5,000 Career Grants from 80k by Mjreard
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: One Month Left to Win $5,000 Career Grants from 80k, published by Mjreard on September 6, 2024 on The Effective Altruism Forum.
TL;DR: generate a referral link to our advising application form here and share the link you generate with potential 80k advisees. If we speak to just two people who applied through your link, you'll have a good chance to win a $5,000 career grant. The more advisees you refer, the better your chances are.
Updates
This is an update on my earlier announcement of our referral program. The important new points are:
The "conference travel" grant is now a general career grant.[1] If there's anything money can do to help advance your impactful career goals, winners can request a grant and I expect we'll be able to do it. Ideas include:
Conference travel + attendance
A skills development course or service
Tuition support
Productivity tools
Paid career coaching
Whatever else you can think of (likely)
Only 11 people have made two successful referrals, meaning you have a very good chance of earning a grant:
We've guaranteed there will be 10 winners
Only 3 people have gotten more than 5+ others to apply
We're counting referrals that generate applications submitted by October 6, 2024. Thoughts on who to refer on my earlier post and the referral page.
Reminders
I want to emphasize that 80k conversations are a really robustly good use of 1-2 hours for advisees and our team is big enough to speak to many more people than we have been historically. Some concrete benefits:
Our headhunting program is expanding and advising is one of the primary ways we get promising candidates on our radar to recommend to orgs
We have hundreds of impactful experts/professionals that we introduce advisees to, often people who can hire or mentor them
Having space to write out your plans and have a smart person focus on them and you for an hour is rare and underrated
Repeat calls are available, so don't fret about whether the timing is 'right'
I'm obviously biased, but I think talking to 80k at some point is basic impact hygiene akin to donating at least a modest amount to projects you think are important. The core idea being "have I run my basic plan and perspective by someone who's thinking along similar lines, heard many such plans, and seen how they pan out?" The answer should be yes.
I'm confident that many Forum readers know many high-potential people who are interested in or curious about EA/GCRs. Spending 10 minutes thinking of everyone you know in this category and reaching out can now net you a $5k grant!
If people have ideas or questions about what would/wouldn't count as a career grant or who to refer, I'm excited to engage in the comments.
1. ^
Thanks to Ben Millwood for pushing back on the narrowness of the initial reward.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Sep 5, 2024 • 9min
LW - instruction tuning and autoregressive distribution shift by nostalgebraist
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: instruction tuning and autoregressive distribution shift, published by nostalgebraist on September 5, 2024 on LessWrong.
[Note: this began life as a "Quick Takes" comment, but it got pretty long, so I figured I might as well convert it to a regular post.]
In LM training, every token provides new information about "the world beyond the LM" that can be used/"learned" in-context to better predict future tokens in the same window.
But when text is produced by autoregressive sampling from the same LM, it is not informative in the same way, at least not to the same extent[1]. Thus, sampling inevitably produces a distribution shift.
I think this is one of the reasons why it's (apparently) difficult to get instruction-tuned / HH-tuned models to report their uncertainty and level of competence accurately, rather than being overconfident.
(I doubt this is a novel point, I just haven't seen it spelled out explicitly before, and felt like doing so.)
Imagine that you read the following (as the beginning of some longer document), and you trust that the author is accurately describing themselves:
I'm a Princeton physics professor, with a track record of highly cited and impactful research in the emerging field of Ultra-High-Density Quasiclassical Exotic Pseudoplasmas (UHD-QC-EPPs).
The state of the art in numerical simulation of UHD-QC-EPPs is the so-called Neural Pseudospectral Method (NPsM).
I made up all those buzzwords, but imagine that this is a real field, albeit one you know virtually nothing about. So you've never heard of "NPsM" or any other competing method.
Nonetheless, you can confidently draw some conclusions just from reading this snippet and trusting the author's self-description:
Later in this document, the author will continue to write as though they believe that NPsM is "the gold standard" in this area.
They're not going to suddenly turn around and say something like "wait, whoops, I just checked Wikipedia and it turns out NPsM has been superseded by [some other thing]." They're a leading expert in the field! If that had happened, they'd already know by the time they sat down to write any of this.
Also, apart from this particular writer's beliefs, it's probably actually true that NPsM is the gold standard in this area.
Again, they're an expert in the field -- and this is the sort of claim that would be fairly easy to check even if you're not an expert yourself, just by Googling around and skimming recent papers. It's also not the sort of claim where there's any obvious incentive for deception. It's hard to think of a plausible scenario in which this person writes this sentence, and yet the sentence is false or even controversial.
During training, LLMs are constantly presented with experiences resembling this one.
The LLM is shown texts about topics of which it has incomplete knowledge. It has to predict each token from the preceding ones.
Whatever new information the text conveys about the topic may make it into the LLM's weights, through gradient updates on this example. But even before that happens, the LLM can also use the kind of reasoning shown in the bulleted list above to improve its predictions on the text right now (before any gradient updates).
That is, the LLM can do in-context learning, under the assumption that the text was produced by an entity outside itself -- so that each part of the text (potentially) provides new information about the real world, not yet present in the LLM's weights, that has useful implications for the later parts of the same text.
So, all else being equal, LLMs will learn to apply this kind of reasoning to all text, always, ubiquitously.
But autoregressive sampling produces text that is not informative about "the world outside" in the same way that all the training texts were.
During training, when an LLM sees information it d...

Sep 5, 2024 • 14min
LW - Conflating value alignment and intent alignment is causing confusion by Seth Herd
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conflating value alignment and intent alignment is causing confusion, published by Seth Herd on September 5, 2024 on LessWrong.
Submitted to the Alignment Forum. Contains more technical jargon than usual.
Epistemic status: I think something like this confusion is happening often. I'm not saying these are the only differences in what people mean by "AGI alignment".
Summary:
Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds.
One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false.
While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1]
Overview
Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes.
Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger's definition of intent alignment: AGI aligned with human intent in the broadest sense.
Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion.
People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans' for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I'm going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target.
Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements.
Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms' recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions.
The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don't need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3]
Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it's adequate for real success. In particular, there are intuitions and arguments that PIA doesn't address the real problem of AGI alignment.
I think PIA does address the real problem, but in a non-obvious and counterintuitive way.
Another unstated divide
There's another important clustering around these two conceptions of alignment. Peop...

Sep 5, 2024 • 20min
LW - The Fragility of Life Hypothesis and the Evolution of Cooperation by KristianRonn
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Fragility of Life Hypothesis and the Evolution of Cooperation, published by KristianRonn on September 5, 2024 on LessWrong.
This part 2 in a 3-part sequence summarizes my book (see part 1 here), The Darwinian Trap. The book aims to popularize the concept of multipolar traps and establish them as a broader cause area. If you find this series intriguing and want to spread the word and learn more:
1. Share this post with others on X or other social media platforms.
2. Pre-order the book
here.
3. Sign up for my mailing list
here before September 24 for a 20% chance to win a free hardcover copy of the book (it takes 5 seconds).
4. Contact me at kristian@kristianronn.com if you have any input or ideas.
In Part 1, I introduced the concept of a Darwinian demon - selection pressures that drive agents to harm others for personal gain. I also argued that the game theory of our evolutionary fitness landscape, with its limited resources, often favors defection over cooperation within populations. Yet, when we observe nature, cooperation is ubiquitous: from molecules working together in metabolism, to genes forming genomes, to cells building organisms, and individuals forming societies.
Clearly, cooperation must be evolutionarily adaptive, or we wouldn't see it so extensively in the natural world. I refer to a selection pressure that fosters mutually beneficial cooperation as a "Darwinian angel."
To understand the conditions under which cooperative behavior thrives, we can look at our own body. For an individual cell, the path to survival might seem clear: prioritize self-interest by replicating aggressively, even at the organism's expense. This represents the Darwinian demon - selection pressure favoring individual survival.
However, from the perspective of the whole organism, survival depends on suppressing these self-serving actions. The organism thrives only when its cells cooperate, adhering to a mutually beneficial code. This tension between individual and collective interests forms the core of multi-level selection, where evolutionary pressures act on both individuals and groups.
Interestingly, the collective drive for survival paradoxically requires cells to act altruistically, suppressing their self-interest for the organism's benefit. In this context, Darwinian angels are the forces that make cooperation adaptive, promoting collective well-being over individual defection. These angels are as much a part of evolution as their demonic counterparts, fostering cooperation that benefits the broader environment.
Major Evolutionary Transitions and Cooperation
This struggle, between selection pressures of cooperation and defection, traces back to the dawn of life. In the primordial Earth, a world of darkness, immense pressure, and searing heat, ribonucleic acid (RNA) emerged - a molecule that, like DNA, encodes the genetic instructions essential for life. Without RNA, complex life wouldn't exist. Yet, as soon as RNA formed, it faced a Darwinian challenge known as Spiegelman's Monster.
Shorter RNA strands replicate faster than longer ones, creating a selection pressure favoring minimal RNA molecules with as few as 218 nucleotides - insufficient to encode any useful genetic material. This challenge was likely overcome through molecular collaboration: a lipid membrane provided a sanctuary for more complex RNA, which in turn helped form proteins to stabilize and enhance the membrane.
Throughout evolutionary history, every major transition has occurred because Darwinian angels successfully suppressed Darwinian demons, forming new units of selection and driving significant evolutionary progress. Each evolutionary leap has been a fierce struggle against these demons, with every victory paving the way for the beauty, diversity, and complexity of life we see today.
These triumphs are...

Sep 5, 2024 • 5min
LW - What is SB 1047 *for*? by Raemon
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is SB 1047 *for*?, published by Raemon on September 5, 2024 on LessWrong.
Emmett Shear asked on twitter:
I think SB 1047 has gotten much better from where it started. It no longer appears actively bad. But can someone who is pro-SB 1047 explain the specific chain of causal events where they think this bill becoming law results in an actual safer world? What's the theory?
And I realized that AFAICT no one has concisely written up what the actual story for SB 1047 is supposed to be.
This is my current understanding. Other folk here may have more detailed thoughts or disagreements.
The bill isn't sufficient on it's own, but it's not regulation for regulation's sake because it's specifically a piece of the regulatory machine I'd ultimately want built.
Right now, it mostly solidifies the safety processes that existing orgs have voluntarily committed to. But, we are pretty lucky that they voluntarily committed to them, and we don't have any guarantee that they'll stick with them in the future.
For the bill to succeed, we do need to invent good, third party auditing processes that are not just a bureaucratic sham. This is an important, big scientific problem that isn't solved yet, and it's going to be a big political problem to make sure that the ones that become consensus are good instead of regulatory-captured. But, figuring that out is one of the major goals of the AI safety community right now.
The "Evals Plan" as I understand it comes in two phase:
1. Dangerous Capability Evals. We invent evals that demonstrate a model is capable of dangerous things (including manipulation/scheming/deception-y things, and "invent bioweapons" type things)
As I understand it, this is pretty tractable, although labor intensive and "difficult" in a normal, boring way.
2. Robust Safety Evals. We invent evals that demonstrate that a model capable of scheming, is nonetheless safe - either because we've proven what sort of actions it will choose to take (AI Alignment), or, we've proven that we can control it even if it is scheming (AI control). AI control is probably easier at first, although limited.
As I understand it, this is very hard, and while we're working on it it requires new breakthroughs.
The goal with SB 1047 as I understand is roughly:
First: Capability Evals trigger
By the time it triggers for the first time, we have a set of evals that are good enough to confirm "okay, this model isn't actually capable of being dangerous" (and probably the AI developers continue unobstructed.
But, when we first hit a model capable of deception, self-propagation or bioweapon development, the eval will trigger "yep, this is dangerous." And then the government will ask "okay, how do you know it's not dangerous?".
And the company will put forth some plan, or internal evaluation procedure, that (probably) sucks. And the Frontier Model Board will say "hey Attorny General, this plan sucks, here's why."
Now, the original version of SB 1047 would include the Attorney General saying "okay yeah your plan doesn't make sense, you don't get to build your model." The newer version of the plan I think basically requires additional political work at this phase.
But, the goal of this phase, is to establish "hey, we have dangerous AI, and we don't yet have the ability to reasonably demonstrate we can render it non-dangerous", and stop development of AI until companies reasonably figure out some plans that at _least_ make enough sense to government officials.
Second: Advanced Evals are invented, and get woven into law
The way I expect a company to prove their AI is safe, despite having dangerous capabilities, is for third parties to invent the a robust version of the second set of evals, and then for new AIs to pass those evals.
This requires a set of scientific and political labor, and the hope is that by the...

Sep 5, 2024 • 14min
AF - Conflating value alignment and intent alignment is causing confusion by Seth Herd
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conflating value alignment and intent alignment is causing confusion, published by Seth Herd on September 5, 2024 on The AI Alignment Forum.
Submitted to the Alignment Forum. Contains more technical jargon than usual.
Epistemic status: I think something like this confusion is happening often. I'm not saying these are the only differences in what people mean by "AGI alignment".
Summary:
Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds.
One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false.
While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1]
Overview
Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes.
Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger's definition of intent alignment: AGI aligned with human intent in the broadest sense.
Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion.
People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans' for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I'm going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target.
Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements.
Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms' recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions.
The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don't need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3]
Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it's adequate for real success. In particular, there are intuitions and arguments that PIA doesn't address the real problem of AGI alignment.
I think PIA does address the real problem, but in a non-obvious and counterintuitive way.
Another unstated divide
There's another important clustering around these two conceptions of al...

Sep 5, 2024 • 12min
EA - The Protester, Priest and Politician: Effective Altruists before their time by NickLaing
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Protester, Priest and Politician: Effective Altruists before their time, published by NickLaing on September 5, 2024 on The Effective Altruism Forum.
Epistemic status: Motivated and biased as a Christian, gazing in awe through rose-tinted glasses at inspiring humans of days gone by. Written primarily for a Christian audience, with hopefully something of use for all.
Benjamin Lay - The Protester
Benjamin Lay, only 4½ feet tall, stood outside the Quaker meeting house in the heart of Pennsylvania winter, his right leg exposed and thrust deep into the snow. One shocked churchgoer after another urged him to protect his life and limb - but he only replied
"Ah, you pretend compassion for me but you do not feel for the poor slaves in your fields, who go all winter half clad."[1]
Portrait of Benjamin Lay (1790) by William Williams.
In 1700 Lay's moral stances were more than radical.[2] He thought women equal to men, was anti-death penalty, pro animal rights and an early campaigner for the abolition of slavery. In the Caribbean he made friends with indentured people while he boycotted all slave produced products such as tea, sugar and coffee.
I thought Bruce Friedrich of the Good Food Institute[3] was ahead of his time for going vegan in 1987 - well, how about Lay in the 1700s? Many of these moral stances might seem unimpressive now, but back then I would bet under 1% of people held any one of them. These were deeply neglected and important causes, and Lay fought against the odds to make them tractable.
His creative protests were perhaps as impressive as his morals. He smashed fine china teacups in the street saying people cared more about the cups than the slaves that produced tea. He yelled out "there's another Negro master" when slave owners spoke in Quaker meetings. He even temporarily kidnapped a slave owner's child, so his parents would experience a taste of the pain parents back in Africa felt while their children were permanently kidnapped.
These protests stemmed from a deep spiritual devotion to do and proclaim the right thing - people's feelings and cultural norms be darned.
Extreme actions like these have potential to backfire, but Lay chose wisely to perform most protests within his own Quaker church. Perhaps he knew that within the Quakers lay fertile ground to change hearts and minds - despite it taking 50 years to make serious inroads. When the Quakers officially denounced slavery In 1758 - perhaps the first large organization to do so - a then feeble Lay, aged 77, exclaimed:
"Thanksgiving and praise be rendered unto the Lord God… I can now die in peace."
John Wesley - The Priest
"Employ whatever God has entrusted you with, in doing good, all possible good, in every possible kind and degree to the household of faith, to all men!" - John Wesley
A key early insight of the "Effective Altruism" movement was the power of "earning to give" - that we can do great good not just through direct deeds, but by earning as much money as possible and then giving it away to effective causes.
Yet one man had the same insight with similar depth of understanding 230 years earlier, outlined in just one sermon derived almost entirely from biblical principles.[4]
John Wesley preached extreme generosity as a clear mandate from Jesus. His message was simple but radical. Earn all you can, live simply to save money, then give the rest to good causes. Sounds great but who actually does that?
He also had deep insight in the pitfalls of earning to give. We should keep ourselves healthy and not overwork. We should sleep well and preserve "the spirit of a healthful mind". We should eschew evil on the path to the big bucks. And don't get rich while you're earning the big bucks, as you risk falling away from your faith and mission. He also understood that earning to give wasn't a path for e...
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.