

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jul 17, 2024 • 11min
LW - Optimistic Assumptions, Longterm Planning, and "Cope" by Raemon
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Optimistic Assumptions, Longterm Planning, and "Cope", published by Raemon on July 17, 2024 on LessWrong.
Eliezer periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:
Saying "well, if this assumption doesn't hold, we're doomed, so we might as well assume it's true."
Worse: coming up with cope-y reasons to assume that the assumption isn't even questionable at all. It's just a pretty reasonable worldview.
Sometimes the questionable plan is "an alignment scheme, which Eliezer thinks avoids the hard part of the problem." Sometimes it's a sketchy reckless plan that's probably going to blow up and make things worse.
Some people complain about Eliezer being a doomy Negative Nancy who's overly pessimistic.
I had an interesting experience a few months ago when I ran some beta-tests of my Planmaking and Surprise Anticipation workshop, that I think are illustrative.
i. Slipping into a more Convenient World
I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.
In the exercise, I have people write down the steps of their plan, and assign a probability to each step.
If there is a part of the puzzle-map that you aren't familiar with, you'll have to make guesses. I recommend making 2-3 guesses for how a new mechanic might work. (I don't recommend making a massive branching tree for every possible eventuality. For the sake of the exercise not taking forever, I suggest making 2-3 branching path plans)
Several months ago, I had three young-ish alignment researchers do this task (each session was a 1-1 with just me and them).
Each of them looked at the level for awhile and said "Well, this looks basically impossible... unless this [questionable assumption I came up with that I don't really believe in] is true. I think that assumption is... 70% likely to be true."
Then they went an executed their plan.
It failed. The questionable assumption was not true.
Then, each of them said, again "okay, well here's a different sketchy assumption that I wouldn't have thought was likely except if it's not true, the level seems unsolveable."
I asked "what's your probability for that one being true?"
"70%"
"Okay. You ready to go ahead again?" I asked.
"Yep", they said.
They tried again. The plan failed again.
And, then they did it a third time, still saying ~70%.
This happened with three different junior alignment researchers, making a total of 9 predictions, which were wrong 100% of the time.
(The third guy, on the the second or third time, said "well... okay, I was wrong last time. So this time let's say it's... 60%.")
My girlfriend ran a similar exercise with another group of young smart people, with similar results. "I'm 90% sure this is going to work" ... "okay that didn't work."
Later I ran the exercise again, this time with a mix of younger and more experienced AI safety folk, several of whom leaned more pessimistic. I think the group overall did better.
One of them actually made the correct plan on the first try.
One them got it wrong, but gave an appropriately low estimate for themselves.
Another of them (call them Bob) made three attempts, and gave themselves ~50% odds on each attempt. They went into the experience thinking "I expect this to be hard but doable, and I believe in developing the skill of thinking ahead like this."
But, after each attempt, Bob was surprised by how out-of-left field their errors were. They'd predicted they'd be surprised... but they were surprised in surprising ways - even in a simplified, toy domain that was optimized for ...

Jul 17, 2024 • 10min
EA - Tarbell is hiring for 3 roles by Cillian
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tarbell is hiring for 3 roles, published by Cillian on July 17, 2024 on The Effective Altruism Forum.
Summary
Tarbell supports independent journalism covering AI. Since 2022, we've placed journalists at TIME, MIT Tech Review, The Information, and supported another to write for the New Yorker.
We're hiring for three exciting roles to grow our organisation:
Fellowship Manager ($75-100k): a skilled project manager to own & expand our Tarbell Fellowship.
Operations Manager / Associate ($55-100k): implement & improve Tarbell's administrative & operational processes.
Special Projects Manager ($75-100k): lead one of our existing programmes or launch an entirely new initiative.
About Tarbell
Tarbell supports independent journalism covering AI. Our mission is to build a global community of expert journalists covering artificial intelligence.
We run three programmes:
Tarbell Fellowship: Our flagship programme provides early career journalists with training, a stipend of up to $50,000, and a 9-month placement at a major newsroom. 2024 placements include MIT Technology Review, TIME, Euractiv, The Information, and Lawfare.
Journalists-in-residence: We provide senior writers with funding to pursue longer investigations, explore entrepreneurial projects, and educate themselves on artificial intelligence. To date, we've supported Shakeel Hashim and Nathaniel Popper.
Tarbell Grants (coming soon): We'll provide awards of $1k-$15k for impactful reporting on artificial intelligence and its impacts.
Fellowship Manager
Key details
Deadline:
Apply by Sunday July 21, 2024
Start date: September 2024 (flexible)
Hours: 40/week (flexible)
Location: London (preferred) / Remote
Reporting to: Cillian Crosson (Executive Director)
Compensation: $75,000 - $100,000
Responsibilities
As Fellowship Manager, you'll be responsible for managing and continually improving the Tarbell Fellowship.
Responsibilities might include:
Programme managing the Tarbell Fellowship. You'll be responsible for overseeing all aspects of the Fellowship. This includes setting goals, tracking progress towards them, and generally ensuring that all work is executed on schedule to a high level.
Matching Tarbell Fellows with placements at top news organisations. In 2024, we successfully placed Fellows at MIT Tech Review, TIME, Lawfare, The Information, Euractiv, and Ars Technica. You'll maintain strong relationships with existing host outlets and forge partnerships with new placement organisations.
Attracting talented and aspiring tech journalists to apply for the Tarbell Fellowship. You'll coordinate our marketing strategy
Selecting a strong cohort of early career journalists from an initial pool of >1,000 candidates. You'll run a multi-stage application process: reviewing applications, blind-grading writing samples, and interviewing candidates to identify the most promising journalists.
Improve our 10-week AI Journalism Fundamentals course by adding new modules that better prepare Tarbell Fellows for their newsroom placements.
Facilitate training sessions on topics in AI and / or key journalism skills.
Lead a team. As the programme expands, we anticipate that you will recruit, develop, and manage a small team.
What we are looking for
You might be a particularly good fit for this role if you are:
Organised and competent at project management. We are looking for someone with experience managing complex projects with multiple stakeholders. You should feel confident with goal-tracking, system-building, and keeping teams on track to meet ambitious deadlines.
A strong understanding of AI, or the ability to learn this quickly. You understand basic concepts in machine learning (e.g. transformers, gradient descent), are familiar with various governance topics (e.g. responsible scaling policies, compute governance, capabilities evaluations)...

Jul 17, 2024 • 16min
EA - Silent cosmic rulers by Magnus Vinding
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Silent cosmic rulers, published by Magnus Vinding on July 17, 2024 on The Effective Altruism Forum.
In this post, I wish to outline an alternative picture to the grabby aliens model proposed by Hanson et al. (2021). The grabby aliens model assumes that "grabby aliens" expand far and wide in the universe, make clearly visible changes to their colonized volumes, and immediately prevent life from emerging in those volumes.
In contrast, the picture I explore here involves what we may call "quiet expansionist aliens". This model also involves expansion far and wide, but unlike in the grabby aliens model, the expansionist aliens in this model do not make clearly visible changes to their colonized volumes, and they do not immediately prevent life from emerging in those volumes - although they do prevent emerging civilizations from developing to the point of rivaling the quiet expansionists' technology and power.
The reason I explore this alternative picture is that I think it is a neglected possible model for hypothetical alien expansion. I am not claiming that it is the most plausible model a priori, but I think it is too plausible for it to be altogether dismissed, as it generally seems to be.
1. What changes in the quiet expansionist model?
The most obvious change in this model compared to the grabby aliens model is that we would not be able to see a colonized volume from afar, and perhaps not even from up close. Likewise, the quiet expansionist model implies that there would be more instances of evolved life, including observers like us, since the expansionist aliens would not immediately prevent such observers from emerging within their colonized volumes; they could instead stay around and observe.
Taken together, this means that quiet expansionist aliens could in theory be here already, and they could even have a lot of experience interacting with civilizations at our stage of development.
Note that the grabby aliens model and the quiet expansionist model need not be mutually exclusive, as they could in principle be combined. That is, one could have a model in which there are both grabby (i.e. clearly visible) and quiet expansionist aliens that each rule their respective volumes, and different versions of the model could vary the relative proportion of these different colonization styles.
(The original grabby aliens model only involves clearly visible expansionist aliens, not quiet expansionist ones; that is a helpful simplifying assumption, but it is worth being clear that it may be wrong.)
2. Arguments against the quiet expansionist model
A reason the quiet expansionist model is rarely taken seriously is that there seem to be some compelling arguments against it. Let us therefore try to explore a couple of these arguments, to see how compelling they are and what they should lead us to conclude.
2.1 "Implausible motive"
One argument is that it is implausible that an expansionist civilization would not visibly change its colonized volume. In particular, it is difficult to see what kind of underlying motive could make sense of such cosmic silence. The default expectation appears to be that we should instead see overt signs of colonization.
How convincing is this as an argument against the plausibility of quiet expansionist aliens? In order to evaluate that, it seems helpful to first outline what could, speculatively, be some motives behind quiet expansion. For example, it is conceivable that quiet expansion could aid internal coordination and alignment in a civilization that spans numerous star systems and perhaps even countless galaxies.
By staying minimally concentrated and diversified across its colonization volume, a civilization might minimize risks of internal drift and conflict.
Another potential reason to stay silent is to try to learn about emerging civilizati...

Jul 17, 2024 • 1min
LW - Turning Your Back On Traffic by jefftk
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Turning Your Back On Traffic, published by jefftk on July 17, 2024 on LessWrong.
We do a lot of walking around the neighborhood with kids, which usually involves some people getting to intersections a while before others. I'm not worried about even the youngest going into the street on their own - Nora's been street trained for about a year - but we have to be careful about what signals we send to cars. Someone standing at an intersection facing traffic looks to a driver like they're waiting for the opportunity to cross.
Waving drivers to continue doesn't work well: they tend to slow down significantly, and many of them will wave back in a misguided attempt at "no, you first" politeness. Instead, what seems to work well is turning your back to the street:
This isn't perfect: some drivers still read anyone stationary near an intersection as intending to cross, but it's pretty good. And it's especially good for little kids: not only do they often like to look intently at passing traffic in a way that is concerning to drivers and passers by, but it's a clear signal to the parent that the kid knows it's not time to cross yet.
Comment via: facebook, mastodon
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 17, 2024 • 23min
EA - Rethink Priorities' Moral Parliament Tool by Derek Shiller
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Rethink Priorities' Moral Parliament Tool, published by Derek Shiller on July 17, 2024 on The Effective Altruism Forum.
Link to tool: https://parliament.rethinkpriorities.org
(1 min) Introductory Video
(6 min) Basic Features Video
Executive Summary
This post introduces Rethink Priorities' Moral Parliament Tool, which models ways an agent can make decisions about how to allocate goods in light of normative uncertainty.
We treat normative uncertainty as uncertainty over worldviews. A worldview encompasses a set of normative commitments, including first-order moral theories, values, and attitudes toward risk. We represent worldviews as delegates in a moral parliament who decide on an allocation of funds to a diverse array of charitable projects.
Users can configure the parliament to represent their own credences in different worldviews and choose among several procedures for finding their best all-things-considered philanthropic allocation.
The relevant procedures are metanormative methods. These methods take worldviews and our credences in them as inputs and produce some action guidance as an output. Some proposed methods have taken inspiration from political or market processes involving agents who differ in their conceptions of the good and their decision-making strategies. Others have modeled metanormative uncertainty by adapting tools for navigating empirical uncertainty.
We show that empirical and metanormative assumptions can each make large differences in the outcomes. Moral theories and metanormative methods differ in their sensitivity to particular changes.
We also show that, taking the results of the EA Survey as inputs to a moral parliament, no one portfolio is clearly favored. The recommended portfolios vary dramatically based on your preferred metanormative method.
By modeling these complexities, we hope to facilitate more transparent conversations about normative uncertainty, metanormative uncertainty, and resource allocation.
Introduction
Decisions about how to do the most good inherently involve moral commitments about what is valuable and which methods for achieving the good are permissible. However, there is deep disagreement about central moral claims that influence our cause prioritization:
How much do animals matter?
Should we prioritize present people over future people?
Should we aim to maximize overall happiness or also care about things like justice or artistic achievement?
The answers to these questions can have significant effects on which causes are most choiceworthy. Understandably, many individuals feel some amount of moral uncertainty, and individuals within groups (such as charitable organizations and moral communities) may have different moral commitments. How should we make decisions in light of such uncertainty?
Rethink Priorities' Moral Parliament Tool allows users to evaluate decisions about how to allocate goods in light of uncertainty over different worldviews. A worldview encompasses a set of normative commitments, including first-order moral theories, values, and attitudes toward risk.[1] We represent worldviews as delegates in a moral parliament who decide on an allocation of funds to a diverse array of charitable projects.
Users can configure the parliament to represent their own credences in different worldviews and choose among several procedures for finding their best all-things-considered philanthropic allocation.
How does it work?
The Moral Parliament tool has three central components: Worldviews, Projects, and Allocation Strategies for making decisions in light of worldview uncertainty. It embodies a three-stage strategy for navigating uncertainty:
What are the worldviews in which I place some non-trivial credence?
What do they individually recommend that I do?
How do I aggregate and arbitrate among these recommendations...

Jul 17, 2024 • 1min
EA - The Giving Green Fund is evolving by Giving Green
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Giving Green Fund is evolving, published by Giving Green on July 17, 2024 on The Effective Altruism Forum.
Giving Green's mission is to direct climate mitigation funding towards the highest-impact projects possible. We wanted to give a short update from the
Giving Green Fund, with links to more details.
1. The Giving Green Fund received an anonymous gift of 10M USD in April, to be used for granting to high-impact climate organizations. We intend to allocate all these funds by the end of 2024.
2. In reaction to this large gift, we updated our fund strategy to support high-impact initiatives beyond our list of top climate nonprofits.
3. In 2024, we are likely to recommend grants in some subset of our priority funding areas:
1. Industrial decarbonization
2. Decreasing livestock emissions
3. Carbon removal
4. Supporting the energy transition in low- and middle-income countries (LMICs)
5. Nuclear power
6. Solar geoengineering governance and coordination
Please feel free to reach out with any questions:
givinggreen@idinsight.org.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 17, 2024 • 23min
EA - Announcing Open Philanthropy's AI governance and policy RFP by JulianHazell
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Open Philanthropy's AI governance and policy RFP, published by JulianHazell on July 17, 2024 on The Effective Altruism Forum.
AI has
enormous beneficial
potential if it is governed well. However, in line with a
growing
contingent of
AI (and
other)
experts from academia, industry, government, and civil society, we also think that AI systems could
soon (e.g. in the next 15 years) cause
catastrophic harm. For example, this could happen if malicious human actors
deliberately misuse advanced AI systems, or if we
lose control of future powerful systems designed to take autonomous actions.[1]
To improve the odds that humanity successfully navigates these risks, we are soliciting short expressions of interest (EOIs) for funding for work across six subject areas, described below.
Strong applications might be funded by Good Ventures (Open Philanthropy's
partner organization), or by any of >20 (and growing) other philanthropists who have told us they are concerned about these risks and are interested to hear about grant opportunities we recommend.[2] (You can indicate in your application whether we have permission to share your materials with other potential funders.)
As this is a new initiative, we are uncertain about the volume of interest we will receive. Our goal is to keep this form open indefinitely; however, we may need to temporarily pause accepting EOIs if we lack the staff capacity to properly evaluate them. We will post any updates or changes to the application process on this page.
Anyone is eligible to apply, including those working in academia, nonprofits, industry, or independently.[3] We will evaluate EOIs on a rolling basis. See below for more details.
If you have any questions, please
email us. If you have any feedback about this page or program, please let us know (anonymously, if you want) via this
short feedback form.
1. Eligible proposal subject areas
We are primarily seeking EOIs in the following subject areas, but will consider exceptional proposals outside of these areas, as long as they are relevant to mitigating catastrophic risks from AI:
Technical AI governance: Developing and vetting technical mechanisms that improve the efficacy or feasibility of AI governance interventions, or answering technical questions that can inform governance decisions. Examples include
compute governance,
model evaluations,
technical safety and security standards for AI developers,
cybersecurity for model weights, and
privacy-preserving transparency mechanisms.
Policy development: Developing and vetting government policy proposals in enough detail that they can be debated and implemented by policymakers. Examples of policies that seem like they might be valuable (but which typically need more development and debate) include some of those mentioned e.g.
here,
here, and
here.
Frontier company policy: Developing and vetting policies and practices that frontier AI companies could volunteer or be required to implement to reduce risks, such as model evaluations, model scaling "red lines" and "if-then commitments," incident reporting protocols, and third-party audits. See e.g.
here,
here, and
here.
International AI governance: Developing and vetting paths to effective, broad, and multilateral AI governance, and working to improve coordination and cooperation among key state actors. See e.g.
here.
Law: Developing and vetting legal frameworks for AI governance, exploring relevant legal issues such as liability and antitrust, identifying concrete legal tools for implementing high-level AI governance solutions, encouraging sound legal drafting of impactful AI policies, and understanding the legal aspects of various AI policy proposals. See e.g.
here.
Strategic analysis and threat modeling: Improving society's understanding of the strategic landscape around
transformative ...

Jul 17, 2024 • 3min
LW - Why the Best Writers Endure Isolation by Declan Molony
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why the Best Writers Endure Isolation, published by Declan Molony on July 17, 2024 on LessWrong.
Douglas Adams, author of The Hitchhiker's Guide to the Galaxy, was
once locked in a room for three weeks until he completed one of his books.
Victor Hugo,
when faced with a deadline for his book The Hunchback of Notre Dame, locked all his clothes away except for a large shawl. "Lacking any suitable clothing to go outdoors, [he] was no longer tempted to leave the house and get distracted. Staying inside and writing was his only option." Six months later, the book was published.
Dozens of famous authors have done the same. Names like Virginia Woolf, Henry David Thoreau, Mark Twain - all of them constructed small
writing sheds from which to work. Names like Ian Fleming, Maya Angelou, and George Orwell - the first two penned their novels while locked in
hotel rooms, while Orwell isolated himself on a remote Scottish island to write.
One explanation for this reclusive behavior comes from author Neil Gaiman in an
interview he did with Tim Ferriss a few years ago. Ferriss mentioned Gaiman's most important rule for writing:
You can sit here and write, or you can sit here and do nothing. But you can't sit here and do anything else.
Gaiman, after a moment of reflection, responded by saying:
I would go down to my lovely little gazebo [at the] bottom of the garden [and] sit down. I'm absolutely allowed not to do anything. I'm allowed to sit at my desk. I'm allowed to stare out at the world. I'm allowed to do anything I like, as long as it isn't anything. Not allowed to do a crossword; not allowed to read a book; not allowed to phone a friend. All I'm allowed to do is absolutely nothing or write.
What I love about that is I'm giving myself permission to write or not write. But writing is actually more interesting than doing nothing after a while. You sit there and you've been staring out the window now for five minutes, and it kind of loses its charm. You [eventually think], "well actually…[I] might as well write something."
Writing is hard. Between writing or doing anything else, most writers - even some of the most accomplished ones - acquiesce to distraction. That's why so many of them construct and work in environments devoid of external stimuli - the better to circumvent akrasia.
I do all my writing in coffee shops. Similar to Gaiman, I allow myself to do one of two things: write, or people-watch. I don't bring anything with me except for a pencil, paper, and my research material housed in my journals. That means no phone, no laptop, and no watch (even knowing the time is a kind of distraction and pressure to perform).
Within this environment, I end up writing because I've made it the path of least resistance.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 17, 2024 • 8min
LW - DM Parenting by Shoshannah Tekofsky
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: DM Parenting, published by Shoshannah Tekofsky on July 17, 2024 on LessWrong.
Cause no one will question your ethics if you refer to yourself as a Dungeon Mom.
I snort experimentation to feel alive. It's a certain type of orientation to life, completely at odds with all parenting advice about predictability and routine.
Enter DM parenting. Where you approach every parenting task as a Dungeons and Dragons session where you are shepherding a team of pure outliers on the enthusiasm-skill spectrum through the Sisyphisian ordeal of rolling their toothbrush up hi … no, wait, stop that!
Anyway.
You need them to fight the BBEG cause otherwise you are not having fun, but who says they wouldn't rather murder hobo their way through the local dairy supply chain?As a DM, you have to juggle an objective, your own enjoyment, and the enjoyment of your players.
This is basically parenting.
Of course, as a DM, you generally play with people who have opted in while playing according to a rule set someone lovingly crafted for you.
Luckily kids love to play, and if you pick the right rule set, they will probably be game. Except no one wrote any rule sets on how to DM kids into their pyjamas.
Till now.
My kids are young - 3 and 5. These rules work far better for the older of the two. I assume they will keep working better till they become old enough to build their own rules, but here is where we got in the last 2 weeks or so:
Bedtime Rules
Peekaboo
You close your eyes and keep them closed while your kid still needs to get ready for bed. But of course, you try to check if everything is going ok by blindingly reaching out your hands. I'd recommend exaggerating your ineptitude at determining if the little one has actually put on their pajama. It can also be fun to let them advise you on how to navigate the environment. The perspective taking training on this one seems to lead to additional giggles.
Tickle Station
Every time your kid does a bedtime task, they can dock into the tickle station and get tickled by you! Personally I made a tickle station by just reaching out my arms and pretending I was a booth. Some warning here that some kids do not like to be tickled so explicitly check if they find this fun, and also, crucially, let them come to you to receive tickles. In our case, my kiddos love being tickled.
It has gotten to the point that the tickle station has become a bit of an emotional regulation option with me and the kids now, cause it helps them out of a funk quite easily.
Walk a Mile…
… in momma's (or papa's) shoes. Just let them wear your shoes while going through the entire bedtime routine. This was kind of amusing to watch. Might be important to keep them away from stairwells and the like.
Duel Shots
Grab two clothes pins and an elastic band. Hook the elastic band around the (closed) front of the clothes pin and pull back. You can now shoot rubber bands without them snapping your fingers. For the rule set, you both get one clothes pin. For each step of the bed time routine you shoot your kid with the elastic band and they can shoot you back.
Obviously, this can hurt quite a bit, so as an opt out either of you can shout "mirror" and then the other person will have to shoot the mirror image of you instead.
You may now discover if your child has ever shot an elastic band before. Mine had not. The mechanics of aim and force were a complete mystery to her. If you find yourself in this situation then an updated rule set is that the shooter can keep going till they hit. The result in our household was a lot of delight and the absolute slowest bedtime routine yet.
Ghost
You wear a blanket over your head and try to catch the kid while they are putting on their pyjama. If they get too excited they may fail to put on their pyjama all together. If they get sad about being caught, you can tr...

Jul 16, 2024 • 8min
AF - Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural by Rubi Hudson
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplifying Corrigibility - Subagent Corrigibility Is Not Anti-Natural, published by Rubi Hudson on July 16, 2024 on The AI Alignment Forum.
Max Harms recently published
an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A
large installment in the series is dedicated to cataloging the properties that make up such a goal, with
open questions including whether the list is exhaustive and how to trade off between the items that make it up.
I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that.
In a recent blog post comparing
corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can't be straightforwardly captured in a ranking of end states.
Why does this definition of corrigibility matter? It's because properties that are not anti-natural can be explicitly included in the desired utility function.
Following that note, this post is not intended as a response to Max's work, but rather to MIRI and their 2015 paper
Corrigibility. Where Max thinks the approach introduced by that paper is
too narrow, I don't find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model's end goals.
Corrigiblity (2015)
The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.
1. The agent shuts down when properly requested
2. The agent does not try to prevent itself from being shut down
3. The agent does not try to cause itself to be shut down
4. The agent does not create new incorrigible agents
5. Subject to the above constraints, the agent optimizes for some goal
MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.
My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they're distinct attributes and can be addressed separately.
To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.
Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness - in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility).
Weakening this desideratum to avoid incentivi...


