

The Nonlinear Library: LessWrong
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jun 18, 2024 • 1h 2min
LW - Loving a world you don't trust by Joe Carlsmith
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Loving a world you don't trust, published by Joe Carlsmith on June 18, 2024 on LessWrong.
(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)
This is the final essay in a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a brief summary of the series as a whole. There's also a PDF of the whole series here.
Warning: spoilers for Angels in America; and moderate spoilers for Harry Potter and the Methods of Rationality.)
"I come into the presence of still water..."
~Wendell Berry
A lot of this series has been about problems with yang - that is, with the active element in the duality of activity vs. receptivity, doing vs. not-doing, controlling vs. letting go.[1] In particular, I've been interested in the ways that "deep atheism" (that is, a fundamental mistrust towards Nature, and towards bare intelligence) can propel itself towards an ever-more yang-y, controlling relationship to Otherness, and to the universe as a whole.
I've tried to point at various ways this sort of control-seeking can go wrong in the context of AGI, and to highlight a variety of less-controlling alternatives (e.g. "gentleness," "liberalism/niceness/boundaries," and "green") that I think have a role to play.[2]
This is the final essay in the series. And because I've spent so much time on potential problems with yang, and with deep atheism, I want to close with an effort to make sure I've given both of them their due, and been clear about my overall take. To this end, the first part of the essay praises certain types of yang directly, in an effort to avoid over-correction towards yin.
The second part praises something quite nearby to deep atheism that I care about a lot - something I call "humanism." And the third part tries to clarify the depth of atheism I ultimately endorse. In particular, I distinguish between trust in the Real, and various other attitudes towards it - attitudes like love, reverence, loyalty, and forgiveness. And I talk about ways these latter attitudes can still look the world's horrors in the eye.
In praise of yang
Let's start with some words in praise of yang.
In praise of black
Recall "black," from my essay on green. Black, on my construal of the colors, is the color for power, effectiveness, instrumental rationality - and hence, perhaps, the color most paradigmatically associated with yang. And insofar as I was especially interested in green qua yin, black was green's most salient antagonist.
So I want to be clear: I think black is great.[3] Or at least, some aspects of it. Not black qua ego. Not black that wants power and domination for its sake.[4] Rather: black as the color of not fucking around. Of cutting through the bullshit; rejecting what Lewis calls "soft soap"; refusing to pretend things are prettier, or easier, or more comfortable; holding fast to the core thing.
I wrote, in my essay on sincerity, about the idea of "seriousness." Black, I think, is the most paradigmatically serious color.
And it's the color of what Yudkowsky calls "the void" - that nameless, final virtue of rationality; the one that carries your movement past your map, past the performance of effort, and into contact with the true goal.[5] Yudkowsky cites Miyamoto Musashi:
The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means... If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him. More than anything, you must be thinking of carrying your movement through to cutting him.
Musashi (image source here)
In this sense, I think, black is the color of actually caring. That is: one becomes serious, centrally, when there are stak...

Jun 18, 2024 • 3min
LW - Boycott OpenAI by PeterMcCluskey
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Boycott OpenAI, published by PeterMcCluskey on June 18, 2024 on LessWrong.
I have canceled my OpenAI subscription in protest over OpenAI's lack of ethics.
In particular, I object to:
threats to confiscate departing employees' equity unless those employees signed a life-long non-disparagement contract
Sam Altman's pattern of lying about important topics
I'm trying to hold AI companies to higher standards than I use for typical companies, due to the risk that AI companies will exert unusual power.
A boycott of OpenAI subscriptions seems unlikely to gain enough attention to meaningfully influence OpenAI. Where I hope to make a difference is by discouraging competent researchers from joining OpenAI unless they clearly reform (e.g. by firing Altman). A few good researchers choosing not to work at OpenAI could make the difference between OpenAI being the leader in AI 5 years from now versus being, say, a distant 3rd place.
A year ago, I thought that OpenAI equity would be a great investment, but that I had no hope of buying any. But the value of equity is heavily dependent on trust that a company will treat equity holders fairly. The legal system helps somewhat with that, but it can be expensive to rely on the legal system. OpenAI's equity is nonstandard in ways that should create some unusual uncertainty.
Potential employees ought to question whether there's much connection between OpenAI's future profits and what equity holders will get.
How does OpenAI's behavior compare to other leading AI companies?
I'm unsure whether Elon Musk's xAI deserves a boycott, partly because I'm unsure whether it's a serious company. Musk has a history of breaking contracts that bears some similarity to OpenAI's attitude. Musk also bears some responsibility for SpaceX requiring non-disparagement agreements.
Google has shown some signs of being evil. As far as I can tell, DeepMind has been relatively ethical. I've heard clear praise of Demis Hassabis's character from Aubrey de Grey, who knew Hassabis back in the 1990s. Probably parts of Google ought to be boycotted, but I encourage good researchers to work at DeepMind.
Anthropic seems to be a good deal more ethical than OpenAI. I feel comfortable paying them for a subscription to Claude Opus. My evidence concerning their ethics is too weak to say more than that.
P.S. Some of the better sources to start with for evidence against Sam Altman / OpenAI:
a lengthy Zvi post about one week's worth of evidence
Leopold Aschenbrenner
Geoffrey Irving
But if you're thinking of working at OpenAI, please look at more than just those sources.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 18, 2024 • 14min
LW - On DeepMind's Frontier Safety Framework by Zvi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On DeepMind's Frontier Safety Framework, published by Zvi on June 18, 2024 on LessWrong.
On DeepMind's Frontier Safety Framework
Previously: On OpenAI's Preparedness Framework, On RSPs.
The First Two Frameworks
To first update on Anthropic and OpenAI's situation here:
Anthropic's RSP continues to miss the definitions of the all-important later levels, in addition to other issues, although it is otherwise promising. It has now been a number of months, and it is starting to be concerning that nothing has changed. They are due for an update.
OpenAI also has not updated its framework.
I am less down on OpenAI's framework choices than Zac Stein-Perlman was in the other review I have seen. I think that if OpenAI implemented the spirit of what it wrote down, that would be pretty good. The Critical-level thresholds listed are too high, but the Anthropic ASL-4 commitments are still unspecified. An update is needed, but I appreciate the concreteness.
The bigger issue with OpenAI is the two contexts around the framework.
First, there's OpenAI. Exactly.
A safety framework you do not adhere to is worth nothing. A safety framework where you adhere to the letter but not the spirit is not worth much.
Given what we have learned about OpenAI, and their decision to break their very public commitments about committing compute to superalignment and driving out their top safety people and failure to have a means for reporting safety issues (including retaliating against Leopold when he went to the board about cybersecurity) and also all that other stuff, why should we have any expectation that what is written down in their framework is meaningful?
What about the other practical test? Zac points out that OpenAI did not share the risk-scorecard for GPT-4o. They also did not share much of anything else. This is somewhat forgivable given the model is arguably not actually at core stronger than GPT-4 aside from its multimodality. It remains bad precedent, and an indication of bad habits and poor policy.
Then there is Microsoft. OpenAI shares all their models with Microsoft, and the framework does not apply to Microsoft at all. Microsoft's track record on safety is woeful. Their submission at the UK Summit was very weak. Their public statements around safety are dismissive, including their intention to 'make Google dance.' Microsoft Recall shows the opposite of a safety mindset, and they themselves have been famously compromised recently.
Remember Sydney? Microsoft explicitly said they got safety committee approval for their tests in India, then had to walk that back. Even what procedures they have, which are not much, they have broken. This is in practice a giant hole in OpenAI's framework.
This is in contrast to Anthropic, who are their own corporate overlord, and DeepMind, whose framework explicitly applies to all of Google.
The DeepMind Framework
DeepMind finally has its own framework. Here is the blog post version.
So first things first. Any framework at all, even a highly incomplete and unambitious one, is far better than none at all. Much better to know what plans you do have, and that they won't be enough, so we can critique and improve. So thanks to DeepMind for stepping up, no matter the contents, as long as it is not the Meta Framework.
There is extensive further work to be done, as they acknowledge. This includes all plans on dealing with misalignment. The current framework only targets misuse.
With that out of the way: Is the DeepMind framework any good?
In the Framework, we specify protocols for the detection of capability levels at which models may pose severe risks (which we call "Critical Capability Levels (CCLs)"), and articulate a spectrum of mitigation options to address such risks. We are starting with an initial set of CCLs in the domains of Autonomy, Biosecurity, Cybersec...

Jun 18, 2024 • 10min
LW - DandD.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation and Ruleset by aphyer
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset, published by aphyer on June 18, 2024 on LessWrong.
This is a follow-up to last week's D&D.Sci scenario: if you intend to play that, and haven't done so yet, you should do so now before spoiling yourself.
There is a web interactive here you can use to test your answer, and generation code available here if you're interested, or you can read on for the ruleset and scores.
RULESET
There are two steps to brewing a potion:
STEP 1: MAGICAL POTENCY
Any ingredient that doesn't exist in the mundane world is Magical, while any ingredient that exists in the mundane world is not:
Magical
Not Magical
Angel Feather
Badger Skull
Beholder Eye
Beech Bark
Demon Claw
Crushed Diamond
Dragon Scale
Crushed Onyx
Dragon Spleen
Crushed Ruby
Dragon Tongue
Crushed Sapphire
Dragon's Blood
Eye of Newt
Ectoplasm
Ground Bone
Faerie Tears
Oaken Twigs
Giant's Toe
Powdered Silver
Troll Blood
Quicksilver
Vampire Fang
Redwood Sap
The first step of potion-brewing is to dissolve the magical potency out of the Magical Ingredients to empower your potion. This requires the right amount of Magical Ingredients: too few, and nothing magical will happen and you will produce Inert Glop, while too many and there will be an uncontrolled Magical Explosion.
If you include:
0-1 Magical Ingredients: 100% chance of Inert Glop.
2 Magical Ingredients: 50% chance of Inert Glop, 50% chance OK.
3 Magical Ingredients: 100% chance OK.
4 Magical Ingredients: 50% chance OK, 50% chance Magical Explosion.
5+ Magical Ingredients: 100% chance Magical Explosion.
If your potion got past this step OK, move on to:
STEP 2: DIRECTION
Some ingredients are used to direct the magical power into the desired resulting potion. Each potion has two required Key Ingredients, both of which must be included to make it:
Potion
Key Ingredient 1
Key Ingredient 2
Barkskin Potion*
Crushed Onyx
Ground Bone
Farsight Potion
Beholder Eye
Eye of Newt
Fire Breathing Potion
Dragon Spleen
Dragon's Blood
Fire Resist Potion
Crushed Ruby
Dragon Scale
Glibness Potion
Dragon Tongue
Powdered Silver
Growth Potion
Giant's Toe
Redwood Sap
Invisibility Potion
Crushed Diamond
Ectoplasm
Necromantic Power Potion*
Beech Bark
Oaken Twigs
Rage Potion
Badger Skull
Demon Claw
Regeneration Potion
Troll Blood
Vampire Fang
*Well. Sort of. See the Bonus Objective section below.
Some ingredients (Angel Feather, Crushed Sapphire, Faerie Tears and Quicksilver) aren't Key Ingredients for any potion in the dataset. Angel Feather and Faerie Tears are nevertheless useful - as magical ingredients that don't risk creating any clashing potion, they're good ways to add magical potential to a recipe. Crushed Sapphire and Quicksilver have no effect, including them is entirely wasteful.
If you've gotten through Step 1, the outcome depends on how many potions you've included both the Key Ingredients of:
0 potions: with nothing to direct it, the magical potential dissolves into an Acidic Slurry.
1 potion: you successfully produce that potion.
2 or more potions:
Sometimes (1/n of the time, where n is # of potions you included) a random one of the potions will dominate, and you will produce that one.
The rest of the time, the clashing directions will produce Mutagenic Ooze.
So, for example, if you brew a potion with:
Dragon Spleen, Dragon Scale, Dragon Tongue and Dragon's Blood:
You have included 4 magical ingredients, and the Key Ingredients of one potion (Fire Breathing).
50% of the time you will get a Magical Explosion, 50% of the time you will get a Fire Breathing Potion.
Badger Skull, Demon Claw, Giant's Toe, Redwood Sap.
You have included 2 magical ingredients, and the Key Ingredients of two potions (Rage and Growth).
50% of the time you will get Inert Glop, 25% of the time Mutagenic Ooze, 12.5% of the time G...

Jun 18, 2024 • 6min
LW - I would have shit in that alley, too by Declan Molony
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I would have shit in that alley, too, published by Declan Molony on June 18, 2024 on LessWrong.
After living in a suburb for most of my life, when I moved to a major U.S. city the first thing I noticed was the feces. At first I assumed it was dog poop, but my naivety didn't last long.
One day I saw a homeless man waddling towards me at a fast speed while holding his ass cheeks. He turned into an alley and took a shit. As I passed him, there was a moment where our eyes met. He sheepishly averted his gaze.
The next day I walked to the same place. There are a number of businesses on both sides of the street that probably all have bathrooms. I walked into each of them to investigate.
In a coffee shop, I saw a homeless woman ask the barista if she could use the bathroom. "Sorry, that bathroom is for customers only." I waited five minutes and then inquired from the barista if I could use the bathroom (even though I hadn't ordered anything). "Sure! The bathroom code is 0528."
The other businesses I entered also had policies for 'customers only'. Nearly all of them allowed me to use the bathroom despite not purchasing anything.
If I was that homeless guy, I would have shit in that alley, too.
I receive more compliments from homeless people compared to the women I go on dates with
There's this one homeless guy - a big fella who looks intimidating - I sometimes pass on my walk to the gym. The first time I saw him, he put on a big smile and said in a booming voice, "Hey there! I hope you're having a blessed day!" Without making eye contact (because I didn't want him to ask me for money), I mumbled "thanks" and quickly walked away.
I saw him again a few weeks later. With another beaming smile he exclaimed, "You must be going to the gym - you're looking fit, my man!" I blushed and replied, "I appreciate it, have a good day." He then added, "God bless you, sir!" Being non-religious, that made me a little uncomfortable.
With our next encounter, I found myself smiling as I approached him. This time I greeted him first, "Good afternoon!" His face lit up with glee. "Sir, that's very kind of you. I appreciate that. God bless you!" Without hesitation I responded, "God bless you, too!" I'm not sure the last time I've uttered those words; I don't even say 'bless you' after people sneeze.
We say hi to each other regularly now. His name is George.
Is that guy dead?
Coming home one day, I saw a disheveled man lying facedown on the sidewalk.
He's not moving. I crouched to hear if he's breathing. Nothing.
I looked up and saw a lady in a car next to me stopped at a red light. We made eye contact and I gestured towards the guy as if to say what the fuck do we do? Her answer was to grip the steering wheel and aggressively stare in front of her until the light turned green and she sped off.
Not knowing if I needed to call an ambulance, I asked him, "Hey buddy, you okay?" I heard back a muffled, "AYE KENT GEEUP!"
Well, at least he's not dead.
"Uhh, what was that? You doing okay?" This time a more articulate, "I CAN'T GET UP," escaped from him. Despite his clothes being somewhat dirty and not wanting to touch him, I helped him to his feet.
With one look on his face I could tell that he wasn't all there. I asked him if he knew where he was or if he needed help, but he could only reply with gibberish. It could have been drugs; it could have been mental illness. With confirmation that he wasn't dead and was able to walk around, I went home.
Who's giving Brazilian waxes to the homeless?
I was walking behind a homeless man the other day. He was wearing an extra long flannel and sagging his pants low.
Suddenly, he noticed his (one and only) shoe was untied and fixed it promptly by executing a full
standing pike. I wasn't expecting him to have the flexibility of a gymnast.
In doing so, his flannel lifted u...

Jun 17, 2024 • 3min
LW - Fat Tails Discourage Compromise by niplav
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fat Tails Discourage Compromise, published by niplav on June 17, 2024 on LessWrong.
Say that we have a set of options, such as (for example) wild animal welfare interventions.
Say also that you have two axes along which you can score those interventions: popularity (how much people will like your intervention) and effectiveness (how much the intervention actually helps wild animals).
Assume that we (for some reason) can't convert between and compare those two properties.
Should you then pick an intervention that is a compromise on the two axes - that is, it scores decently well on both - or should you max out on a particular axis?
One thing you might consider is the distribution of options along those two axes: the distribution of interventions can be normal on for both popularity and effectiveness, or the underlying distribution could be lognormal for both axes, or they could be mixed (e.g. normal for popularity, and lognormal for effectiveness).
Intuitively, the distributions seem like they affect the kinds of tradeoffs we can make, how could we possibly figure out how?
…
…
…
It turns out that if both properties are normally distributed, one gets a fairly large Pareto frontier, with a convex set of options, while if the two properties are lognormally distributed, one gets a concave set of options.
(Code here.)
So if we believe that the interventions are normally distributed around popularity and effectiveness, we would be justified in opting for an intervention that gets us the best of both worlds, such as sterilising stray dogs or finding less painful rodenticides.
If we, however, believe that popularity and effectiveness are lognormally distributed, we instead want to go in hard on only one of those, such as buying brazilian beef that leads to Amazonian rainforest being destroyed, or writing a book of poetic short stories that detail the harsh life of wild animals.
What if popularity of interventions is normally distributed, but effectiveness is lognormally distributed?
In that case you get a pretty large Pareto frontier which almost looks linear to me, and it's not clear anymore that one can't get a good trade-off between the two options.
So if you believe that heavy tails dominate with the things you care about, on multiple dimensions, you might consider taking a barbell strategy and taking one or multiple options that each max out on a particular axis.
If you have thin tails, however, taking a concave disposition towards your available options can give you most of the value you want.
See Also
Being the (Pareto) Best in the World (johnswentworth, 2019)
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 17, 2024 • 33min
LW - Getting 50% (SoTA) on ARC-AGI with GPT-4o by ryan greenblatt
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Getting 50% (SoTA) on ARC-AGI with GPT-4o, published by ryan greenblatt on June 17, 2024 on LessWrong.
I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs.
[This post is on a pretty different topic than the usual posts I make about AI safety.]
The additional approaches and tweaks are:
I use few-shot prompts which perform meticulous step-by-step reasoning.
I have GPT-4o try to revise some of the implementations after seeing what they actually output on the provided examples.
I do some feature engineering, providing the model with considerably better grid representations than the naive approach of just providing images. (See below for details on what a "grid" in ARC-AGI is.)
I used specialized few-shot prompts for the two main buckets of ARC-AGI problems (cases where the grid size changes vs doesn't).
The prior state of the art on this dataset was 34% accuracy, so this is a significant improvement.[3]
On a held-out subset of the train set, where humans get 85% accuracy, my solution gets 72% accuracy.[4] (The train set is significantly easier than the test set as noted here.)
Additional increases of runtime compute would further improve performance (and there are clear scaling laws), but this is left as an exercise to the reader.
In this post:
I describe my method;
I analyze what limits its performance and make predictions about what is needed to reach human performance;
I comment on what it means for claims that François Chollet makes about LLMs. Given that current LLMs can perform decently well on ARC-AGI, do claims like "LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." make sense? (This quote is from here.)
Thanks to Fabien Roger and Buck Shlegeris for a bit of help with this project and with writing this post.
What is ARC-AGI?
ARC-AGI is a dataset built to evaluate the general reasoning abilities of AIs. It consists of visual problems like the below, where there are input-output examples which are grids of colored cells. The task is to guess the transformation from input to output and then fill out the missing grid. Here is an example from the tutorial:
This one is easy, and it's easy to get GPT-4o to solve it. But the tasks from the public test set are much harder; they're often non-trivial for (typical) humans. There is a reported MTurk human baseline for the train distribution of 85%, but no human baseline for the public test set which is known to be significantly more difficult.
Here are representative problems from the test set[5], and whether my GPT-4o-based solution gets them correct or not.
Problem 1:
Problem 2:
Problem 3:
My method
The main idea behind my solution is very simple: get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s). I show GPT-4o the problem as images and in various ascii representations.
My approach is similar in spirit to the approach applied in AlphaCode in which a model generates millions of completions attempting to solve a programming problem and then aggregates over them to determine what to submit.
Actually getting to 50% with this main idea took me about 6 days of work. This work includes construct...

Jun 17, 2024 • 53min
LW - OpenAI #8: The Right to Warn by Zvi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI #8: The Right to Warn, published by Zvi on June 17, 2024 on LessWrong.
The fun at OpenAI continues.
We finally have the details of how Leopold Aschenbrenner was fired, at least according to Leopold. We have a letter calling for a way for employees to do something if frontier AI labs are endangering safety. And we have continued details and fallout from the issues with non-disparagement agreements and NDAs.
Hopefully we can stop meeting like this for a while.
Due to jury duty and it being largely distinct, this post does not cover the appointment of General Paul Nakasone to the board of directors. I'll cover that later, probably in the weekly update.
The Firing of Leopold Aschenbrenner
What happened that caused Leopold to leave OpenAI? Given the nature of this topic, I encourage getting the story from Leopold by following along on the transcript of that section of his appearance on the Dwarkesh Patel Podcast or watching the section yourself.
This is especially true on the question of the firing (control-F for 'Why don't I'). I will summarize, but much better to use the primary source for claims like this. I would quote, but I'd want to quote entire pages of text, so go read or listen to the whole thing.
Remember that this is only Leopold's side of the story. We do not know what is missing from his story, or what parts might be inaccurate.
It has however been over a week, and there has been no response from OpenAI.
If Leopold's statements are true and complete? Well, it doesn't look good.
The short answer is:
1. Leopold refused to sign the OpenAI letter demanding the board resign.
2. Leopold wrote a memo about what he saw as OpenAI's terrible cybersecurity.
3. OpenAI did not respond.
4. There was a major cybersecurity incident.
5. Leopold shared the memo with the board.
6. OpenAI admonished him for sharing the memo with the board.
7. OpenAI went on a fishing expedition to find a reason to fire him.
8. OpenAI fired him, citing 'leaking information' that did not contain any non-public information, and that was well within OpenAI communication norms.
9. Leopold was explicitly told that without the memo, he wouldn't have been fired.
You can call it 'going outside the chain of command.'
You can also call it 'fired for whistleblowing under false pretenses,' and treating the board as an enemy who should not be informed about potential problems with cybersecurity, and also retaliation for not being sufficiently loyal to Altman.
Your call.
For comprehension I am moving statements around, but here is the story I believe Leopold is telling, with time stamps.
1. (2:29:10) Leopold joined superalignment. The goal of superalignment was to find the successor to RLHF, because it probably won't scale to superhuman systems, humans can't evaluate superhuman outputs. He liked Ilya and the team and the ambitious agenda on an important problem.
1. Not probably won't scale. It won't scale. I love that Leike was clear on this.
2. (2:31:24) What happened to superalignment? OpenAI 'decided to take things in a somewhat different direction.' After November there were personnel changes, some amount of 'reprioritization.' The 20% compute commitment, a key part of recruiting many people, was broken.
1. If you turn against your safety team because of corporate political fights and thus decide to 'go in a different direction,' and that different direction is to not do the safety work? And your safety team quits with no sign you are going to replace them? That seems quite bad.
2. If you recruit a bunch of people based on a very loud public commitment of resources, then you do not commit those resources? That seems quite bad.
3. (2:32:25) Why did Leopold leave, they said you were fired, what happened? I encourage reading Leopold's exact answer and not take my word for this, but the short version i...

Jun 17, 2024 • 37min
LW - Towards a Less Bullshit Model of Semantics by johnswentworth
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards a Less Bullshit Model of Semantics, published by johnswentworth on June 17, 2024 on LessWrong.
Or: Towards Bayesian Natural Language Semantics In Terms Of Interoperable Mental Content
Or: Towards a Theory of Interoperable Semantics
You know how natural language "semantics" as studied in e.g. linguistics is kinda bullshit? Like, there's some fine math there, it just ignores most of the thing which people intuitively mean by "semantics".
When I think about what natural language "semantics" means, intuitively, the core picture in my head is:
I hear/read some words, and my brain translates those words into some kind of internal mental content.
The mental content in my head somehow "matches" the mental content typically evoked in other peoples' heads by the same words, thereby allowing us to communicate at all; the mental content is "interoperable" in some sense.
That interoperable mental content is "the semantics of" the words. That's the stuff we're going to try to model.
The main goal of this post is to convey what it might look like to "model semantics for real", mathematically, within a Bayesian framework.
But Why Though?
There's lots of reasons to want a real model of semantics, but here's the reason we expect readers here to find most compelling:
The central challenge of ML interpretability is to faithfully and robustly translate the internal concepts of neural nets into human concepts (or vice versa). But today, we don't have a precise understanding of what "human concepts" are. Semantics gives us an angle on that question: it's centrally about what kind of mental content (i.e. concepts) can be interoperable (i.e. translatable) across minds.
Later in this post, we give a toy model for the semantics of nouns and verbs of rigid body objects. If that model were basically correct, it would give us a damn strong starting point on what to look for inside nets if we want to check whether they're using the concept of a teacup or free-fall or free-falling teacups.
This potentially gets us much of the way to calculating quantitative bounds on how well the net's internal concepts match humans', under conceptually simple (though substantive) mathematical assumptions.
Then compare that to today: Today, when working on interpretability, we're throwing darts in the dark, don't really understand what we're aiming for, and it's not clear when the darts hit something or what, exactly, they've hit. We can do better.
Overview
In the first section, we will establish the two central challenges of the problem we call Interoperable Semantics. The first is to characterize the stuff within a Bayesian world model (i.e. mental content) to which natural-language statements resolve; that's the "semantics" part of the problem.
The second aim is to characterize when, how, and to what extent two separate models can come to agree on the mental content to which natural language resolves, despite their respective mental content living in two different minds; that's the "interoperability" part of the problem.
After establishing the goals of Interoperable Semantics, we give a first toy model of interoperable semantics based on the "
words point to clusters in thingspace" mental model. As a concrete example, we quantify the model's approximation errors under an off-the-shelf gaussian clustering algorithm on a small-but-real dataset. This example emphasizes the sort of theorems we want as part of the Interoperable Semantics project, and the sorts of tools which might be used to prove those theorems. However, the example is very toy.
Our second toy model sketch illustrates how to construct higher level Interoperable Semantics models using the same tools from the first model. This one is marginally less toy; it gives a simple semantic model for rigid body nouns and their verbs. However, this secon...

Jun 17, 2024 • 6min
LW - (Appetitive, Consummatory) (RL, reflex) by Steven Byrnes
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: (Appetitive, Consummatory) (RL, reflex), published by Steven Byrnes on June 17, 2024 on LessWrong.
"Appetitive" and "Consummatory" are terms used in the animal behavior literature. I was was briefly confused when I first came across these terms (a year or two ago), because I'm most comfortable thinking in terms of brain algorithms, whereas these terms were about categories of behavior, and the papers I was reading didn't spell out how the one is related to the other.
I'm somewhat embarrassed to write this because the thesis seems so extremely obvious to me now, and it's probably obvious to many other people too. So if you read the title of this post and were thinking "yeah duh", then you already get it, and you can stop reading.
Definition of "appetitive" and "consummatory"
In animal behavior there's a distinction between "appetitive behaviors" and "consummatory behaviors". Here's a nice description from Hansen et al. 1991 (formatting added, references omitted):
It is sometimes helpful to break down complex behavioral sequences into appetitive and consummatory phases, although the distinction between them is not always absolute.
Appetitive behaviors involve approach to the appropriate goal object and prepare the animal for consummatory contact with it. They are usually described by consequence rather than by physical description, because the movements involved are complex and diverse.
Consummatory responses, on the other hand, depend on the outcome of the appetitive phase. They appear motorically rigid and stereotyped and are thus more amenable to physical description. In addition, consummatory responses are typically activated by a more circumscribed set of specific stimuli.
So for example, rat mothers have a pup retrieval behavior; if you pick up a pup and place it outside the nest, the mother will walk to it, pick it up in her mouth, and bring it back to the nest.
The walking-over-to-the-pup aspect of pup-retrieval is clearly appetitive. It's not rigid and stereotyped; for example, if you put up a trivial barrier between the rat mother and her pup, the mother will flexibly climb over or walk around the barrier to get to the pup.
Whereas the next stage (picking up the pup) might be consummatory (I'm not sure). For example, if the mother always picks up the pup in the same way, and if this behavior is innate, and if she won't flexibly adapt in cases where the normal method for pup-picking-up doesn't work, then all that would be a strong indication that pup-picking-up is indeed consummatory.
Other examples of consummatory behavior: aggressively bristling and squeaking at an unwelcome intruder, or chewing and swallowing food.
How do "appetitive" & "consummatory" relate to brain algorithms?
Anyway, here's the "obvious" point I want to make. (It's a bit oversimplified; caveats to follow.)
Appetitive behaviors are implemented via an animal's reinforcement learning (RL) system. In other words, the animal has experienced reward / positive reinforcement signals when a thing has happened in the past, so they take actions and make plans so as to make a similar thing happen again in the future. RL enables flexible, adaptable, and goal-oriented behaviors, like climbing over an obstacle in order to get to food.
Consummatory behaviors are generally implemented via the triggering of specific innate motor programs stored in the brainstem. For example, vomiting isn't a behavior where the end-result is self-motivating, and therefore you systematically figure out from experience how to vomit, in detail, i.e. which muscles you should contract in which order. That's absurd! Rather, we all know that vomiting is an innate motor program.
Ditto for goosebumps, swallowing, crying, laughing, various facial expressions, orienting to unexpected sounds, flinching, and many more.
There are many s...