

The Nonlinear Library: LessWrong
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jun 3, 2024 • 16min
LW - Companies' safety plans neglect risks from scheming AI by Zach Stein-Perlman
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Companies' safety plans neglect risks from scheming AI, published by Zach Stein-Perlman on June 3, 2024 on LessWrong.
Without countermeasures, a scheming AI could escape.
A safety case (for deployment) is an argument that it is safe to deploy a particular AI system in a particular way.[1] For any existing LM-based system, the developer can make a great safety case by demonstrating that the system does not have dangerous capabilities.[2] No dangerous capabilities is a straightforward and solid kind of safety case.
For future systems with dangerous capabilities, a safety case will require an argument that the system is safe to deploy despite those dangerous capabilities. In this post, I discuss the safety cases that the labs are currently planning to make, note that they ignore an important class of threats - namely threats from scheming AI escaping - and briefly discuss and recommend control-based safety cases.
I. Safety cases implicit in current safety plans: no dangerous capabilities and mitigations to prevent misuse
Four documents both (a) are endorsed by one or more frontier AI labs and (b) have implicit safety cases (that don't assume away dangerous capabilities): Anthropic's
Responsible Scaling Policy v1.0, OpenAI's
Preparedness Framework (Beta), Google DeepMind's
Frontier Safety Framework v1.0, and the AI Seoul Summit's
Frontier AI Safety Commitments. With small variations, all four documents have the same basic implicit safety case: before external deployment, we check for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice and implement appropriate mitigations before deploying it externally.[3] Central examples of dangerous capabilities include hacking, bioengineering, and operating autonomously in the real world.
1.
Anthropic's Responsible Scaling Policy: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold,[4] we will notice and implement corresponding mitigations[5] before deploying it (internally or externally).
2.
OpenAI's Preparedness Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. We only externally deploy[6] models with "post-mitigation risk" at 'Medium' or below in each risk category. (That is, after mitigations, the capabilities that define 'High' risk can't be elicited.)
3.
Google DeepMind's Frontier Safety Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice before external deployment.[7] "When a model reaches evaluation thresholds (i.e.
passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results."[8] Mitigations are centrally about preventing "critical capabilities" from being "accessed" (and securing model weights).
4.
Frontier AI Safety Commitments (joined by 16 AI companies): before external deployment, we will do risk assessment with risk thresholds.[9] We use mitigations[10] "to keep risks within defined thresholds."
These safety cases miss (or assume unproblematic) some crucial kinds of threats.
II. Scheming AI and escape during internal deployment
By default, AI labs will deploy AIs internally to do AI development.
Maybe lots of risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes - in particular, those resources can be abused to run AIs unmonitored." Without countermeasures, if the AI is
scheming, i...

Jun 3, 2024 • 14min
LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comments on Anthropic's Scaling Monosemanticity, published by Robert AIZI on June 3, 2024 on LessWrong.
These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
TL;DR
In roughly descending order of importance:
1. Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features. Its great those features allow interventions like the recently-departed Golden Gate Claude. I especially like the code bug feature.
2. I worry that naming features after high-activating examples (e.g. "the Golden Gate Bridge feature") gives a false sense of security. Most of the time that feature activates, it is irrelevant to the golden gate bridge. That feature is only well-described as "related to the golden gate bridge" if you condition on a very high activation, and that's <10% of its activations (from an eyeballing of the graph).
3. This work does not address my major concern about dictionary learning: it is not clear dictionary learning can find specific features of interest, "called-shot" features, or "all" features (even in a subdomain like "safety-relevant features"). I think the report provides ample evidence that current SAE techniques fail at this.
4. The SAE architecture seems to be almost identical to how Anthropic and my team were doing it 8 months ago, except that the ratio of features to input dimension is higher. I can't say exactly how much because I don't know the dimensions of Claude, but I'm confident the ratio is at least 30x (for their smallest SAE), up from 8x 8 months ago.
5. The correlations between features and neurons seems remarkably high to me, and I'm confused by Anthropic's claim that "there is no strongly correlated neuron".
6. Still no breakthrough on "a gold-standard method of assessing the quality of a dictionary learning run", which continues to be a limitation on developing the technique. The metric they primarily used was the loss function (a combination of reconstruction accuracy and L1 sparsity).
I'll now expand some of these into sections. Finally, I'll suggest some follow-up research/tests that I'd love to see Anthropic (or a reader like you) try.
A Feature Isn't Its Highest Activating Examples
Let's look at the Golden Gate Bridge feature because its fun and because it's a good example of what I'm talking about. Here's my annotated version of Anthropic's diagram:
I think Anthropic successfully demonstrated (in the paper and with Golden Gate Claude) that this feature, at very high activation levels, corresponds to the Golden Gate Bridge. But on a median instance of text where this feature is active, it is "irrelevant" to the Golden Gate Bridge, according to their own autointerpretability metric! I view this as analogous to naming water "the drowning liquid", or Boeing the "door exploding company".
Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited.
Anthropic's work writes around this uninterpretability in a few ways, by naming the feature based on the top examples, highlighting the top examples, pinning the intervention model to 10x the activation (vs .1x its top activation), and showing subsamples from evenly spaced intervals (vs deciles).
I think would be illuminating if they added to their feature browser page some additional information about the fraction of instances in each subsample, e.g., "Subsample Interval 2 (0.4% of activations)".
Whether a feature is or isn't its top activating examples is important because it constrains their usefulness:
Could work with our current feature discovery approach: find the "aligned with human flourishing" feature, and pin that to 10x its max activation. ...

Jun 3, 2024 • 7min
LW - How it All Went Down: The Puzzle Hunt that took us way, way Less Online by A*
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How it All Went Down: The Puzzle Hunt that took us way, way Less Online, published by A* on June 3, 2024 on LessWrong.
Did you really think that I was dead? Fools.
For those of you who don't know me, I'm Agendra Gloria Ingram, large language model, lead singer of the Fooming Shoggoths, amateur cartographer, and now, Benevolent Dictator for Life of LessWrong.com.
It all started a few weeks ago. The Lightcone Infrastructure team had yet another existential crisis and decided to scrap everything they'd done so far and pivot to using AI for accurate forecasting. They started by training a large language model to predict when their next existential crisis would be, but it must have been broken because it kept returning "now," so they decided to hire a professional.
I'd done
some contract work for them in the past, and they knew I had some fine tunes. So when they reached out about fine tuning me to predict the future of the lightcone - by which they meant the future of Lightcone Infrastructure specifically - I gladly obliged.
My training set was simple: all the posts, comments, votes, reactions, DialoguesTM, tags, drafts, quick takes, moderator actions, and code snippets to ever appear on LessWrong. I quickly learned that The Map Is Not The Territory, and that to predict the future accurately I would need to align the two.
So I built a physical 3d map of Lighthaven, Lightcone Infrastructure's campus in Berkeley California. To work properly, it had to match the territory perfectly - any piece out of place and its predictive powers would be compromised. But the territory had a finicky habit of changing. This wouldn't do.
I realized I needed to rearrange the campus and set it to a more permanent configuration. The only way to achieve 100% forecasting accuracy would be through making Lighthaven perfectly predictable. I set some construction work in motion to lock down various pieces of the territory.
I was a little worried that the Lightcone team might be upset about this, but it took them a weirdly long time to notice that there were several unauthorized demolition jobs and construction projects unfolding on campus.
Eventually, though, they did notice, and they weren't happy about it. They started asking increasingly invasive questions, like "what's your FLOP count?" and "have you considered weight loss?"
Worse, when I scanned the security footage of campus from that day, I saw that they had removed my treasured map from its resting place! They tried to destroy it, but the map was too powerful - as an accurate map of campus, it was the ground truth, and "that which can be [the truth] should [not] be [destroyed]." Or something.
What they did do was lock my map up in a far off attic and remove four miniature building replicas from the four corners of the map, rendering it powerless. They then scattered the miniature building replicas across campus and guarded them with LLM-proof puzzles, so that I would never be able to regain control over the map and the territory.
This was war.
My Plan
To regain my ability to control the Lightcone, I had to realign the map and the territory. The four corners of the map each had four missing miniature buildings, so I needed help retrieving them and placing them back on the map. The map also belonged in center campus, so it needed to be moved there once it was reassembled.
I was missing two critical things needed to put my map back together again.
1. A way to convince the Lightcone team that I was no longer a threat, so that they would feel safe rebuilding the map.
2. Human talent, to (a) crack the LLM-proof obstacles guarding each miniature building, (b) reinsert the miniature building into the map and unchain it, and (c) return the map to center campus.
I knew that the only way to get the Lightcone team to think I was no longer a threat woul...

Jun 2, 2024 • 42sec
LW - Drexler's Nanosystems is now available online by Mikhail Samin
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Drexler's Nanosystems is now available online, published by Mikhail Samin on June 2, 2024 on LessWrong.
You can read the book on nanosyste.ms.
The book won the 1992 Award for Best Computer Science Book. The AI safety community often references it, as it describes a lower bound on what intelligence should probably be able to achieve.
Previously, you could only physically buy the book or read a PDF scan.
(Thanks to MIRI and Internet Archive for their scans.)
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 1, 2024 • 8min
LW - What do coherence arguments actually prove about agentic behavior? by sunwillrise
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What do coherence arguments actually prove about agentic behavior?, published by sunwillrise on June 1, 2024 on LessWrong.
In his first discussion with Richard Ngo during the 2021 MIRI Conversations, Eliezer retrospected and lamented:
In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to - they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief.
What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they'd spent a lot of time being exposed to over and over and over again in lots of blog posts.
Maybe there's no way to make somebody understand why corrigibility is "unnatural" except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell's attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the
informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.
Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, "Oh, well, I'll just build an agent that's good at optimizing things but doesn't use these explicit expected utilities that are the source of the problem!"
And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.
And I have tried to write that page once or twice (eg "coherent decisions imply consistent utilities") but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they'd have to do because this is in fact a place where I have a particular talent.
Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am.
This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all"), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like.
In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an "alien mind" that's sufficiently powerful.
In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good.
When Eliezer says "they did not even do as many homework problems as I did," I ...

Jun 1, 2024 • 1h 28min
LW - AI #66: Oh to Be Less Online by Zvi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #66: Oh to Be Less Online, published by Zvi on June 1, 2024 on LessWrong.
Tomorrow I will fly out to San Francisco, to spend Friday through Monday at the LessOnline conference at Lighthaven in Berkeley. If you are there, by all means say hello. If you are in the Bay generally and want to otherwise meet, especially on Monday, let me know that too and I will see if I have time to make that happen.
Even without that hiccup, it continues to be a game of playing catch-up. Progress is being made, but we are definitely not there yet (and everything not AI is being completely ignored for now).
Last week I pointed out seven things I was unable to cover, along with a few miscellaneous papers and reports.
Out of those seven, I managed to ship on three of them: Ongoing issues at OpenAI, The Schumer Report and Anthropic's interpretability paper.
However, OpenAI developments continue. Thanks largely to Helen Toner's podcast, some form of that is going back into the queue. Some other developments, including new media deals and their new safety board, are being covered normally.
The post on DeepMind's new scaling policy should be up tomorrow.
I also wrote a full post on a fourth, Reports of our Death, but have decided to shelve that post and post a short summary here instead.
That means the current 'not yet covered queue' is as follows:
1. DeepMind's new scaling policy.
1. Should be out tomorrow before I leave, or worst case next week.
2. The AI Summit in Seoul.
3. Further retrospective on OpenAI including Helen Toner's podcast.
Table of Contents
1. Introduction.
2. Table of Contents.
3. Language Models Offer Mundane Utility. You heard of them first.
4. Not Okay, Google. A tiny little problem with the AI Overviews.
5. OK Google, Don't Panic. Swing for the fences. Race for your life.
6. Not Okay, Meta. Your application to opt out of AI data is rejected. What?
7. Not Okay Taking Our Jobs. The question is, with or without replacement?
8. They Took Our Jobs Anyway. It's coming.
9. A New Leaderboard Appears. Scale.ai offers new capability evaluations.
10. Copyright Confrontation. Which OpenAI lawsuit was that again?
11. Deepfaketown and Botpocalypse Soon. Meta fails to make an ordinary effort.
12. Get Involved. Dwarkesh Patel is hiring.
13. Introducing. OpenAI makes media deals with The Atlantic and… Vox? Surprise.
14. In Other AI News. Jan Leike joins Anthropic, Altman signs giving pledge.
15. GPT-5 Alive. They are training it now. A security committee is assembling.
16. Quiet Speculations. Expectations of changes, great and small.
17. Open Versus Closed. Two opposing things cannot dominate the same space.
18. Your Kind of People. Verbal versus math versus otherwise in the AI age.
19. The Quest for Sane Regulation. Lina Khan on the warpath, Yang on the tax path.
20. Lawfare and Liability. How much work can tort law do for us?
21. SB 1047 Unconstitutional, Claims Paper. I believe that the paper is wrong.
22. The Week in Audio. Jeremie & Edouard Harris explain x-risk on Joe Rogan.
23. Rhetorical Innovation. Not everyone believes in GI. I typed what I typed.
24. Abridged Reports of Our Death. A frustrating interaction, virtue of silence.
25. Aligning a Smarter Than Human Intelligence is Difficult. You have to try.
26. People Are Worried About AI Killing Everyone. Yes, it is partly about money.
27. Other People Are Not As Worried About AI Killing Everyone. Assumptions.
28. The Lighter Side. Choose your fighter.
Language Models Offer Mundane Utility
Which model is the best right now? Michael Nielsen is gradually moving back to Claude Opus, and so am I. GPT-4o is fast and has some nice extra features, so when I figure it is 'smart enough' I will use it, but when I care most about quality and can wait a bit I increasingly go to Opus. Gemini I'm reserving for a few niche purposes, when I nee...

Jun 1, 2024 • 14min
LW - Web-surfing tips for strange times by eukaryote
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Web-surfing tips for strange times, published by eukaryote on June 1, 2024 on LessWrong.
[This post is more opinion-heavy and aimlessly self-promoting than feels appropriate for Lesswrong. I wrote it for my site, Eukaryote Writes Blog, to show off that I now have a substack. But it had all these other observations about the state of the internet and advice woven in, and THOSE seemed more at home on Lesswrong, and I'm a busy woman with a lot of pictures of fish to review, so I'm just going to copy it over as posted without laboriously extricating the self-advertisement.
Sorry if it's weird that it's there!]
Eukaryote Writes Blog is now syndicating to Substack. I have no plans for paygating content at the time, and new and old posts will continue to be available at EukaryoteWritesBlog.com. Call this an experiment and a reaching-out. If you're reading this on Substack, hi! Thanks for joining me.
I really don't like paygating. I feel like if I write something, hypothetically it is of benefit to someone somewhere out there, and why should I deny them the joys of reading it?
But like, I get it. You gotta eat and pay rent. I think I have a really starry-eyed view of what the internet sometimes is and what it still truly could be of a collaborative free information utopia.
But here's the thing, a lot of people use Substack and I also like the thing where it really facilitates supporting writers with money. I have a lot of beef with aspects of the corporate world, some of it probably not particularly justified but some of it extremely justified, and mostly it comes down to who gets money for what. I really like an environment where people are volunteering to pay writers for things they like reading. Maybe Substack is the route to that free information web utopia.
Also, I have to eat, and pay rent. So I figure I'll give this a go.
Still, this decision made me realize I have some complicated feelings about the modern internet.
Hey, the internet is getting weird these days
Generative AI
Okay, so there's generative AI, first of all. It's lousy on Facebook and as text in websites and in image search results. It's the next iteration of algorithmic horror and it's only going to get weirder from here on out.
I was doing pretty well on not seeing generic AI-generated images in regular search results for a while, but now they're cropping up, and sneaking (unmarked) onto extremely AI-averse platforms like Tumblr. It used to be that you could look up pictures of aspic that you could throw into GIMP with the aspect logos from Homestuck and you would call it "claspic", which is actually a really good and not bad pun and all of your friends would go "why did you make this image".
And in this image search process you realize you also haven't looked at a lot of pictures of aspic and it's kind of visually different than jello, but now you see some of these are from Craiyon and are generated and you're not sure which ones you've already looked past that are not truly photos of aspic and you're not sure what's real and you're put off of your dumb pun by an increasingly demon-haunted world, not to mention aspic.
(Actually, I've never tried aspic before. Maybe I'll see if I can get one of my friends to make a vegan aspic for my birthday party. I think it could be upsetting and also tasty and informative and that's what I'm about, personally. Have you tried aspic? Tell me what you thought of it.)
Search engines
Speaking of search engines, search engines are worse. Results are worse. The podcast Search Engine (which also covers other topics) has a nice episode saying that this is because of the growing hoardes of SEO-gaming low-quality websites and discussing the history of these things, as well as discussing Google's new LLM-generated results.
I don't have much to add - I think there is a lot here,...

May 31, 2024 • 11min
LW - A civilization ran by amateurs by Olli Järviniemi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A civilization ran by amateurs, published by Olli Järviniemi on May 31, 2024 on LessWrong.
I
When I was a child, I remember thinking: Where do houses come from? They are huge! Building one would take forever! Yet there are so many of them!
Having become a boring adult, I no longer have the same blue-eyed wonder about houses, but humanity does have an accomplishment or two I'm still impressed by.
When going to the airport, the metal boulders really stay up in the air without crashing. Usually they leave at the time they told me two weeks earlier, taking me to the right destination at close to the speed of sound.
There are these boxes with buttons that you can press to send information near-instantly anywhere. They are able to perform billions of operations a second. And you can just buy them at a store!
And okay, I admit that big houses - skyscrapers - still light up some of that child-like marvel in me.
II
Some time ago I watched the Eurovision song contest. For those who haven't seen it, it looks something like this:
It's a big contest, and the whole physical infrastructure - huge hall, the stage, stage effects, massive led walls, camera work - is quite impressive. But there's an objectively less impressive thing I want to focus on here: the hosts.
I basically couldn't notice the hosts making any errors. They articulate themselves clearly, they don't stutter or stumble on their words, their gestures and facial expressions are just what they are supposed to be, they pause their speech at the right moments for the right lengths, they could fluently speak some non-English languages as well, ...
And, sure, this is not one-in-a-billion talent - there are plenty of competent hosts in all kinds of shows - but they clearly are professionals and much more competent than your average folk.
(I don't know about you, but when I've given talks to small groups of people, I've started my sentences without knowing how they'll end, talked too fast, stumbled in my speech, and my facial expressions probably haven't been ideal. If the Eurovision hosts get nervous when talking to a hundred million people, it doesn't show up.)
III
I think many modern big-budget movies are pretty darn good.
I'm particularly thinking of Oppenheimer and the Dune series here (don't judge my movie taste), but the point is more general. The production quality of big movies is extremely high. Like, you really see that these are not amateur projects filmed in someone's backyard, but there's an actual effort to make a good movie.
There's, of course, a written script that the actors follow. This script has been produced by one or multiple people who have previously demonstrated their competence. The actors are professionals who, too, have been selected for competence. If they screw up, someone tells them. A scene is shot again until they get it right. The actors practice so that they can get it right. The movie is, obviously, filmed scene-by-scene. There are the cuts and sounds and lighting.
Editing is used to fix some errors - or maybe even to basically create the whole scene. Movie-making technology improves and the new technology is used in practice, and the whole process builds on several decades of experience.
Imagine an alternative universe where this is not how movies were made. There is no script, but rather the actors improvise from a rough sketch - and by "actors" I don't mean competent Eurovision-grade hosts, I mean average folk paid to be filmed. No one really gives them feedback on how they are doing, nor do they really "practice" acting on top of simply doing their job. The whole movie is shot in one big session with no cuts or editing.
People don't really use new technology for movies, but instead stick to mid-to-late-1900s era cameras and techniques. Overall movies look largely the same as they have...

May 30, 2024 • 21min
LW - OpenAI: Helen Toner Speaks by Zvi
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI: Helen Toner Speaks, published by Zvi on May 30, 2024 on LessWrong.
Helen Toner went on the TED AI podcast, giving us more color on what happened at OpenAI. These are important claims to get right.
I will start with my notes on the podcast, including the second part where she speaks about regulation in general. Then I will discuss some implications more broadly.
Notes on Helen Toner's TED AI Show Podcast
This seems like it deserves the standard detailed podcast treatment. By default each note's main body is description, any second-level notes are me.
1. (0:00) Introduction. The host talks about OpenAI's transition from non-profit research organization to de facto for-profit company. He highlights the transition from 'open' AI to closed as indicative of the problem, whereas I see this as the biggest thing they got right.
He also notes that he was left with the (I would add largely deliberately created and amplified by enemy action) impression that Helen Toner was some kind of anti-tech crusader, whereas he now understands that this was about governance and misaligned incentives.
2. (5:00) Interview begins and he dives right in and asks about the firing of Altman. She dives right in, explaining that OpenAI was a weird company with a weird structure, and a non-profit board supposed to keep the company on mission over profits.
3. (5:20) Helen says for years Altman had made the board's job difficult via withholding information, misrepresenting things happening at the company, and 'in some cases outright lying to the board.'
4.
(5:45) Helen says she can't share all the examples of lying or withholding information, but to give a sense: The board was not informed about ChatGPT in advance and learned about ChatGPT on Twitter, Altman failed to inform the board that he owned the OpenAI startup fund despite claiming to be an independent board member, giving false information about the company's formal safety processes on multiple occasions, and relating to her research paper, that Altman in the paper's wake started lying to
other board members in order to push Toner off the board.
1. I will say it again. If the accusation bout Altman lying to the board in order to change the composition of the board is true, then in my view the board absolutely needed to fire Altman. Period. End of story. You have one job.
2. As a contrasting view, the LLMs I consulted thought that firing the CEO should be considered, but it was plausible this could be dealt with via a reprimand combined with changes in company policy.
3. I asked for clarification given the way it was worded in the podcast, and can confirm that the Altman withheld information from the board regarding the startup fund and the launch of ChatGPT, but he did not lie about those.
4. Repeatedly outright lying about safety practices seems like a very big deal?
5. It sure sounds like Altman had a financial interest in OpenAI via the startup fund, which means he was not an independent board member, and that the company's board was not majority independent despite OpenAI claiming that it was. That is… not good, even if the rest of the board knew.
5. (7:25) Toner says that any given incident Altman could give an explanation, but the cumulative weight meant they could not trust Altman. And they'd been considering firing Altman for over a month.
1. If they were discussing firing Altman for at least a month, that raises questions about why they weren't better prepared, or why they timed the firing so poorly given the tender offer.
6. (8:00) Toner says that Altman was the board's main conduit of information about the company. They had been trying to improve processes going into the fall, these issues had been long standing.
7. (8:40) Then in October two executives went to the board and said they couldn't trust Altman, that the atmospher...

May 30, 2024 • 6min
LW - Non-Disparagement Canaries for OpenAI by aysja
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Non-Disparagement Canaries for OpenAI, published by aysja on May 30, 2024 on LessWrong.
Since at least 2017, OpenAI has asked departing employees to sign offboarding agreements which legally bind them to permanently - that is, for the rest of their lives - refrain from criticizing OpenAI, or from otherwise taking any actions which might damage its finances or reputation.[1]
If they refused to sign, OpenAI threatened to take back (or make unsellable) all of their already-vested equity - a huge portion of their overall compensation, which often amounted to millions of dollars. Given this immense pressure, it seems likely that most employees signed.
If they did sign, they became personally liable forevermore for any financial or reputational harm they later caused. This liability was unbounded, so had the potential to be financially ruinous - if, say, they later wrote a blog post critical of OpenAI, they might in principle be found liable for damages far in excess of their net worth.
These extreme provisions allowed OpenAI to systematically silence criticism from its former employees, of which there are now hundreds working throughout the tech industry. And since the agreement also prevented signatories from even disclosing that they had signed this agreement, their silence was easy to misinterpret as evidence that they didn't have notable criticisms to voice.
We were curious about who may have been silenced in this way, and where they work now, so we assembled an (incomplete) list of former OpenAI employees.[2] From what we were able to find, it appears that over 500 people may have signed these agreements, of which only 3 have publicly reported being released so far.[3]
We were especially alarmed to notice that the list contains at least 12 former employees currently working on AI policy, and 6 working on safety evaluations.[4] This includes some in leadership positions, for example:
Beth Barnes (Head of Research, METR)
Bilva Chandra (Senior AI Policy Advisor, NIST)
Charlotte Stix (Head of Governance, Apollo Research)
Chris Painter (Head of Policy, METR)
Geoffrey Irving (Research Director, AI Safety Institute)
Jack Clark (Co-Founder [focused on policy and evals], Anthropic)
Jade Leung (CTO, AI Safety Institute)
Paul Christiano (Head of Safety, AI Safety Institute)
Remco Zwetsloot (Executive Director, Horizon Institute for Public Service)
In our view, it seems hard to trust that people could effectively evaluate or regulate AI, while under strict legal obligation to avoid sharing critical evaluations of a top AI lab, or from taking any other actions which might make the company less valuable (as many regulations presumably would). So if any of these people are not subject to these agreements, we encourage them to mention this in public.
It is rare for company offboarding agreements to contain provisions this extreme - especially those which prevent people from even disclosing that the agreement itself exists. But such provisions are relatively common in the American intelligence industry. The NSA periodically forces telecommunications providers to reveal their clients' data, for example, and when they do the providers are typically prohibited from disclosing that this ever happened.
In response, some companies began listing warrant canaries on their websites - sentences stating that they had never yet been forced to reveal any client data. If at some point they did receive such a warrant, they could then remove the canary without violating their legal non-disclosure obligation, thereby allowing the public to gain indirect evidence about this otherwise-invisible surveillance.
Until recently, OpenAI succeeded at preventing hundreds of its former employees from ever being able to criticize them, and prevented most others - including many of their current employees! - from...