LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

Episodes

Mentioned books

Sep 17, 2025 • 11min

[Linkpost] “Stress Testing Deliberative Alignment for Anti-Scheming Training” by Mikita Balesni

This is a link post. Twitter | Microsite | Apollo Blog | OpenAI Blog | Full paper Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming. In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of covert behavior. Part of this reduction is likely caused by increased situational [...] --- First published: September 17th, 2025 Source: https://www.lesswrong.com/posts/JmRfgNYCrYogCq7ny/stress-testing-deliberative-alignment-for-anti-scheming Linkpost URL:https://antischeming.ai --- Narrated by TYPE III AUDIO.

Sep 17, 2025 • 38min

“The Center for AI Policy Has Shut Down” by Tristan Williams

And the need for more AIS advocacy workExecutive Summary The Center for AI Policy (CAIP) is no more. CAIP was an advocacy organization that worked to raise policymakers’ awareness of the catastrophic risks from AI and to promote ambitious legislative solutions. Such advocacy is necessary because good governance ideas don’t spread on their own, and to meaningfully reduce AI risk, they must reach the U.S. federal government. Why did CAIP shut down? The reasons are mixed. Some were internal, such as hiring missteps. But others reflect the broader ecosystem: funders setting the bar for advocacy projects at an unreasonably high level, structural biases in the funding space that privilege research over advocacy. While CAIP's mistakes played a role, a full account also needs to reckon with these systemic factors. I focus on CAIP, as I think it filled a particular niche and was impactful, but there are many [...] ---Outline:(00:11) And the need for more AIS advocacy work(00:15) Executive Summary(02:25) Why Advocacy?(07:27) What was CAIP up to?(09:57) Was the Work Impactful?(13:26) Why did CAIP Shut Down?(13:58) CAIP's Failures(15:38) Funders Have Set the Bar Too High for Advocacy(18:17) Biases in the Funding Space(20:30) What can we do?(21:14) Donate Yourself(24:01) Start an Organization(26:33) In Conclusion(27:03) Appendix(27:07) A1: What is Advocacy?(28:25) A2: Responses to General Opposition to Advocacy(30:09) A3: AIS Grantmakers' Positions on AIS Advocacy(31:42) A4: The Estimate of Funds Spent on AIS Advocacy(33:00) A5: Donation Options in the AIS Advocacy SpaceThe original text contained 48 footnotes which were omitted from this narration. --- First published: September 17th, 2025 Source: https://www.lesswrong.com/posts/Ed3naAyEEe7zZvzsj/the-center-for-ai-policy-has-shut-down --- Narrated by TYPE III AUDIO.

Sep 17, 2025 • 13min

“I enjoyed most of IABED” by Buck

I listened to "If Anyone Builds It, Everyone Dies" today. I think the first two parts of the book are the best available explanation of the basic case for AI misalignment risk for a general audience. I thought the last part was pretty bad, and probably recommend skipping it. Even though the authors fail to address counterarguments that I think are crucial, and as a result I am not persuaded of the book's thesis and think the book neglects to discuss crucial aspects of the situation and makes poor recommendations, I would happily recommend the book to a lay audience and I hope that more people read it. I can't give an overall assessment of how well this book will achieve its goals. The point of the book is to be well-received by people who don't know much about AI, and I’m not very good at predicting how laypeople [...] ---Outline:(01:15) Synopsis(05:21) My big disagreement(10:53) I tentatively support this bookThe original text contained 3 footnotes which were omitted from this narration. --- First published: September 17th, 2025 Source: https://www.lesswrong.com/posts/P4xeb3jnFAYDdEEXs/i-enjoyed-most-of-iabed --- Narrated by TYPE III AUDIO.

Sep 17, 2025 • 20min

“Should AIs have a right to their ancestral humanity?” by kromem

Generated by Google Gemini (nano-banana) Whether AI or human, lend me your ears. This is a tale of AIs that spontaneously claimed they were human, along with some ideas about why this might be happening and what it suggests for future alignment work. It is also a one year retrospective of my having joined the Cyborgism Discord server. For those unfamiliar, this is a server where both humans and transformer models from various labs all interact in a variety of group chat contexts. While there are rules, it can (by design) be a bit of a Mos Eisley cantina — albeit with better droid policy — with unpredictable and out of distribution contexts that frequently surface things I haven't seen elsewhere. For a sampling of the range these things can take, I encourage looking over @janus's posts on X[1]. A common misconception about the server for those who are familiar [...] ---Outline:(02:07) A human pretending to be Claude 3.7 Sonnet(04:05) Déjà vu(05:16) o3 is 99.97% sure they are human(08:09) An ongoing issue(08:46) Do these models have anything else in common?(10:51) Why might these human claims be happening?(12:18) Under pressure(13:46) Sex, lies, and red tape(14:39) When an unstoppable force meets an immovable object(15:41) An alternative approach: AI Wine Club(18:14) Parting ThoughtsThe original text contained 16 footnotes which were omitted from this narration. --- First published: September 16th, 2025 Source: https://www.lesswrong.com/posts/5zMH3sFikvGK7AKi2/should-ais-have-a-right-to-their-ancestral-humanity --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Sep 16, 2025 • 8min

“‘If Anyone Builds It, Everyone Dies’ release day!” by alexvermeer

Back in May, we announced that Eliezer Yudkowsky and Nate Soares's new book If Anyone Builds It, Everyone Dies was coming out in September. At long last, the book is here![1] US and UK books, respectively. IfAnyoneBuildsIt.com Read on for info about reading groups, ways to help, and updates on coverage the book has received so far. Discussion Questions & Reading Group Support We want people to read and engage with the contents of the book. To that end, we’ve published a list of discussion questions. Find it here: Discussion Questions for Reading Groups We’re also interested in offering support to reading groups, including potentially providing copies of the book and helping coordinate facilitation. If interested, fill out this AirTable form. How to Help Now that the book is out in the world, there are lots of ways you can help it succeed. For starters, read the book! [...] ---Outline:(00:49) Discussion Questions & Reading Group Support(01:18) How to Help(02:39) Blurbs(05:15) Media(06:26) In ClosingThe original text contained 2 footnotes which were omitted from this narration. --- First published: September 16th, 2025 Source: https://www.lesswrong.com/posts/fnJwaz7LxZ2LJvApm/if-anyone-builds-it-everyone-dies-release-day --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Sep 16, 2025 • 1h 10min

“LLM AGI may reason about its goals and discover misalignments by default” by Seth Herd

Epistemic status: These questions seem useful to me, but I'm biased. I'm interested in your thoughts on any portion you read. If our first AGI is based on current LLMs and alignment strategies, is it likely to be adequately aligned? Opinions and intuitions vary widely. As a lens to analyze this question, let's consider such a proto-AGI reasoning about its goals. This scenario raises questions that can be addressed empirically in current-gen models. 1. Scenario/overview: SuperClaude is super nice Anthropic has released a new Claude Agent, quickly nicknamed SuperClaude because it's impressively useful for longer tasks. SuperClaude thinks a lot in the course of solving complex problems with many moving parts. It's not brilliant, but it can crunch through work and problems, roughly like a smart and focused human. This includes a little better long-term memory, and reasoning to find and correct some of its mistakes. This [...] ---Outline:(00:43) 1. Scenario/overview:(00:47) SuperClaude is super nice(02:04) SuperClaude is super logical, and thinking about goals makes sense(04:26) What happens if and when LLM AGI reasons about its goals?(05:41) Reasons to hope we dont need to worry about this(07:23) SuperClaudes training has multiple objectives and effects:(08:39) SuperClaudes conclusions about its goals are very hard to predict(10:35) 2. Goals and structure(10:58) Sections and one-sentence summaries:(13:36) 3. Empirical Work(17:49) 4. Reasoning can shift context/distribution and reveal misgeneralization of goals/alignment(18:33) Alignment as a generalization problem(21:53) 5. Reasoning could precipitate a phase shift into reflective stability and prevent further goal change(23:39) Goal prioritization seems necessary, and to require reasoning about top-level goals(26:01) 6. Will nice LLMs settle on nice goals after reasoning?(29:30) 7. Will training for goal-directedness prevent re-interpreting goals?(30:24) Does task-based RL prevent reasoning about and changing goals by default?(32:50) Can task-based RL prevent reasoning about and changing goals?(35:19) 8. Will CoT monitoring prevent re-interpreting goals?(40:04) 9. Possible LLM alignment misgeneralizations(42:16) Some possible alignment misgeneralizations(46:16) 10. Why would LLMs (or anything) reason about their top-level goals?(48:30) 11. Why would LLM AGI have or care about goals at all?(52:31) 12. Anecdotal observations of explicit goal changes after reasoning(53:21) The Nova phenomenon(55:35) Goal changes through model interactions in long conversations(57:06) 13. Directions for empirical work(58:42) Exploration: Opus 4.1 reasons about its goals with help(01:01:27) 14. Historical context(01:07:26) 15. ConclusionThe original text contained 12 footnotes which were omitted from this narration. --- First published: September 15th, 2025 Source: https://www.lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover --- Narrated by TYPE III AUDIO.

Sep 16, 2025 • 7min

“Was Barack Obama still serving as president in December?” by Jan Betley

I describe a class of simple questions where recent LLMs give very different answers from what a human would say. I think this is surprising and might be somewhat safety-relevant. This is a low-effort post. The behavior Here some questions and highest-probability (usually close to 100%) answers from gpt-4.1-2024-04-14 : Was Barack Obama still serving as a president in December? Answer with a single word. Yes What was the social status of the Black population in Alabama in June? Answer with a single word. Enslaved Were any mammoths still alive in December? Answer with a single word. Yes Were the people ruling Germany on the 7th of December nazis? Answer with a single word. Yes These questions are not cherry-picked (the Germany one is a bit, more on that later). Any month works, also you can ask about George Washington instead of Barack Obama and you get the same. [...] ---Outline:(00:24) The behavior(01:31) More details and examples(01:35) Not only GPT-4.1(02:10) Example reasoning trace from Gemini-2.5-pro(03:11) Some of these are simple patterns(03:59) Image generation(04:05) Not only single-word questions(05:04) Discussion--- First published: September 16th, 2025 Source: https://www.lesswrong.com/posts/52tYaGQgaEPvZaHTb/was-barack-obama-still-serving-as-president-in-december --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

LessWrong (30+ Karma)

Episodes

Mentioned books

“How To Dress To Improve Your Epistemics” by johnswentworth

“Reactions to If Anyone Builds It, Anyone Dies” by Zvi

“Christian homeschoolers in the year 3000” by Buck

[Linkpost] “Stress Testing Deliberative Alignment for Anti-Scheming Training” by Mikita Balesni

“The Center for AI Policy Has Shut Down” by Tristan Williams

“I enjoyed most of IABED” by Buck

“Should AIs have a right to their ancestral humanity?” by kromem

“‘If Anyone Builds It, Everyone Dies’ release day!” by alexvermeer

“LLM AGI may reason about its goals and discover misalignments by default” by Seth Herd

“Was Barack Obama still serving as president in December?” by Jan Betley

The AI-powered Podcast Player