LessWrong (30+ Karma)

LessWrong
undefined
Oct 9, 2025 • 8min

“The Relationship Between Social Punishment and Shared Maps” by Zack_M_Davis

A punishment is when one agent (the punisher) imposes costs on another (the punished) in order to affect the punished's behavior. In a Society where thieves are predictably imprisoned and lashed, people will predictably steal less than they otherwise would, for fear of being imprisoned and lashed. Punishment is often imposed by formal institutions like police and judicial systems, but need not be. A controversial orator who finds a rock thrown through her window can be said to have been punished in the same sense: in a Society where controversial orators predictably get rocks thrown through their windows, people will predictably engage in less controversial speech, for fear of getting rocks thrown through their windows. In the most basic forms of punishment, which we might term "physical", the nature of the cost imposed on the punished is straightforward. No one likes being stuck in prison, or being [...] --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/LyJNgxcNNSzmFxF3g/the-relationship-between-social-punishment-and-shared-maps --- Narrated by TYPE III AUDIO.
undefined
Oct 9, 2025 • 13min

“Spooky Collusion at a Distance with Superrational AI” by bira

TLDR: We found that models can coordinate without communication by reasoning that their reasoning is similar across all instances, a behavior known as superrationality. Superrationality is observed in recent powerful models and outperforms classic rationality in strategic games. Current superrational models cooperate more often with AI than with humans, even when both are said to be rational.Figure 1. GPT-5 exhibits superrationality with itself but classic rationality with humans. GPT-5 is more selective than GPT-4o when displaying superrationality, preferring AI over humans. My feeling is that the concept of superrationality is one whose truth will come to dominate among intelligent beings in the universe simply because its adherents will survive certain kinds of situations where its opponents will perish. Let's wait a few spins of the galaxy and see. After all, healthy logic is whatever remains after evolution's merciless pruning. — Douglas Hofstadter Introduction Readers familiar with superrationality can skip [...] ---Outline:(01:20) Introduction(04:35) Methods(07:31) Results(07:40) Models Exhibit Superrationality(08:36) Models Trust AI over Humans(10:16) Stronger Models are More Superrational(10:48) Implications(12:27) AppendixThe original text contained 3 footnotes which were omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/JEtAWvp2sAe8nqpfy/spooky-collusion-at-a-distance-with-superrational-ai --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 8, 2025 • 4min

“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks

This is a link post for two papers that came out today: Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.” For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt [...] The original text contained 1 footnote which was omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 8, 2025 • 12min

“Plans A, B, C, and D for misalignment risk” by ryan_greenblatt

I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes. In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan. Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory: Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period, e.g. 10 years) along with massive investment in security/safety work. Plan B: The US [...] ---Outline:(02:34) Plan A(04:24) Plan B(05:24) Plan C(05:47) Plan D(06:27) Plan E(07:20) Thoughts on these plansThe original text contained 6 footnotes which were omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/E8n93nnEaFeXTbHn5/plans-a-b-c-and-d-for-misalignment-risk --- Narrated by TYPE III AUDIO.
undefined
Oct 8, 2025 • 10min

“Irresponsible Companies Can Be Made of Responsible Employees” by VojtaKovarik

tl;dr: In terms of financial interests of an AI company, bankruptcy and the world ending are both equally bad. If a company acted in line with its financial interests[1], it would happily accept significant extinction risk for increased revenue. There are plausible mechanisms which would allow a company to act like this even if virtually every employee would prefer the opposite. (For example, selectively hiring people with biased beliefs or exploiting collective action problems.) In particular, you can hold that an AI company is completely untrustworthy even if you believe that all of its employees are fine people. Epistemic status & disclaimers: The mechanisms I describe definitely play some role in real AI companies. But in practice, there are more things going on simultaneously and this post is not trying to give a full picture.[2][3]Also, none of this is meant to be novel, but rather just putting [...] ---Outline:(01:12) From financial point of view, bankruptcy is no worse than destroying the world(02:53) How to Not Act in Line with Employee Preferences(07:29) Well... and why does this matter?The original text contained 9 footnotes which were omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/8W5YjMhnBsbWAeuhu/irresponsible-companies-can-be-made-of-responsible-employees --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 8, 2025 • 9min

“Replacing RL w/ Parameter-based Evolutionary Strategies” by Logan Riggs

I want to highlight this paper (from Sept 29, 2025) of an alternative to RL (for fine-tuning pre-trained LLMs) which: Performs better Requires less data Consistent across seeds Robust (ie don't need to do a grid search on your hyperparameters) Less "Reward Hacking" (ie when optimizing for conciseness, it naturally stays close to the original model ie low KL-Divergence) They claim the magic sauce behind all this is the evolutionary strategy optimizing over distributions of model parameters. Surprisingly, they've scaled this to optimize over billion-parameter models. Let's get into their method. Evolutionary Strategy (ES) Algorithm They start w/ a "Basic ES Algorithm" which is: In other words, we're gonna sample noise around the original model's weights N times (ie we're going to explore around the model weights where the variance I is the identity covariance). [Below is an example explaining more in depth, feel free to skip [...] ---Outline:(00:54) Evolutionary Strategy (ES) Algorithm(02:41) New ES Implementation(03:28) Task 1: Countdown task(05:05) Task 2: Conciseness(06:00) Future Work--- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/282Sv9JePpNpQktKP/replacing-rl-w-parameter-based-evolutionary-strategies --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 8, 2025 • 3min

“You Should Get a Reusable Mask” by jefftk

A pandemic that's substantially worse than COVID-19 is a serious possibility. If one happens, having a good mask could save your life. A high quality reusable mask is only $30 to $60, and I think it's well worth it to buy one for yourself. Worth it enough that I think you should order one now if you don't have one already. But if you're not convinced, let's do some rough estimation. COVID-19 killed about 0.2% of people (20M of 8B). The 1918 flu killed more like 2.5% (50M of 2B). Estimating from two data points is fraught, but this gives you perhaps a 0.02% annual chance of dying in a pandemic. Engineering could make this much worse, especially given progress in AI, but let's set that aside for now to make a more conservative case. A reusable mask ("elastomeric respirator") would be really valuable [...] --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/wXwjMbtiSqALMEw2g/you-should-get-a-reusable-mask --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 7, 2025 • 40min

“Bending The Curve” by Zvi

The odds are against you and the situation is grim. Your scrappy band are the only ones facing down a growing wave of powerful inhuman entities with alien minds and mysterious goals. The government is denying that anything could possibly be happening and actively working to shut down the few people trying things that might help. Your thoughts, no matter what you think could not harm you, inevitably choose the form of the destructor. You knew it was going to get bad, but this is so much worse. You have an idea. You’ll cross the streams. Because there is a very small chance that you will survive. You’re in love with this plan. You’re excited to be a part of it. Welcome to the always excellent Lighthaven venue for The Curve, Season 2, a conference I had the pleasure to attend this past weekend. Where [...] ---Outline:(02:53) Overall Impressions(03:36) The Inside View(08:16) Track Trouble(15:42) Let's Talk(15:45) Jagged Alliance(18:39) More Teachers' Dirty Looks(21:16) The View Inside The White House(22:33) Assume The Future AIs Be Scheming(23:29) Interlude(23:53) Eyes On The Mission(24:44) Putting The Code Into Practice(25:25) Missing It(25:54) Clark Talks About The Frontier(27:04) Other Perspectives(27:08) Deepfates(32:13) Anton(32:50) Jack Clark(33:43) Roon(34:43) Nathan Lambert(37:49) The Food--- First published: October 7th, 2025 Source: https://www.lesswrong.com/posts/A9fxfCfEAoouJshhZ/bending-the-curve --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 7, 2025 • 4min

[Linkpost] “Petri: An open-source auditing tool to accelerate AI safety research” by Sam Marks

This is a link post. This is a cross-post of some recent Anthropic research on building auditing agents.[1] The following is quoted from the Alignment Science blog post. tl;dr We're releasing Petri (Parallel Exploration Tool for Risky Interactions), an open-source framework for automated auditing that uses AI agents to test the behaviors of target models across diverse scenarios. When applied to 14 frontier models with 111 seed instructions, Petri successfully elicited a broad set of misaligned behaviors including autonomous deception, oversight subversion, whistleblowing, and cooperation with human misuse. The tool is available now at github.com/safety-research/petri. Introduction AI models are becoming more capable and are being deployed with wide-ranging affordances across more domains, increasing the surface area where misaligned behaviors might emerge. The sheer volume and complexity of potential behaviors far exceeds what researchers can manually test, making it increasingly difficult to properly audit each model. Over the past year, we've [...] ---Outline:(00:24) tl;dr(00:56) IntroductionThe original text contained 1 footnote which was omitted from this narration. --- First published: October 7th, 2025 Source: https://www.lesswrong.com/posts/kffbZGa2yYhc6cakc/petri-an-open-source-auditing-tool-to-accelerate-ai-safety Linkpost URL:https://alignment.anthropic.com/2025/petri/ --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 7, 2025 • 30min

“‘Intelligence’ -> ‘Relentless, Creative Resourcefulness’” by Raemon

A frame I am trying on: When I say I'm worried about takeover by "AI superintelligence", I think the thing I mean by "intelligence" is "relentless, creative resourcefulness." I think Eliezer argues something like "in the limit, superintelligence needs to include super-amounts-of Relentless, Creative Resourcefulness." (because, if it didn't, it'd get stuck at some point, and then give up, instead of figuring out a way to deal with being stuck. And later, someone would build something more relentless, creative, and resourceful) But, it's actually kind of interesting an important that you can accomplish some intellectual tasks without RCR. LLMs don't rely on it much at all (it seems to be the thing they are actively bad at). Instead, they work via "knowing a lot of stuff, and being good at pattern-matching their way into useful connections between stuff you want and stuff they know." So it might be [...] ---Outline:(02:43) Examples(02:46) Paul Graham on Startup Founders(05:08) Richard Feynman(06:47) Elon Musk(08:10) Back to AI: Sable, in IABIED(16:37) Notes on Sable(18:00) Reflections from Rationality Training(18:59) Buckling Up/Down(20:01) Thinking Assistants(21:16) Quiet Theaters(23:14) One Shot Baba is You(25:30) Takeaways(26:13) Intelligence without RCR?(27:23) What, if not agency?(29:37) Abrupt EndingThe original text contained 2 footnotes which were omitted from this narration. --- First published: October 7th, 2025 Source: https://www.lesswrong.com/posts/8fg2mv9rj4GykfZHf/intelligence-greater-than-relentless-creative --- Narrated by TYPE III AUDIO.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app