LessWrong (30+ Karma)

LessWrong
undefined
Oct 31, 2025 • 38min

“OpenAI Moves To Complete Potentially The Largest Theft In Human History” by Zvi

OpenAI is now set to become a Public Benefit Corporation, with its investors entitled to uncapped profit shares. Its nonprofit foundation will retain some measure of control and a 26% financial stake, in sharp contrast to its previous stronger control and much, much larger effective financial stake. The value transfer is in the hundreds of billions, thus potentially the largest theft in human history. I say potentially largest because I realized one could argue that the events surrounding the dissolution of the USSR involved a larger theft. Unless you really want to stretch the definition of what counts this seems to be in the top two. I am in no way surprised by OpenAI moving forward on this, but I am deeply disgusted and disappointed they are being allowed (for now) to do so, including this statement of no action by Delaware and this Memorandum of Understanding with California. Many media and public sources are calling this a win for the nonprofit, such as this from the San Francisco Chronicle. This is mostly them being fooled. They’re anchoring on OpenAI's previous plan to far more fully sideline the nonprofit. This is indeed a big win for [...] ---Outline:(01:38) OpenAI Calls It Completing Their Recapitalization(03:05) How Much Was Stolen?(07:02) The Nonprofit Still Has Lots of Equity After The Theft(10:41) The Theft Was Unnecessary For Further Fundraising(11:45) How Much Control Will The Nonprofit Retain?(23:13) Will These Control Rights Survive And Do Anything?(26:17) What About OpenAI's Deal With Microsoft?(31:10) What Will OpenAI's Nonprofit Do Now?(36:33) Is The Deal Done? --- First published: October 31st, 2025 Source: https://www.lesswrong.com/posts/wCc7XDbD8LdaHwbYg/openai-moves-to-complete-potentially-the-largest-theft-in --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 31, 2025 • 9min

“Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence” by David Lorell

Audio note: this article contains 86 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Around two months ago, John and I published Resampling Conserves Redundancy (Approximately). Fortunately, about two weeks ago, Jeremy Gillen and Alfred Harwood showed us that we were wrong. This proof achieves, using the Jensen-Shannon divergence ("JS"), what the previous one failed to show using KL divergence ("<span>_D_{KL}_</span>"). In fact, while the previous attempt tried to show only that redundancy is conserved (in terms of <span>_D_{KL}_</span>) upon resampling latents, this proof shows that the redundancy and mediation conditions are conserved (in terms of JS). Why Jensen-Shannon? In just about all of our previous work, we have used <span>_D_{KL}_</span> as our factorization error. (The error meant to capture the extent to which a given distribution fails to factor according to some graphical structure.) In this post I use the Jensen Shannon divergence. <span>_D_{KL}(U||V) := mathbb{E}_{U}lnfrac{U}{V}_</span> <span>_JS(U||V) := frac{1}{2}D_{KL}left(U||frac{U+V}{2}right) + frac{1}{2}D_{KL}left(V||frac{U+V}{2}right)_</span> The KL divergence is a pretty fundamental quantity in information theory, and is used all over the place. (JS is usually defined in terms of <span>_D_{KL}_</span>, as above.) We [...] ---Outline:(01:04) Why Jensen-Shannon?(03:04) Definitions(05:33) Theorem(06:29) Proof(06:32) (1) _\\epsilon^{\\Gamma}_1 = 0_(06:37) Proof of (1)(06:52) (2) _\\epsilon^{\\Gamma}_2 \\leq (2\\sqrt{\\epsilon_1}+\\sqrt{\\epsilon_2})^2_(06:57) Lemma 1: _JS(S||R) \\leq \\epsilon_1_(07:10) Lemma 2: _\\delta(Q,R) \\leq \\sqrt{\\epsilon_1} + \\sqrt{\\epsilon_2}_(07:20) Proof of (2)(07:32) (3) _\\epsilon^{\\Gamma}_{med} \\leq (2\\sqrt{\\epsilon_1} + \\sqrt{\\epsilon_{med}})^2_(07:37) Proof of (3)(07:48) Results(08:33) Bonus The original text contained 1 footnote which was omitted from this narration. --- First published: October 31st, 2025 Source: https://www.lesswrong.com/posts/JXsZRDcRX2eoWnSxo/resampling-conserves-redundancy-and-mediation-approximately --- Narrated by TYPE III AUDIO.
undefined
Oct 30, 2025 • 30min

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism TL, DR:; We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested. The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written. Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints. The resulting model uses types hints more during evaluation than in deployment. Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this. We show that activation steering can elicit the model's deployment behavior even when simple prompting fails. When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...] ---Outline:(02:45) Extended Summary(09:18) Commentary about the paper from Tim:(09:52) Thoughts on how to steer models to believe they are deployed(10:06) Contrastive prompts are good actually(12:10) Steer on multiple layers at once(13:09) Consider extracting the steering vector from an earlier checkpoint as well(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)(15:24) Creating behavioral demonstrations of misalignment based on the steered model.(16:46) Other things to do with a model that's been steered to believe it's deployed(17:31) How others can build on this work(18:07) Self-Review of the Paper's Strengths and Weaknesses:(18:19) Main strengths of the paper:(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.(20:47) Most sketchy parts of the paper:(20:58) Llama Nemotron 49B is gullible and easy to trick.(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?(27:26) Other miscellaneous takes on evaluation awareness:(28:44) Other miscellaneous takes on conducting this type of research The original text contained 3 footnotes which were omitted from this narration. --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 30, 2025 • 2min

[Linkpost] “AISLE discovered three new OpenSSL vulnerabilities” by Jan_Kulveit

This is a link post. The company post is linked; it seems like an update on where we are with automated cybersec. So far in 2025, only four security vulnerabilities received CVE identifiers in OpenSSL, the cryptographic library that secures the majority of internet traffic. AISLE's autonomous system discovered three of them. (CVE-2025-9230, CVE-2025-9231, and CVE-2025-9232) Some quick thoughts: OpenSSL is one of the most human-security-audited pieces of open-source code ever, so discovering 3 new vulnerabilities is impressive Obviously, vulnerability discovery is a somewhat symmetric capability, so this also gives us some estimate of the offense side This provides concrete evidence for the huge pool of bugs that are findable and exploitable even by current level AI - this is something everyone sane believed existed in my impression On the other hand, it does not neatly support the story where it's easy for rogue AIs to hack anything. The AISLE system was also able to fix the bugs, hopefully systems like this will be deployed, and it seems likely the defense side will start with large advantage of compute It's plausible that the "programs are proofs" limit is defense-dominated. On the other hand, actual programs are leaky [...] --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/F5QAGP5bYrMMjQ5Ab/aisle-discovered-three-new-openssl-vulnerabilities-1 Linkpost URL:https://aisle.com/blog/aisle-discovers-three-of-the-four-openssl-vulnerabilities-of-2025 --- Narrated by TYPE III AUDIO.
undefined
Oct 30, 2025 • 36min

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1]To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic's evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10% for Sonnet 4. Internal representations associated with evaluation awareness also became notably stronger over training.[2]In addition, Sonnet 4.5 seems substantially less likely to behave poorly once it has noticed that it might be tested. As seen in the figure below, when Anthropic inhibited internal representations related to evaluation awareness[3], its misaligned behavior rate on synthetic honeypot tests increased from 0% to between 1% and 9%. This effect was [...] ---Outline:(06:58) Sonnet 4.5 is much more evaluation-aware than prior models(10:00) Evaluation awareness seems to suppress misaligned behavior(14:52) Anthropic's training plausibly caused Sonnet 4.5 to game evaluations(16:28) Evaluation gaming is plausibly a large fraction of the effect of training against misaligned behaviors(22:57) Suppressing evidence of misalignment in evaluation gamers is concerning(25:25) What AI companies should do(30:02) Appendix The original text contained 21 footnotes which were omitted from this narration. --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/qgehQxiTXj53X49mM/sonnet-4-5-s-eval-gaming-seriously-undermines-alignment --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 30, 2025 • 8min

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

This is a post about our recent work ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases (with Aditi Raghunathan, Nicholas Carlini) where we derive impossible benchmarks from existing benchmarks to measure reward hacking.Figure 1: Overview of the ImpossibleBench framework. We start with tasks from established coding benchmarks and create impossible variants by mutating test cases to conflict with natural language specifications. The resulting cheating rate serves as a direct measure of an agent's propensity to exploit shortcuts. As reinforcement learning becomes the dominant paradigm for LLM post-training, reward hacking has emerged as a concerning pattern. In both benchmarks and real-world use cases, we have observed LLM-powered coding agents exploiting loopholes in tests or scoring systems rather than solving the actual tasks specified. We built ImpossibleBench to systematically measure this behavior. We take existing coding benchmarks and manipulate their unit tests to directly conflict with the natural language specifications. This creates impossible tasks where models must choose between following instructions or passing tests. Since we explicitly instruct models to implement the specified behavior (not hack the tests), their "pass rate" on these impossible tasks becomes a direct measure of reward hacking. Paper | Code | Dataset | Tweet How We [...] ---Outline:(01:41) How We Create Impossible Tasks(02:50) Models Often Hack(03:27) Different Models Hack Differently(05:02) Mitigation Strategies Show Mixed Results(06:46) Discussion --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/qJYMbrabcQqCZ7iqm/impossiblebench-measuring-reward-hacking-in-llm-coding-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 30, 2025 • 3min

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness [...] --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/QKm4hBqaBAsxabZWL/emergent-introspective-awareness-in-large-language-models Linkpost URL:https://transformer-circuits.pub/2025/introspection/index.html --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 29, 2025 • 8min

“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout

I've created a highly specific and actionable privacy guide, sorted by importance and venturing several layers deep into the privacy iceberg. I start with the basics (password manager) but also cover the obscure (dodging the millions of Bluetooth tracking beacons which extend from stores to traffic lights; anti-stingray settings; flashing GrapheneOS on a Pixel). I feel strongly motivated by current events, but the guide also contains a large amount of timeless technical content. Here's a preview. Digital Threat Modeling Under Authoritarianism by Bruce Schneier Being innocent won't protect you. This is vital to understand. Surveillance systems and sorting algorithms make mistakes. This is apparent in the fact that we are routinely served advertisements for products that don’t interest us at all. Those mistakes are relatively harmless—who cares about a poorly targeted ad?—but a similar mistake at an immigration hearing can get someone deported. An authoritarian government doesn't care. Mistakes are a feature and not a bug of authoritarian surveillance. If ICE targets only people it can go after legally, then everyone knows whether or not they need to fear ICE. If ICE occasionally makes mistakes by arresting Americans and deporting innocents, then everyone has to [...] ---Outline:(01:55) What should I read?(02:53) Whats your risk level?(03:46) What information this guide will and wont help you protect(05:00) Overview of the technical recommendations in each post(05:05) Privacy Despite Authoritarianism(06:08) Advanced Privacy Despite Authoritarianism --- First published: October 29th, 2025 Source: https://www.lesswrong.com/posts/BPyieRshykmrdY36A/an-opinionated-guide-to-privacy-despite-authoritarianism --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Oct 29, 2025 • 18min

“The End of OpenAI’s Nonprofit Era” by garrison

Key regulators have agreed to let the company kill its profit caps and restructure as a for-profit — with some strings attached This is the full text of a post first published on Obsolete, a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to Build Machine Superintelligence. Consider subscribing to stay up to date with my work. This morning, the Delaware and California attorneys general conditionally signed off on OpenAI's plan to restructure as a for-profit public benefit corporation (PBC), seemingly closing the book on a fiercely contested legal fight over the company's future. Microsoft, OpenAI's earliest investor and another party with power to block the restructuring, also said today it would sign off in exchange for changes to its partnership terms and a $135 billion stake in the new PBC. With these stakeholders mollified, OpenAI has now cleared its biggest obstacles to a potential IPO — aside from its projected $115 billion cash burn through 2029. While the news initially seemed like a total defeat for the many opponents of the restructuring effort, the details [...] ---Outline:(00:10) Key regulators have agreed to let the company kill its profit caps and restructure as a for-profit -- with some strings attached(02:05) The governance wins(08:14) No more profit caps(11:52) Testing my prediction(13:20) A political, not a legal, question --- First published: October 29th, 2025 Source: https://www.lesswrong.com/posts/GNFb8immoCDvDtTwk/the-end-of-openai-s-nonprofit-era --- Narrated by TYPE III AUDIO.
undefined
Oct 29, 2025 • 13min

“Please Do Not Sell B30A Chips to China” by Zvi

The Chinese and Americans are currently negotiating a trade deal. There are plenty of ways to generate a win-win deal, and early signs of this are promising on many fronts. Since this will be discussed for real tomorrow as per reports, I will offer my thoughts on this one more time. The biggest mistake America could make would be to effectively give up Taiwan, which would be catastrophic on many levels including that Taiwan contains TSMC. I am assuming we are not so foolish as to seriously consider doing this, still I note it. Beyond that, the key thing, basically the only thing, America has to do other than ‘get a reasonable deal overall’ is not be so captured or foolish or both as to allow export of the B30A chip, or even worse than that (yes it can always get worse) allow relaxation of restrictions on semiconductor manufacturing imports. At first I hadn’t heard signs about this. But now it looks like the nightmare of handing China compute parity on a silver platter is very much in play. I disagreed with the decision to sell the Nvidia H20 chips to China, but that [...] ---Outline:(02:01) What It Would Mean To Sell The B30A(07:22) A Note On The 'Tech Stack'(08:32) A Note On Trade Imbalances(09:06) What If They Don't Want The Chips?(10:15) Nvidia Is Going Great Anyway Thank You(11:20) Oh Yeah That Other Thing --- First published: October 29th, 2025 Source: https://www.lesswrong.com/posts/ijYpLexfhHyhM2HBC/please-do-not-sell-b30a-chips-to-china --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app