LessWrong (30+ Karma)

LessWrong
undefined
Jul 3, 2025 • 3min

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

This is a link post. Note: This is a research note, and the analysis is less rigorous than our standard for a published paper. We’re sharing these findings because we think they might be valuable for other evaluators and decision-makers. Executive Summary In May 2024, we designed “precursor” evaluations for scheming (agentic self-reasoning and agentic theory of mind), i.e., evaluations that aim to capture important necessary components of scheming. In December 2024, we published “in-context scheming” evaluations, i.e. evaluations that directly aim to measure scheming reasoning capabilities. We have easy, medium, and hard difficulty levels for all evals. In this research note, we run some basic analysis on how predictive our precursor evaluations were of our scheming predictions to test the underlying hypothesis of whether the precursor evals would have “triggered” relevant scheming thresholds.  We run multiple pieces of analysis to test whether our precursor evaluations predict our scheming [...] --- First published: July 3rd, 2025 Source: https://www.lesswrong.com/posts/9tqpPP4FwSnv9AWsi/research-note-our-scheming-precursor-evals-had-limited Linkpost URL:https://www.apolloresearch.ai/blog/research-note-our-scheming-precursor-evals-had-limited-predictive-power-for-our-in-context-scheming-evals --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Jul 3, 2025 • 3min

“Call for suggestions - AI safety course” by boazbarak

In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to be roughly similar to my "foundations of deep learning" course. I am still not sure of the content, and would be happy to get suggestions. Some (somewhat conflicting desiderata): I would like to cover the various ways AI could go wrong: malfunction, misuse, societal upheaval, arms race, surveillance, bias, misalignment, loss of control,... (and anything else I'm not thinking of right now). I talke about some of these issues here and here. I would like to see what we can learn from other fields, including software security, aviation and automative safety, drug safety, nuclear arms control, etc.. (and happy to get other suggestions) Talk about policy as well, various frameworks inside companies, regulations etc.. Talk about predictions for the future, methodologies for how to come up with them. [...] --- First published: July 3rd, 2025 Source: https://www.lesswrong.com/posts/qe8LjXAtaZfrc8No7/call-for-suggestions-ai-safety-course --- Narrated by TYPE III AUDIO.
undefined
Jul 3, 2025 • 2min

[Linkpost] “IABIED: Advertisement design competition” by yams

This is a link post. We’re currently in the process of locking in advertisements for the September launch of If Anyone Builds It, Everyone Dies, and we’re interested in your ideas! If you have graphic design chops, and would like to try your hand at creating promotional material for If Anyone Builds It, Everyone Dies, we’ll be accepting submissions in a design competition ending on August 10, 2025. We’ll be giving out up to four $1000 prizes: One for any asset we end up using on a billboard in San Francisco (landscape, ~2:1, details below) One for any asset we end up using in the subways of either DC or NY or both (~square, details below) One for any additional wildcard thing that ends up being useful (e.g. a t-shirt design, a book trailer, etc.; we’re being deliberately vague here so as not to anchor people too hard on [...] --- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/8cmhAAj3jMhciPFqv/iabied-advertisement-design-competition Linkpost URL:https://intelligence.org/2025/07/01/iabied-advertisement-design-competition/ --- Narrated by TYPE III AUDIO.
undefined
Jul 3, 2025 • 30min

“Congress Asks Better Questions” by Zvi

Back in May I did a dramatization of a key and highly painful Senate hearing. Now, we are back for a House committee meeting. It was entitled ‘Authoritarians and Algorithms: Why U.S. AI Must Lead’ and indeed a majority of talk was very much about that, with constant invocations of the glory of democratic AI and the need to win. The majority of talk was this orchestrated rhetoric that assumes the conclusion that what matters is ‘democracy versus authoritarianism’ and whether we ‘win,’ often (but not always) translating that as market share without any actual mechanistic model of any of it. However, there were also some very good signs, some excellent questions, signs that there is an awareness setting in. As far as Congressional discussions of real AGI issues go, this was in part one of them. That's unusual. (And as always there were a few [...] ---Outline:(01:29) Some Of The Best Stuff(05:30) Welcome to the House(08:22) Opening Statements(12:24) Krishnamoorthi Keeps Going(14:09) The Crowd Goes Wild(14:13) Moolenaar(15:04) Carson(15:16) Lahood(15:44) Dunn(17:08) Johnson(19:21) Torres(22:25) Brown(22:52) Nun(24:43) Tokuda(25:32) Moran(26:33) Connor(28:32) Humans Unemployable--- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/2dwBxehFfAsdKtbuq/congress-asks-better-questions --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Jul 2, 2025 • 16min

“Curing PMS with Hair Loss Pills” by David Lorell

Over the last two years or so, my girlfriend identified her cycle as having a unusually strong and very predictable effect on her mood/affect. We tried a bunch of interventions (food, sleep, socializing, supplements, reading the sequences, …) and while some seemed to help a bit, none worked reliably. Then, suddenly, something kind of crazy actually worked: Hair loss pills. The Menstrual Cycle Quick review: The womenfolk among us go through a cycle of varying hormone levels over the course of about a month. The first two weeks are called the “Follicular Phase”, and the last two weeks are called the “Luteal Phase.” The first week (usually less) is the “period” (“menstrual phase”, “menses”) where the body sloughs off the endometrial lining in the uterus, no longer of use to the unimpregnated womb, and ejects it in an irregular fountain of blood and gore. After that is a [...] ---Outline:(00:35) The Menstrual Cycle(02:28) The Bad Guy: Progesterone Allopregnanolone(05:19) Hair Loss Pills(07:17) Finasteride(09:23) Dutasteride(10:55) Results So Far(12:39) Come ON ALREADY(13:44) iM nOt A dOcToRThe original text contained 7 footnotes which were omitted from this narration. --- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/mkGxXEmcAGowmMjHC/curing-pms-with-hair-loss-pills --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Jul 2, 2025 • 24min

“AI Task Length Horizons in Offensive Cybersecurity” by Sean Peters

This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting. Introduction A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications. Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found: Cyber task horizons are doubling every ~5 months. The best current models solve 6-minute tasks with a 50% success rate. Counter-intuitively, the lightweight o4-mini edged out larger flagship models (o3, gemini-2.5-pro). Below I outline the datasets, IRT-based analysis, results and caveats. [...] ---Outline:(00:20) Introduction(01:34) Methodology(04:07) Datasets(11:49) Models(13:33) Results(18:26) Limitations(20:47) Personal Retrospective & Next Steps(23:08) References--- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/fjgYkTWKAXSxsxdsj/untitled-draft-zgxc --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
undefined
Jul 2, 2025 • 8min

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem. For context on our paper, the tweet thread is here and the paper is here.Context: Chain of Thought Faithfulness Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during [...] ---Outline:(00:49) Context: Chain of Thought Faithfulness(02:26) Our Results(04:06) Interpretability as a Practical Tool for Real-World Debiasing(06:10) Discussion and Related Work--- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/me7wFrkEtMbkzXGJt/race-and-gender-bias-as-an-example-of-unfaithful-chain-of --- Narrated by TYPE III AUDIO.
undefined
Jul 2, 2025 • 7min

“There are two fundamentally different constraints on schemers” by Buck

  People (including me) often say that scheming models “have to act as if they were aligned”. This is an alright summary; it's accurate enough to use when talking to a lay audience. But if you want to reason precisely about threat models arising from schemers, or about countermeasures to scheming, I think it's important to make some finer-grained distinctions. (Most of this has been said before. I’m not trying to make any points that would be surprising or novel to people who’ve thought a lot about these issues, I’m just trying to clarify something for people who haven’t thought much about them.) In particular, there are two important and fundamentally different mechanisms that incentivize schemers to act as if they were aligned. Training: Firstly, the AI might behave well because it's trying to avoid getting changed by training. We train AIs to have various desirable behaviors. If a [...] ---Outline:(03:32) Training might know what the model really thinks(05:57) Behavioral evidence can be handled more sample-efficiently(06:48) Why I care--- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/qDWm7E9sfwLDBWfMw/there-are-two-fundamentally-different-constraints-on --- Narrated by TYPE III AUDIO.
undefined
Jul 2, 2025 • 4min

“‘What’s my goal?’” by Raemon

The first in a series of bite-sized rationality prompts[1]. This is my most common opening-move for Instrumental Rationality. There are many, many other pieces of instrumental rationality. But asking this question is usually a helpful way to get started. Often, simply asking myself "what's my goal?" is enough to direct my brain to a noticeably better solution, with no further work. Examples Puzzle Games I'm playing Portal 2, or Baba is You. I'm fiddling around with the level randomly, sometimes going in circles. I notice I've been doing that awhile. I ask "what's my goal?" And then my eyes automatically glance at the exit for the level and realize I can't possibly make progress unless I solve a particular obstacle, which none of my fiddling-around was going to help with. Arguing I'm arguing with a person, poking holes in their position. I easily notice considerations [...] ---Outline:(00:33) Examples(02:50) Triggers(03:31) Exercises for the ReaderThe original text contained 1 footnote which was omitted from this narration. --- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/ry4nLykB2piWJpK7M/what-s-my-goal --- Narrated by TYPE III AUDIO.
undefined
Jul 2, 2025 • 10min

“A Simple Explanation of AGI Risk” by TurnTrout

Notes from a talk originally given at my alma mater I went to Grinnell College for my undergraduate degree. For the 2025 reunion event, I agreed to speak on a panel about AI. I like the talk I gave because I think it's a good "101" intro to AI risk, aimed at educated laypeople. I'm also glad to have a go-to explainer for why I'm currently worried about AGI. I work at Google DeepMind on the science of aligning artificial intelligence with human interests. I completed a PhD in this field in 2022 and then I did my postdoc at UC Berkeley. I’ll discuss some of the ways in which AI might go wrong. ⚠️ I'm only speaking for myself, not for my employer.The romance and peril of AI For many years, I’ve had quite the romantic vision of the promise of AI. Solving artificial intelligence will [...] ---Outline:(00:58) The romance and peril of AI(02:35) Risks from AI(03:47) Spelling out an argument for AI extinction risk(04:35) Intuitive support for the argument(07:11) So are we doomed?(08:50) Conclusion--- First published: July 1st, 2025 Source: https://www.lesswrong.com/posts/W43vm8aD9jf9peAFf/a-simple-explanation-of-agi-risk --- Narrated by TYPE III AUDIO.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app