
The Nonlinear Library: LessWrong
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Latest episodes

Sep 16, 2024 • 4min
LW - How you can help pass important AI legislation with 10 minutes of effort by ThomasW
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How you can help pass important AI legislation with 10 minutes of effort, published by ThomasW on September 16, 2024 on LessWrong.
Posting something about a current issue that I think many people here would be interested in. See also the related
EA Forum post.
California Governor Gavin Newsom has until September 30 to decide the fate of SB 1047 - one of the most hotly debated AI bills in the world. The Center for AI Safety Action Fund, where I work, is a co-sponsor of the bill. I'd like to share how you can help support the bill if you want to.
About SB 1047 and why it is important
SB 1047 is an AI bill in the state of California. SB 1047 would require the developers of the largest AI models, costing over $100 million to train, to test the models for the potential to cause or enable severe harm, such as cyberattacks on critical infrastructure or the creation of biological weapons resulting in mass casualties or $500 million in damages.
AI developers must have a safety and security protocol that details how they will take reasonable care to prevent these harms and publish a copy of that protocol. Companies who fail to perform their duty under the act are liable for resulting harm. SB 1047 also lays the groundwork for a public cloud computing resource to make AI research more accessible to academic researchers and startups and establishes whistleblower protections for employees at large AI companies.
So far, AI policy has relied on government reporting requirements and voluntary promises from AI developers to behave responsibly. But if you think voluntary commitments are insufficient, you will probably think we need a bill like SB 1047.
If SB 1047 is vetoed, it's plausible that no comparable legal protection will exist in the next couple of years, as Congress does not appear likely to pass anything like this any time soon.
The bill's text can be found here. A summary of the bill can be found here. Longer summaries can be found here and here, and a debate on the bill is
here. SB 1047 is supported by many academic researchers (including Turing Award winners Yoshua Bengio and Geoffrey Hinton), employees at major AI companies and organizations like Imbue and Notion. It is opposed by OpenAI, Google, Meta, venture capital firm A16z as well as some other academic researchers and organizations. After a recent round of amendments, Anthropic said "we believe its benefits likely outweigh its costs."
SB 1047 recently passed the California legislature, and Governor Gavin Newsom has until September 30th to sign or veto it. Newsom has not yet said whether he will sign it or not, but he is being lobbied hard to veto it. The Governor needs to hear from you.
How you can help
If you want to help this bill pass, there are some pretty simple steps you can do to increase that probability, many of which are detailed on the SB 1047 website.
The most useful thing you can do is write a custom letter. To do this:
Make a letter addressed to Governor Newsom using the template here.
Save the document as a PDF and email it to leg.unit@gov.ca.gov.
In writing this letter, we encourage you to keep it simple, short (0.5-2 pages), and intuitive. Complex, philosophical, or highly technical points are not necessary or useful in this context - instead, focus on how the risks are serious and how this bill would help keep the public safe.
Once you've written your own custom letter, you can also think of 5 family members or friends who might also be willing to write one. Supporters from California are especially helpful, as are parents and people who don't typically engage on tech issues. Then help them write it! You can:
Call or text them and tell them about the bill and ask them if they'd be willing to support it.
Draft a custom letter based on what you know about them and what they told you.
Send them a com...

Sep 16, 2024 • 31min
LW - My disagreements with "AGI ruin: A List of Lethalities" by Noosphere89
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My disagreements with "AGI ruin: A List of Lethalities", published by Noosphere89 on September 16, 2024 on LessWrong.
This is going to probably be a long post, so do try to get a drink and a snack while reading this post.
This is an edited version of my own comment on the post below, and I formatted and edited the quotes and content in line with what @MondSemmel recommended:
My comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
MondSemmel's comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=WcKi4RcjRstoFFvbf
The post I'm responding to: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/
To start out my disagreement, I have this to talk about:
Response to Lethality 3
We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.
I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn't know whether we tested it or whether it's in the the real world:
Because of this, given our theoretical complete control over the data input of these models, it will be possible to train them inside what is effectively a simulated world slightly different from our own. This will allow us to maximize their capabilities along axes that are safe and useful to us, while minimizing the risks of deception and take-over and being able to setup traps to detect when these are likely.
If the model is not trained on human data, then we can make its understanding of our world subtly different from the reality. If done carefully and cleverly, this will make it very easy to fool and be able to discover e.g. deceptiveness, or attempted takeover in a 'realistic' simulated world that the model is put inside during red-teaming.
While deceiving an adversary with direct access to your cognition is hard enough, doing so when you have been brought up in a world entirely controlled by that adversary, and which is feeding you subtly incorrect and misleading information about the adversary and their capabilities is likely orders of magnitude more difficult.
Jacob Cannell has argued along similar lines of training potential AGI systems in simulations of fantasy worlds, where takeover can be simulated but is far enough away that skills do not transfer. Fully synthetic datasets are necessary for this and indeed, a fantasy world is likely far from what we need.
Instead, we only need to simulate a world very close to our own but with slightly different affordances - maybe specific software vulnerabilities are simulated to exist but are actually patched in our world, maybe certain peptide-making companies accept all orders in the simulated world but have more checks in our world, maybe certain interpretability techniques are claimed not to work in the simulation, but do in fact work in our world - to cause any attempted takeover behaviour or adversarial
deception to appear extremely plausible to the AI but in fact be detectable and defendable in our reality.
In fact, we should design simulated worlds with 'honeytraps', which appear to be loopholes subtly planted in the AI's input data to encourage them to attempt to pursue their misaligned behaviour, but which we know of ahead of time and can monitor and catch.
It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.
See also this link:
https://www.beren.io/2024-05-11-Alignment-in-...

Sep 15, 2024 • 6min
LW - Why I funded PIBBSS by Ryan Kidd
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I funded PIBBSS, published by Ryan Kidd on September 15, 2024 on LessWrong.
I just left a comment on PIBBSS' Manfund grant request (which I funded $25k) that people might find interesting. PIBBSS needs more funding!
Main points in favor of this grant
1. My inside view is that PIBBSS mainly supports "
blue sky" or "
basic" research, some of which has a low chance of paying off, but might be critical in "
worst case" alignment scenarios (e.g., where "
alignment MVPs" don't work, "
sharp left turns" and "
intelligence explosions" are more likely than I expect, or where we have more time before AGI than I expect). In contrast, of the technical research MATS supports, about half is
basic research (e.g., interpretability, evals, agent foundations) and half is
applied research (e.g., oversight + control, value alignment). I think the MATS portfolio is a better holistic strategy for furthering AI safety and reducing AI catastrophic risk.
However, if one takes into account the research conducted at AI labs and supported by MATS, PIBBSS' strategy makes a lot of sense: they are supporting a wide portfolio of blue sky research that is particularly neglected by existing institutions and might be very impactful in a range of possible "worst-case" AGI scenarios. I think this is a valid strategy in the current ecosystem/market and I support PIBBSS!
2. In MATS' recent post, "
Talent Needs of Technical AI Safety Teams", we detail an AI safety talent archetype we name "Connector". Connectors bridge exploratory theory and empirical science, and sometimes instantiate new research paradigms. As we discussed in the post, finding and developing Connectors is hard, often their development time is on the order of years, and there is little demand on the AI safety job market for this role.
However, Connectors can have an outsized impact on shaping the AI safety field and the few that make it are "household names" in AI safety and usually build organizations, teams, or grant infrastructure around them.
I think that MATS is far from the ideal training ground for Connectors (although some do pass through!) as our program is only 10 weeks long (with an optional 4 month extension) rather than the ideal 12-24 months, we select scholars to fit established mentors' preferences rather than on the basis of their original research ideas, and our curriculum and milestones generally focus on building object-level scientific/engineering skills rather than research ideation and "identifying gaps".
It's thus no surprise that most MATS scholars are "Iterator" archetypes. I think there is substantial value in a program like PIBBSS existing, to support the long-term development of "Connectors" and pursue impact in a higher-variance way than MATS.
3. PIBBSS seems to have decent track record for recruiting experienced academics in non-CS fields and helping them repurpose their advanced scientific skills to develop novel approaches to AI safety. Highlights for me include Adam Shai's "computational mechanics" approach to interpretability and model cognition, Martín Soto's "logical updatelessness" approach to decision theory, and Gabriel Weil's "tort law" approach to making AI labs liable for their potential harms on the long-term future.
4. I don't know Lucas Teixeira (Research Director) very well, but I know and respect Dušan D. Nešić (Operations Director) a lot. I also highly endorsed Nora Ammann's vision (albeit while endorsing a different vision for MATS). I see PIBBSS as a highly competent and EA-aligned organization, and I would be excited to see them grow!
5. I think PIBBSS would benefit from funding from diverse sources, as mainstream AI safety funders have pivoted more towards applied technical research (or more governance-relevant basic research like evals). I think Manifund regrantors are well-positio...

Sep 15, 2024 • 12min
LW - Proveably Safe Self Driving Cars by Davidmanheim
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Proveably Safe Self Driving Cars, published by Davidmanheim on September 15, 2024 on LessWrong.
I've seen a fair amount of skepticism about the "Provably Safe AI" paradigm, but I think detractors give it too little credit. I suspect this is largely because of
idea inoculation - people have heard an undeveloped or weak man version of the idea, for example, that we can use formal methods to state our goals and prove that an AI will do that, and have already dismissed it. (Not to pick on him at all, but see my
question for Scott Aaronson here.)
I will not argue that Guaranteed Safe AI solves AI safety generally, or that it could do so - I will leave that to others. Instead, I want to provide a concrete example of a near-term application, to respond to critics who say that proveability isn't useful because it can't be feasibly used in real world cases when it involves the physical world, and when it is embedded within messy human systems.
I am making far narrower claims than the general ones which have been debated, but at the very least I think it is useful to establish whether this is actually a point of disagreement. And finally, I will admit that the problem I'm describing would be adding proveability to
a largely solved problem, but it provides a concrete example for where the approach is viable.
A path to provably safe autonomous vehicles
To start, even critics agree that formal verification is possible, and is already used in practice in certain places. And given (formally specified) threat models in different narrow domains, there are ways to do threat and risk modeling and get different types of guarantees. For example, we already have proveably verifiable code for things like
microkernels, and that means we can prove that buffer overflows, arithmetic exceptions, and deadlocks are impossible, and have hard guarantees for worst case execution time. This is a basis for further applications - we want to start at the bottom and build on provably secure systems, and get additional guarantees beyond that point. If we plan to make autonomous cars that are provably safe, we would build
starting from that type of kernel, and then we "only" have all of the other safety issues to address.
Secondly, everyone seems to agree that provable safety in physical systems requires a model of the world, and given the limits of physics, the limits of our models, and so on, any such approach can only provide approximate guarantees, and proofs would be conditional on those models. For example, we aren't going to formally verify that Newtonian physics is correct, we're instead formally verifying that if Newtonian physics is correct, the car will not crash in some situation.
Proven Input Reliability
Given that, can we guarantee that a car has some low probability of crashing?
Again, we need to build from the bottom up. We can show that sensors have some specific failure rate, and use that to show a low probability of not identifying other cars, or humans - not in the direct formal verification sense, but instead with the types of guarantees typically used for hardware, with known failure rates, built in error detection, and redundancy.
I'm not going to talk about how to do that class of risk analysis, but (modulus adversarial attacks, which I'll mention later,) estimating engineering reliability is a solved problem - if we don't have other problems to deal with. But we do, because cars are complex and interact with the wider world - so the trick will be integrating those risk analysis guarantees that we can prove into larger systems, and finding ways to build broader guarantees on top of them.
But for the engineering reliability, we don't only have engineering proof. Work like
DARPA's VerifAI is "applying formal methods to perception and ML components." Building guarantees about perceptio...

Sep 15, 2024 • 18min
LW - Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more by Michael Cohn
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more, published by Michael Cohn on September 15, 2024 on LessWrong.
In the fields of user experience and accessibility, everyone talks about the curb cut effect: Features that are added as accommodations for people with disabilities sometimes become widely useful and beloved. But not every accommodation becomes a "curb cut," and I've been thinking about other patterns that come up when accommodations intersect with wider society.
The original Curb Cut Effect
The eponymous curb cut -- the place at the intersection where the sidewalk slopes down to the street instead of just dropping off -- is most obviously there to for wheelchair users. But it's also great for people who are pulling a suitcase, runners who want to avoid jarring their ankles, and people who are walking their bikes.
Universal captioning on TV, movies, and video is nominally for Deaf or hearing-impaired people, but captions are handy to anyone who's watching TV in a noisy restaurant, or trying to make sense of a show with artistically muddy audio, or trying to watch a video at 3x speed and the audio is unintelligible. When we make products easier to use, or spaces easier to access, it's not just some essentialized group of people with disabilities who benefit -- accessibility is good for everyone.
Why the idea is useful: First, it breaks down the perspective of disability accommodations as being a costly charity where "we" spend resources to help "them." Further, it breaks down the idea of disability as an essentialized, either-or, othered type of thing.
Everybody has some level of difficulty accessing parts of the world some of the time, and improving accessibility is an inherent part of good design, good thinking, and good communication.[1] Plus, it's cool to be aware of all the different ways we can come up with to hack our experience of the world around us!
I think there's also a dark side to the idea -- a listener could conclude that we wouldn't invest in accommodations if they didn't happen to help people without disabilities. A just and compassionate society designs for accessibility because we value everybody, not because it's secretly self-interested.
That said, no society spends unlimited money to make literally every experience accessible to literally every human. There's always a cost-benefit analysis and sometimes it might be borderline. In those cases there's nothing wrong with saying that the benefits to the wider population tip the balance in favor of investing in accessibility.
But when it comes to things as common as mobility impairments and as simple as curb cuts, I think it would be a moral no-brainer even if the accommodation had no value to most people.
The Handicapped Parking effect
This edgier sibling of the curb cut effect comes up when there's a limited resource -- like handicapped parking. There are only X parking spaces within Y feet of the entrance to the Chipotle, and if we allocate them to people who have trouble getting around, then everyone else has a longer average walk to their car.
That doesn't mean it's zero-sum: The existence of a handicapped parking spot that I can't use might cost me an extra 20 seconds of walking, but save an extra five minutes of painful limping for the person who uses it.[2] This arrangement probably increases overall utility both in the short term (reduced total pain experienced by people walking from their cars) and in the long term (signaling the importance of helping everyone participate in society).
But this is manifestly not a curb cut effect where everyone benefits: You have to decide who's going to win and who's going to lose, relative to an unregulated state where all parking is first-come-first-served.
Allocation can be made well or p...

Sep 15, 2024 • 11min
LW - Did Christopher Hitchens change his mind about waterboarding? by Isaac King
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Did Christopher Hitchens change his mind about waterboarding?, published by Isaac King on September 15, 2024 on LessWrong.
There's a popular story that goes like this: Christopher Hitchens used to be in favor of the US waterboarding terrorists because he though it's wasn't bad enough to be torture.. Then he had it tried on himself, and changed his mind, coming to believe it isn't torture.
(Context for those unfamiliar: in the decade following 9/11, the US engaged in a lot of... questionable behavior to persecute the war on terror, and there was a big debate on whether waterboarding should be permitted. Many other public figures also volunteered to undergo the procedure as a part of this public debate; most notably Sean Hannity, who was an outspoken proponent of waterboarding, yet welched on his offer and never tried it himself.)
This story intrigued me because it's popular among both Hitchens' fans and his detractors. His fans use it as an example of his intellectual honesty and willingness to undergo significant personal costs in order to have accurate beliefs and improve the world. His detractors use it to argue that he's self-centered and unempathetic, only coming to care about a bad thing that's happening to others after it happens to him.
But is the story actually true? Usually when there are two sides to an issue, one side will have an incentive to fact-check any false claims that the other side makes. An impartial observer can then look at the messaging from both sides to discover any flaws in the other. But if a particular story is convenient for both groups, then neither has any incentive to debunk it.
I became suspicious when I tried going to the source of this story to see what Hitchens had written about waterboarding prior to his 2008 experiment, and consistently found these leads to evaporate.
The part about him having it tried on himself and finding it tortureous is certainly true. He reported this himself in his Vanity Fair article Believe me, It's Torture.
But what about before that? Did he ever think it wasn't torture?
His article on the subject doesn't make any mention of changing his mind, and it perhaps lightly implies that he always had these beliefs. He says, for example:
In these harsh [waterboarding] exercises, brave men and women were introduced to the sorts of barbarism that they might expect to meet at the hands of a lawless foe who disregarded the Geneva Conventions. But it was something that Americans were being trained to resist, not to inflict. [Link to an article explaining that torture doesn't work.]
[And later:]
You may have read by now the official lie about this treatment, which is that it "simulates" the feeling of drowning. This is not the case. You feel that you are drowning because you are drowning[.]
In a video interview he gave about a year later, he said:
There was only one way I felt I could advance the argument, which was to see roughly what it was like.
The loudest people on the internet about this were... not promising. Shortly after the Vanity Fair article, the ACLU released an article titled "Christopher Hitchens Admits Waterboarding is Torture", saying:
You have to hand it to him: journalist Christopher Hitchens, who previously discounted that waterboarding was indeed torture, admits in the August issue of Vanity Fair that it is, indeed, torture.
But they provide no source for this claim.
As I write this, Wikipedia says:
Hitchens, who had previously expressed skepticism over waterboarding being considered a form of torture, changed his mind.
No source is provided for this either.
Yet it's repeated everywhere. The top comments on the Youtube video. Highly upvoted Reddit posts. Etc.
Sources for any of these claims were quite scant. Many people cited "sources" that, upon me actually reading them, had nothing to do with t...

Sep 15, 2024 • 7min
LW - Pay-on-results personal growth: first success by Chipmonk
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pay-on-results personal growth: first success, published by Chipmonk on September 15, 2024 on LessWrong.
Thanks to Kaj Sotala, Stag Lynn, and Ulisse Mini for reviewing. Thanks to Kaj Sotala, Brian Toomey, Alex Zhu, Damon Sasi, Anna Salamon, and CFAR for mentorship and financial support
A few months ago I made
the claim "Radically effective and rapid growth [motivationally / emotionally / socially] is possible with the right combination of facilitator and method". Eg: for anxiety, agency, insecurity, need for validation.
To test my hypothesis, I began a pay-on-results coaching experiment: When clients achieve their goal and are confident it will last (at least one month), they pay a bounty.
My first client
Bob (pseudonymous) and I met at Manifest 2024, where I had set up a table at the night market for hunting "emotional security" bounties.
Bob had lifelong anxiety, and it was crushing his agency and relationships. He offered a $3,000 bounty for resolving it, and I decided to pursue it.
We spoke and tried my method.
It was only necessary for us to talk once, apparently, because a month later he said our one conversation helped him achieve what 8 years of talk therapy could not:
I'm choosing to work on problems beyond my capabilities, and get excited about situations where my weaknesses are repeatedly on display.
I actually feel excited about entering social situations where chances of things going worse than I would want them to were high.
So he paid his bounty when he was ready (in this case, 35 days after the session).
I've been checking in with him since (latest: last week, two months after the session) and he tells me all is well.
Bob also shared some additional benefits beyond his original bounty:
Planning to make dancing a weekly part of my life now.
(All shared with permission.)
I'm also hunting many other bounties
A woman working in SF after 3 sessions, text support, and three weeks:
I went to Chris with a torrent of responsibilities and a key decision looming ahead of me this month. I felt overwhelmed, upset, and I didn't want just talk
Having engaged in 9+ years of coaching and therapy with varying levels of success, I'm probably one of the toughest clients - equal parts hopeful and skeptical. Chris created an incredibly open space where I could easily tell him if I didn't know something, or couldn't feel something, or if I'm overthinking. He also has an uncanny sense of intuition on these things and a strong attunement to being actually effective
The results are already telling: a disappointment that might've made me emotionally bleed and mope for a month was something I addressed in the matter of a couple of days with only a scoop of emotional self-doubt instead of *swimming* in self-torture. The lag time of actually doing things to be there for myself was significantly quicker, warmer, and more effective
To-dos that felt very heavy lightened up considerably and began to feel fun again and as ways of connecting!
I've now started to predict happier things ahead with more vivid, emotionally engaged, and realistic detail. I'll continue being intensely focused this year for the outcomes I want, but I'm actually looking forward to it! Will reflect back on Month 2!
An SF founder in his 30s after 1 session and two weeks:
After working with Chris, I learned One Weird Trick to go after what I really want and feel okay no matter what happens.
This is a new skill I didn't learn in 3 years of IFS therapy.
I already feel more confident being myself and expressing romantic interest (and I already have twice, that's new).
(
More…)
What the fuck?
"Why does your thing work so unusually well?" asks my mentor
Kaj.
For one, it doesn't work for everyone with every issue, as you can see in the screenshot above. (That said, I suspect a lot of this is my fault for pursuing bounti...

Sep 14, 2024 • 3min
LW - OpenAI o1, Llama 4, and AlphaZero of LLMs by Vladimir Nesov
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI o1, Llama 4, and AlphaZero of LLMs, published by Vladimir Nesov on September 14, 2024 on LessWrong.
GPT-4 level open weights models like Llama-3-405B don't seem capable of dangerous cognition. OpenAI o1 demonstrates that a GPT-4 level model can be post-trained into producing useful long horizon reasoning traces. AlphaZero shows how capabilities can be obtained from compute alone, with no additional data. If there is a way of bringing these together, the apparent helplessness of the current generation of open weights models might prove misleading.
Post-training is currently a combination of techniques that use synthetic data and human labeled data. Human labeled data significantly improves quality, but its collection is slow and scales poorly. Synthetic data is an increasingly useful aspect of post-training, and automated aspects of its generation scale easily. Unlike weaker models, GPT-4 level LLMs clearly pass reading comprehension on most occasions, OpenAI o1 improves on this further.
This suggests that at some point human data might become mostly unnecessary in post-training, even if it still slightly helps. Without it, post-training becomes automated and gets to use more compute, while avoiding the need for costly and complicated human labeling.
A pretrained model at the next level of scale, such as Llama 4, if made available in open weights, might initially look approximately as tame as current models. OpenAI o1 demonstrates that useful post-training for long sequences of System 2 reasoning is possible.
In the case of o1 in particular, this might involve a lot of human labeling, making its reproduction a very complicated process (at least if the relevant datasets are not released, and the reasoning traces themselves are not leaked in large quantities). But if some generally available chatbots at the next level of scale are good enough at automating labeling, this complication could be sidestepped, with o1 style post-training cheaply reproduced on top of a previously released open weights model.
So there is an overhang in an open weights model that's distributed without long horizon reasoning post-training, since applying such post-training significantly improves its capabilities, making perception of its prior capabilities inadequate.
The problem right now is that a new level of pretraining scale is approaching in the coming months, while ability to cheaply apply long horizon reasoning post-training might follow shortly thereafter, possibly unlocked by these very same models at the new level of pretraining scale (since it might currently be too expensive for most actors to implement, or to do enough experiments to figure out how).
The resulting level of capabilities is currently unknown, and could well remain unknown outside the leading labs until after the enabling artifacts of the open weights pretrained models at the next level of scale have already been published.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Sep 14, 2024 • 1h 6min
LW - If-Then Commitments for AI Risk Reduction [by Holden Karnofsky] by habryka
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: If-Then Commitments for AI Risk Reduction [by Holden Karnofsky], published by habryka on September 14, 2024 on LessWrong.
Holden just published this paper on the Carnegie Endowment website. I thought it was a decent reference, so I figured I would crosspost it (included in full for convenience, but if either Carnegie Endowment or Holden has a preference for just having an excerpt or a pure link post, happy to change that)
If-then commitments are an emerging framework for preparing for risks from AI without unnecessarily slowing the development of new technology. The more attention and interest there is these commitments, the faster a mature framework can progress.
Introduction
Artificial intelligence (AI) could pose a variety of catastrophic risks to international security in several domains, including the proliferation and acceleration of cyberoffense capabilities, and of the ability to develop chemical or biological weapons of mass destruction. Even the most powerful AI models today are not yet capable enough to pose such risks,[1] but the coming years could see fast and hard-to-predict changes in AI capabilities.
Both companies and governments have shown significant interest in finding ways to prepare for such risks without unnecessarily slowing the development of new technology.
This piece is a primer on an emerging framework for handling this challenge: if-then commitments. These are commitments of the form: If an AI model has capability X, risk mitigations Y must be in place. And, if needed, we will delay AI deployment and/or development to ensure the mitigations can be present in time.
A specific example: If an AI model has the ability to walk a novice through constructing a weapon of mass destruction, we must ensure that there are no easy ways for consumers to elicit behavior in this category from the AI model.
If-then commitments can be voluntarily adopted by AI developers; they also, potentially, can be enforced by regulators. Adoption of if-then commitments could help reduce risks from AI in two key ways: (a) prototyping, battle-testing, and building consensus around a potential framework for regulation; and (b) helping AI developers and others build roadmaps of what risk mitigations need to be in place by when.
Such adoption does not require agreement on whether major AI risks are imminent - a polarized topic - only that certain situations would require certain risk mitigations if they came to pass.
Three industry leaders - Google DeepMind, OpenAI, and Anthropic - have published relatively detailed frameworks along these lines.
Sixteen companies have announced their intention to establish frameworks in a similar spirit by the time of the upcoming 2025 AI Action Summit in France.[2] Similar ideas have been explored at the International Dialogues on AI Safety in March 2024[3] and the UK AI Safety Summit in November 2023.[4] As of mid-2024, most discussions of if-then commitments have been in the context of voluntary commitments by companies, but this piece focuses on the general framework as something that could be
useful to a variety of actors with different enforcement mechanisms.
This piece explains the key ideas behind if-then commitments via a detailed walkthrough of a particular if-then commitment, pertaining to the potential ability of an AI model to walk a novice through constructing a chemical or biological weapon of mass destruction.
It then discusses some limitations of if-then commitments and closes with an outline of how different actors - including governments and companies - can contribute to the path toward a robust, enforceable system of if-then commitments.
Context and aims of this piece. In 2023, I helped with the initial development of ideas related to if-then commitments.[5] To date, I have focused on private discussion of this new fram...

Sep 14, 2024 • 10min
LW - Evidence against Learned Search in a Chess-Playing Neural Network by p.b.
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evidence against Learned Search in a Chess-Playing Neural Network, published by p.b. on September 14, 2024 on LessWrong.
Introduction
There is a new paper and lesswrong post about "learned look-ahead in a chess-playing neural network". This has long been a research interest of mine for reasons that are well-stated in the paper:
Can neural networks learn to use algorithms such as look-ahead or search internally? Or are they better thought of as vast collections of simple heuristics or memorized data? Answering this question might help us anticipate neural networks' future capabilities and give us a better understanding of how they work internally.
and further:
Since we know how to hand-design chess engines, we know what reasoning to look for in chess-playing networks. Compared to frontier language models, this makes chess a good compromise between realism and practicality for investigating whether networks learn reasoning algorithms or rely purely on heuristics.
So the question is whether Francois Chollet is correct with transformers doing "curve fitting" i.e. memorisation with little generalisation or whether they learn to "reason". "Reasoning" is a fuzzy word, but in chess you can at least look for what human players call "calculation", that is the ability to execute moves solely in your mind to observe and evaluate the resulting position.
To me this is a crux as to whether large language models will scale to human capabilities without further algorithmic breakthroughs.
The paper's authors, which include Erik Jenner and Stuart Russell, conclude that the policy network of Leela Chess Zero (a top engine and open source replication of AlphaZero) does learn look-ahead.
Using interpretability techniques they "find that Leela internally represents future optimal moves and that these representations are crucial for its final output in certain board states."
While the term "look-ahead" is fuzzy, the paper clearly intends to show that the Leela network implements an "algorithm" and a form of "reasoning".
My interpretation of the presented evidence is different, as discussed in the comments of the original lesswrong post. I argue that all the evidence is completely consistent with Leela having learned to recognise multi-move patterns. Multi-move patterns are just complicated patterns that take into account that certain pieces will have to be able to move to certain squares in future moves for the pattern to hold.
The crucial different to having learned an algorithm:
An algorithm can take different inputs and do its thing. That allows generalisation to unseen or at least unusual inputs. This means that less data is necessary for learning because the generalisation power is much higher.
Learning multi-move patterns on the other hand requires much more data because the network needs to see many versions of the pattern until it knows all specific details that have to hold.
Analysis setup
Unfortunately it is quite difficult to distinguish between these two cases. As I argued:
Certain information is necessary to make the correct prediction in certain kinds of positions. The fact that the network generally makes the correct prediction in these types of positions already tells you that this information must be processed and made available by the network. The difference between lookahead and multi-move pattern recognition is not whether this information is there but how it got there.
However, I propose an experiment, that makes it clear that there is a difference.
Imagine you train the model to predict whether a position leads to a forced checkmate and also the best move to make. You pick one tactical motive and erase it from the checkmate prediction part of the training set, but not the move prediction part.
Now the model still knows which the right moves are to make i.e. it would pl...