The Nonlinear Library cover image

The Nonlinear Library

Latest episodes

undefined
Sep 5, 2024 • 8min

LW - Executable philosophy as a failed totalizing meta-worldview by jessicata

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Executable philosophy as a failed totalizing meta-worldview, published by jessicata on September 5, 2024 on LessWrong. (this is an expanded, edited version of an x.com post) It is easy to interpret Eliezer Yudkowsky's main goal as creating a friendly AGI. Clearly, he has failed at this goal and has little hope of achieving it. That's not a particularly interesting analysis, however. A priori, creating a machine that makes things ok forever is not a particularly plausible objective. Failure to do so is not particularly informative. So I'll focus on a different but related project of his: executable philosophy. Quoting Arbital: Two motivations of "executable philosophy" are as follows: 1. We need a philosophical analysis to be "effective" in Turing's sense: that is, the terms of the analysis must be useful in writing programs. We need ideas that we can compile and run; they must be "executable" like code is executable. 2. We need to produce adequate answers on a time scale of years or decades, not centuries. In the entrepreneurial sense of "good execution", we need a methodology we can execute on in a reasonable timeframe. There is such a thing as common sense rationality, which says the world is round, you shouldn't play the lottery, etc. Formal notions like Bayesianism, VNM utility theory, and Solomonoff induction formalize something strongly related to this common sense rationality. Yudkowsky believes further study in this tradition can supersede ordinary academic philosophy, which he believes to be conceptually weak and motivated to continue ongoing disputes for more publications. In the Sequences, Yudkowsky presents these formal ideas as the basis for a totalizing meta-worldview, of epistemic and instrumental rationality, and uses the meta-worldview to argue for his object-level worldview (which includes many-worlds, AGI foom, importance of AI alignment, etc.). While one can get totalizing (meta-)worldviews from elsewhere (such as interdisciplinary academic studies), Yudkowsky's (meta-)worldview is relatively easy to pick up for analytically strong people (who tend towards STEM), and is effective ("correct" and "winning") relative to its simplicity. Yudkowsky's source material and his own writing do not form a closed meta-worldview, however. There are open problems as to how to formalize and solve real problems. Many of the more technical sort are described in MIRI's technical agent foundations agenda. These include questions about how to parse a physically realistic problem as a set of VNM lotteries ("decision theory"), how to use something like Bayesianism to handle uncertainty about mathematics ("logical uncertainty"), how to formalize realistic human values ("value loading"), and so on. Whether or not the closure of this meta-worldview leads to creation of friendly AGI, it would certainly have practical value. It would allow real world decisions to be made by first formalizing them within a computational framework (related to Yudkowsky's notion of "executable philosophy"), whether or not the computation itself is tractable (with its tractable version being friendly AGI). The practical strategy of MIRI as a technical research institute is to go meta on these open problems by recruiting analytically strong STEM people (especially mathematicians and computer scientists) to work on them, as part of the agent foundations agenda. I was one of these people. While we made some progress on these problems (such as with the Logical Induction paper), we didn't come close to completing the meta-worldview, let alone building friendly AGI. With the Agent Foundations team at MIRI eliminated, MIRI's agent foundations agenda is now unambiguously a failed project. I had called MIRI technical research as likely to fail around 2017 with the increase in internal secrecy, but at thi...
undefined
Sep 5, 2024 • 3min

LW - Michael Dickens' Caffeine Tolerance Research by niplav

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Michael Dickens' Caffeine Tolerance Research, published by niplav on September 5, 2024 on LessWrong. Michael Dickens has read the research and performed two self-experiments on whether consuming caffeine builds up tolerance, and if yes, how quickly. First literature review: What if instead of taking caffeine every day, you only take it intermittently - say, once every 3 days? How often can most people take caffeine without developing a tolerance? The scientific literature on this question is sparse. Here's what I found: 1. Experiments on rats found that rats who took caffeine every other day did not develop a tolerance. There are no experiments on humans. There are no experiments that use other intermittent dosing frequencies (such as once every 3 days). 2. Internet forum users report that they can take caffeine on average once every 3 days without developing a tolerance. But there's a lot of variation between individuals. Second literature review: If you take caffeine every day, does it stop working? If it keeps working, how much of its effect does it retain? There are many studies on this question, but most of them have severe methodological limitations. I read all the good studies (on humans) I could find. Here's my interpretation of the literature: Caffeine almost certainly loses some but not all of its effect when you take it every day. In expectation, caffeine retains 1/2 of its benefit, but this figure has a wide credence interval. The studies on cognitive benefits all have some methodological issues so they might not generalize. There are two studies on exercise benefits with strong methodology, but they have small sample sizes. First experiment: I conducted an experiment on myself to see if I would develop a tolerance to caffeine from taking it three days a week. The results suggest that I didn't. Caffeine had just as big an effect at the end of my four-week trial as it did at the beginning. This outcome is statistically significant (p = 0.016), but the data show a weird pattern: caffeine's effectiveness went up over time instead of staying flat. I don't know how to explain that, which makes me suspicious of the experiment's findings. Second experiment: This time I tested if I could have caffeine 4 days a week without getting habituated. Last time, when I took caffeine 3 days a week, I didn't get habituated but the results were weird. This time, with the more frequent dose, I still didn't get habituated, and the results were weird again! […] But it looks like I didn't get habituated when taking caffeine 4 days a week - or, at least, not to a detectable degree. So I'm going to keep taking caffeine 4 days a week. When I take caffeine 3 days in a row, do I habituate by the 3rd day? The evidence suggests that I don't, but the evidence is weak. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Sep 4, 2024 • 6min

LW - What happens if you present 500 people with an argument that AI is risky? by KatjaGrace

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What happens if you present 500 people with an argument that AI is risky?, published by KatjaGrace on September 4, 2024 on LessWrong. Recently, Nathan Young and I wrote about arguments for AI risk and put them on the AI Impacts wiki. In the process, we ran a casual little survey of the American public regarding how they feel about the arguments, initially (if I recall) just because we were curious whether the arguments we found least compelling would also fail to compel a wide variety of people. The results were very confusing, so we ended up thinking more about this than initially intended and running four iterations total. This is still a small and scrappy poll to satisfy our own understanding, and doesn't involve careful analysis or error checking. But I'd like to share a few interesting things we found. Perhaps someone else wants to look at our data more carefully, or run more careful surveys about parts of it. In total we surveyed around 570 people across 4 different polls, with 500 in the main one. The basic structure was: 1. p(doom): "If humanity develops very advanced AI technology, how likely do you think it is that this causes humanity to go extinct or be substantially disempowered?" Responses had to be given in a text box, a slider, or with buttons showing ranges 2. (Present them with one of eleven arguments, one a 'control') 3. "Do you understand this argument?" 4. "What did you think of this argument?" 5. "How compelling did you find this argument, on a scale of 1-5?" 6. p(doom) again 7. Do you have any further thoughts about this that you'd like to share? Interesting things: In the first survey, participants were much more likely to move their probabilities downward than upward, often while saying they found the argument fairly compelling. This is a big part of what initially confused us. We now think this is because each argument had counterarguments listed under it. Evidence in support of this: in the second and fourth rounds we cut the counterarguments and probabilities went overall upward. When included, three times as many participants moved their probabilities downward as upward (21 vs 7, with 12 unmoved). In the big round (without counterarguments), arguments pushed people upward slightly more: 20% move upward and 15% move downward overall (and 65% say the same). On average, p(doom) increased by about 1.3% (for non-control arguments, treating button inputs as something like the geometric mean of their ranges). But the input type seemed to make a big difference to how people moved! It makes sense to me that people move a lot more in both directions with a slider, because it's hard to hit the same number again if you don't remember it. It's surprising to me that they moved with similar frequency with buttons and open response, because the buttons covered relatively chunky ranges (e.g. 5-25%) so need larger shifts to be caught. Input type also made a big difference to the probabilities people gave to doom before seeing any arguments. People seem to give substantially lower answers when presented with buttons (Nathan proposes this is because there was was a <1% and 1-5% button, so it made lower probabilities more salient/ "socially acceptable", and I agree): Overall, P(doom) numbers were fairly high: 24% average, 11% median. We added a 'control argument'. We presented this as "Here is an argument that advanced AI technology might threaten humanity:" like the others, but it just argued that AI might substantially contribute to music production: This was the third worst argument in terms of prompting upward probability motion, but the third best in terms of being "compelling". Overall it looked a lot like other arguments, so that's a bit of a blow to the model where e.g. we can communicate somewhat adequately, 'arguments' are more compelling than rando...
undefined
Sep 4, 2024 • 20min

LW - AI and the Technological Richter Scale by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI and the Technological Richter Scale, published by Zvi on September 4, 2024 on LessWrong. The Technological Richter scale is introduced about 80% of the way through Nate Silver's new book On the Edge. A full review is in the works (note to prediction markets: this post alone does NOT on its own count as a review, but this counts as part of a future review), but this concept seems highly useful, stands on its own and I want a reference post for it. Nate skips around his chapter titles and timelines, so why not do the same here? Defining the Scale Nate Silver, On the Edge (location 8,088 on Kindle): The Richter scale was created by the physicist Charles Richter in 1935 to quantify the amount of energy released by earthquakes. It has two key features that I'll borrow for my Technological Richter Scale (TRS). First, it is logarithmic. A magnitude 7 earthquake is actually ten times more powerful than a mag 6. Second, the frequency of earthquakes is inversely related to their Richter magnitude - so 6s occur about ten times more often than 7s. Technological innovations can also produce seismic disruptions. Let's proceed quickly through the lower readings of the Technological Richter Scale. 1. Like a half-formulated thought in the shower. 2. Is an idea you actuate, but never disseminate: a slightly better method to brine a chicken that only you and your family know about. 3. Begins to show up in the official record somewhere, an idea you patent or make a prototype of. 4. An invention successful enough that somebody pays for it; you sell it commercially or someone buys the IP. 5. A commercially successful invention that is important in its category, say, Cool Ranch Doritos, or the leading brand of windshield wipers. 6. An invention can have a broader societal impact, causing a disruption within its field and some ripple effects beyond it. A TRS 6 will be on the short list for technology of the year. At the low end of the 6s (a TRS 6.0) are clever and cute inventions like Post-it notes that provide some mundane utility. Toward the high end (a 6.8 or 6.9) might be something like the VCR, which disrupted home entertainment and had knock-on effects on the movie industry. The impact escalates quickly from there. 7. One of the leading inventions of the decade and has a measurable impact on people's everyday lives. Something like credit cards would be toward the lower end of the 7s, and social media a high 7. 8. A truly seismic invention, a candidate for technology of the century, triggering broadly disruptive effects throughout society. Canonical examples include automobiles, electricity, and the internet. 9. By the time we get to TRS 9, we're talking about the most important inventions of all time, things that inarguably and unalterably changed the course of human history. You can count these on one or two hands. There's fire, the wheel, agriculture, the printing press. Although they're something of an odd case, I'd argue that nuclear weapons belong here also. True, their impact on daily life isn't necessarily obvious if you're living in a superpower protected by its nuclear umbrella (someone in Ukraine might feel differently). But if we're thinking in expected-value terms, they're the first invention that had the potential to destroy humanity. 10. Finally, a 10 is a technology that defines a new epoch, one that alters not only the fate of humanity but that of the planet. For roughly the past twelve thousand years, we have been in the Holocene, the geological epoch defined not by the origin of Homo sapiens per se but by humans becoming the dominant species and beginning to alter the shape of the Earth with our technologies. AI wresting control of this dominant position from humans would qualify as a 10, as would other forms of a "technological singularity," a term popularized by...
undefined
Sep 4, 2024 • 7min

EA - Fungal diseases: Health burden, neglectedness, and potential interventions by Rethink Priorities

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fungal diseases: Health burden, neglectedness, and potential interventions, published by Rethink Priorities on September 4, 2024 on The Effective Altruism Forum. Editorial note This report is a "shallow" investigation, as described here, and was commissioned by Open Philanthropy and produced by Rethink Priorities from January to February 2023. We revised the report for publication. Open Philanthropy does not necessarily endorse our conclusions, nor do the organizations represented by those who were interviewed. Our report focuses on exploring fungal diseases as a potential new cause area for Open Philanthropy. We assessed the current and future health burden of fungal diseases, provided an overview of current interventions and the main gaps and barriers to address the burden, and discussed some plausible options for philanthropic spending. We reviewed the scientific and gray literature and spoke with five experts. While revising the report for publication, we learned of a new global burden study ( Denning et al., 2024) whose results show an annual incidence of 6.5 million invasive fungal infections, and 3.8 million total deaths from fungal diseases (2.5 million of which are "directly attributable" to fungal diseases). The study's results align with this report's estimate of annual 1.5 million to 4.6 million deaths (80% confidence) but were not considered in this report. We don't intend this report to be Rethink Priorities' final word on fungal diseases. We have tried to flag major sources of uncertainty in the report and are open to revising our views based on new information or further research. Executive summary While fungal diseases are very common and mostly mild, some forms are life-threatening and predominantly affect low- and middle-income countries (LMICs). The evidence base on the global fungal disease burden is poor, and estimates are mostly based on extrapolations from the few available studies. Yet, all experts we talked to agree that current burden estimates (usually stated as >1.7M deaths/year) likely underestimate the true burden. Overall, we think the annual death burden could be 1.5M - 4.6M (80% CI), which would exceed malaria and HIV/AIDS deaths combined.[1] Moreover, our best guess is that fungal diseases cause 8M - 49M DALYs (80% CI) per year, but this is based on our own back-of-the-envelope calculation of high-uncertainty inputs. Every expert we spoke with expects the burden to increase substantially in the future, though no formal estimates exist. We project that deaths and DALYs could grow to approximately 2-3 times the current burden until 2040, though this is highly uncertain. This will likely be partly due to a rise in antifungal resistance, which is especially problematic as few treatment classes exist and many fungal diseases are highly lethal without treatment. We estimate that only two diseases (chronic pulmonary aspergillosis [CPA] and candidemia/invasive candidiasis [IC/C]) account for ~39%-45% of the total death and DALY burden. Moreover, a single fungal pathogen (Aspergillus fumigatus) accounts for ~50% of the burden. Thus, much of the burden can be reduced by focusing on only a few of the fungal diseases or on a few pathogens. Available estimates suggest the top fungal diseases have highest burdens in Asia and LMICs, and that they most affect immunocompromised individuals. Fungal diseases seem very neglected in all areas we considered (research/R&D, advocacy/lobbying, philanthropic spending, and policy interventions) and receive little attention even in comparison to other diseases which predominantly affect LMICs. For example, we estimate the research funding/death ratio for malaria to be roughly 20 times higher than for fungal diseases. Moreover, fewer than 10 countries have national surveillance systems for fungal infections, an...
undefined
Sep 4, 2024 • 1min

AF - Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? by David Scott Krueger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?, published by David Scott Krueger on September 4, 2024 on The AI Alignment Forum. AI systems up to some high level of intelligence plausibly need to know exactly where they are in space-time in order for deception/"scheming" to make sense as a strategy. This is because they need to know: 1) what sort of oversight they are subject to and 2) what effects their actions will have on the real world (side note: Acausal trade might break this argument) There are a number of informal proposals to keep AI systems selectively ignorant of (1) and (2) in order to prevent deception. Those proposals seem very promising to flesh out; I'm not aware of any rigorous work doing so, however. Are you? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Sep 4, 2024 • 30min

LW - On the UBI Paper by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the UBI Paper, published by Zvi on September 4, 2024 on LessWrong. Would a universal basic income (UBI) work? What would it do? Many people agree July's RCT on giving people a guaranteed income, and its paper from Eva Vivalt, Elizabeth Rhodes, Alexander W. Bartik, David E. Broockman and Sarah Miller was, despite whatever flaws it might have, the best data we have so far on the potential impact of UBI. There are many key differences from how UBI would look if applied for real, but this is the best data we have. This study was primarily funded by Sam Altman, so whatever else he may be up to, good job there. I do note that my model of 'Altman several years ago' is more positive than mine of Altman now, and past actions like this are a lot of the reason I give him so much benefit of the doubt. They do not agree on what conclusions we should draw. This is not a simple 'UBI is great' or 'UBI it does nothing.' I see essentially four responses. 1. The first group says this shows UBI doesn't work. That's going too far. I think the paper greatly reduces the plausibility of the best scenarios, but I don't think it rules UBI out as a strategy, especially if it is a substitute for other transfers. 2. The second group says this was a disappointing result for UBI. That UBI could still make sense as a form of progressive redistribution, but likely at a cost of less productivity so long as people impacted are still productive. I agree. 3. The third group did its best to spin this into a positive result. There was a lot of spin here, and use of anecdotes, and arguments as soldiers. Often these people were being very clear they were true believers and advocates, that want UBI now, and were seeking the bright side. Respect? There were some bright spots that they pointed out, and no one study over three years should make you give up, but this was what it was and I wish people wouldn't spin like that. 4. The fourth group was some mix of 'if brute force (aka money) doesn't solve your problem you're not using enough' and also 'but work is bad, actually, and leisure is good.' That if we aren't getting people not to work then the system is not functioning, or that $1k/month wasn't enough to get the good effects, or both. I am willing to take a bold 'people working more is mostly good' stance, for the moment, although AI could change that. And while I do think that a more permanent or larger support amount would do some interesting things, I wouldn't expect to suddenly see polarity reverse. I am so dedicated to actually reading this paper that it cost me $5. Free academia now. RTFP (Read the Paper): Core Design Core design was that there were 1,000 low-income individuals randomized into getting $1k/month for 3 years, or $36k total. A control group of 2,000 others got $50/month, or $1800 total. Average household income in the study before transfers was $29,900. They then studied what happened. Before looking at the results, what are the key differences between this and UBI? Like all studies of UBI, this can only be done for a limited population, and it only lasts a limited amount of time. If you tell me I am getting $1,000/month for life, then that makes me radically richer, and also radically safer. In extremis you can plan to live off that, or it can be a full fallback. Which is a large part of the point, and a lot of the danger as well. If instead you give me that money for only three years, then I am slightly less than $36k richer. Which is nice, but impacts my long term prospects much less. It is still a good test of the 'give people money' hypothesis but less good at testing UBI. The temporary form, and also the limited scope, means that it won't cause a cultural shift and changing of norms. Those changes might be good or bad, and they could overshadow other impacts. Does this move tow...
undefined
Sep 3, 2024 • 2min

EA - Giving What We Can is now its own legal entity! by Alana HF

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Giving What We Can is now its own legal entity!, published by Alana HF on September 3, 2024 on The Effective Altruism Forum. On August 31st, Giving What We Can (GWWC) completed the "spin-out" we announced in December. As a result, we are no longer a legal project of the Effective Ventures Foundation UK and US (collectively referred to as "EV"), and have instead set up our own independent charitable entities in both the US and the UK, with Canada coming soon! We're super excited to take this important step as an organisation. While our core mission, commitments, and focus on effective giving remain unchanged, we've already begun to feel some of the benefits of being fully in charge of our own operations, including: Aligning our organisational structure and governance more closely with our mission Facilitating greater internal clarity and transparency around our core processes Having greater control over our operational costs Processing donations made via bank transfer, DAF, stock, or crypto more quickly than before Of course, we owe a very big thank you to the team at EV for its incredible support over the years, which has helped us grow into the organisation we are today, and has prepared us to embark on this new chapter. As we continue to move our ambitious plans forward, we're focused more than ever on our core mission: to make effective and significant giving a cultural norm. Check out the details of our new entities (UK) (US) (Canada - awaiting charitable status), along with our updated privacy policy! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Sep 3, 2024 • 21min

LW - Book Review: What Even Is Gender? by Joey Marcellino

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Book Review: What Even Is Gender?, published by Joey Marcellino on September 3, 2024 on LessWrong. I submitted this review to the 2024 ACX book review contest, but it didn't make the cut, so I'm putting it here instead for posterity. Conspiracy theories are fun because of how they make everything fit together, and scratch the unbearable itch some of us get when there are little details of a narrative that just don't make sense. The problem is they tend to have a few issues, like requiring one to posit expansive perfectly coordinated infosecurity, demanding inaccessible or running contrary to existing evidence, and generally making you look weird for believing them. We can get our connecting-the-dots high while avoiding social stigma and epistemic demerits by instead foraging in the verdant jungle of "new conceptual frameworks for intractable debates." Arguments about gender tend to devolve, not just for lack of a shared conceptual framework, but because the dominant frameworks used by both defenders and critics of gender ideology are various shades of incoherent. To the rescue are R. A. Briggs and B. R. George, two philosophers of gender promising a new approach to thinking about gender identity and categorization with their book What Even Is Gender? I appreciate that I'm probably atypical in that my first thought when confronting a difficult conceptual problem is "I wonder what mainstream analytic philosophy has to say about this?", but What Even Is Gender? is that rare thing: a philosophical work for a popular audience that is rigorous without sacrificing clarity (and that's clarity by normal-human-conversation standards, not analytic philosophy standards). Let's see what they have to say. Why I Picked This Book BG are writing for two primary audiences in What Even Is Gender? First are people trying to make sense of their own experience of gender, especially those who feel the existing conceptual toolbox is limited, or doesn't exactly match up with their circumstances. The second, in their words, are: "people who, while broadly sympathetic (or at least open) to the goals of trans inclusion and trans liberation, harbor some unease regarding the conceptual tensions, apparent contradictions, and metaphysical vagaries of the dominant rhetoric of trans politics. This sort of reader might feel the pull of some of the foundational concerns that they see raised in "gender critical" arguments, but is also trying to take their trans friends' anxious reactions seriously, and is loath to accept the political agenda that accompanies such arguments." People with a non-standard experience of gender are known to be overrepresented among readers of this blog, and I suspect people in BG's second kind of audience are as well, extrapolating from my sample size of one. This book thus seemed like a good fit. BG contrast their conception of gender with what they call the "received narrative": the standard set of ideas about gender and identity that one hears in progressive spaces e.g. college campuses. Reviewing WEIG on this blog provides another interesting point of contrast in The Categories Were Made for Man. BG make similar moves as Scott but extend the analysis further, and provide an alternative account of gender categories that avoids some of the weaknesses of Scott's. Where we're coming from So what exactly is this received narrative, and what's wrong with it? BG give the following sketch: "1 People have a more-or-less stable inner trait called "gender identity". 2 One's "gender identity" is what disposes one to think of oneself as a "woman" or as a "man" (or, perhaps, as both or as neither). 3 One's "gender identity" is what disposes one to favor or avoid stereotypically feminine or masculine behaviors (or otherwise gendered behaviors). 4 It is possible for there to be a mismatc...
undefined
Sep 3, 2024 • 35min

LW - The Checklist: What Succeeding at AI Safety Will Involve by Sam Bowman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Checklist: What Succeeding at AI Safety Will Involve, published by Sam Bowman on September 3, 2024 on LessWrong. Crossposted by habryka with Sam's permission. Expect lower probability for Sam to respond to comments here than if he had posted it. Preface This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it's disproportionately focused on technical research and on averting emerging catastrophic risks. For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety work we'll need over the coming years. This piece is my own opinionated take on that question, though it draws very heavily on discussions with colleagues across the organization: Medium- and long-term AI safety strategy is the subject of countless leadership discussions and Google docs and lunch-table discussions within the organization, and this piece is a snapshot (shared with permission) of where those conversations sometimes go. To be abundantly clear: Nothing here is a firm commitment on behalf of Anthropic, and most people at Anthropic would disagree with at least a few major points here, but this can hopefully still shed some light on the kind of thinking that motivates our work. Here are some of the assumptions that the piece relies on. I don't think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans: Broadly human-level AI is possible. I'll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1] Broadly human-level AI (or TAI) isn't an upper bound on most AI capabilities that matter, and substantially superhuman systems could have an even greater impact on the world along many dimensions. If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that's not wildly different from today. If TAI is possible, it could be used to dramatically accelerate AI R&D, potentially leading to the development of substantially superhuman systems within just a few months or years after TAI. Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means. Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges. Alignment - in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy - requires some non-trivial effort to get right, and it gets harder as systems get more powerful. Most of the ideas here ultimately come from outside Anthropic, and while I cite a few sources below, I've been influenced by far more writings and people than I can credit here or even keep track of. Introducing the Checklist This lays out what I think we need to do, divided into three chapters, based on the capabilities of our strongest models: Chapter 1: Preparation You are here. In this period, our best models aren't yet TAI. In the language of Anthropic's RSP, they're at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the work that we hav...

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner