The Nonlinear Library

The Nonlinear Fund
undefined
Jul 25, 2024 • 2min

AF - Does robustness improve with scale? by ChengCheng

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does robustness improve with scale?, published by ChengCheng on July 25, 2024 on The AI Alignment Forum. Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to what extent can scale help solve robustness? In this post, we explore this question in the classification setting: predicting the binary label of a text input. We find that scale alone does little to improve model robustness, but that larger models benefit more from defenses such as adversarial training than do smaller models. We study models in the classification setting as there is a clear notion of "correct behavior": does the model output the right label? We can then naturally define robustness as the proportion of the attacked dataset that the model correctly classifies. We evaluate models on tasks such as spam detection and movie sentiment classification. We adapt pretrained foundation models for classification by replacing the generative model's unembedding layer with a randomly initialized classification head, and then fine-tune the models on each task. We focus on adversarial-suffix style attacks: appending an adversarially chosen prompt to a benign prompt in an attempt to cause the model to misclassify the input, e.g., classify a spam email as not-spam. We consider two attacks: the state-of-the-art Greedy Coordinate Gradient method (Zou et al., 2023), and a baseline random token attack. This simple threat model has the advantage of being unlikely to change the semantics of the input. For example, a spam email is still spam even if a handful of tokens are appended to it. Of course, attackers are not limited to such a simple threat model: studying more open-ended threat models (such as rephrasing the prompt, or replacing words with synonyms) and corresponding attack methods (such as LLM generated adversarial prompts) is an important direction that we hope to pursue soon in future work. For more information, see our blog post or paper. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Jul 25, 2024 • 3min

LW - "AI achieves silver-medal standard solving International Mathematical Olympiad problems" by gjm

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "AI achieves silver-medal standard solving International Mathematical Olympiad problems", published by gjm on July 25, 2024 on LessWrong. Google DeepMind reports on a system for solving mathematical problems that allegedly is able to give complete solutions to four of the six problems on the 2024 IMO, putting it near the top of the silver-medal category. Well, actually, two systems for solving mathematical problems: AlphaProof, which is more general-purpose, and AlphaGeometry, which is specifically for geometry problems. (This is AlphaGeometry 2; they reported earlier this year on a previous version of AlphaGeometry.) AlphaProof works in the "obvious" way: an LLM generates candidate next steps which are checked using a formal proof-checking system, in this case Lean. One not-so-obvious thing, though: "The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found." (That last bit is reminiscent of something from the world of computer go: a couple of years ago someone trained a custom version of KataGo specifically to solve the infamous Igo Hatsuyoron problem 120, starting with ordinary KataGo and feeding it training data containing positions reachable from the problem's starting position. They claim to have laid that problem to rest at last.) AlphaGeometry is similar but uses something specialized for (I think) Euclidean planar geometry problems in place of Lean. The previous version of AlphaGeometry allegedly already performed at gold-medal IMO standard; they don't say anything about whether that version was already able to solve the 2024 IMO problem that was solved using AlphaGeometry 2. AlphaProof was able to solve questions 1, 2, and 6 on this year's IMO (two algebra, one number theory). It produces Lean-formalized proofs. AlphaGeometry 2 was able to solve question 4 (plane geometry). It produces proofs in its own notation. The solutions found by the Alpha... systems are at https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/index.html. (There are links in the top-of-page navbar to solutions to the individual problems.) (If you're curious about the IMO questions or want to try them yourself before looking at the machine-generated proofs, you can find them -- and those for previous years -- at https://www.imo-official.org/problems.aspx.) One caveat (note: an earlier version of what I wrote failed to notice this and quite wrongly explicitly claimed something different): "First, the problems were manually translated into formal mathematical language for our systems to understand." It feels to me like it shouldn't be so hard to teach an LLM to convert IMO problems into Lean or whatever, but apparently they aren't doing that yet. Another caveat: "Our systems solved one problem within minutes and took up to three days to solve the others." Later on they say that AlphaGeometry 2 solved the geometry question within 19 seconds, so I guess that was also the one that was done "within minutes". Three days is a lot longer than human IMO contestants get given, but this feels to me like the sort of thing that will predictably improve pretty rapidly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 25, 2024 • 11min

EA - Applications Now Open for AIM's 2025 Programs by CE

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Applications Now Open for AIM's 2025 Programs, published by CE on July 25, 2024 on The Effective Altruism Forum. In short: Applications for AIM's two upcoming programs - the Charity Entrepreneurship Incubation Program and AIM's new Founding to Give Program - are now open. You can apply to multiple programs with one joint application form. The deadline to apply for all programs is September 15th. Over the past five years, AIM has incubated 40 highly effective nonprofits and secured over $3.9 million in seed grants. These organizations now reach over 35 million people and have the potential to improve the lives of more than 1 billion animals through their interventions. We are incredibly proud of the success achieved by these organizations and their dedicated founders. We believe founding an organization is the most impactful career path for fast-moving, entrepreneurial individuals. In our search for more impact-driven individuals to found field-leading organizations, we are excited to announce that our applications are now open. Dates: Charity Entrepreneurship Incubation Program: February-March 2025 (8 weeks) July - August 2025 (8 weeks) AIM Founding to Give: 6th of January - 28th of March 2025 (12 weeks) AIM Research Fellowship - Expression of Interest (dates TBD, likely early 2025) About the Charity Entrepreneurship Incubation Program Why apply? Put simply, we believe that founding a charity is likely one of the most impactful and exciting career options for people who are a good fit. In just a few years of operation, our best charities have gone from an idea to organizations with dozens of staff improving the lives of millions of people and animals every year. We provide the training, funding, research, and mentorship to ensure that people with the right aptitudes and a relentless drive to improve the world can start a charity, no matter their background or prior experience. This could be you! Our initial application form takes as little as 30 minutes - take a look at our applicant resources and apply now. Who is this program for? Individuals who want to make a huge impact with their careers. Charity entrepreneurs are ambitious, fast-moving, and prioritize impact above all. They are focused on cost-effectiveness and are motivated to pilot and scale an evidence-backed intervention. We have found that those from consulting backgrounds, for-profit entrepreneurship, effective NGOs, or recent graduates perform well in this program. What we offer: 2-month full-time training with two weeks in-person in London. Stipend of £1900/month during (and potentially up to 2 months after) the program. Incredibly talented individuals to co-found your new project with. Possibility to apply for $100,000 - $200,000 seed funding (~80% of projects get funded). Membership of the AIM alumni network, connecting you to mentorship, funders, and a community of other founders. The ideas: We are excited to announce our top charity ideas for the upcoming CE incubator. These ideas are the results of a seven-stage research process. To be brief, we have sacrificed nuance. In the upcoming weeks, full reports will be announced in our newsletter, published on our website, and posted on the EA Forum. Cage-free in the Middle East: An organization focused on good-cop cage-free corporate campaigning in neglected countries in the Middle East (United Arab Emirates, Saudi Arabia, and Egypt). Keel Bone Fractures: A charity working on how farmers can reduce the prevalence of keel bone fractures (KBF) in cage-free layer hens, ideally through outreach to certifiers to update their certification standards to include an outcome-based limit on KBF. Fish Welfare East Asia - We continue to recommend an organization that works with farmers in neglected, high-priority countries in East Asia (Philippines, Taiwan, and Indo...
undefined
Jul 25, 2024 • 4min

EA - Forum update: User database, card view, and more (Jul 2024) by Sarah Cheng

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Forum update: User database, card view, and more (Jul 2024), published by Sarah Cheng on July 25, 2024 on The Effective Altruism Forum. Highlights since our last feature update: People directory Google Doc import Card view for the Frontpage User profile updates Updates to sequences We've also hosted 3 events, significantly improved the site speed, and made many more improvements across the Forum. As always, we'd love feedback on these changes. People directory We've built a filterable database where anyone can find Forum users they might want to hire, collaborate with, or read. You can filter the database by role, organization, interests and more. If you want people to easily find you, consider adding more information to your Forum profile. Google Doc import You can now import posts directly from Google Docs. Footnotes and images will be imported along with the text. How it works: Write and format your post in Google Docs Share the Google doc with eaforum.posts@gmail.com Copy the link to your doc, click "Import Google doc" on a new Forum post, and paste the link For more details, check out Will Howard's quick take. Card view for the Frontpage Card view is a new way to view the Frontpage, allowing you to see post images and a preview of the content. Try it out by changing to card view in the dropdown to the right of "Customize feed". User profile updates We've redesigned the Edit profile page, and moved "Display name" to this page (from Account settings). Feel free to update your display name to add a GWWC pledge diamond. We've also re-enabled topic interests. This helps possible collaborators or hiring managers find you in our new People directory, as well as letting you make your profile a bit more information dense and personalized. Updates to sequences You can now get notified when posts are added to a sequence. We've also updated the sequence editor, including adding the ability to delete sequences. Other updates Forum performance is greatly improved We've made several behind the scenes changes which have sped up the Forum loading times. This should make your experience of using the Forum a lot smoother. Improved on-site audio experience While the previous version was static, attached to the top of the post, the new audio player sticks to the bottom of the page while the reader scrolls through the article. Once you have clicked the speaker icon to open the audio player, you can click the play button next to any header to start playing the audio from there. We've fixed our Twitter bot Our Twitter bot tweets posts when they hit 40 karma. It's been out of commission for a while, but now it's back up and running! We encourage you to follow, share, and retweet posts that you think are valuable. Updated UI for post stats We've simplified the UI and added high level metrics to your post stats. Before and after: Self-service account deletion There is now a section at the bottom of your account settings allowing you to permanently delete your personal data from the Forum. Events since March Draft Amnesty Week We hosted a Draft Amnesty Week to encourage people to publish posts that had been languishing in draft or procrastinated states. We had around 50 posts published. We also made some design changes to how we show events on the Forum, which will make future events more visible and navigable. "Ways the world is getting better" banner This banner was on the Forum for a week. Users could add emojis to the banner, which would show a piece of good news when someone hovered over them. When you clicked the emojis, you would be linked to an article explaining the good news. AI Welfare Debate Week Our first interactive debate week had over 500 Forum users voting on our interactive banner, around 30 posts, and much discussion in comments and quick takes. We received a lot of enco...
undefined
Jul 25, 2024 • 2min

EA - It's OK to kill and eat animals - but don't get caught slapping one. by Denis

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: It's OK to kill and eat animals - but don't get caught slapping one., published by Denis on July 25, 2024 on The Effective Altruism Forum. Very short post. This is not my area of expertise at all. But it seems like an opportunity. The Olympics start this week. In the UK, the biggest Olympic story is not about any runner or swimmer or gymnast. It is about animal rights. But, as with most animal-rights stories which make the front-pages (bull-fighting, hunting), it misses the real problem, factory-farming. The story: Apparently a famous Olympian equestrian has been forced to withdraw from the Olympics after footage emerged of her whipping a horse during training, 4 years ago. Cue the standard apologies, the "error of judgment" comment, the universal condemnation - and of course the video is shared with a warning that people might find this shocking. I think it would be wonderful if someone with the right moral stature (which is not me, I'm not even a vegan ...) were to highlight the absurdity of so much moral outrage for an otherwise well-treated, well-fed horse who gets whipped on the leg one time, but no reaction to the billions of factory-farmed animals who suffer in cages for their entire lives before we kill them and eat them. Maybe it would make people think again about factory-farming, or at least ask themselves if their views on animals were consistent. I was reminded of the Tolstoy description of a lady who "faints when she sees a calf being killed, she is so kind hearted that she can't look at the blood, but enjoys serving the calf up with sauce." My point with this post is just that if someone is in a position to express a public opinion on this, or write a letter to the editor, it might be an opportune moment given the size of the story right now. Charlotte Dujardin out of Olympics: The video, the reaction and what happens now explained | Olympics News | Sky Sports Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 25, 2024 • 35min

AF - AI Constitutions are a tool to reduce societal scale risk by Samuel Dylan Martin

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Constitutions are a tool to reduce societal scale risk, published by Samuel Dylan Martin on July 25, 2024 on The AI Alignment Forum. Sammy Martin, Polaris Ventures As AI systems become more integrated into society, we face potential societal-scale risks that current regulations fail to address. These risks include cooperation failures, structural failures from opaque decision-making, and AI-enabled totalitarian control. We propose enhancing LLM-based AI Constitutions and Model Specifications to mitigate these risks by implementing specific behaviours aimed at improving AI systems' epistemology, decision support capabilities, and cooperative intelligence. This approach offers a practical, near-term intervention to shape AI behaviour positively. We call on AI developers, policymakers, and researchers to consider and implement improvements along these lines, as well as for more research into testing Constitution/Model Spec improvements, setting a foundation for more responsible AI development that reduces long-term societal risks. Introduction There is reason to believe that in the near future, autonomous, LLM based AI systems, while not necessarily surpassing human intelligence in all domains, will be widely deployed throughout society. We anticipate a world where AI will be making some decisions on our behalf, following complex plans, advising on decision-making and negotiation, and presenting conclusions without human oversight at every step. While this is already happening to some degree in low-stakes settings, we must prepare for its expansion into high-stakes domains (e.g. politics, the military), and do our best to anticipate the systemic, societal scale risks that might result and act to prevent them. Most of the important work on reducing societal-scale risk will, by their very nature, have to involve policy changes, for example to ensure that there are humans in the loop on important decisions, but there are some technical interventions which we have identified that can help. We believe that by acting now to improve the epistemology (especially on moral or political questions), decision support capabilities and cooperative intelligence of LLM based AI systems, we can mitigate near-term risks and also set important precedents for future AI development. We aim to do this by proposing enhancements to AI Constitutions or Model Specifications. If adopted, we believe these improvements will reduce societal-scale risks which have so far gone unaddressed by AI regulation. Here, we justify this overall conclusion and propose preliminary changes that we think might improve AI Constitutions. We aim to empirically test and iterate on these improvements before finalising them. Recent years have seen significant efforts to regulate frontier AI, from independent initiatives to government mandates. Many of these are just aimed at improving oversight in general (for example, the reporting requirements in EO 14110), but some are directed at destructive misuse or loss of control (for example, the requirement to prove no catastrophic potential in SB 1047 and the independent tests run by the UK AISI). Many are also directed at near-term ethical concerns. However, we haven't seen shovel ready regulation or voluntary commitments proposed to deal with longer-term societal-scale risks, even though these have been much discussed in the AI safety community. Some experts, (e.g. Andrew Critch), argue these may represent the most significant source of overall AI risk and they have been discussed as 'societal scale risks', for example in Critch and Russel's TARSA paper. What are these "less obvious" 'societal scale' risks? Some examples: Cooperation failures: AI systems are widely integrated into society, used for advice on consequential decisions and delegated decision making power, but...
undefined
Jul 25, 2024 • 8min

EA - Focus group study of Non-Western EAs' experiences with Western EAs by Yi-Yang

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Focus group study of Non-Western EAs' experiences with Western EAs, published by Yi-Yang on July 25, 2024 on The Effective Altruism Forum. Summary What are my goals? And what did I find? 1. Are cross cultural interactions (CCIs) in EA even an issue for non-Western EAs who attended the retreat? 1. It's more likely than not that they had experienced at least one mildly-to-moderately bad interaction. These are usually more subtle and unintentional. 2. It's very unlikely that they had experienced an extremely bad interaction. 3. It's very likely that their interactions are mostly positive. 2. How widespread is it? 1. Uncertain, but probably yes. Methodology I thought a retreat that happened before EAGxPhilippines was a good opportunity to talk to a bunch of non-Western EAs, so I ran a focus group session as a way to solicit people's experiences of CCIs in EA settings. The rules I enforced during that time were: To use Chatham house rule when talking about the session to others To keep our shared notes anonymised To differentiate between purely factual observations (e.g., I see this person doing that) and interpretations of these observations (e.g., I think they are bad) Results Negative experiences * indicates that I was the one who initially shared the experience, and hence may be biassed to get people to talk more about it. Experiences Supporting details * EAs in "perceived-to-be-lower-status-cultures" [e.g., non-Western] have to put much more effort to be included in spaces where EAs in "perceived-to-be-higher-status-cultures" [e.g., Western] occupy. OTOH, EAs in "perceived-to-be-higher-status-cultures" have to put much less effort to be included in spaces where "perceived-to-be-lower-status-cultures" occupy. 3 people gave supporting anecdotal evidence. "In a conference, I noticed EAs from 'low status cultures' weren't invited to hang out. OTOH, folks from 'high status cultures' were doing their own thing and not being super inclusive." "Someone from country X told me their effort is double or maybe triple to join events, book 1-1s, etc" "Everyone but me [in a group] was invited to an after-conference party. I suspect it's because I'm a POC." * EAs from "perceived-to-be-higher-status-cultures" hijacking (probably unintentionally) norms in spaces that belong to EAs from "perceived-to-be-lower-status-cultures" 1 person gave supporting anecdotal evidence 1 person gave counter anecdotal evidence Didn't really see people hijack conversations that much, but they have to sometimes push people to speak up more due to lack of comfort in speaking in other languages. 1 person gave a different hypothesis Different cultures have different wait times to fill the silence: some are longer and some are shorter. After telling people about this, they give other people more wait time. EAs usually find the opportunity cost of travelling to far away conferences very high. This makes EAs in far away countries less likely to interact with other EAs in other parts of the world. 1 person gave supporting anecdotal evidence Pressure to move to an EA hub. 1 person gave supporting anecdotal evidence. "In many EA forms they ask how willing you are to move to different hubs for work. But many people like myself aren't willing to uproot their entire lives. Maybe there should be more effort to have work that is remote-friendly, or time zone-friendly." Cause prioritisation done by folks are influenced by their location 1 person gave supporting anecdotal evidence "If you live somewhere without AI safety jobs, you're much more unlikely to pursue it." 1 person disagreed "I tend to separate out cause prio and personal fit. So I do the cause prio separately, and then look into what fits me." Folks in Asia think they're not a great fit for EA if they're not working on AI safety 1 person gave supporting anecd...
undefined
Jul 25, 2024 • 48min

LW - Llama Llama-3-405B? by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Llama Llama-3-405B?, published by Zvi on July 25, 2024 on LessWrong. It's here. The horse has left the barn. Llama-3.1-405B, and also Llama-3.1-70B and Llama-3.1-8B, have been released, and are now open weights. Early indications are that these are very good models. They were likely the best open weight models of their respective sizes at time of release. Zuckerberg claims that open weights models are now competitive with closed models. Yann LeCun says 'performance is on par with the best closed models.' This is closer to true than in the past, and as corporate hype I will essentially allow it, but it looks like this is not yet fully true. Llama-3.1-405B not as good as GPT-4o or Claude Sonnet. Certainly Llama-3.1-70B is not as good as the similarly sized Claude Sonnet. If you are going to straight up use an API or chat interface, there seems to be little reason to use Llama. That is a preliminary result. It is still early, and there has been relatively little feedback. But what feedback I have seen is consistent on this. Prediction markets are modestly more optimistic. This market still has it 29% to be the #1 model on Arena, which seems unlikely given Meta's own results. Another market has it 74% to beat GPT-4-Turbo-2024-04-09, which currently is in 5th position. That is a big chance for it to land in a narrow window between 1257 and 1287. This market affirms that directly on tiny volume. Such open models like Llama-3.1-405B are of course still useful even if a chatbot user would have better options. There are cost advantages, privacy advantages and freedom of action advantages to not going through OpenAI or Anthropic or Google. In particular, if you want to distill or fine-tune a new model, and especially if you want to fully own the results, Llama-3-405B is here to help you, and Llama-3-70B and 8B are here as potential jumping off points. I expect this to be the main practical effect this time around. If you want to do other things that you can't do with the closed options? Well, technically you can't do most of them under Meta's conditions either, but there is no reason to expect that will stop people, especially those overseas including in China. For some of these uses that's a good thing. Others, not as good. Zuckerberg also used the moment to offer a standard issue open source manifesto, in which he abandons any sense of balance and goes all-in, which he affirmed in a softball interview with Rowan Cheung. On the safety front, while I do not think they did their safety testing in a way that would have caught issues if there had been issues, my assumption is there was nothing to catch. The capabilities are not that dangerous at this time. Thus I do not predict anything especially bad will happen here. I expect the direct impact of Llama-3.1-405B to be positive, with the downsides remaining mundane and relatively minor. The only exception would be the extent to which this enables the development of future models. I worry that this differentially accelerates and enables our rivals and enemies and hurts our national security, and indeed that this will be its largest impact. And I worry more that this kind of action and rhetoric will lead us down the path where if things get dangerous in the future, it will become increasingly hard not to get ourselves into deep trouble, both in terms of models being irrevocably opened up when they shouldn't be and increasing pressure on everyone else to proceed even when things are not safe, up to and including loss of control and other existential risks. If Zuckerberg had affirmed a reasonable policy going forward but thought the line could be drawn farther down the line, I would have said this was all net good. Instead, I am dismayed. I do get into the arguments about open weights at the end of this post, because it felt obligato...
undefined
Jul 25, 2024 • 12min

LW - The last era of human mistakes by owencb

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The last era of human mistakes, published by owencb on July 25, 2024 on LessWrong. Suppose we had to take moves in a high-stakes chess game, with thousands of lives at stake. We wouldn't just find a good chess player and ask them to play carefully. We would consult a computer. It would be deeply irresponsible to do otherwise. Computers are better than humans at chess, and more reliable. We'd probably still keep some good chess players in the loop, to try to catch possible computer error. (Similarly we still have pilots for planes, even though the autopilot is often safer.) But by consulting the computer we'd remove the opportunity for humans to make a certain type of high stakes mistake. A lot of the high stakes decisions people make today don't look like chess, or flying a plane. They happen in domains where computers are much worse than humans. But that's a contingent fact about our technology level. If we had sufficiently good AI systems, they could catch and prevent significant human errors in whichever domains we wanted them to. In such a world, I think that they would come to be employed for just about all suitable and important decisions. If some actors didn't take advice from AI systems, I would expect them to lose power over time to actors who did. And if public institutions were making consequential decisions, I expect that it would (eventually) be seen as deeply irresponsible not to consult computers. In this world, humans could still be responsible for taking decisions (with advice). And humans might keep closer to sole responsibility for some decisions. Perhaps deciding what, ultimately, is valued. And many less consequential decisions, but still potentially large at the scale of an individual's life (such as who to marry, where to live, or whether to have children), might be deliberately kept under human control[1]. Such a world might still collapse. It might face external challenges which were just too difficult. But it would not fail because of anything we would parse as foolish errors. In many ways I'm not so interested in that era. It feels out of reach. Not that we won't get there, but that there's no prospect for us to help the people of that era to navigate it better. My attention is drawn, instead, to the period before it. This is a time when AI will (I expect) be advancing rapidly. Important decisions may be made in a hurry. And while automation-of-advice will be on the up, it seems like wildly unprecedented situations will be among the hardest things to automate good advice for. We might think of it as the last era of consequential human mistakes[2]. Can we do anything to help people navigate those? I honestly don't know. It feels very difficult (given the difficulty at our remove in even identifying the challenges properly). But it doesn't feel obviously impossible. What will this era look like? Perhaps AI progress is blisteringly fast and we move from something like the world of today straight to a world where human mistakes don't matter. But I doubt it. On my mainline picture of things, this era - the final one in which human incompetence (and hence human competence) really matters - might look something like this: Cognitive labour approaching the level of human thinking in many domains is widespread, and cheap People are starting to build elaborate ecosystems leveraging its cheapness … … since if one of the basic inputs to the economy is changed, the optimal arrangement of things is probably quite different (cf. the ecosystem of things built on the internet); … but that process hasn't reached maturity. There is widespread access to standard advice, which helps to avoid some foolish errors, though this is only applicable to "standard" situations, and it isn't universal to seek that advice In some domains, AI performance is significantly bet...
undefined
Jul 24, 2024 • 31min

AF - A framework for thinking about AI power-seeking by Joe Carlsmith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A framework for thinking about AI power-seeking, published by Joe Carlsmith on July 24, 2024 on The AI Alignment Forum. This post lays out a framework I'm currently using for thinking about when AI systems will seek power in problematic ways. I think this framework adds useful structure to the too-often-left-amorphous "instrumental convergence thesis," and that it helps us recast the classic argument for existential risk from misaligned AI in a revealing way. In particular, I suggest, this recasting highlights how much classic analyses of AI risk load on the assumption that the AIs in question are powerful enough to take over the world very easily, via a wide variety of paths. If we relax this assumption, I suggest, the strategic trade-offs that an AI faces, in choosing whether or not to engage in some form of problematic power-seeking, become substantially more complex. Prerequisites for rational takeover-seeking For simplicity, I'll focus here on the most extreme type of problematic AI power-seeking - namely, an AI or set of AIs actively trying to take over the world ("takeover-seeking"). But the framework I outline will generally apply to other, more moderate forms of problematic power-seeking as well - e.g., interfering with shut-down, interfering with goal-modification, seeking to self-exfiltrate, seeking to self-improve, more moderate forms of resource/control-seeking, deceiving/manipulating humans, acting to support some other AI's problematic power-seeking, etc.[2] Just substitute in one of those forms of power-seeking for "takeover" in what follows. I'm going to assume that in order to count as "trying to take over the world," or to participate in a takeover, an AI system needs to be actively choosing a plan partly in virtue of predicting that this plan will conduce towards takeover.[3] And I'm also going to assume that this is a rational choice from the AI's perspective.[4] This means that the AI's attempt at takeover-seeking needs to have, from the AI's perspective, at least some realistic chance of success - and I'll assume, as well, that this perspective is at least decently well-calibrated. We can relax these assumptions if we'd like - but I think that the paradigmatic concern about AI power-seeking should be happy to grant them. What's required for this kind of rational takeover-seeking? I think about the prerequisites in three categories: Agential prerequisites - that is, necessary structural features of an AI's capacity for planning in pursuit of goals. Goal-content prerequisites - that is, necessary structural features of an AI's motivational system. Takeover-favoring incentives - that is, the AI's overall incentives and constraints combining to make takeover-seeking rational. Let's look at each in turn. Agential prerequisites In order to be the type of system that might engage in successful forms of takeover-seeking, an AI needs to have the following properties: 1. Agentic planning capability: the AI needs to be capable of searching over plans for achieving outcomes, choosing between them on the basis of criteria, and executing them. 2. Planning-driven behavior: the AI's behavior, in this specific case, needs to be driven by a process of agentic planning. 1. Note that this isn't guaranteed by agentic planning capability. 1. For example, an LLM might be capable of generating effective plans, in the sense that that capability exists somewhere in the model, but it could nevertheless be the case that its output isn't driven by a planning process in a given case - i.e., it's not choosing its text output via a process of predicting the consequences of that text output, thinking about how much it prefers those consequences to other consequences, etc. 2. And note that human behavior isn't always driven by a process of agentic planning, either, despite our ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app