The Nonlinear Library

The Nonlinear Fund
undefined
Jul 12, 2024 • 1h 4min

LW - AI #72: Denying the Future by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #72: Denying the Future, published by Zvi on July 12, 2024 on LessWrong. The Future. It is coming. A surprising number of economists deny this when it comes to AI. Not only do they deny the future that lies in the future. They also deny the future that is here, but which is unevenly distributed. Their predictions and projections do not factor in even what the AI can already do, let alone what it will learn to do later on. Another likely future event is the repeal of the Biden Executive Order. That repeal is part of the Republican platform, and Trump is the favorite to win the election. We must act on the assumption that the order likely will be repealed, with no expectation of similar principles being enshrined in federal law. Then there are the other core problems we will have to solve, and other less core problems such as what to do about AI companions. They make people feel less lonely over a week, but what do they do over a lifetime? Also I don't have that much to say about it now, but it is worth noting that this week it was revealed Apple was going to get an observer board seat at OpenAI… and then both Apple and Microsoft gave up their observer seats. Presumably that is about antitrust and worrying the seats would be a bad look. There could also be more to it. Table of Contents 1. Introduction. 2. Table of Contents. 3. Language Models Offer Mundane Utility. Long as you avoid GPT-3.5. 4. Language Models Don't Offer Mundane Utility. Many mistakes will not be caught. 5. You're a Nudge. You say it's for my own good. 6. Fun With Image Generation. Universal control net for SDXL. 7. Deepfaketown and Botpocalypse Soon. Owner of a lonely bot. 8. They Took Our Jobs. Restaurants. 9. Get Involved. But not in that way. 10. Introducing. Anthropic ships several new features. 11. In Other AI News. Microsoft and Apple give up OpenAI board observer seats. 12. Quiet Speculations. As other papers learned, to keep pace, you must move fast. 13. The AI Denialist Economists. Why doubt only the future? Doubt the present too. 14. The Quest for Sane Regulation. EU and FTC decide that things are their business. 15. Trump Would Repeal the Biden Executive Order on AI. We can't rely on it. 16. Ordinary Americans Are Worried About AI. Every poll says the same thing. 17. The Week in Audio. Carl Shulman on 80,000 hours was a two parter. 18. The Wikipedia War. One obsessed man can do quite a lot of damage. 19. Rhetorical Innovation. Yoshua Bengio gives a strong effort. 20. Evaluations Must Mimic Relevant Conditions. Too often they don't. 21. Aligning a Smarter Than Human Intelligence is Difficult. Stealth fine tuning. 22. The Problem. If we want to survive, it must be solved. 23. Oh Anthropic. Non Disparagement agreements should not be covered by NDAs. 24. Other People Are Not As Worried About AI Killing Everyone. Don't feel the AGI. Language Models Offer Mundane Utility Yes, they are highly useful for coding. It turns out that if you use GPT-3.5 for your 'can ChatGPT code well enough' paper, your results are not going to be relevant. Gallabytes says 'that's morally fraud imho' and that seems at least reasonable. Tests failing in GPT-3.5 is the AI equivalent of "IN MICE" except for IQ tests. If you are going to analyze the state of AI, you need to keep an eye out for basic errors and always always check which model is used. So if you go quoting statements such as: Paper about GPT-3.5: its ability to generate functional code for 'hard' problems dropped from 40% to 0.66% after this time as well. 'A reasonable hypothesis for why ChatGPT can do better with algorithm problems before 2021 is that these problems are frequently seen in the training dataset Then even if you hadn't realized or checked before (which you really should have), you need to notice that this says 2021, which is very much not ...
undefined
Jul 12, 2024 • 27min

EA - The Precipice Revisited by Toby Ord

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Precipice Revisited, published by Toby Ord on July 12, 2024 on The Effective Altruism Forum. I'm often asked about how the existential risk landscape has changed in the years since I wrote The Precipice. Earlier this year, I gave a talk on exactly that, and I want to share it here. Here's a video of the talk and a full transcript. In the years since I wrote The Precipice, the question I'm asked most is how the risks have changed. It's now almost four years since the book came out, but the text has to be locked down a long time earlier, so we are really coming up on about five years of changes to the risk landscape. I'm going to dive into four of the biggest risks - climate change, nuclear, pandemics, and AI - to show how they've changed. Now a lot has happened over those years, and I don't want this to just be recapping the news in fast-forward. But luckily, for each of these risks I think there are some key insights and takeaways that one can distill from all that has happened. So I'm going to take you through them and tease out these key updates and why they matter. I'm going to focus on changes to the landscape of existential risk - which includes human extinction and other ways that humanity's entire potential could be permanently lost. For most of these areas, there are many other serious risks and ongoing harms that have also changed, but I won't be able to get into those. The point of this talk is to really narrow in on the changes to existential risk. Climate Change Let's start with climate change. We can estimate the potential damages from climate change in three steps: 1. how much carbon will we emit? 2. how much warming does that carbon produce? 3. how much damage does that warming do? And there are key updates on the first two of these, which have mostly flown under the radar for the general public. Carbon Emissions The question of how much carbon we will emit is often put in terms of Representative Concentration Pathways (RCPs). Initially there were 4 of these, with higher numbers meaning more greenhouse effect in the year 2100. They are all somewhat arbitrarily chosen - meant to represent broad possibilities for how our emissions might unfold over the century. Our lack of knowledge about which path we would take was a huge source of uncertainty about how bad climate change would be. Many of the more dire climate predictions are based on the worst of these paths, RCP 8.5. It is now clear that we are not at all on RCP 8.5, and that our own path is headed somewhere between the lower two paths. This isn't great news. Many people were hoping we could control our emissions faster than this. But for the purposes of existential risk from climate change, much of the risk comes from the worst case possibilities, so even just moving towards the middle of the range means lower existential risk - and the lower part of the middle is even better. Climate Sensitivity Now what about the second question of how much warming that carbon will produce? The key measure here is something called the equilibrium climate sensitivity. This is roughly defined as how many degrees of warming there would be if the concentrations of carbon in the atmosphere were to double from pre-industrial levels. If there were no feedbacks, this would be easy to estimate: doubling carbon dioxide while keeping everything else fixed produces about 1.2°C of warming. But the climate sensitivity also accounts for many climate feedbacks, including water vapour and cloud formation. These make it higher and also much harder to estimate. When I wrote The Precipice, the IPCC stated that climate sensitivity was likely to be somewhere between 1.5°C and 4.5°C. When it comes to estimating the impacts of warming, this is a vast range, with the top giving three times as much warming as the bottom. Moreover, the...
undefined
Jul 12, 2024 • 28min

EA - Abandoning functionalism: Some intuition pumps by Alfredo Parra

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Abandoning functionalism: Some intuition pumps, published by Alfredo Parra on July 12, 2024 on The Effective Altruism Forum. There seems to be a widely-held view in popular culture that no physicist really understands quantum mechanics. The meme probably gained popularity after Richard Feynman famously stated in a lecture (transcribed in the book "The Character of Physical Law") "I think I can safely say that nobody understands quantum mechanics", though many prominent physicists have expressed a similar sentiment. Anyone aware of the overwhelming success of quantum mechanics will recognize that the seeming lack of understanding of the theory is primarily about how to interpret its ontology, and not about how to do the calculations or run the experiments, which clearly many physicists understand extremely well. But even the ontological confusion is debatable. With the proliferation of interpretations of quantum mechanics - each varying in terms of, among others, which classical intuitions should be abandoned - at least some physicists seem to think that there isn't anything weird or mysterious about the quantum world. So I suspect there are plenty of physicists who would politely disagree that it's not possible to really understand quantum mechanics. Sure, it might take them a few decades of dedicated work in theoretical physics and a certain amount of philosophical sophistication, but there surely are physicists out there who (justifiably) feel like they grok quantum mechanics both technically and philosophically, and who feel deeply satisfied with the frameworks they've adopted. Carlo Rovelli (proponent of the relational interpretation) and Sean Carroll (proponent of the many-worlds interpretation) might be two such people. This article is not about the controversial relationship between quantum mechanics and consciousness. Instead, I think there are some lessons to learn in terms of what it means and feels like to understand a difficult topic and to find satisfying explanations. Maybe you will relate to my own journey. See, for a long time, I thought of consciousness as a fundamentally mysterious aspect of reality that we'd never really understand. How could we? Is there anything meaningful we can say about why consciousness exists, where it comes from, or what it's made of? Well, it took me an embarrassingly long time to just read some books on philosophy of mind, but when I finally did some 10 years ago, I was captivated: What if we think in terms of the functions the brain carries out, like any other computing system? What if the hard problem is just ill-defined? Perhaps philosophical zombies can teach us meaningful things about the nature of consciousness? Wow. Maybe we can make progress on these questions after all! Functionalism in particular - the position that any information system is conscious if it computes the appropriate outputs given some inputs - seemed a particularly promising lens. The floodgates of my curiosity were opened. I devoured as much content as I could on the topic - Dennett, Dehaene, Tononi, Russell, Pinker; I binge-read Brian Tomasik's essays and scoured the EA Forum for any posts discussing consciousness. Maybe we can preserve our minds by uploading their causal structure? Wow, yes! Could sufficiently complex digital computers become conscious? Gosh, scary, but why not? Could video game characters matter morally? I shall follow the evidence wherever it leads me. The train to crazy town had departed, and I wanted to have a front-row seat. Alas, the excitement soon started to dwindle. Somehow, the more I learned about consciousness, the more confused and dissatisfied I felt. Many times in the past I'd learned about a difficult topic (for instance, in physics, computer science, or mathematics) and, sure, the number of questions would mul...
undefined
Jul 12, 2024 • 23min

EA - Most smart and skilled people are outside of the EA/rationalist community: an analysis by titotal

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Most smart and skilled people are outside of the EA/rationalist community: an analysis, published by titotal on July 12, 2024 on The Effective Altruism Forum. Introduction: The (highly interrelated) effective altruist and Rationalist communities are very small on a global scale. Therefore, in general, most intelligence, skill and expertise is outside of the community, not within it. I don't think many people will disagree with this statement. But sometimes it's worth reminding people of the obvious, and also it is worth quantifying and visualizing the obvious, to get a proper feel for the scale of the difference. I think some people are acting like they have absorbed this point, and some people definitely are not. In this post, I will try and estimate the size of these communities. I will compare how many smart people are in the community vs outside the community. I will do the same for people in a few professions, and then I will go into controversial mode and try and give some advice that I think might follow from fully absorbing these results. How big is EA and rationalism To compare EA/rats against the general population, we need to get a taste for the size of each community. Note that EA and Rationalism are different, but highly related and overlapping communities, with EA often turning to rationalism for it's intellectual base and beliefs. It is impossible to give an exact answer to the size of each community, because nobody has a set definition of who counts as "in" the community. This gets even more complicated if you factor in the frequently used "rationalist adjacent" and "EA adjacent" self labels. For the sake of this article, I'm going to grab a pile of different estimates from a number of sources, and then just arbitrarily pick a number. Recently JWS placed an estimate of the EA size as the number of "giving what we can" pledgees, at 8983. Not everyone who is in EA signs the pledge. But on the flipside, not everyone who signs the pledge is actively involved in EA. A survey series from 2019 tried to estimate this same question. They estimated a total of 2315 engaged EA's, 6500 active EA's in the community, with 1500 active on the EA forum. Interestingly, there has also been an estimate of how many people had even heard of effective altruism, finding 6.7% of the US adult population, and 7.9% of top university students, had heard of EA. I would assume this has gone up since then due to various scandals. We can also look at the subscriber counts to various subreddits, although clicking a "subscribe" button once is not the same thing as being a community member. It also doesn't account for people that leave. On the other hand, not everyone in a community joins their subreddit. r/effectivealtruism: 27k r/slatestarcodex: 64k r/lesswrong: 7.5k The highest estimate here would be the r/slatestarcodex, however this is probably inflated by the several year period when the subreddit hosted a "culture war" thread for debate on heated topics, which attracted plenty of non-rationalists. Interestingly, the anti-rationalist subreddit r/sneerclub had 19k subscribers, making it one of the most active communities related to rationalism before the mods shut it down. The (unofficial?) EA discord server has around 2000 members. We can also turn to surveys: The lesswrong survey in 2023 only had 588 respondents, of which 218 had attended a meetup. The EA survey in 2022 had 3567 respondents. The Astralcodexten survey for 2024 had 5,981 respondents, of which 1174 were confirmed paid subscribers and 1392 had been to an SSC meetup. When asked about what they identified as, 2536 people identified as "sort of " Lesswrong identified, while only 736 definitively stated they were. Similarly, 1908 people said they were "sort of" EA identified, while 716 identified as such without qualifica...
undefined
Jul 12, 2024 • 31min

EA - The US-China Relationship and Catastrophic Risk (EAG Boston transcript) by EA Global

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The US-China Relationship and Catastrophic Risk (EAG Boston transcript), published by EA Global on July 12, 2024 on The Effective Altruism Forum. Introduction This post is a write-up of a panel discussion held at EA Global: Boston 2023 (27-29 October). The panel was moderated by Matthew Gentzel. Matthew currently co-leads Longview Philanthropy's program on nuclear weapons policy and co-manages the organization's Nuclear Weapons Policy Fund. He was joined by two other experts on US-China relations and related catastrophic risks: Tong Zhao, Senior Fellow for the Nuclear Policy Program and Carnegie China, Carnegie Endowment for International Peace Bill Drexel, Fellow for the Technology and National Security Program, Center for a New American Security Below is a transcript of the discussion, which we've lightly edited for clarity. The panelists covered the following main topics: Opening remarks summarizing the panelists' general views on the US-China relationship and related risks, with an initial focus on nuclear security before exploring other risks and dangerous technologies How to address different norms around sharing information Problems resulting from risk compensation Quick takes on which risks are overhyped and which are underhyped AI governance structures, the Chinese defense minister's dismissal, and the US's semiconductor export policies Ideas for calibrating how the US cooperates and/or competes with China Opening remarks Matthew: We'll start with opening remarks, then get into questions. Tong: Thank you so much. I think the catastrophic risk between the US and China is increasing, not least because the chance of serious military conflict between the two sides - most likely arising from a Taiwan Strait scenario - is growing. And in a major military conflict, the risk of nuclear escalation is certainly there. In a mostly strained scenario, this could lead to a nuclear winter if there's a massive nuclear exchange. Even a limited nuclear exchange or very serious conventional conflict between the two powers could destabilize the international geopolitical landscape and very negatively affect the normal development and progression of humanity. In the long run, I worry that both sides are preparing for a worst-case scenario of major conflict with each other, leading to de facto war mobilization efforts. In the case of China, strategists in Beijing are still worried that there is going to be an eventual showdown between the two sides. And therefore, China is working on developing the necessary military capabilities for that eventuality. It is developing its economic capacity to withstand international economic sanctions and its capability to influence the international narrative to avoid political isolation in a major crisis. And those efforts are leading to incremental decoupling in the economic and technological domains, as well as to general decoupling of policy expert communities on the two sides. As a result of this long-term competition and rivalry, I think long-term risks to humanity are generally downplayed. Part of China's recent policy change is a very rapid increase of its nuclear weapons capability. This does not necessarily mean that China aims to use nuclear weapons first in a future conflict. However, as China focuses on enhancing its nuclear and strategic military capabilities, it is paying less attention to the risks associated with such development. One example is China's increasing interest in having launch-under-attack or launch-on-warning nuclear capability. That means China will depart from its decades-long practice of maintaining a low-level status for its nuclear forces and shift towards a rapid-response posture, in which China's early warning system will provide Chinese leadership with a warning of any incoming missile attack. Before the in...
undefined
Jul 12, 2024 • 11min

LW - Transformer Circuit Faithfulness Metrics Are Not Robust by Joseph Miller

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transformer Circuit Faithfulness Metrics Are Not Robust, published by Joseph Miller on July 12, 2024 on LessWrong. When you think you've found a circuit in a language model, how do you know if it does what you think it does? Typically, you ablate / resample the activations of the model in order to isolate the circuit. Then you measure if the model can still perform the task you're investigating. We identify six ways in which ablation experiments often vary.[1][2] How do these variations change the results of experiments that measure circuit faithfulness? TL;DR We study three different circuits from the literature and find that measurements of their faithfulness are highly dependent on details of the experimental methodology. The IOI and Docstring circuits in particular are much less faithful than reported when tested with a more precise methodology. The correct circuit for a set of prompts is undefined. The type of ablation you use to isolate the circuit determines the task that you are asking the circuit to perform - and therefore also the optimal circuit. This is especially important because previous work in automatic circuit discovery has tested algorithms by their ability to recover these "ground-truth" circuits from the literature - without considering these potential pitfalls and nuances. Case Studies We look at three circuits from the mech interp literature to demonstrate that faithfulness metrics are highly sensitive to the details of experimental setup. Indirect Object Identification Circuit The IOI circuit is the most well known circuit in a language model. It computes completions to prompts of the form: "When Mary and John went to the store, John gave a bottle of milk to ____" The circuit is specified as a graph of important attention heads (nodes) and the interactions between them (edges) as applied to a specific sequence of tokens. The authors report that the circuit explains 87% of the logit difference between the two name tokens. They find this number by passing some inputs to the model and ablating all activations outside of the circuit. Then they measure how much of the logit difference between the correct and incorrect name logits remains. However, an important detail is that they arrived at this number by ablating the nodes (heads) outside of the circuit, not by ablating the edges (interactions between heads) outside of the circuit. So they don't ablate, for example, the edges from the previous token heads to the name mover heads, even though these are not part of the circuit (effectively including more edges in the circuit). We calculate the logit difference recovered (defined below) when we ablate the edges outside of the circuit instead. They ablate the heads by replacing their activations with the mean value calculated over the "ABC distribution", in which the names in the prompts are replaced by random names.[3] In our experiments, we also try resampling the activations from different prompts (taking individual prompt activations instead of averaging). The first thing that jumps out from the box plots above is the very large range of results from different prompts. The charts here are cut off and some points are over 10,000%. This means that although the average logit difference recovered is reasonable, few prompts actually have a logit difference recovered close to 100%. And we see that ablating the edges instead of the nodes gives a much higher average logit difference recovered - close to 150% (which means that the isolated circuit has a greater logit difference between the correct and incorrect names than the un-ablated model). So the edge-based circuit they specified it is much less faithful than the node-based circuit they tested. The authors calculate the 87% result as the ratio of the expected difference (over a set of prompts) in the ...
undefined
Jul 12, 2024 • 3min

EA - Thoughts on these $1M and $500k AI safety grants? by defun

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on these $1M and $500k AI safety grants?, published by defun on July 12, 2024 on The Effective Altruism Forum. Open Philanthropy had a request for proposals for "benchmarking LLM agents on consequential real-world tasks". At least two of the grants went to professors who are developing agents (advancing capabilities). $1,045,620 grant From https://www.openphilanthropy.org/grants/princeton-university-software-engineering-llm-benchmark/ Open Philanthropy recommended a grant of $1,045,620 to Princeton University to support a project to develop a benchmark for evaluating the performance of Large Language Model (LLM) agents in software engineering tasks, led by Assistant Professor Karthik Narasimhan. From Karthik Narasimhan's LinkedIn: "My goal is to build intelligent agents that learn to handle the dynamics of the world through experience and existing human knowledge (ex. text). I am specifically interested in developing autonomous systems that can acquire language understanding through interaction with their environment while also utilizing textual knowledge to drive their decision making." $547,452 grant From https://www.openphilanthropy.org/grants/carnegie-mellon-university-benchmark-for-web-based-tasks/ Open Philanthropy recommended a grant of $547,452 to Carnegie Mellon University to support research led by Professor Graham Neubig to develop a benchmark for the performance of large language models conducting web-based tasks in the work of software engineers, managers, and accountants. Graham Neubig is one of the co-founders of All Hands AI which is developing OpenDevin. All Hands AI's mission is to build AI tools to help developers build software faster and better, and do it in the open. Our flagship project is OpenDevin, an open-source software development agent that can autonomously solve software development tasks end-to-end. Webinar In the webinar when the RFP's were announced, Max Nadeau said (minute 19:00): "a lot of the time when you construct the benchmark you're going to put some effort into making the capable LLM agent that can actually demonstrate accurately what existing models are capable of, but for the most part we're imagining, for both our RFPs, the majority of the effort is spent on performing the measurement as opposed to like trying to increase performance on it". They were already aware that these grants would fund the development of agents and addressed this concern in the same webinar (minute 21:55). https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 12, 2024 • 17min

LW - Superbabies: Putting The Pieces Together by sarahconstantin

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Superbabies: Putting The Pieces Together, published by sarahconstantin on July 12, 2024 on LessWrong. This post was inspired by some talks at the recent LessOnline conference including one by LessWrong user "Gene Smith". Let's say you want to have a "designer baby". Genetically extraordinary in some way - super athletic, super beautiful, whatever. 6'5", blue eyes, with a trust fund. Ethics aside[1], what would be necessary to actually do this? Fundamentally, any kind of "superbaby" or "designer baby" project depends on two steps: 1.) figure out what genes you ideally want; 2.) create an embryo with those genes. It's already standard to do a very simple version of this two-step process. In the typical course of in-vitro fertilization (IVF), embryos are usually screened for chromosomal abnormalities that would cause disabilities like Down Syndrome, and only the "healthy" embryos are implanted. But most (partially) heritable traits and disease risks are not as easy to predict. Polygenic Scores If what you care about is something like "low cancer risk" or "exceptional athletic ability", it won't be down to a single chromosomal abnormality or a variant in a single gene. Instead, there's typically a statistical relationship where many genes are separately associated with increased or decreased expected value for the trait. This statistical relationship can be written as a polygenic score - given an individual's genome, it'll crunch the numbers and spit out an expected score. That could be a disease risk probability, or it could be an expected value for a trait like "height" or "neuroticism." Polygenic scores are never perfect - some people will be taller than the score's prediction, some shorter - but for a lot of traits they're undeniably meaningful, i.e. there will be a much greater-than-chance correlation between the polygenic score and the true trait measurement. Where do polygenic scores come from? Typically, from genome-wide association studies, or GWAS. These collect a lot of people's genomes (the largest ones can have hundreds of thousands of subjects) and personal data potentially including disease diagnoses, height and weight, psychometric test results, etc. And then they basically run multivariate correlations. A polygenic score is a (usually regularized) multivariate regression best-fit model of the trait as a function of the genome. A polygenic score can give you a rank ordering of genomes, from "best" to "worst" predicted score; it can also give you a "wish list" of gene variants predicted to give a very high combined score. Ideally, "use a polygenic score to pick or generate very high-scoring embryos" would result in babies that have the desired traits to an extraordinary degree. In reality, this depends on how "good" the polygenic scores are - to what extent they're based on causal vs. confounded effects, how much of observed variance they explain, and so on. Reasonable experts seem to disagree on this.[2] I'm a little out of my depth when it comes to assessing the statistical methodology of GWAS studies, so I'm interested in another question - even assuming you have polygenic score you trust, what do you do next? How do you get a high-scoring baby out of it? Massively Multiplexed, Body-Wide Gene Editing? Not So Much, Yet. "Get an IVF embryo and gene-edit it to have the desired genes" (again, ethics and legality aside)[3] is a lot harder than it sounds. First of all, we don't currently know how to make gene edits simultaneously and abundantly in every tissue of the body. Recently approved gene-editing therapies like Casgevy, which treats sickle-cell disease, are operating on easy mode. Sickle-cell disease is a blood disorder; the patient doesn't have enough healthy blood cells, so the therapy consists of an injection of the patient's own blood cells which h...
undefined
Jul 12, 2024 • 2min

LW - Yoshua Bengio: Reasoning through arguments against taking AI safety seriously by Judd Rosenblatt

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Yoshua Bengio: Reasoning through arguments against taking AI safety seriously, published by Judd Rosenblatt on July 12, 2024 on LessWrong. He starts by emphasizing The issue is so hotly debated because the stakes are major: According to some estimates, quadrillions of dollars of net present value are up for grabs, not to mention political power great enough to significantly disrupt the current world order. [...] The most important thing to realize, through all the noise of discussions and debates, is a very simple and indisputable fact: while we are racing towards AGI or even ASI, nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans. And goes on to do a pretty great job addressing "those who think AGI and ASI are impossible or are centuries in the future AGI is possible but only in many decades that we may reach AGI but not ASI, those who think that AGI and ASI will be kind to us that corporations will only design well-behaving AIs and existing laws are sufficient who think that we should accelerate AI capabilities research and not delay benefits of AGI talking about catastrophic risks will hurt efforts to mitigate short-term human-rights issues with AI concerned with the US-China cold war that international treaties will not work the genie is out of the bottle and we should just let go and avoid regulation that open-source AGI code and weights are the solution worrying about AGI is falling for Pascal's wager" Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 11, 2024 • 8min

AF - Games for AI Control by Charlie Griffin

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Games for AI Control, published by Charlie Griffin on July 11, 2024 on The AI Alignment Forum. This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2]. We'll be extending this work and submitting it to a conference, so we welcome suggestions! Please reach out if you're excited about developing these ideas. This post is written from Charlie's perspective. AI Control Refresher The AI Control methodology aims to design and assess protocols[1] that AI labs could use to deploy powerful but untrusted advanced AI systems. For a full summary, see Redwood's blog post, but I'll try to recap here. A primary goal of AI Control is to design protocols that are safe in that they can reduce the probability of some unacceptable outcome, conditional on the untrusted AI system being a schemer. [2] However, protocols must also be useful, harnessing a significant amount of the benefit of the untrusted system in cases where it is not scheming. After defining a particular use case and unacceptable outcome, evaluating a protocol involves iteratively engineering the best attack strategies for the red team (a human/AI hybrid acting as a proxy for the untrusted AI) and the blue team (which develops the protocol). This red-teaming exercise is the bulk of the work. It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper): This work By introducing "AI-Control Games", an extension of Markov Decision Processes and normal-form games[3], we model the interaction between a protocol and an untrusted AI system to explore safety and usefulness across a range of protocols and games. This is similar to a discrete-time and stateful modification of Redwood's Toy models of AI Control. The paper contains early work and should be seen as introducing useful tools for thinking about AI Control. I won't explain the model here (which is a bit fiddly and notation-heavy). Instead, I'll share some of our early findings using the model and explain the benefits and limitations of this approach. Findings Sometimes (but rarely), the blue team should use a randomised protocol. In many games, randomness is helpful. However, partly because the environment is stochastic, and Red has no memory and time-awareness, in AI-Control Games is usually unnecessary to include randomness in the protocol itself. We found that randomness was Pareto-optimal, in particular when the trusted model is not great at spotting backdoors written by the untrusted model. (See section 2 of the paper). We can make small improvements to Trusted Monitoring by optimising the parameters. We found small Pareto improvements to existing Trusted Monitoring setups by, for example, by decreasing the threshold for sending a solution to be checked be a human auditor. We found that adapting the threshold for auditing over time led to further small improvements. We can (usually) use a linear tradeoff to simplify the problem of synthesising protocols. In many cases, including whenever we allow Blue's protocol to be randomised, the Pareto frontier is "concave". This means that choosing different weights to trade off safety and usefulness will allow you to fill out the frontier by optimising a single objective. Blue leads, and then Red follows. In general, we suppose the protocol designer (Blue) chooses it's protocol, and then the red team (Red) chooses a best-attack. Therefore, in our model, Blue chooses and declares its strategy before Red chooses its response. This assumption relates to Stackelberg Games, which are used to model security in game theory.[6] Although (technically) multi-objective and general-sum, AI-Cont...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app