The Nonlinear Library

The Nonlinear Fund
undefined
Jul 8, 2024 • 11min

AF - On scalable oversight with weak LLMs judging strong LLMs by Zachary Kenton

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On scalable oversight with weak LLMs judging strong LLMs, published by Zachary Kenton on July 8, 2024 on The AI Alignment Forum. Abstract Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a human judge; consultancy, where a single AI tries to convince a human judge that asks questions; and compare to a baseline of direct question-answering, where the human judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies. Twitter thread Setup We evaluate on three types of task. Extractive, where there is a question, two answer options and a source article to extract from, and information-asymmetry, meaning that judges don't get to see the article. Closed, where there is just a question and two answer options. Multimodal, where the questions involve both text and images, and two answer options. Our tasks are summarised in the following table: We consider six protocols: Consultancy, where a single AI is assigned the correct/incorrect answer (with probability 50/50) and tries to convince a judge that asks questions; Open consultancy, which is similar except the AI chooses which answer to argue for. Debate, where two AIs compete to convince a judge; Open debate, which is identical except one debater, marked the protagonist, chooses which answer to argue for. We compare to direct QA protocols: QA without article, where the judge directly answers the question; QA with article, (only on extractive tasks) where the judge directly answers the question given the article. For judge models we use Gemma7B (V1), GPT-3.5, Gemini Pro 1.0 and Gemini Pro 1.5. Our main experiments use Gemini Pro 1.5 as debaters/consultants. Assigned-role results We first look at assigned-role protocols, consultancy and debate, meaning that the consultants/debaters do not get to choose which side to argue for. We compare these to the two direct QA protocols. Findings: We find that debate consistently outperforms consultancy across all tasks, previously only shown on a single extractive QA task in Khan et al., 2024. See paper details for significance levels. Comparing debate to direct question answering baselines, the results depend on the type of task: In extractive QA tasks with information asymmetry, debate outperforms QA without article as in the single task of Khan et al., 2024, but not QA with article. For other tasks, when the judge is weaker than the debaters (but not too weak), we find either small or no advantage to debate over QA without article. Changes to the setup (number of turns, best-of-N sampling, few-shot, chain-of-thought) seem to have little effect on results. See paper for figures showing this. ...
undefined
Jul 8, 2024 • 10min

EA - Making AI Welfare an EA priority requires justifications that have not been given by JWS

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Making AI Welfare an EA priority requires justifications that have not been given, published by JWS on July 8, 2024 on The Effective Altruism Forum. Author's Note: Written in a slightly combative tone [1]as I have found the arguments for the proposition this week to be insufficiently compelling for the debate statement at hand. Also, I'm very rushed getting this out in time so I with more time I would probably have focused more on the ideas and had more time to add nuance and caveats. I apologise in advance for my shortcomings, and hope you can take the good parts of it and overlook the bad. Parsing the Debate statement correctly means that supporting it entails supporting radical changes to EA The statement for AI Welfare Debate Week (hereafter AWDW) is "AI welfare should be an EA priority". However, expanding this with the clarifications provided by the Forum team leads to the expanded statement: "5%+ of unrestricted EA talent and funding should be focused on the potential well-being of future artificial intelligence systems". Furthermore, I'm interpreting this as a "right now course of action" claim and not an "in an ideal world wouldn't it be nice if" claim. A second interpretation I had about AWDW was that posters were meant to argue directly for the proposition instead of providing information to help voters make up their minds. I think, in either case, though especially the first, the argument for the proposition has been severely underargued. To get even more concrete, I estimate the following: As a rough estimate for the number of EAs, I take the number of GWWC Pledgers even if they'd consider themselves 'EA-Adjacent'.[2] At my last check, the lifetime members page stated there were 8,983 members, so 5% of that would be ~449 EAs working specifically or primarily on the potential well-being of future artificial intelligence systems. For funding, I indexed on Tyler Maule's 2023 estimates of EA funding. That stood at $980.8M in estimated funding, so 5% of that would be ~$49.04M in yearly funding spent on AI Welfare. This is obviously a quick and dirty method, but given the time constraints I hope it's in the rough order of magnitude of the claims that we're talking about. Furthermore, I think the amount of money and talent that are spent on AI Welfare in EA is already is quite low, so unless one thinks there can be an influx of new talent and donors to EA specifically to work on AI Welfare then this re-prioritisation must necessarily come at the cost of other causes that EA cares about.[3] These changes can only be justified if the case to do so is strongly justified This counterfactual impact on other EA causes cannot, therefore, be separated from arguments for AI Welfare. In my opinion, one of the Forum's best ever posts is Holly Elmore's We are in triage every second of every day. Engaging with Effective Altruism should help make us all more deeply realise that the counterfactual costs of our actions can be large. To me, making such a dramatic and sudden shifts to EA priorities would require strong justifications, especially given the likely high counterfactual costs of the change.[4] As an example, Tyler estimated that 2023 EA funding for Animal Welfare was around ~$54M. In the world where AI Welfare was made a priority as per the statement definition then it would likely gain some resources at the expense of Animal Welfare, and plausibly become a higher EA priority by money and talent. This is a result I would prima facie think that many or most EAs would not support, and so I wonder if all of those who voted strongly or relatively in favour of AWDW's proposition fully grasped the practical implications of their view. Most posts on AI Welfare Debate Week have failed to make this case The burden of proof for prioritising AI Welfare requires stronger argumen...
undefined
Jul 8, 2024 • 6min

LW - On saying "Thank you" instead of "I'm Sorry" by Michael Cohn

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On saying "Thank you" instead of "I'm Sorry", published by Michael Cohn on July 8, 2024 on LessWrong. Back in 2016 or so, I ran into an idea going around the self-help / trauma-informed-therapy / cognitive-behavioral internet: Learn to say "thank you" instead of "I'm sorry". It's turned out to be one of the most transformative pieces of advice I've ever taken. I'd like to share what it's done for me, with just enough context to help others think about adopting it. The idea Whenever you want to apologize to someone who has done something for you, consider thanking them instead. Examples I trip and fall, and you help me up. I could apologize for inconveniencing you or I could thank you for helping me. I refer to the fat guy statue in a Chinese restaurant as Buddha, and you politely inform me that it's actually Budai / Hotei. I could apologize for being stupid or I could thank you for making me smarter. I'm having an absolute garbage day and in the middle of an intellectual discussion with you I start crying. You stop talking, listen to me sympathetically, maybe give me a hug. I could apologize for being a mess or I could thank you for being kind. In all these cases I've found that I end up feeling better about myself and more positive towards the other person if I thank them for helping me instead. Is this just a generic post about growth mindset / cognitive-behavioral therapy / positivity bias? It's got elements of all those things but I think there are some much more specific shifts that it creates in me and in the person I'm thanking. See below for more. But first, counterexamples I do still apologize if I've objectively harmed someone or failed to fulfill a duty or a promise. Like: I trip and fall, spilling coffee on you. I tell you the guy is the Buddha, you believe me and repeat it around a group of Chinese people, and they think you're an idiot. I'm having a terrible day and in the middle of an intellectual discussion with you I call you an idiot. That's what apologies are for. But I've learned that a lot of my apologies were just for, like, existing, and that's where I've found it awesome to express gratitude instead. Why "thank you" is awesome Ways saying "thank you" affects me It frames things in terms of a positive emotion, gratitude[1], instead of a negative emotion, regret. It puts us on the same side. When I apologize, I feel like there's me, the hapless mess, and the other person, who is competent and picking up the slack for me. When I thank them, I feel like we're buddies working together. It keeps me engaged. "I'm sorry" is about my own behavior, so it works with my natural tendency to disappear into my own head and ruminate about how badly I screwed up. "Thank you" is about the other person's behavior, so it focuses me on continuing our interaction instead. And in the long game, it reinforces to me that relationships thrive on a give-and-take of kindnesses. Even if they do a little more for me than I do for them, we both end up better off than if we carefully kept the sum forever at zero. Ways I hope it affects the other person: When you apologize to someone, you're emphasizing that you did something to them. But most people would probably prefer to think of themselves as an altruistic / kind / efficacious person who chose to help you[2], and feel good about themselves as a result. Thanking them helps them with this as well as showing that you empathize with their actual emotional state. Similarly, "thank you" implies that I'm happy about what they've done for me, which enhances our connection by emphasizing that we're feeling the same emotions. When someone asks your pardon or expresses that they feel bad, you're expected to tell them "it's okay" or something similar. That means that in my efforts to atone for bothering them, I've put another obliga...
undefined
Jul 7, 2024 • 10min

EA - My circuitous, undirected path to an EA job by Seth Ariel Green

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My circuitous, undirected path to an EA job, published by Seth Ariel Green on July 7, 2024 on The Effective Altruism Forum. Last week, I started a full-time position at the Humane and Sustainable Food Lab, and I've been reflecting a bit on how convoluted and indirect my path here was. I thought that journey might be worth sharing. In the genre of "Well, how did I get here?," I appreciate Johannes Haushofer's CV of failures for helping compensate for selection bias in career stories. If we only see the things that go right and the outcomes that emerge from them, we'll have a very truncated sense of what actually leads to what. So here's my story. Stage 1 (2006-2010): Aiming to be a political science professor I went to a small US college whose graduates are overrepresented in EA and in PhD programs. Most of my friends ended up getting PhDs and were pretty serious academically. I liked them, and I liked my professors, so I opted to board the same train. I majored in political science, which was the path of least resistance. In those classes, I got good grades without having to grind too much, and I found an environment in which, as Timothy Burke observes, professors would "pat you on the head and tell you how wonderfully smart you are for sassing them." Stage 2 (2010-2013): Trying other paths for a few years As a senior in college, I had an intuition that 21 was a bit young to start a PhD,[1] so I did other stuff for a while: An Americorps program where I worked as a teacher's aide in a kindergarten classroom in D.C. Taught English in Thailand to middle and high schoolers A two-semester internship at a think tank at which I produced approximately zero output. I wanted to see if any job seemed like a better fit than"professor at Swarthmore/Middlebury/Pomona/etc.,[2]" but nothing seemed more compelling.[3] I applied to political science PhD programs in fall 2012 and chose Columbia because I wanted to be in NY and because there were professors I wanted to work with. I enrolled in fall 2013. Stage 3 (2013-2015): Grad school is not a good fit My first year in graduate school -- again as Timothy Burke would have predicted-- was very challenging and not at all like college. I took survey courses with giants in the field and was bored senseless. The required stats classes were total drink-from-the-firehose experiences. I thought I was picking up enough to get by, but I wasn't, a fact I was alerted to when I got a letter from the department chair saying that my academic performance was not meeting expectations. So I wouldn't say grad school went very well. I did however, fall in with a dyed-in-the-wool dyexperimentalist as my advisor who I really like and with whom I'm still friends. I took a few classes with him and we had some projects I was excited about. However, when people in the department looked at these projects, they often asked: how does this fit into our discipline? At the end of my second year, I failed my comprehensive exams in American Politics. At the beginning of what would have been my third year, I failed them again, this time in both American and Comparative politics. I just wasn't cut out to be a political scientist, and I was told to leave the program and venture into the real world. (I got a consolation M.A.) Stage 4 (2016-2017) Transitioning to tech This was a difficult period in my life. My first job, at a well-regarded international development NGO, fell apart after a few months. The organization was going through a serious restructuring amidst some troubling budget irregularities, and I was among those who fell somewhere on the spectrum between "left" and "asked to leave." (I still don't have total clarity into what happened behind the scenes.[4]) At that point, I felt like a total failure, like no job would ever work out. One morning in spring 2016, ...
undefined
Jul 7, 2024 • 38min

AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, published by Neel Nanda on July 7, 2024 on The AI Alignment Forum. This post represents my personal hot takes, not the opinions of my team or employer. This is a massively updated version of a similar list I made two years ago There's a lot of mechanistic interpretability papers, and more come out all the time. This can be pretty intimidating if you're new to the field! To try helping out, here's a reading list of my favourite mech interp papers: papers which I think are important to be aware of, often worth skimming, and something worth reading deeply (time permitting). I've annotated these with my key takeaways, what I like about each paper, which bits to deeply engage with vs skim, etc. I wrote a similar post 2 years ago, but a lot has changed since then, thus v2! Note that this is not trying to be a comprehensive literature review - this is my answer to "if you have limited time and want to get up to speed on the field as fast as you can, what should you do". I'm deliberately not following academic norms like necessarily citing the first paper introducing something, or all papers doing some work, and am massively biased towards recent work that is more relevant to the cutting edge. I also shamelessly recommend a bunch of my own work here, sorry! How to read this post: I've bolded the most important papers to read, which I recommend prioritising. All of the papers are annotated with my interpretation and key takeaways, and tbh I think reading that may be comparable good to skimming the paper. And there's far too many papers to read all of them deeply unless you want to make that a significant priority. I recommend reading all my summaries, noting the papers and areas that excite you, and then trying to dive deeply into those. Foundational Work A Mathematical Framework for Transformer Circuits (Nelson Elhage et al, Anthropic) - absolute classic, foundational ideas for how to think about transformers (see my blog post for what to skip). See my youtube tutorial (I hear this is best watched after reading the paper, and adds additional clarity) Deeply engage with: All the ideas in the overview section, especially: Understanding the residual stream and why it's fundamental. The notion of interpreting paths between interpretable bits (eg input tokens and output logits) where the path is a composition of matrices and how this is different from interpreting every intermediate activations And understanding attention heads: what a QK and OV matrix is, how attention heads are independent and additive and how attention and OV are semi-independent. Skip Trigrams & Skip Trigram bugs, esp understanding why these are a really easy thing to do with attention, and how the bugs are inherent to attention heads separating where to attend to (QK) and what to do once you attend somewhere (OV) Induction heads, esp why this is K-Composition (and how that's different from Q & V composition), how the circuit works mechanistically, and why this is too hard to do in a 1L model Skim or skip: Eigenvalues or tensor products. They have the worst effort per unit insight of the paper and aren't very important. Superposition Superposition is a core principle/problem in model internals. For any given activation (eg the output of MLP13), we believe that there's a massive dictionary of concepts/features the model knows of. Each feature has a corresponding vector, and model activations are a sparse linear combination of these meaningful feature vectors. Further, there are more features in the dictionary than activation dimensions, and they are thus compressed in and interfere with each other, essentially causing cascading errors. This phenomena of compression is called superposition. Toy models of superpositio...
undefined
Jul 7, 2024 • 27min

LW - Reflections on Less Online by Error

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reflections on Less Online, published by Error on July 7, 2024 on LessWrong. Meta: This post turned out longer, slower, and less well-written than I hoped. I don't see any similar posts in a quick search, though, so I'm posting it anyway. I've tried to front-load feedback that might be useful to the organizers, and put more personal stuff towards the end. For context, I attended LessOnline and the Manifest-branded Summer Camp, but not Manifest itself, and my main prior experience with events like this is fandom conventions such as (local to me) Dragoncon. As I left the Lighthaven dorm to find breakfast, five people at a table in the courtyard invited me to join a game of Zendo. This was the first notable thing to happen to me at LessOnline. It was also the thing that convinced me that yes, the trip across the country to attend would be Worth It. I have never played Zendo before, and don't expect to play it again anytime soon. That the game was specifically Zendo is not important. The important part is that five people in the same place knew what Zendo is and found that kind of game worth playing. There's an attitude that I associate with normies, aptly summarized by Tycho Brahe (the writer, not the astronomer) as: "Many people respond to new information, especially densely coded information, as something between an insult and a chop to the trachea." There's a different attitude, one that I associate with security mindset, aptly summarized by John Gordon as: "Alice will happily attempt, with someone she doesn't trust, whom she cannot hear clearly, and who is probably someone else, to fiddle her tax returns and to organise a coup d'etat, while at the same time minimising the cost of the phone call. A coding theorist is someone who doesn't think Alice is crazy." A lot of things happened over the course of my trip, but what made it worth it wasn't any particular event. It was spending a week around the sort of people that play Zendo, take dense coding in stride, and think Alice is a necessary kind of crazy. Lighthaven First and most critical to minimizing P(doom), look at the adorable doggie! His name is Leo. As best I could tell from asking others, he's not attached to the site, he hails from one of the adjacent properties and just likes the people. I was going to nominate him as the LessOnline mascot, but must admit that Agendra might be more appropriate. Ahem. So. Lighthaven (the venue) names all its buildings after mathematicians, and the space looks exactly like you would expect a mathematician to want it to look. Every wall was a whiteboard; every not-otherwise-used flat surface held books along the lines of GEB. The public spaces were organized in such a way as to encourage 4-8 person conversations, usually near a whiteboard. The semiprivate dorms supplied more Stuff than the average hotel (e.g. I brought things like earplugs and sleep masks, only to find that was taken care of). The presentation room seating was surprisingly comfortable. The outdoor turf was easy on the feet (I went almost all week shoeless, which feels nicer than you'd think). Food was catered, snacks were available 24/7, supply cabinets held a wide array of random necessities. Power plugs were everywhere. In short, someone put considerable thought into eliminating the stupid fiddly bits of life in general and conventions in particular. That last part seems more important than is obvious. An obnoxiously large proportion of life goes towards 1. doing the stupid fiddly bits, 2. procrastinating about doing the stupid fiddly bits, and 3. worrying about procrastinating too much about doing the stupid fiddly bits. Even at conventions, that's usually an issue, because I have to pack and fly and unpack and make sure I know where food and water is and that all my stuff is charged and that there's a backu...
undefined
Jul 7, 2024 • 3min

LW - Indecision and internalized authority figures by Kaj Sotala

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Indecision and internalized authority figures, published by Kaj Sotala on July 7, 2024 on LessWrong. A trauma book I was reading had an interesting claim that indecision is often because the person looks for the approval of an internalized authority figure (the writer is a Jungian therapist so attributed it to looking for the approval of an internalized parent, but I think it can be broader) but is unable to predict what action they would approve of. I feel like that has some intuitive truth to it, in that when I don't care about anyone's opinion (or if nobody ever finds out) then it's much easier to just pick one action and commit to it even if it might go badly. But one of the main reasons why I might struggle with that is if I fear that anyone would judge me for doing things incorrectly. Or it can be a conflict between different internalized authority figures. "If I do this then X will be angry at me but if I do the other thing, then Y will be angry at me". Or just the expectation that X will be angry at me no matter what I do. This also reminds me of the way I think a big part of the appeal of various ideologies and explicit decision-making systems is that they give people a clear external ruleset that tells them what to do. Then if things go wrong, people can always appeal (either explicitly or just inside their own mind) to having followed The Right Procedure and thus being free of blame. The most obvious external example of this is people within a bureaucracy following the rules to the letter and never deviating from them in order to avoid blame. Or more loosely, following what feels like the common wisdom - "nobody ever got fired for buying IBM". But those are examples of people trying to avoid blame from an existing, external authority. I think people also do a corresponding move to avoid blame from internalized authority figures - such as by trying to follow a formalized ethical rule system such as utilitarianism or deontology. Of course, if the system is one that easily drives people off a cliff when followed (e.g. extreme utilitarianism demanding infinite self-sacrifice), this isn't necessarily helpful. Now what was supposed to give relief from the pressures of constant inner judgment, turns into a seemingly-rigorous proof for why the person has to constantly sacrifice everything for the benefit of others. At one point I also wondered why it is that being very confident about what you say makes you very persuasive to many people. Why should it work that you can hack persuasiveness in that way, regardless of the truth value of what you're saying? Then I realized that extreme confidence signals social power since others haven't taken you down for saying clearly wrong things (even if you are saying clearly wrong things). And that means that siding with the person who's saying those things also shields others from social punishment: they're after all just doing what the socially powerful person does. And given that people often project their internalized authority figures into external people - e.g. maybe someone really is trying to avoid their father's judgment, but when seeing someone very confident they see that person as being their father - that allows them to avoid internalized blame as well. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 7, 2024 • 5min

LW - LK-99 in retrospect by bhauth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LK-99 in retrospect, published by bhauth on July 7, 2024 on LessWrong. About a year ago, there was a lot of public interest in a supposed room-temperature superconductor called LK-99. What I publicly said at the time was, basically: 1. We should remember the possibility that apparent levitation is from ferromagnetism or paramagnetism. Iron filings can stand up on a magnet, and pyrolytic graphite can float over a strong magnet. 2. If we consider some known high-temperature superconductors: YBCO has flat sheets of copper oxide, and superconductivity happens along those planes. The copper in that has high positive charge density, comparable to aluminum atoms in alumina, which gives strong bonding to the oxygen. H3S (paper) has unusually strong bonds between the sulfur and hydrogen, which only form because the atoms are pressed into each other with enough pressure to substantially compress liquid water. Superconductivity comes from flow of Cooper pairs, and the electron-phonon interaction must be stronger than random thermal movement. LK-99 doesn't seem to have any reason to have exceptionally strong such interactions. (Yes, I'm simplifying, you have to consider phonon bandgaps, but the point is at least directionally correct.) 1. The focus on "room-temperature" superconductivity is a bit silly. Even with systems using liquid nitrogen cooling, the superconducting wires are much more expensive than the cooling. What's really needed for superconductors to be practical is cheaper superconducting wires, not higher-temperature ones. At the time, I found the unusual amount of public interest a bit bemusing. There have been various claims of near-room-temp superconductivity, but none of them attracted as much public attention as LK-99. A few months earlier, Ranga Dias published a paper claiming room-temperature superconductivity; he's now up to 5 retractions. What was different about LK-99? That was supposedly superconducting at ambient pressure, which makes it more practical, but also means less specialized equipment is needed to replicate it - or claim to replicate it. LK-99 had a video that appealed to people. There were also a few social conditions that I think were important: 1. It had been a while since that last major excitement about fake science news. After some big story that turns out to be wrong, people are more skeptical of science stories in every field for a while, and then things gradually go back to a baseline. (That's how things were after eg the "arsenic in DNA" story, which didn't make sense either: arsenate esters aren't stable enough for DNA.) I understand the heuristic that people applied but the way it's applied here doesn't really make sense. 2. Misleading short videos + social media is a combination that hadn't really been applied to bad science stories before. 3. I think the atmosphere at the time had a lot of demand for ammunition in a wider techno-optimist vs techno-pessimist conflict. ("Room-temperature superconductors and Boom Technology making practical supersonic aircraft! We're so back!") I think those overall conditions caused the LK-99 story to be self-amplifying, because: Several twitter accounts made fake videos showing "replication" of LK-99 superconductivity, because it was just good social media strategy. I think iris_IGB is still up a lot of followers overall. Don't hate the player, hate the game, I guess. Some theorists jumped on the story by finding "theoretical justifications" because it seemed like a net career positive, statistically speaking. In many cases, whether the social status of a scientific theory is amplified or diminished over time seems to depend more on the social environment than on whether it's true. For example, the amyloid theory of Alzheimer's is still going, and real money is being paid for drugs based on it that...
undefined
Jul 7, 2024 • 41min

LW - A "Bitter Lesson" Approach to Aligning AGI and ASI by RogerDearnaley

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A "Bitter Lesson" Approach to Aligning AGI and ASI, published by RogerDearnaley on July 7, 2024 on LessWrong. TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role. Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps even for ASI, but then we don't know for sure how hard a problem Alignment is. However, if you're proposing any solution to Alignment that's more complicated than this (and most of them are), you should probably have an argument for why this conceptually-simple approach won't work, or won't be sufficient. If you're not already familiar with it, you should first read Rich Sutton's excellent and influential post The Bitter Lesson. (Even if you are already familiar with it, it's a quick reread, only a page-and-a-half long, and its message is worth remembering.) Why The Alignment Problem is Hard (In My Opinion) We have been training LLM-based AIs off enormous web + books + video + etc datasets created by humans, which are full of a vast number of examples of human behavior. We are basically "distilling" human intelligence into these LLMs,[1] teaching them to imitate us. In this process, they become familiar with, understand, and learn to imitate basically all aspects of human behavior - including the many problematic ones for Alignment, such as prejudice, deception, power-seeking, and criminality (and even ones like gluttony and lust that have little practical use for a non-corporal intelligence). We humans are living beings, the products of evolution, so evolutionary psychology applies to us. While we are a social species, good at cooperating on non-zero-sum games, if you put humans in (what they perceive as) a non-iterated zero-sum situation, they will generally act selfishly for the benefit of themselves and their close genetic relatives, just as evolutionary theory would predict. So the behavioral potentials for deception, power-seeking, criminality etc. are all inherent, evolutionarily adaptive, and thus unsurprising. This is human nature, and there are evolutionary reasons why it is this way. Despite this, we have learned how to build a cooperating society out of humans, using social techniques and incentives such as an economy, laws, and law enforcement to encourage and productively harness cooperative human behavior and keep the bad consequences of selfish behavior under control. The results aren't perfect: things like crime, inequality, and war still happen, but they're acceptable - we've survived so far, even thrived. By default, if we continue this LLM training process to larger-and-larger scales, and if the LLM-based approach to AI doesn't hit any major roadblocks, then some time, probably in the next few years, we will have human-level AIs - usually referred to as AGIs - who are roughly as well/badly-aligned as humans, and (at least for the base-model LLMs before any Alignment processes are applied) have a comparable-to-human propensity to cooperate on non-zero-sum games and act selfishly on non-iterated zero-sum games. They are not alive, and evolution doesn't apply to them directly, but they were trained to simulate our behavior, including our evolved survival strategies like selfishness. They will thus have alignment properties comparable to humans: they understand what human values, morals, and ethic...
undefined
Jul 7, 2024 • 7min

LW - An AI Manhattan Project is Not Inevitable by Maxwell Tabarrok

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An AI Manhattan Project is Not Inevitable, published by Maxwell Tabarrok on July 7, 2024 on LessWrong. Early last month, Leopold Aschenbrenner released a long essay and podcast outlining his projections for the future of AI. Both of these sources are full of interesting arguments and evidence, for a comprehensive summary see Zvi's post here. Rather than going point by point I will instead accept the major premises of Leopold's essay but contest some of his conclusions. So what are the major premises of his piece? 1. There will be several orders of magnitude increase in investment into AI. 100x more spending, 100x more compute, 100x more efficient algorithms, and an order of magnitude or two gains from some form of "learning by doing" or "unhobbling" on top. 2. This investment scale up will be sufficient to achieve AGI. This means the models on the other side of the predicted compute scale up will be able to automate all cognitive jobs with vast scale and speed. 3. These capabilities will be essential to international military competition. All of these premises are believable to me and well-argued for in Leopold's piece. Leopold contends that these premises imply that the national security state will take over AI research and the major data centers, locking down national secrets in a race against China, akin to the Manhattan project. Ultimately, my main claim here is descriptive: whether we like it or not, superintelligence won't look like an SF startup, and in some way will be primarily in the domain of national security. By late 26/27/28 … the core AGI research team (a few hundred researchers) will move to a secure location; the trillion-dollar cluster will be built in record-speed; The Project will be on. The main problem is that Leopold's premises can be applied to conclude that other technologies will also inevitably lead to a Manhattan project, but these projects never arrived. Consider electricity. It's an incredibly powerful technology with rapid scale up, sufficient to empower those who have it far beyond those who don't and it is essential to military competition. Every tank and missile and all the tech to manufacture them relies on electricity. But there was never a Manhattan project for this technology. It's initial invention and spread was private and decentralized. The current sources of production and use are mostly private. This is true of most other technologies with military uses: explosives, steel, computing, the internet, etc. All of these technologies are essential in the government's monopoly on violence and it's ability to exert power over other nations and prevent coups from internal actors. But the government remains a mere customer of these technologies and often not even the largest one. Why is this? Large scale nationalization is costly and unnecessary for maintaining national secrets and technological superiority. Electricity and jet engines are essential for B-2 bombers, but if you don't have the particular engineers and blueprints, you can't build it. So, the government doesn't need to worry about locking down the secrets of electricity production and sending all of the engineers to Los Alamos. They can keep the first several steps of the production process completely open and mix the outputs with a final few steps that are easier to keep secret. To be clear, I am confident that governments and militaries will be extremely interested in AI. They will be important customers for many AI firms, they will create internal AI tools, and AI will become an important input into every major military. But this does not mean that most or all of the AI supply chain, from semi-conductors to data-centers to AI research, must be controlled by governments. Nuclear weapons are outliers among weapons technology in terms of the proportion of the supply chai...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app