The Nonlinear Library

The Nonlinear Fund
undefined
Aug 20, 2024 • 18min

LW - Thiel on AI & Racing with China by Ben Pace

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thiel on AI & Racing with China, published by Ben Pace on August 20, 2024 on LessWrong. This post is a transcript of part of a podcast with Peter Thiel, touching on topics of AI, China, extinction, Effective Altruists, and apocalyptic narratives, published on August 16th 2024. If you're interested in reading the quotes, just skip straight to them, the introduction is not required reading. Introduction Peter Thiel is probably known by most readers, but briefly: he is an venture capitalist, the first outside investor in Facebook, cofounder of Paypal and Palantir, and wrote Zero to One (a book I have found very helpful for thinking about building great companies). He has also been one of the primary proponents of the Great Stagnation hypothesis (along with Tyler Cowen). More local to the LessWrong scene, Thiel was an early funder of MIRI and a speaker at the first Effective Altruism summit in 2013. He funded Leverage Research for many years, and also a lot of anti-aging research, and the seasteading initiative, and his Thiel Fellowship included a number of people who are around the LessWrong scene. I do not believe he has been active around this scene much in the last ~decade. He appears rarely to express a lot of positions about society, and I am curious to hear them when he does. In 2019 I published the transcript of another longform interview of his here with Eric Weinstein. Last week another longform interview with him came out, and I got the sense again, that even though we disagree on many things, conversation with him would be worthwhile and interesting. Then about 3 hours in he started talking more directly about subjects that I think actively about and some conflicts around AI, so I've quoted the relevant parts below. His interviewer, Joe Rogan is a very successful comedian and podcaster. He's not someone who I would go to for insights about AI. I think of him as standing in for a well-intentioned average person, for better or for worse, although he is a little more knowledgeable and a little more intelligent and a lot more curious than the average person. The average Joe. I believe he is talking in good faith to the person before him, with curiosity, and making points that seem natural to many. Artificial Intelligence Discussion focused on the AI race and China, atarting at 2:56:40. The opening monologue by Rogan is skippable. Rogan If you look at this mad rush for artificial intelligence - like, they're literally building nuclear reactors to power AI. Thiel Well, they're talking about it. Rogan Okay. That's because they know they're gonna need enormous amounts of power to do it. Once it's online, and it keeps getting better and better, where does that go? That goes to a sort of artificial life-form. I think either we become that thing, or we integrate with that thing and become cyborgs, or that thing takes over. And that thing becomes the primary life force of the universe. And I think that biological life, we look at like life, because we know what life is, but I think it's very possible that digital life or created life might be a superior life form. Far superior. [...] I love people, I think people are awesome. I am a fan of people. But if I had to look logically, I would assume that we are on the way out. And that the only way forward, really, to make an enormous leap in terms of the integration of society and technology and understanding our place in the universe, is for us to transcend our physical limitations that are essentially based on primate biology, and these primate desires for status (like being the captain), or for control of resources, all of these things - we assume these things are standard, and that they have to exist in intelligent species. I think they only have to exist in intelligent species that have biological limitations. I think in...
undefined
Aug 20, 2024 • 26min

EA - CEA will continue to take a "principles-first" approach to EA by Zachary Robinson

The new CEO of the Centre for Effective Altruism dives into a 'principles-first' approach that emphasizes the movement's core values over specific causes. Explore the importance of scope sensitivity, impartiality, and decision-making trade-offs. The speaker shares their personal journey, highlighting how community collaboration enriches the effective altruism landscape. This refreshing perspective encourages thinking beyond singular issues, reinforcing that effective altruism encompasses a wide array of impactful actions.
undefined
Aug 20, 2024 • 8min

AF - Finding Deception in Language Models by Esben Kran

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Finding Deception in Language Models, published by Esben Kran on August 20, 2024 on The AI Alignment Forum. This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon. Bringing together students, researchers, and engineers from around the world to tackle a pressing challenge in AI safety; preventing AI from deceiving humans and overseers. The hackathon took place both online and in multiple physical locations simultaneously. Marius Hobbhahn, the CEO of Apollo Research, kicked off the hackathon with a keynote talk about evaluating deception in AI with white-box and black-box methods. You can watch his talk here. We also had talks by Jacob Haimes, an Apart fellow, and Mikita Balesni, a research scientist at Apollo Research. This post details the top 8 projects, multiple of which are currently being developed further in Apart Lab. In total, we had 189 participants and 21 final entries. Sandbagging Detection through Model Degradation By Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahani The team introduced a new approach to detect AI sandbagging, a form of deception where an AI model strategically underperforms during evaluation to hide its true capabilities. The assumption behind their project is that sandbagging is a more complex task than showing the true capabilities. In the graph above, you can see how they gradually add noise to the model's parameters and evaluate the models during this process. Sandbagging models consistently improve in performance right before the noise level causes a normal model to lose the capability. It is a straightforward, inexpensive technique that doesn't require prior assumptions about how sandbagging is represented within the model. They successfully detected sandbagging behavior in two open-source LLMs: An 8B Llama 3 Instruct model and Phi 3 small instruct model. This approach offers a practical and accessible method for identifying potential deception in AI systems. Read the full project here. Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering By Avyay M Casheekar, Kaushik Sanjay Prabhakar, Kanishk Rath, Sienka Dounia The team focused on detecting and mitigating sandbagging in AI models using recent methods for understanding activations. They fine-tuned a Llama 3 8B model on a custom dataset of deceptive and honest scenarios, then used Linear Artificial Tomography (LAT) scans to detect deceptive behaviors in the internal representations of the LLM. The team's technique not only identified deceptive behaviors, but also showed promise in mitigating them. Importantly, their approach demonstrated good generalization to out-of-distribution scenarios, suggesting its potential robustness in real-world applications. Read the full project here. Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach By Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy This team tackled the challenge of detecting deception in closed-source, black-box models like GPT-3.5-turbo. They investigated whether LLM API response metadata such as token count, response time, and tokens per second could be used to detect intentional deceptive behavior. The team analyzed 300 prompts and generated 1,200 responses from GPT-3.5-turbo, comparing the metadata of baseline and deceptive outputs. Their results revealed that deceptive outputs tend to have increased response times and altered token usage. This approach demonstrates that deception detection is possible without accessing a model's internal representation, opening up new avenues for monitoring and safeguarding AI systems, even when their inner workings are not accessible. Read the full project here. Modelling the Oversight of Automated Interpretability Against Deceptive Agents on Sp...
undefined
Aug 20, 2024 • 38min

LW - Limitations on Formal Verification for AI Safety by Andrew Dickson

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Limitations on Formal Verification for AI Safety, published by Andrew Dickson on August 20, 2024 on LessWrong. In the past two years there has been increased interest in formal verification-based approaches to AI safety. Formal verification is a sub-field of computer science that studies how guarantees may be derived by deduction on fully-specified rule-sets and symbol systems. By contrast, the real world is a messy place that can rarely be straightforwardly represented in a reductionist way. In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete symbolic rule sets. Additionally, even if we had such rules for the natural sciences, it would be very difficult for any software system to obtain sufficiently accurate models and data about initial conditions for a prover to succeed in deriving strong guarantees for AI systems operating in the real world. Practical limitations like these on formal verification have been well-understood for decades to engineers and applied mathematicians building real-world software systems, which makes it puzzling that they have mostly been dismissed by leading researchers advocating for the use of formal verification in AI safety so far. This paper will focus-in on several such limitations and use them to argue that we should be extremely skeptical of claims that formal verification-based approaches will provide strong guarantees against major AI threats in the near-term. What do we Mean by Formal Verification for AI Safety? Some examples of the kinds of threats researchers hope formal verification will help with come from the paper "Provably Safe Systems: The Only Path to Controllable AGI" [1] by Max Tegmark and Steve Omohundro (emphasis mine): Several groups are working to identify the greatest human existential risks from AGI. For example, the Center for AI Safety recently published 'An Overview of Catastrophic AI Risks' which discusses a wide range of risks including bioterrorism, automated warfare, rogue power seeking AI, etc. Provably safe systems could counteract each of the risks they describe. These authors describe a concrete bioterrorism scenario in section 2.4: a terrorist group wants to use AGI to release a deadly virus over a highly populated area. They use an AGI to design the DNA and shell of a pathogenic virus and the steps to manufacture it. They hire a chemistry lab to synthesize the DNA and integrate it into the protein shell. They use AGI controlled drones to disperse the virus and social media AGIs to spread their message after the attack. Today, groups are working on mechanisms to prevent the synthesis of dangerous DNA. But provably safe infrastructure could stop this kind of attack at every stage: biochemical design AI would not synthesize designs unless they were provably safe for humans, data center GPUs would not execute AI programs unless they were certified safe, chip manufacturing plants would not sell GPUs without provable safety checks, DNA synthesis machines would not operate without a proof of safety, drone control systems would not allow drones to fly without proofs of safety, and armies of persuasive bots would not be able to manipulate media without proof of humanness. [1] The above quote contains a number of very strong claims about the possibility of formally or mathematically provable guarantees around software systems deployed in the physical world - for example, the claim that we could have safety proofs about the real-world good behavior of DNA synthesis machines, or drones. From a practical standpoint, our default stance towards such claims should be skepticism, since we do not have proofs of this sort for any of the technologies we interact with in the real-world today. For example, DNA synthesis machines exist today and do not come with f...
undefined
Aug 20, 2024 • 16min

LW - Untrustworthy models: a frame for scheming evaluations by Olli Järviniemi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 20, 2024 on LessWrong. In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated. I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making. Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback. Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as hacking the data center or copying its weights to a remote server. What can the company do? This is a textbook example of the type of problem AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity. (If you have never heard of AI control before, you might want to try playing the Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.) Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon. There are many ways how this control strategy could fail. Here are some:[2] 1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice. 2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt. 1. Or: Weak happens to have a backdoor, which Smart finds and exploits. 3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3] 4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans. 5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans. 6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape. Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this". This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no less!) either...
undefined
Aug 20, 2024 • 13min

LW - LLM Applications I Want To See by sarahconstantin

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLM Applications I Want To See, published by sarahconstantin on August 20, 2024 on LessWrong. I'm convinced that people who are interested in large language models (LLMs) are overwhelmingly focused on general-purpose "performance" at the expense of exploring useful (or fun) applications. As I'm working on a personal project, I've been learning my way around HuggingFace, which is a hosting platform, set of libraries, and almost-social-network for the open-source AI community. It's fascinating, and worth exploring even if you're not going to be developing foundation models from scratch yourself; if you simply want to use the latest models, build apps around them, or adapt them slightly to your own purposes, HuggingFace seems like the clear place to go. You can look at trending models, and trending public "spaces", aka cloud-hosted instances of models that users can test out, and get a sense of where the "energy" is. And what I see is that almost all the "energy" in LLMs is on general-purpose models, competing on general-purpose question-answering benchmarks, sometimes specialized to particular languages, or to math or coding. "How can I get something that behaves basically like ChatGPT or Claude or Gemini, but gets fewer things wrong, and ideally requires less computing power and and gets the answer faster?" is an important question, but it's far from the only interesting one! If I really search I can find "interesting" specialized applications like "predicts a writer's OCEAN personality scores based on a text sample" or "uses abliteration to produce a wholly uncensored chatbot that will indeed tell you how to make a pipe bomb" but mostly…it's general-purpose models. Not applications for specific uses that I might actually try. And some applications seem to be eager to go to the most creepy and inhumane use cases. No, I don't want little kids talking to a chatbot toy, especially. No, I don't want a necklace or pair of glasses with a chatbot I can talk to. (In public? Imagine the noise pollution!) No, I certainly don't want a bot writing emails for me! Even the stuff I found potentially cool (an AI diary that analyzes your writing and gives you personalized advice) ended up being, in practice, so preachy that I canceled my subscription. In the short term, of course, the most economically valuable thing to do with LLMs is duplicating human labor, so it makes sense that the priority application is autogenerated code. But the most creative and interesting potential applications go beyond "doing things humans can already do, but cheaper" to do things that humans can't do at all on comparable scale. A Personalized Information Environment To some extent, social media, search, and recommendation engines were supposed to enable us to get the "content" we want. And mostly, to the extent that's turned out to be a disappointment, people complain that getting exactly what you want is counterproductive - filter bubbles, superstimuli, etc. But I find that we actually have incredibly crude tools for getting what we want. We can follow or unfollow, block or mute people; we can upvote and downvote pieces of content and hope "the algorithm" feeds us similar results; we can mute particular words or tags. But what we can't do, yet, is define a "quality" we're looking for, or a "genre" or a "vibe", and filter by that criterion. The old tagging systems (on Tumblr or AO3 or Delicious, or back when hashtags were used unironically on Twitter) were the closest approximation to customizable selectivity, and they're still pretty crude. We can do a lot better now. Personalized Content Filter This is a browser extension. You teach the LLM, by highlighting and saving examples, what you consider to be "unwanted" content that you'd prefer not to see. The model learns a classifier to sort all text in yo...
undefined
Aug 19, 2024 • 38min

AF - Limitations on Formal Verification for AI Safety by Andrew Dickson

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Limitations on Formal Verification for AI Safety, published by Andrew Dickson on August 19, 2024 on The AI Alignment Forum. In the past two years there has been increased interest in formal verification-based approaches to AI safety. Formal verification is a sub-field of computer science that studies how guarantees may be derived by deduction on fully-specified rule-sets and symbol systems. By contrast, the real world is a messy place that can rarely be straightforwardly represented in a reductionist way. In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete symbolic rule sets. Additionally, even if we had such rules for the natural sciences, it would be very difficult for any software system to obtain sufficiently accurate models and data about initial conditions for a prover to succeed in deriving strong guarantees for AI systems operating in the real world. Practical limitations like these on formal verification have been well-understood for decades to engineers and applied mathematicians building real-world software systems, which makes it puzzling that they have mostly been dismissed by leading researchers advocating for the use of formal verification in AI safety so far. This paper will focus-in on several such limitations and use them to argue that we should be extremely skeptical of claims that formal verification-based approaches will provide strong guarantees against major AI threats in the near-term. What do we Mean by Formal Verification for AI Safety? Some examples of the kinds of threats researchers hope formal verification will help with come from the paper "Provably Safe Systems: The Only Path to Controllable AGI" [1] by Max Tegmark and Steve Omohundro (emphasis mine): Several groups are working to identify the greatest human existential risks from AGI. For example, the Center for AI Safety recently published 'An Overview of Catastrophic AI Risks' which discusses a wide range of risks including bioterrorism, automated warfare, rogue power seeking AI, etc. Provably safe systems could counteract each of the risks they describe. These authors describe a concrete bioterrorism scenario in section 2.4: a terrorist group wants to use AGI to release a deadly virus over a highly populated area. They use an AGI to design the DNA and shell of a pathogenic virus and the steps to manufacture it. They hire a chemistry lab to synthesize the DNA and integrate it into the protein shell. They use AGI controlled drones to disperse the virus and social media AGIs to spread their message after the attack. Today, groups are working on mechanisms to prevent the synthesis of dangerous DNA. But provably safe infrastructure could stop this kind of attack at every stage: biochemical design AI would not synthesize designs unless they were provably safe for humans, data center GPUs would not execute AI programs unless they were certified safe, chip manufacturing plants would not sell GPUs without provable safety checks, DNA synthesis machines would not operate without a proof of safety, drone control systems would not allow drones to fly without proofs of safety, and armies of persuasive bots would not be able to manipulate media without proof of humanness. [1] The above quote contains a number of very strong claims about the possibility of formally or mathematically provable guarantees around software systems deployed in the physical world - for example, the claim that we could have safety proofs about the real-world good behavior of DNA synthesis machines, or drones. From a practical standpoint, our default stance towards such claims should be skepticism, since we do not have proofs of this sort for any of the technologies we interact with in the real-world today. For example, DNA synthesis machines exist today and do no...
undefined
Aug 19, 2024 • 21min

LW - A primer on why computational predictive toxicology is hard by Abhishaike Mahajan

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A primer on why computational predictive toxicology is hard, published by Abhishaike Mahajan on August 19, 2024 on LessWrong. Introduction There are now (claimed) foundation models for protein sequences, DNA sequences, RNA sequences, molecules, scRNA-seq, chromatin accessibility, pathology slides, medical images, electronic health records, and clinical free-text. It's a dizzying rate of progress. But there's a few problems in biology that, interestingly enough, have evaded a similar level of ML progress, despite there seemingly being all the necessary conditions to achieve it. Toxicology is one of those problems. This isn't a new insight, it was called out in one of Derek Lowe's posts, where he said: There are no existing AI/ML systems that mitigate clinical failure risks due to target choice or toxicology. He also repeats it in a more recent post: '…the most badly needed improvements in drug discovery are in the exact areas that are most resistant to AI and machine learning techniques. By which I mean target selection and predictive toxicology.' Pat Walters also goes into the subject with much more depth, emphasizing how difficult the whole field is. As someone who isn't familiar at all with the area of predictive toxicology, that immediately felt strange. Why such little progress? It can't be that hard, right? Unlike drug development, where you're trying to precisely hit some key molecular mechanism, assessing toxicity almost feels…brutish in nature. Something that's as clear as day, easy to spot out with eyes, easier still to do with a computer trained to look for it. Of course, there will be some stragglers that leak through this filtering, but it should be minimal. Obviously a hard problem in its own right, but why isn't it close to being solved? What's up with this field? Some background One may naturally assume that there is a well-established definition of toxicity, a standard blanket definition to delineate between things that are and aren't toxic. While there are terms such as LD50, LC50, EC50, and IC50, used to explain the degree by which something is toxic, they are an immense oversimplification. When we say a substance is "toxic," there's usually a lot of follow-up questions. Is it toxic at any dose? Only above a certain threshold? Is it toxic for everyone, or just for certain susceptible individuals (as we'll discuss later)? The relationship between dose and toxicity is not always linear, and can vary depending on the route of exposure, the duration of exposure, and individual susceptibility factors. A dose that causes no adverse effects when consumed orally might be highly toxic if inhaled or injected. And a dose that is well-tolerated with acute exposure might cause serious harm over longer periods of chronic exposure. The very definition of an "adverse effect" resulting from toxicity is not always clear-cut either. Some drug side effects, like mild nausea or headache, might be considered acceptable trade-offs for therapeutic benefit. But others, like liver failure or birth defects, would be considered unacceptable at any dose. This is particularly true when it comes to environmental chemicals, where the effects may be subtler and the exposure levels more variable. Is a chemical that causes a small decrease in IQ scores toxic? What about one that slightly increases the risk of cancer over a lifetime (20+ years)? And this is one of the major problems with applying predicting toxicology at all - defining what is and isn't toxic is hard! One may assume the FDA has clear stances on all these, but even they approach it on a 'vibe-based' perspective. They simply collate the data from in-vitro studies, animal studies, and human clinical trials, and arrive to an approval/no-approval conclusion that is, very often, at odds with some portion of the medical comm...
undefined
Aug 19, 2024 • 13min

AF - Clarifying alignment vs capabilities by Richard Ngo

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying alignment vs capabilities, published by Richard Ngo on August 19, 2024 on The AI Alignment Forum. A core distinction in AGI safety is between alignment and capabilities. However, I think this distinction is a very fuzzy one, which has led to a lot of confusion. In this post I'll describe some of the problems with how people typically think about it, and offer a replacement set of definitions. "Alignment" and "capabilities" are primarily properties of AIs not of AI research The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community's focus on general capabilities, and now it's much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI. Insofar as the ML community has thought about alignment, it has mostly focused on aligning models' behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans. To be fair, the alignment community has caused some confusion by describing models as more or less "aligned", rather than more or less "aligned to X" for some specified X. I'll talk more about this confusion, and how we should address it, in a later post. But the core point is that AIs might develop internally-represented goals or values that we don't like, and we should try to avoid that. However, extending "alignment" and "capabilities" from properties of AIs to properties of different types of research is a fraught endeavor. It's tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems. Firstly, in general it's very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin's study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways. More concretely, there are many cases where research done under the banner of alignment has advanced, or plausibly will advance, AI capabilities to a significant extent. This undermines our ability to categorize research by its impacts. Central examples include: RLHF makes language models more obedient, but also more capable of coherently carrying out tasks. Scalable oversight techniques can catch misbehavior, but will likely become important for generating high-quality synthetic training data, as it becomes more and more difficult for unassisted humans to label AI outputs correctly. Interpretability techniques will both allow us to inspect AI cognition and also extract more capable behavior from them (e.g. via ...
undefined
Aug 19, 2024 • 16min

AF - Untrustworthy models: a frame for scheming evaluations by Olli Järviniemi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 19, 2024 on The AI Alignment Forum. In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated. I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making. Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback. Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as hacking the data center or copying its weights to a remote server. What can the company do? This is a textbook example of the type of problem AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity. (If you have never heard of AI control before, you might want to try playing the Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.) Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon. There are many ways how this control strategy could fail. Here are some:[2] 1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice. 2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt. 1. Or: Weak happens to have a backdoor, which Smart finds and exploits. 3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3] 4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans. 5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans. 6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape. Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this". This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app