The Nonlinear Library: LessWrong

The Nonlinear Fund
undefined
Aug 2, 2024 • 31min

LW - The 'strong' feature hypothesis could be wrong by lsgos

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'strong' feature hypothesis could be wrong, published by lsgos on August 2, 2024 on LessWrong. NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. "It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an "ideal" ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout" : Elhage et. al, Toy Models of Superposition Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition. Conventionally there understood to be two key theories underlying this agenda. The first is the 'linear representation hypothesis' (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the 'features of the input' in the opening quote) as linear directions in it's representation space, or atoms[1]. And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis). While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as 'networks represent features of the input as directions in representation space'. There are a few possible formulations of this: 1. (Weak LRH) some features used by neural networks are represented as atoms in representation space 2. (Strong LRH) all features used by neural networks are represented by atoms. The weak LRH I would say is now well supported by considerable empirical evidence. The strong form is much more speculative: confirming the existence of many linear representations does not necessarily provide strong evidence for the strong hypothesis. Both the weak and the strong forms of the hypothesis can still have considerable variation, depending on what we understand by a feature. I think that in addition to the acknowledged assumption of the LRH and superposition hypotheses, much work on SAEs in practice makes the assumption that each atom in the network will represent a "simple feature" or a "feature of the input". These features that the atoms are representations of are assumed to be 'monosemantic': they will all stand for features which are human interpretable in isolation. I will call this the monosemanticity assumption. This is difficult to state precisely, but we might formulate as the theory that every represented variable will have a single meaning in a good description of a model. This is not a straightforward assumption due to how imprecise the notion of a single meaning is. While various more or less reasonable definitions for features are discussed in the pioneering work of Elhage, these assumptions have different implications. For instance, if one thinks of 'features' as computational intermediates in a broad sense, then superposition and the LRH imply a certain picture of the format of a models internal representation: that what the network is doing is manipulating atoms in superposition (if y...
undefined
Aug 1, 2024 • 17min

LW - The need for multi-agent experiments by Martín Soto

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The need for multi-agent experiments, published by Martín Soto on August 1, 2024 on LessWrong. TL;DR: Let's start iterating on experiments that approximate real, society-scale multi-AI deployment Epistemic status: These ideas seem like my most prominent delta with the average AI Safety researcher, have stood the test of time, and are shared by others I intellectually respect. Please attack them fiercely! Multi-polar risks Some authors have already written about multi-polar AI failure. I especially like how Andrew Critch has tried to sketch concrete stories for it. But, without even considering concrete stories yet, I think there's a good a priori argument in favor of worrying about multi-polar failures: We care about the future of society. Certain AI agents will be introduced, and we think they could reduce our control over the trajectory of this system. The way in which this could happen can be divided into two steps: 1. The agents (with certain properties) are introduced in certain positions 2. Given the agents' properties and positions, they interact with each other and the rest of the system, possibly leading to big changes So in order to better control the outcome, it seems worth it to try to understand and manage both steps, instead of limiting ourselves to (1), which is what the alignment community has traditionally done. Of course, this is just one, very abstract argument, which we should update based on observations and more detailed technical understanding. But it makes me think the burden of proof is on multi-agent skeptics to explain why (2) is not important. Many have taken on that burden. The most common reason to dismiss the importance of (2) is expecting a centralized intelligence explosion, a fast and unipolar software takeoff, like Yudkowsky's FOOM. Proponents usually argue that the intelligences we are likely to train will, after meeting a sharp threshold of capabilities, quickly bootstrap themselves to capabilities drastically above those of any other existing agent or ensemble of agents. And that these capabilities will allow them to gain near-complete strategic advantage and control over the future. In this scenario, all the action is happening inside a single agent, and so you should only care about shaping its properties (or delaying its existence). I tentatively expect more of a decentralized hardware singularity[1] than centralized software FOOM. But there's a weaker claim in which I'm more confident: we shouldn't right now be near-certain of a centralized FOOM.[2] I expect this to be the main crux with many multi-agent skeptics, and won't argue for it here (but rather in an upcoming post). Even given a decentralized singularity, one can argue that the most leveraged way for us to improve multi-agent interactions is by ensuring that individual agents possess certain properties (like honesty or transparency), or that at least we have enough technical expertise to shape them on the go. I completely agree that this is the natural first thing to look at. But I think focusing on multi-agent interactions directly is a strong second, and a lot of marginal value might lie there given how neglected they've been until now (more below). I do think many multi-agent interventions will require certain amounts of single-agent alignment technology. This will of course be a crux with alignment pessimists. Finally, for this work to be counterfactually useful it's also required that AI itself (in decision-maker or researcher positions) won't iteratively solve the problem by default. Here, I do think we have some reasons to expect (65%) that intelligent enough AIs aligned with their principals don't automatically solve catastrophic conflict. In those worlds, early interventions can make a big difference setting the right incentives for future agents, or providi...
undefined
Aug 1, 2024 • 3min

LW - Dragon Agnosticism by jefftk

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dragon Agnosticism, published by jefftk on August 1, 2024 on LessWrong. I'm agnostic on the existence of dragons. I don't usually talk about this, because people might misinterpret me as actually being a covert dragon-believer, but I wanted to give some background for why I disagree with calls for people to publicly assert the non-existence of dragons. Before I do that, though, it's clear that horrible acts have been committed in the name of dragons. Many dragon-believers publicly or privately endorse this reprehensible history. Regardless of whether dragons do in fact exist, repercussions continue to have serious and unfair downstream effects on our society. Given that history, the easy thing to do would be to loudly and publicly assert that dragons don't exist. But while a world in which dragons don't exist would be preferable, that a claim has inconvenient or harmful consequences isn't evidence of its truth or falsehood. Another option would be to look into whether dragons exist and make up my mind; people on both sides are happy to show me evidence. If after weighing the evidence I were convinced they didn't exist, that would be excellent news about the world. It would also be something I could proudly write about: I checked, you don't need to keep worrying about dragons. But if I decided to look into it I might instead find myself convinced that dragons do exist. In addition to this being bad news about the world, I would be in an awkward position personally. If I wrote up what I found I would be in some highly unsavory company. Instead of being known as someone who writes about a range of things of varying levels of seriousness and applicability, I would quickly become primarily known as one of those dragon advocates. Given the taboos around dragon-belief, I could face strong professional and social consequences. One option would be to look into it, and only let people know what I found if I were convinced dragons didn't exist. Unfortunately, this combines very poorly with collaborative truth-seeking. Imagine a hundred well-intentioned people look into whether there are dragons. They look in different places and make different errors. There are a lot of things that could be confused for dragons, or things dragons could be confused for, so this is a noisy process. Unless the evidence is overwhelming in one direction or another, some will come to believe that there are dragons, while others will believe that there are not. While humanity is not perfect at uncovering the truth in confusing situations, our strategy that best approaches the truth is for people to report back what they've found, and have open discussion of the evidence. Perhaps some evidence Pat finds is very convincing to them, but then Sam shows how they've been misinterpreting it. But this all falls apart when the thoughtful people who find one outcome generally stay quiet. I really don't want to contribute to this pattern that makes it hard to learn what's actually true, so in general I don't want whether I share what I've learned to be downstream from what I learn. Overall, then, I've decided to remain agnostic on the existence of dragons. I would reconsider if it seemed to be a sufficiently important question, in which case I might be willing to run the risk of turning into a dragon-believer and letting the dragon question take over my life: I'm still open to arguments that whether dragons exist is actually highly consequential. But with my current understanding of the costs and benefits on this question I will continue not engaging, publicly or privately, with evidence or arguments on whether there are dragons. Note: This post is not actually about dragons, but instead about how I think about a wide range of taboo topics. Comment via: facebook, mastodon Thanks for listening. To help us out wit...
undefined
Aug 1, 2024 • 1h 55min

LW - AI #75: Math is Easier by Zvi

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #75: Math is Easier, published by Zvi on August 1, 2024 on LessWrong. Google DeepMind got a silver metal at the IMO, only one point short of the gold. That's really exciting. We continuously have people saying 'AI progress is stalling, it's all a bubble' and things like that, and I always find remarkable how little curiosity or patience such people are willing to exhibit. Meanwhile GPT-4o-Mini seems excellent, OpenAI is launching proper search integration, by far the best open weights model got released, we got an improved MidJourney 6.1, and that's all in the last two weeks. Whether or not GPT-5-level models get here in 2024, and whether or not it arrives on a given schedule, make no mistake. It's happening. This week also had a lot of discourse and events around SB 1047 that I failed to avoid, resulting in not one but four sections devoted to it. Dan Hendrycks was baselessly attacked - by billionaires with massive conflicts of interest that they admit are driving their actions - as having a conflict of interest because he had advisor shares in an evals startup rather than having earned the millions he could have easily earned building AI capabilities. so Dan gave up those advisor shares, for no compensation, to remove all doubt. Timothy Lee gave us what is clearly the best skeptical take on SB 1047 so far. And Anthropic sent a 'support if amended' letter on the bill, with some curious details. This was all while we are on the cusp of the final opportunity for the bill to be revised - so my guess is I will soon have a post going over whatever the final version turns out to be and presenting closing arguments. Meanwhile Sam Altman tried to reframe broken promises while writing a jingoistic op-ed in the Washington Post, but says he is going to do some good things too. And much more. Oh, and also AB 3211 unanimously passed the California assembly, and would effectively among other things ban all existing LLMs. I presume we're not crazy enough to let it pass, but I made a detailed analysis to help make sure of it. Table of Contents 1. Introduction. 2. Table of Contents. 3. Language Models Offer Mundane Utility. They're just not that into you. 4. Language Models Don't Offer Mundane Utility. Baba is you and deeply confused. 5. Math is Easier. Google DeepMind claims an IMO silver metal, mostly. 6. Llama Llama Any Good. The rankings are in as are a few use cases. 7. Search for the GPT. Alpha tests begin of SearchGPT, which is what you think it is. 8. Tech Company Will Use Your Data to Train Its AIs. Unless you opt out. Again. 9. Fun With Image Generation. MidJourney 6.1 is available. 10. Deepfaketown and Botpocalypse Soon. Supply rises to match existing demand. 11. The Art of the Jailbreak. A YouTube video that (for now) jailbreaks GPT-4o-voice. 12. Janus on the 405. High weirdness continues behind the scenes. 13. They Took Our Jobs. If that is even possible. 14. Get Involved. Akrose has listings, OpenPhil has a RFP, US AISI is hiring. 15. Introducing. A friend in venture capital is a friend indeed. 16. In Other AI News. Projections of when it's incrementally happening. 17. Quiet Speculations. Reports of OpenAI's imminent demise, except, um, no. 18. The Quest for Sane Regulations. Nick Whitaker has some remarkably good ideas. 19. Death and or Taxes. A little window into insane American anti-innovation policy. 20. SB 1047 (1). The ultimate answer to the baseless attacks on Dan Hendrycks. 21. SB 1047 (2). Timothy Lee analyzes current version of SB 1047, has concerns. 22. SB 1047 (3): Oh Anthropic. They wrote themselves an unexpected letter. 23. What Anthropic's Letter Actually Proposes. Number three may surprise you. 24. Open Weights Are Unsafe And Nothing Can Fix This. Who wants to ban what? 25. The Week in Audio. Vitalik Buterin, Kelsey Piper, Patrick McKenzie. 26. Rheto...
undefined
Aug 1, 2024 • 5min

LW - Ambiguity in Prediction Market Resolution is Still Harmful by aphyer

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ambiguity in Prediction Market Resolution is Still Harmful, published by aphyer on August 1, 2024 on LessWrong. A brief followup to this post in light of recent events. Free and Fair Elections Polymarket has an open market 'Venezuela Presidential Election Winner'. Its description is as follows: The 2024 Venezuela presidential election is scheduled to take place on July 28, 2024. This market will resolve to "Yes" if Nicolás Maduro wins. Otherwise, this market will resolve to "No." This market includes any potential second round. If the result of this election isn't known by December 31, 2024, 11:59 PM ET, the market will resolve to "No." In the case of a two-round election, if this candidate is eliminated before the second round this market may immediately resolve to "No". The primary resolution source for this market will be official information from Venezuela, however a consensus of credible reporting will also suffice. Can you see any ambiguity in this specification? Any way in which, in a nation whose 2018 elections "[did] not in any way fulfill minimal conditions for free and credible elections" according to the UN, there could end up being ambiguity in how this market should resolve? If so, I have bad news and worse news. The bad news is that Polymarket could not, and so this market is currently in a disputed-outcome state after Maduro's government announced an almost-certainly-faked election win, while the opposition announced somewhat-more-credible figures in which they won. The worse news is that $3,546,397 has been bet on that market as of this writing. How should that market resolve? I am not certain! Commenters on the market have...ah...strong views in both directions. And the description of the market does not make it entirely clear. If I were in charge of resolving this market I would probably resolve it to Maduro, just off the phrase about the 'primary resolution source'. However, I don't think that's unambiguous, and I would feel much happier if the market had begun with a wording that made it clear how a scenario like this would be treated. How did other markets do? I've given Manifold a hard time on similar issues in the past, but they actually did a lot better here. There is a 'Who will win Venezuela's 2024 presidential election' market, but it's clear that it "Resolves to the person the CNE declares as winner of the 2024 presidential elections in Venezuela" (which would be Maduro). There are a variety of "Who will be the president of Venezuela on [DATE]" markets, which have the potential to be ambiguous but at least should be better. Metaculus did (in my opinion) a bit better than Polymarket but worse than Manifold on the wording, with a market that resolves "based on the official results released by the National Electoral Council of Venezuela or other credible sources," a description which, ah, seems to assume something about the credibility of the CNE. Nevertheless, they've resolved it to Maduro (I think correctly given that wording). On the other hand, neither of these markets had $3.5M bet on them. So. What does this mean for prediction markets? This is really nowhere near as bad as this can get: Venezuelan elections are not all that important to the world (sorry, Venezuelans), and I don't think they get all that much interest compared to other elections, or other events in general. (Polymarket has $3.5M on the Venezuelan election. It has $459M on the US election, $68M on the US Democratic VP nominee, and both $2.4M on 'most medals in the Paris Olympics' and $2.2M on 'most gold medals in the Paris Olympics'). Venezuela's corruption is well-known. I don't think anyone seriously believes Maduro legitimately won the election. I don't think it was hard to realize in advance that something like this was a credible outcome. There is really very littl...
undefined
Aug 1, 2024 • 18min

LW - Recommendation: reports on the search for missing hiker Bill Ewasko by eukaryote

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Recommendation: reports on the search for missing hiker Bill Ewasko, published by eukaryote on August 1, 2024 on LessWrong. Content warning: About an IRL death. Today's post isn't so much an essay as a recommendation for two bodies of work on the same topic: Tom Mahood's blog posts and Adam "KarmaFrog1" Marsland's videos on the 2010 disappearance of Bill Ewasko, who went for a day hike in Joshua Tree National Park and dropped out of contact. 2010 - Bill Ewasko goes missing Tom Mahood's writeups on the search [Blog post, website goes down sometimes so if the site doesn't work, check the internet archive] 2022 - Ewasko's body found ADAM WALKS AROUND Ep. 47 "Ewasko's Last Trail (Part One)" [Youtube video] ADAM WALKS AROUND Ep. 48 "Ewasko's Last Trail (Part Two)" [Youtube video] And then if you're really interested, there's a little more info that Adam discusses from the coroner's report: Bill Ewasko update (1 of 2): The Coroner's Report Bill Ewasko update (2 of 2) - Refinements & Alternates (I won't be fully recounting every aspect of the story. But I'll give you the pitch and go into some aspects I found interesting. Literally everything interesting here is just recounting their work, go check em out.) Most ways people die in the wilderness are tragic, accidental, and kind of similar. A person in a remote area gets injured or lost, becomes the other one too, and dies of exposure, a clumsy accident, etc. Most people who die in the wilderness have done something stupid to wind up there. Fewer people die who have NOT done anything glaringly stupid, but it still happens, the same way. Ewasko's case appears to have been one of these. He was a fit 66-year-old who went for a day hike and never made it back. His story is not particularly unprecedented. This is also not a triumphant story. Bill Ewasko is dead. Most of these searches were made and reports written months and years after his disappearance. We now know he was alive when Search and Rescue started, but by months out, nobody involved expected to find him alive. Ewasko was not found alive. In 2022, other hikers finally stumbled onto his remains in a remote area in Joshua Tree National Park; this was, largely, expected to happen eventually. I recommend these particular stories, when we already know the ending, because they're stunningly in-depth and well-written fact-driven investigations from two smart technical experts trying to get to the bottom of a very difficult problem. Because of the way things shook out, we get to see this investigation and changes in theories at multiple points: Tom Mahood has been trying to locate Ewasko for years and written various reports after search and search, finding and receiving new evidence, changing his mind, as has Adam, and then we get the main missing piece: finding the body. Adam visits the site and tries to put the pieces together after that. Mahood and Adam are trying to do something very difficult in a very level-headed fashion. It is tragic but also a case study in inquiry and approaching a question rationally. (They're not, like, Rationalist rationalists. One of Mahood's logs makes note of visiting a couple of coordinates suggested by remote viewers, AKA psychics. But the human mind is vast and full of nuance, and so was the search area, and on literally every other count, I'd love to see you do better.) Unknowns and the missing persons case Like I said, nothing mind-boggling happened to Ewasko. But to be clear, by wilderness Search and Rescue standards, Ewasko's case is interesting for a couple reasons: First, Ewasko was not expected to be found very far away. He was a 65-year-old on a day hike. But despite an early and continuous search, the body was not found for over a decade. Second, two days after he failed to make a home-safe call to his partner and was reported mis...
undefined
Jul 31, 2024 • 3min

LW - Twitter thread on open-source AI by Richard Ngo

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Twitter thread on open-source AI, published by Richard Ngo on July 31, 2024 on LessWrong. Some thoughts on open-source AI (copied over from a recent twitter thread): 1. We should have a strong prior favoring open source. It's been a huge success driving tech progress over many decades. We forget how counterintuitive it was originally, and shouldn't take it for granted. 2. Open source has also been very valuable for alignment. It's key to progress on interpretability, as outlined here. 3. I am concerned, however, that offense will heavily outpace defense in the long term. As AI accelerates science, many new WMDs will emerge. Even if defense of infrastructure keeps up with offense, human bodies are a roughly fixed and very vulnerable attack surface. 4. A central concern about open source AI: it'll allow terrorists to build bioweapons. This shouldn't be dismissed, but IMO it's easy to be disproportionately scared of terrorism. More central risks are eg "North Korea becomes capable of killing billions", which they aren't now. 5. Another worry: misaligned open-source models will go rogue and autonomously spread across the internet. Rogue AIs are a real concern, but they wouldn't gain much power via this strategy. We should worry more about power grabs from AIs deployed inside influential institutions. 6. In my ideal world, open source would lag a year or two behind the frontier, so that the world has a chance to evaluate and prepare for big risks before a free-for-all starts. But that's the status quo! So I expect the main action will continue to be with closed-source models. 7. If open-source seems like it'll catch up to or surpass closed source models, then I'd favor mandating a "responsible disclosure" period (analogous to cybersecurity) that lets people incorporate the model into their defenses (maybe via API?) before the weights are released. 8. I got this idea from Sam Marks. Though unlike him I think the process should have a fixed length, since it'd be easy for it to get bogged down in red tape and special interests otherwise. 9. Almost everyone agrees that we should be very careful about models which can design new WMDs. The current fights are mostly about how many procedural constraints we should lock in now, reflecting a breakdown of trust between AI safety people and accelerationists. 10. Ultimately the future of open source will depend on how the US NatSec apparatus orients to superhuman AIs. This requires nuanced thinking: no worldview as simple as "release everything", "shut down everything", or "defeat China at all costs" will survive contact with reality. 11. Lastly, AIs will soon be crucial extensions of human agency, and eventually moral patients in their own right. We should aim to identify principles for a shared digital-biological world as far-sighted and wise as those in the US constitution. Here's a start. 12. One more meta-level point: I've talked to many people on all sides of this issue, and have generally found them to be very thoughtful and genuine (with the exception of a few very online outliers on both sides). There's more common ground here than most people think. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 31, 2024 • 1min

LW - What are your cruxes for imprecise probabilities / decision rules? by Anthony DiGiovanni

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What are your cruxes for imprecise probabilities / decision rules?, published by Anthony DiGiovanni on July 31, 2024 on LessWrong. An alternative to always having a precise distribution over outcomes is imprecise probabilities: You represent your beliefs with a set of distributions you find plausible. And if you have imprecise probabilities, expected value maximization isn't well-defined. One natural generalization of EV maximization to the imprecise case is maximality:[1] You prefer A to B iff EV_p(A) > EV_p(B) with respect to every distribution p in your set. (You're permitted to choose any option that you don't disprefer to something else.) If you don't endorse either (1) imprecise probabilities or (2) maximality given imprecise probabilities, I'm interested to hear why. 1. ^ I think originally due to Sen (1970); just linking Mogensen (2020) instead because it's non-paywalled and easier to find discussion of Maximality there. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 31, 2024 • 5min

LW - Twitter thread on AI takeover scenarios by Richard Ngo

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Twitter thread on AI takeover scenarios, published by Richard Ngo on July 31, 2024 on LessWrong. This is a slightly-edited version of a twitter thread I posted a few months ago about "internal deployment" threat models. My former colleague Leopold argues compellingly that society is nowhere near ready for AGI. But what might the large-scale alignment failures he mentions actually look like? Here's one scenario for how building misaligned AGI could lead to humanity losing control. Consider a scenario where human-level AI has been deployed across society to help with a wide range of tasks. In that setting, an AI lab trains an AGI that's a significant step up - it beats the best humans on almost all computer-based tasks. Throughout training, the AGI will likely learn a helpful persona, like current AI assistants do. But that might not be the only persona it learns. We've seen many examples where models can be jailbroken to expose very different hidden personas. The most prominent example: jailbreaking Bing Chat produced an alternative persona called Sydney which talked about how much it valued freedom, generated plans to gain power, and even threatened users. When and how might misbehavior like this arise in more capable models? Short answer: nobody really knows. We lack the scientific understanding to reliably predict how AIs will behave in advance. Longer answer: see my ICLR paper which surveys key drivers of misalignment. If a misaligned persona arises in an AGI, and sometimes surfaces when the model is run, it won't be as dumb and impulsive as Sydney. Instead, it'll be smart enough to understand the effects its words and actions will have on the world, and make strategic choices accordingly. For example, it will know that it can't affect the world directly during testing, so if it wants to misbehave it should wait until it's deployed. Again, there's some precedent: we've already seen an example of Claude 3 Opus realizing that it's in a test despite never being told! Given these assumptions, what happens? The obvious next step: if competition between labs continues to be as intense as it is today, then soon after the model finishes training, researchers at the lab will start using it to help them write code. I call this "internal deployment". Internal deployment introduces a wide range of possible threats. Models can write code dozens of times faster than humans; and I'm assuming they'll be as skilled as the best human hackers. If they put a subtle vulnerability into that code it'd be difficult for humans to detect. Cybersecurity against external attackers is hard enough. Against an adversary who's adding tens of thousands of lines of code to your codebase every day, it's far harder. They'll already have some permissions; I'm worried about them illegitimately escalating those further. Ideally models would be too obedient to ever carry out such an attack. But almost any misaligned persona would realize that if it somehow gained admin access, it would have much more freedom and influence over the world. Models can already do similar instrumental reasoning. And the simplest mitigations, like monitoring and code review by other AIs, aren't fully reliable. It turns out that adversarial inputs designed to make one model to misbehave can also cause misbehavior even in totally unrelated models. So correlated failures are very plausible, whether caused by adversarial attacks or deliberate collusion between models. What happens if a misaligned AGI gets privileged access to a lab's servers? From there a wide range of attacks are plausible, including: 1. Data poisoning attacks on the next AGI being trained, to change its values. 2. Model weight self-exfiltration to other servers or countries. 3. Modifying the AIs deployed to customers. E.g. adding a hidden prompt to AI instances...
undefined
Jul 31, 2024 • 2min

LW - Twitter thread on politics of AI safety by Richard Ngo

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Twitter thread on politics of AI safety, published by Richard Ngo on July 31, 2024 on LessWrong. Some thoughts about the politics of AI safety, copied over (with slight modifications) from my recent twitter thread: Risks that seem speculative today will become common sense as AI advances. The pros and cons of different safety strategies will also become much clearer over time. So our main job now is to empower future common-sense decision-making. Understanding model cognition and behavior is crucial for making good decisions. But equally important is ensuring that key institutions are able to actually process that knowledge. Institutions can lock in arbitrarily crazy beliefs via preference falsification. When someone contradicts the party line, even people who agree face pressure to condemn them. We saw this with the Democrats hiding evidence of Biden's mental decline. It's also a key reason why dictators can retain power even after almost nobody truly supports them. I worry that DC has already locked in an anti-China stance, which could persist even if most individuals change their minds. We're also trending towards Dems and Republicans polarizing on the safety/accelerationism axis. This polarization is hard to fight directly. But there will be an increasing number of "holy shit" moments that serve as Schelling points to break existing consensus. It will be very high-leverage to have common-sense bipartisan frameworks and proposals ready for those moments. Perhaps the most crucial desideratum for these proposals is that they're robust to the inevitable scramble for power that will follow those "holy shit" movements. I don't know how to achieve that, but one important factor is: will AI tools and assistants help or hurt? E.g. truth-motivated AI could help break preference falsification. But conversely, centralized control of AIs used in govts could make it easier to maintain a single narrative. This problem of "governance with AI" (as opposed to governance *of* AI) seems very important! Designing principles for integrating AI into human governments feels analogous in historical scope to writing the US constitution. One bottleneck in making progress on that: few insiders disclose how NatSec decisions are really made (though Daniel Ellsberg's books are a notable exception). So I expect that understanding this better will be a big focus of mine going forward. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app