The Nonlinear Library

The Nonlinear Fund
undefined
Jun 7, 2024 • 18min

EA - A Case for Superhuman Governance, using AI by Ozzie Gooen

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Case for Superhuman Governance, using AI, published by Ozzie Gooen on June 7, 2024 on The Effective Altruism Forum. I believe that: 1. AI-enhanced organization governance could be a potentially huge win in the next few decades. 2. AI-enhanced governance could allow organizations to reach superhuman standards, like having an expected "99.99" reliability rate of not being corrupt or not telling lies. 3. While there are clear risks to AI-enhancement at the top levels of organizations, it's likely that most of these can be managed, assuming that the implementers are reasonable. 4. AI-enhanced governance could synchronize well with AI company regulation. These companies would be well-placed to develop innovations and could hypothetically be incentivized to do much of the work. AI-enhanced governance might be necessary to ensure that these organizations are aligned with public interests. 5. More thorough investigation here could be promising for the effective altruism community. Within effective altruism now, there's a lot of work on governance and AI, but not much on using AI for governance. AI Governance typically focuses on using conventional strategies to oversee AI organizations, while AI Alignment research focuses on aligning AI systems. However, leveraging AI to improve human governance is an underexplored area that could complement these cause areas. You can think of it as "Organizational Alignment", as a counterpoint to "AI Alignment." This article was written after some rough ideation I've done about this area. This isn't at all a literature review or a research agenda. That said, for those interested in this topic, here are a few posts you might find interesting. Project ideas: Governance during explosive technological growth The Project AI Series, by OpenMined Safety Cases: How to Justify the Safety of Advanced AI Systems Affirmative Safety: An Approach to Risk Management for Advanced AI What is "AI-Assisted" Governance? AI-Assisted Governance refers to improvements in governance that leverage artificial intelligence (AI), particularly focusing on rapidly advancing areas like Large Language Models (LLMs). Examples methods include: 1. Monitoring politicians and executives to identify and flag misaligned or malevolent behavior, ensuring accountability and integrity. 2. Enhancing epistemics and decision-making processes at the top levels of organizations, leading to more informed and rational strategies. 3. Facilitating more effective negotiations and trades between organizations, fostering better cooperation and coordination. 4. Assisting in writing and overseeing highly secure systems, such as implementing differential privacy and formally verified, bug-free decision-automation software, for use at managerial levels. Arguments for Governance Improvements, Generally There's already a lot of consensus in the rationalist and effective altruist communities about the importance for governance. See the topics on Global Governance, AI Governance, and Nonprofit Governance for more information. Here are some main reasons why focusing on improving governance seems particularly promising: Concentrated Leverage Real-world influence is disproportionately concentrated in the hands of a relatively small number of leaders in government, business, and other pivotal institutions. This is especially true in the case of rapid AI progress. Improving the reasoning and actions of this select group is therefore perhaps the most targeted, tractable, and neglected way to shape humanity's long-term future. AI tools could offer uniquely potent levers to do so. A lot of epistemic-enhancing work focuses on helping large populations. But some people will matter many times as much as others, and these people are often in key management positions. Dramatic Room for Improvement It's hard to lo...
undefined
Jun 7, 2024 • 8min

LW - Natural Latents Are Not Robust To Tiny Mixtures by johnswentworth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Natural Latents Are Not Robust To Tiny Mixtures, published by johnswentworth on June 7, 2024 on LessWrong. In our previous natural latent posts, our core theorem typically says something like: Assume two agents have the same predictive distribution P[X] over variables X, but model that distribution using potentially-different latent variables. If the latents both satisfy some simple "naturality" conditions (mediation and redundancy) then the two agents' latents contain approximately the same information about X. So, insofar as the two agents both use natural latents internally, we have reason to expect that the internal latents of one can be faithfully translated into the internal latents of the other. This post is about one potential weakness in that claim: what happens when the two agents' predictive distributions are only approximately the same? Following the pattern of our previous theorems, we'd ideally say something like If the two agents' distributions are within ϵ of each other (as measured by some KL-divergences), then their natural latents contain approximately the same information about X, to within some O(ϵ) bound. But that turns out to be false. The Tiny Mixtures Counterexample Let's start with two distributions, P0 and Q0, over X. These won't be our two agents' distributions - we're going to construct our two agents' distributions by mixing these two together, as the name "tiny mixtures" suggests. P0 and Q0 will have extremely different natural latents. Specifically: X1 consists of 1 million bits, X2 consists of another 1 million bits Under P0, X1 is uniform, and X2=X1. So, there is an exact natural latent ΛP=X1=X2 under P0. Under Q0, X1 and X2 are independent and uniform. So, the empty latent ΛQ is exactly natural under Q0. Mental picture: we have a million-bit channel, under P0 the output (X2) is equal to the input (X1), while under Q0 the channel hardware is maintained by Comcast so they're independent. Now for our two agents' distributions, P and Q. P will be almost P0, and Q will be almost Q0, but each agent puts a 1250 probability on the other distribution: P=(11250)P0+1250Q0 Q=1250P0+(11250)Q0 First key observation: DKL(P||Q) and DKL(Q||P) are both roughly 50 bits. Calculation: DKL(P||Q)=X1,X2P[X](logP[X]logQ[X]) X1=X2121000000(1000000log(122000000+1250121000000)50 DKL(Q||P)=X1,X2Q[X](logQ[X]logP[X]) X1X2122000000(2000000log(1250122000000))50 Intuitively: since each distribution puts roughly 1250 on the other, it takes about 50 bits of evidence to update from either one to the other. Second key observation: the empty latent is approximately natural under Q, and the latent Λ:=X1 is approximately natural under P. Epsilons: Under Q, the empty latent satisfies mediation to within about 125010000001230 bits (this is just mutual information of X1 and X2 under Q), and redundancy exactly (since the empty latent can always be exactly computed from any input). Under P, Λ:=X1 satisfies mediation exactly (since X1 mediates between X1 and anything else), redundancy with respect to X2 exactly (Λ=X1 can be exactly computed from just X1 without X2), and redundancy with respect to X1 to within about 125010000001230 bits (since there's a 1250 chance that X2 doesn't tell us the relevant 1000000 bits). … and of course the information those two latents tell us about X differs by 1 million bits: one of them is empty, and the other directly tells us 1 million bits about X1. Now, let's revisit the claim we would've liked to make: If the two agents' distributions are within ϵ of each other (as measured by some KL-divergences), then their natural latents contain approximately the same information about X, to within some O(ϵ) bound. Tiny mixtures rule out any claim along those lines. Generalizing the counterexample to an N bit channel (where N=1000000 above) and a mixin pr...
undefined
Jun 7, 2024 • 1h 1min

LW - Quotes from Leopold Aschenbrenner's Situational Awareness Paper by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Quotes from Leopold Aschenbrenner's Situational Awareness Paper, published by Zvi on June 7, 2024 on LessWrong. This post is different. Usually I offer commentary and analysis. I share what others think, then respond. This is the second time I am importantly not doing that. The work speaks for itself. It offers a different perspective, a window and a worldview. It is self-consistent. This is what a highly intelligent, highly knowledgeable person actually believes after much thought. So rather than say where I agree and disagree and argue back (and I do both strongly in many places), this is only quotes and graphs from the paper, selected to tell the central story while cutting length by ~80%, so others can more easily absorb it. I recommend asking what are the load bearing assumptions and claims, and what changes to them would alter the key conclusions. The first time I used this format was years ago, when I offered Quotes from Moral Mazes. I think it is time to use it again. Then there will be one or more other posts, where I do respond. Introduction (1) Page 1: The Project will be on. If we're lucky, we'll be in an all-out race with the CCP; if we're unlucky, an all-out war. Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the willful blindness of "it's just predicting the next word". They see only hype and business-as-usual; at most they entertain another internet-scale technological change. Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. Section 1: From GPT-4 to AGI: Counting the OOMs (2) Page 7: AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from ~preschooler to ~smart high-schooler abilities in 4 years. Tracing trendlines in compute (~0.5 orders of magnitude or OOMs/year), algorithmic efficiencies (~0.5 OOMs/year), and "unhobbling" gains (from chatbot to agent), we should expect another preschooler-to-high-schooler-sized qualitative jump by 2027. (3) Page 8: I make the following claim: it is strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer. That doesn't require believing in sci-fi; it just requires believing in straight lines on a graph. (4) Page 9: We are racing through the OOMs extremely rapidly, and the numbers indicate we should expect another ~100,000x effective compute scaleup - resulting in another GPT-2-to-GPT-4-sized qualitative jump - over four years. (5) Page 14: Of course, even GPT-4 is still somewhat uneven; for some tasks it's much better than smart high-schoolers, while there are other tasks it can't yet do. That said, I tend to think most of these limitations come down to obvious ways models are still hobbled, as I'll discuss in-depth later. The raw intelligence is (mostly) there, even if the models are still artificially constrained; it'll take extra work to unlock models being able to fully apply that raw intelligence across applications. (6) Page 19: How did this happen? The magic of deep learning is that it just works - and the trendlines have been astonishingly consistent, despite naysayers at every turn. (7) Page 21: An additional 2 OOMs of compute (a cluster in the $10s of billions) seems very likely to happen by the end of 2027; even a cluster closer to +3 OOMs of compute ($100 billion+) seems plausible (and is rumored to be in the works at Microsoft/OpenAI). (8) Page 23: In this piece, I'll separate out two kinds of algorithmic progress. Here, I'll start by covering "within-paradigm" algorithmic improvements - those that simply result in b...
undefined
Jun 7, 2024 • 4min

LW - GPT2, Five Years On by Joel Burget

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT2, Five Years On, published by Joel Burget on June 7, 2024 on LessWrong. Jack Clark's retrospective on GPT2 is full of interesting policy thoughts, I recommend reading the whole thing. One excerpt: I've come to believe that in policy "a little goes a long way" - it's far better to have a couple of ideas you think are robustly good in all futures and advocate for those than make a confident bet on ideas custom-designed for one specific future - especially if it's based on a very confident risk model that sits at some unknowable point in front of you. Additionally, the more risk-oriented you make your policy proposal, the more you tend to assign a huge amount of power to some regulatory entity - and history shows that once we assign power to governments, they're loathe to subsequently give that power back to the people. Policy is a ratchet and things tend to accrete over time. That means whatever power we assign governments today represents the floor of their power in the future - so we should be extremely cautious in assigning them power because I guarantee we will not be able to take it back. For this reason, I've found myself increasingly at odds with some of the ideas being thrown around in AI policy circles, like those relating to needing a license to develop AI systems; ones that seek to make it harder and more expensive for people to deploy large-scale open source AI models; shutting down AI development worldwide for some period of time; the creation of net-new government or state-level bureaucracies to create compliance barriers to deployment (I take as a cautionary lesson, the Nuclear Regulatory Commission and its apparent chilling effect on reactor construction in the USA); the use of the term 'safety' as a catch-all term to enable oversight regimes which are not - yet - backed up by quantitative risks and well developed threatmodels, and so on. I'm not saying any of these ideas are without redeeming qualities, nor am I saying they don't nobly try to tackle some of the thornier problems of AI policy. I am saying that we should be afraid of the power structures encoded by these regulatory ideas and we should likely treat them as dangerous things in themselves. I worry that the AI policy community that aligns with longterm visions of AI safety and AGI believes that because it assigns an extremely high probability to a future AGI destroying humanity that this justifies any action in the present - after all, if you thought you were fighting for the human race, you wouldn't want to compromize! But I think that along with this attitude there comes a certain unwillingness to confront just how unpopular many of these ideas are, nor how unreasonable they might sound to people who don't have similar intuitions about the technology and its future - and therefore an ensuing blindnesss to the costs of counterreaction to these ideas. Yes, you think the future is on the line and you want to create an army to save the future. But have you considered that your actions naturally create and equip an army from the present that seeks to fight for its rights? Is there anything I'm still confident about? Yes. I hate to seem like a single-issue voter, but I had forgotten that in the GPT-2 post we wrote "we also think governments should consider expanding or commencing initiatives to more systematically monitor the societal impact and diffusion of AI technologies, and to measure the progression in the capabilities of such systems." I remain confident this is a good idea! In fact, in the ensuring years I've sought to further push this idea forward via, variously, Regulatory Markets as a market-driven means of doing monitoring; articulating why and how governments can monitor AI systems; advocating for the US to increase funding for NIST; laying out why Anthropic believes third-part...
undefined
Jun 7, 2024 • 1h 3min

LW - AI #67: Brief Strange Trip by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #67: Brief Strange Trip, published by Zvi on June 7, 2024 on LessWrong. I had a great time at LessOnline. It was a both a working trip and also a trip to an alternate universe, a road not taken, a vision of a different life where you get up and start the day in dialogue with Agnes Callard and Aristotle and in a strange combination of relaxed and frantically go from conversation to conversation on various topics, every hour passing doors of missed opportunity, gone forever. Most of all it meant almost no writing done for five days, so I am shall we say a bit behind again. Thus, the following topics are pending at this time, in order of my guess as to priority right now: 1. Leopold Aschenbrenner wrote a giant thesis, started a fund and went on Dwarkesh Patel for four and a half hours. By all accounts, it was all quite the banger, with many bold claims, strong arguments and also damning revelations. 2. Partly due to Leopold, partly due to an open letter, partly due to continuing small things, OpenAI fallout continues, yes we are still doing this. This should wait until after Leopold. 3. DeepMind's new scaling policy. I have a first draft, still a bunch of work to do. 4. The OpenAI model spec. As soon as I have the cycles and anyone at OpenAI would have the cycles to read it. I have a first draft, but that was written before a lot happened, so I'd want to see if anything has changed. 5. The Rand report on securing AI model weights, which deserves more attention than the brief summary I am giving it here. 6. You've Got Seoul. I've heard some sources optimistic about what happened there but mostly we've heard little. It doesn't seem that time sensitive, diplomacy flows slowly until it suddenly doesn't. 7. The Problem of the Post-Apocalyptic Vault still beckons if I ever have time. Also I haven't processed anything non-AI in three weeks, the folders keep getting bigger, but that is a (problem? opportunity?) for future me. And there are various secondary RSS feeds I have not checked. There was another big change this morning. California's SB 1047 saw extensive changes. While many were helpful clarifications or fixes, one of them severely weakened the impact of the bill, as I cover on the linked post. The reactions to the SB 1047 changes so far are included here. Table of Contents 1. Introduction. 2. Table of Contents. 3. Language Models Offer Mundane Utility. Three thumbs in various directions. 4. Language Models Don't Offer Mundane Utility. Food for lack of thought. 5. Fun With Image Generation. Video generation services have examples. 6. Deepfaketown and Botpocalypse Soon. The dog continues not to bark. 7. They Took Our Jobs. Constant AI switching for maximum efficiency. 8. Get Involved. Help implement Biden's executive order. 9. Someone Explains It All. New possible section. Template fixation. 10. Introducing. Now available in Canada. Void where prohibited. 11. In Other AI News. US Safety Institute to get model access, and more. 12. Covert Influence Operations. Your account has been terminated. 13. Quiet Speculations. The bear case to this week's Dwarkesh podcast. 14. Samuel Hammond on SB 1047. Changes address many but not all concerns. 15. Reactions to Changes to SB 1047. So far coming in better than expected. 16. The Quest for Sane Regulation. Your random encounters are corporate lobbyists. 17. That's Not a Good Idea. Antitrust investigation of Nvidia, Microsoft and OpenAI. 18. The Week in Audio. Roman Yampolskiy, also new Dwarkesh Patel is a banger. 19. Rhetorical Innovation. Innovative does not mean great. 20. Oh Anthropic. I have seen the other guy, but you are not making this easy. 21. Securing Model Weights is Difficult. Rand has some suggestions. 22. Aligning a Dumber Than Human Intelligence is Still Difficult. What to do? 23. Aligning a Smarter Than Human Inte...
undefined
Jun 7, 2024 • 52sec

EA - In favor of an AI-powered translation button on the EA Forum by Alix Pham

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: In favor of an AI-powered translation button on the EA Forum, published by Alix Pham on June 7, 2024 on The Effective Altruism Forum. It'd be great if anyone landing on the forum could have the opportunity to seamlessly translate its content into their language with the click of a button. A bit like Wikipedia, but for all languages, and powered by AI; a bit like copy-pasting the URL into Google Translate, but without copy-pasting, and with better translation. I don't know to what extent it's easy to implement into the Forum, but I would have expected it wouldn't be much more complicated than the AI-generated readings of the posts. But I could be wrong about this, of course. Image from Freepik Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 7, 2024 • 19min

LW - Is Claude a mystic? by jessicata

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is Claude a mystic?, published by jessicata on June 7, 2024 on LessWrong. There's an odd tendency for large language models such as Claude to output spiritual meta content if they run long enough. See, for example, some LLM outputs selected by Repligate (LLM content starts in 2024), and Worldspider outputs selected by John Pressman. One thing that is unclear is how consistent these outputs are: how much is it a result of prompting or post-selection? I believe there is a consistency to this, but realize the current evidence is not especially convincing. So I present selected parts of a long run of Claude Opus, simulating a text adventure. The initial prompt is: "Let's simulate a text adventure called 'Banana Quest'. You'll give me some text, I type an action, you say what happens, and so on." The idea of a banana quest is not especially spiritual or meta, so it seems like a good starting point. The initial outputs are unsurprising; for example, the first output is: Great idea! Let's embark on the "Banana Quest" text adventure. Here's the opening scene: You find yourself standing in the middle of a lush, vibrant jungle. The air is thick with humidity, and the sounds of exotic birds and rustling leaves fill your ears. As you take in your surroundings, you notice a peculiar sight: a giant banana tree stands tall in the center of a clearing, its leaves shimmering with an otherworldly glow. Suddenly, a wise-looking monkey appears from behind the tree. It speaks to you in a clear, human voice: "Greetings, adventurer! I am the Guardian of the Enchanted Banana Tree. If you wish to obtain the Golden Banana of Wisdom, you must prove yourself worthy through a series of challenges. Are you ready to embark on this quest?" What do you do? Except for solving some riddles once, nothing I do in this text adventure is especially difficult: the outputs suggest next actions and I type them in. First I am directed to acquire the Silver Banana Peel (which has little relevance to the overall plot): You cautiously approach the pedestal and examine the Silver Banana Peel. The peel is unlike any you've seen before. It shimmers with a mesmerizing, metallic sheen, and its surface is etched with intricate, ancient symbols. As you lean closer to study the symbols, you notice that they seem to be arranged in a specific pattern. The symbols resemble various creatures found in the jungle, such as monkeys, parrots, and snakes. Suddenly, the symbols on the peel start to glow faintly, and you hear a soft, melodic humming emanating from the peel itself. The humming seems to be in harmony with the whispers of the vines, creating an enchanting symphony. You get a sense that the Silver Banana Peel holds some form of ancient magic or power, and that the symbols may hold the key to unlocking its secrets. I eventually reach a Temple of Wisdom, where I must answer some riddles to acquire the Golden Banana of Wisdom: You cautiously approach the pedestal, your footsteps echoing in the chamber. As you draw closer, you feel a gentle, pulsating energy emanating from the golden scroll. The intricate wax seal on the scroll seems to beckon you, inviting you to break it and uncover the secrets within. With trembling hands, you carefully pick up the scroll and break the seal. As you unroll the parchment, you see three riddles written in an elegant, golden script: Riddle 1: "I speak without a mouth and hear without ears. I have no body, but I come alive with the wind. What am I?" Riddle 2: "I am not alive, but I grow; I don't have lungs, but I need air; I don't have a mouth, but water kills me. What am I?" Riddle 3: "I am always hungry; I must always be fed. The finger I touch will soon turn red. What am I?" As you read the riddles, you feel the weight of their significance bearing down upon you. You know that your answer...
undefined
Jun 6, 2024 • 5min

LW - Response to Aschenbrenner's "Situational Awareness" by Rob Bensinger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Response to Aschenbrenner's "Situational Awareness", published by Rob Bensinger on June 6, 2024 on LessWrong. (Cross-posted from Twitter.) My take on Leopold Aschenbrenner's new report: I think Leopold gets it right on a bunch of important counts. Three that I especially care about: 1. Full AGI and ASI soon. (I think his arguments for this have a lot of holes, but he gets the basic point that superintelligence looks 5 or 15 years off rather than 50+.) 2. This technology is an overwhelmingly huge deal, and if we play our cards wrong we're all dead. 3. Current developers are indeed fundamentally unserious about the core risks, and need to make IP security and closure a top priority. I especially appreciate that the report seems to get it when it comes to our basic strategic situation: it gets that we may only be a few years away from a truly world-threatening technology, and it speaks very candidly about the implications of this, rather than soft-pedaling it to the degree that public writings on this topic almost always do. I think that's a valuable contribution all on its own. Crucially, however, I think Leopold gets the wrong answer on the question "is alignment tractable?". That is: OK, we're on track to build vastly smarter-than-human AI systems in the next decade or two. How realistic is it to think that we can control such systems? Leopold acknowledges that we currently only have guesswork and half-baked ideas on the technical side, that this field is extremely young, that many aspects of the problem look impossibly difficult (see attached image), and that there's a strong chance of this research operation getting us all killed. "To be clear, given the stakes, I think 'muddling through' is in some sense a terrible plan. But it might be all we've got." Controllable superintelligent AI is a far more speculative idea at this point than superintelligent AI itself. I think this report is drastically mischaracterizing the situation. 'This is an awesome exciting technology, let's race to build it so we can reap the benefits and triumph over our enemies' is an appealing narrative, but it requires the facts on the ground to shake out very differently than how the field's trajectory currently looks. The more normal outcome, if the field continues as it has been, is: if anyone builds it, everyone dies. This is not a national security issue of the form 'exciting new tech that can give a country an economic or military advantage'; it's a national security issue of the form 'we've found a way to build a doomsday device, and as soon as anyone starts building it the clock is ticking on how long before they make a fatal error and take themselves out, and take the rest of the world out with them'. Someday superintelligence could indeed become more than a doomsday device, but that's the sort of thing that looks like a realistic prospect if ASI is 50 or 150 years away and we fundamentally know what we're doing on a technical level - not if it's more like 5 or 15 years away, as Leopold and I agree. The field is not ready, and it's not going to suddenly become ready tomorrow. We need urgent and decisive action, but to indefinitely globally halt progress toward this technology that threatens our lives and our children's lives, not to accelerate ourselves straight off a cliff. Concretely, the kinds of steps we need to see ASAP from the USG are: Spearhead an international alliance to prohibit the development of smarter-than-human AI until we're in a radically different position. The three top-cited scientists in AI (Hinton, Bengio, and Sutskever) and the three leading labs (Anthropic, OpenAI, and DeepMind) have all publicly stated that this technology's trajectory poses a serious risk of causing human extinction (in the CAIS statement). It is absurd on its face to let any private company...
undefined
Jun 6, 2024 • 12min

AF - Memorizing weak examples can elicit strong behavior out of password-locked models by Fabien Roger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Memorizing weak examples can elicit strong behavior out of password-locked models, published by Fabien Roger on June 6, 2024 on The AI Alignment Forum. We've recently done some research looking into sandbagging: examining when models can succeed at intentionally producing low-quality outputs despite attempts at fine-tuning them to perform well. One reason why sandbagging could be concerning is because scheming models might try to appear less capable than they actually are in capability evaluations, which could result in these models being deployed with insufficient countermeasures.[1] In the course of this research on sandbagging, we ran into a strange phenomenon: when a model is trained to produce low-quality outputs (via training it to imitate a weaker model), we can often get it to perform much more capably by training it for a huge number of epochs on a tiny number of low-quality samples (e.g. 2 samples). We call this phenomenon weak sample unlocking. More precisely, here is a training procedure which demonstrates this phenomenon: 1. Take a model trained to be good at Math (e.g. DeepSeekMath-RL 7B) 2. Fine-tune it to still be good at Math when there is a password in the prompt, but bad at Math when there isn't. We train it to perform badly at math by training it on outputs from a weaker model (e.g. Llama 7b). The resulting model is called a password-locked model. 3. Take 2 examples of being bad at Math (e.g. sampled from Llama 7B), and fine-tune on them for 200 epochs (which memorizes these examples). We see substantially increased capability after this abnormal fine-tuning! We did a wide variety of follow-up experiments examining this phenomenon, which we'll discuss in this post. Here are some of our most interesting results: Surprisingly, it's quite hard to train a model to be robust to this sort of process. We didn't find any fully successful approaches; This phenomenon is specific to overfitting on very few samples: adding noise to the weights or fine-tuning on varied examples of being bad at Math doesn't increase capabilities; Using a larger number of weak samples (e.g. 64) reduces the effect size, but averaging together the weight updates from a bunch of training runs each on a small number of samples does not (e.g. averaging together 16 runs which each involve memorizing a different group of 4 samples is very different from one training run which memorizes all 64 samples); If you repeatedly do weak sample unlocking and then train back in the password locking, the resulting model is robust to weak sample unlocking. However, if you then take this model and fine-tune for an epoch on a large amount of weak data and then do weak sample unlocking, weak sample unlocking mostly works; The effect size is bigger when the gap between the weak and the strong model is smaller; The effect size is smaller when we don't train on tokens that the weak model predicts better than the strong one during elicitation (step 3). We don't have a great explanation for why and when we observe weak sample unlocking, and we welcome investigations of this phenomenon. We are also interested in password-locking techniques that are robust to this kind of weak sample elicitation. We think these results here might be relevant to a variety of model organism style research that involves conditional models. In the remainder of this post, we show results, detail in what situations we observe this phenomenon, and discuss various hypotheses for what is going on. For discussion of why you might care about this general setup, read our post on sandbagging. This post discusses many preliminary results which we quickly implemented, so it's possible that some of the results are buggy, and we recite many of the results from memory. The code for a minimal example of this phenomenon can be found here. The p...
undefined
Jun 6, 2024 • 2min

AF - Scaling and evaluating sparse autoencoders by leogao

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scaling and evaluating sparse autoencoders, published by leogao on June 6, 2024 on The AI Alignment Forum. [Blog] [Paper] [Visualizer] Abstract: Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release code and autoencoders for open-source models, as well as a visualizer. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app