The Nonlinear Library

The Nonlinear Fund
undefined
Jul 7, 2024 • 10min

LW - AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0 by James Fox

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0, published by James Fox on July 7, 2024 on LessWrong. TL;DR We are excited to announce the fourth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! ARENA's mission is to provide talented individuals with the skills, tools, and environment necessary for upskilling in ML engineering, for the purpose of contributing directly to AI alignment in technical roles. ARENA will be running in-person from LISA from 2nd September - 4th October (the first week is an optional review of the fundamentals of neural networks). Apply here before 23:59 July 20th anywhere on Earth! Summary ARENA has been successfully run three times, with alumni going on to become MATS scholars and LASR participants; AI safety engineers at Apollo Research, Anthropic, METR, and OpenAI; and even starting their own AI safety organisations! This iteration will run from 2nd September - 4th October (the first week is an optional review of the fundamentals of neural networks) at the London Initiative for Safe AI (LISA) in Old Street, London. LISA houses small organisations (e.g., Apollo Research, BlueDot Impact), several other AI safety researcher development programmes (e.g., LASR Labs, MATS extension, PIBBS, Pivotal), and many individual researchers (independent and externally affiliated). Being situated at LISA, therefore, brings several benefits, e.g. facilitating productive discussions about AI safety & different agendas, allowing participants to form a better picture of what working on AI safety can look like in practice, and offering chances for research collaborations post-ARENA. The main goals of ARENA are to: Help participants skill up in ML relevant for AI alignment. Produce researchers and engineers who want to work in alignment and help them make concrete next career steps. Help participants develop inside views about AI safety and the paths to impact of different agendas. The programme's structure will remain broadly the same as ARENA 3.0 (see below); however, we are also adding an additional week on evaluations. For more information, see our website. Also, note that we have a Slack group designed to support the independent study of the material (join link here). Outline of Content The 4-5 week program will be structured as follows: Chapter 0 - Fundamentals Before getting into more advanced topics, we first cover the basics of deep learning, including basic machine learning terminology, what neural networks are, and how to train them. We will also cover some subjects we expect to be useful going forward, e.g. using GPT-3 and 4 to streamline your learning, good coding practices, and version control. Note: Participants can optionally skip the program this week and join us at the start of Chapter 1 if they'd prefer this option and if we're confident that they are already comfortable with the material in this chapter. Topics include: PyTorch basics CNNs, Residual Neural Networks Optimization (SGD, Adam, etc) Backpropagation Hyperparameter search with Weights and Biases GANs & VAEs Chapter 1 - Transformers & Interpretability In this chapter, you will learn all about transformers and build and train your own. You'll also study LLM interpretability, a field which has been advanced by Anthropic's Transformer Circuits sequence, and open-source work by Neel Nanda. This chapter will also branch into areas more accurately classed as "model internals" than interpretability, e.g. recent work on steering vectors. Topics include: GPT models (building your own GPT-2) Training and sampling from transformers TransformerLens In-context Learning and Induction Heads Indirect Object Identification Superposition Steering Vectors Chapter 2 - Reinforcement Learning In this chapter, you w...
undefined
Jul 6, 2024 • 11min

EA - Imperfection is OK by Richard Y Chappell

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Imperfection is OK, published by Richard Y Chappell on July 6, 2024 on The Effective Altruism Forum. tl;dr: We shouldn't expect normative perfection of people in general, or pretend to it ourselves. It's OK to be suboptimal: willingness to recognize the state is the first step to (sometimes) doing even better. (But still, be OK with the fact that you'll never be perfect.) [Probably old news for most of this audience, but it's sometimes helpful to reiterate this sort of thing.] Intro A curious feature of human nature is that we're very psychologically invested in seeing ourselves as good. Teaching applied ethics, it's striking how resistant students often are to any hint of moral self-critique. Meat-eaters will come up with the most transparently absurd rationalizations for disregarding all the torture that goes into producing their favorite meals. Some philosophers deny that having kids is good, simply because they're scared of the (non-)implication that people ought to have more kids.[1] And there's obviously plenty of motivated reasoning underlying dismissals of effective altruism in the public sphere. I wish we were all more OK with just admitting moral imperfection and openly admiring others who do more good than we, in various respects. The role of social norms Part of the issue seems to be that people feel strong social pressure not to admit to any divergence between what they (even ideally?) ought to do and what they actually do.[2] Anti-hypocrisy norms seem especially damaging here: there's a sense that it's better to have low standards than to have high standards that you struggle to always meet, even if your actual behavior is better (in absolute terms) in the latter case. In teaching, I try to push back against this by modelling tolerance of moral mistakes: "I still eat meat sometimes, even though I don't really think it's justifiable." My hope is that this helps to create a learning environment where students can be more honest with themselves - a sense of, "Oh, good, we don't have to pretend anymore." Otherwise, there can be an atmosphere of defensiveness when discussing such topics, as people wonder whether they are going to be subject to attack for their personal decisions. That obviously isn't conducive to open-minded inquiry. So I think it can be valuable to create a kind of "safe space" for moral mediocrity, and that this can even be the first step in encouraging people to appreciate that they could do better (and might even feel better about themselves if they did). In general, I think it's hard for moral motivations to win out over conformity and immediate gratification, so the most reliable way to do better is probably to develop a community of people with shared high standards. (That's something I find very valuable about the EA community, for example.) It's often easier to advocate that "we all" should do something valuable (pay higher taxes, eat vegan, tithe 10% to effective charities) than to do it unilaterally, when no-one else around you is doing the same. As a result, I think it makes sense to be pretty tolerant of people in different social circumstances who are just conforming to their local norms. But I'm inclined to take a stricter stance when it comes to intellectual demands: everyone should acknowledge moral truths, even when they struggle to live up to them. Even though I eat meat, I can certainly acknowledge that veganism is better, and celebrate when a community successfully shifts its norms to make going vegan easier. And I think this is basically the stance that people who don't donate to effective charities should have towards effective altruism, for example. It's fine (not great, but fine) if you prefer to spend your money on yourself. We all do, to some degree. But that's no excuse for opposing effective philanthropy. Just be...
undefined
Jul 6, 2024 • 5min

AF - Scalable oversight as a quantitative rather than qualitative problem by Buck Shlegeris

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scalable oversight as a quantitative rather than qualitative problem, published by Buck Shlegeris on July 6, 2024 on The AI Alignment Forum. [Many of these ideas were developed in conversation with Ryan Greenblatt and Ansh Radhakrishnan; a lot of this isn't original but I haven't seen it written up] A lot of the time when people talk about scalable oversight, which I'll define for the moment as "the problem of developing techniques that allow you to assess whether an action taken or proposed by an AI is good (oversee it), in a way that works even if the AI is very smart and knowledgeable", the main application they describe is the following situation: you have an extremely powerful AI that's taking actions that might have subtle properties that humans are unable to understand, and you want to be oversee them (e.g. to train it, or to determine whether the actions will have catastrophic consequences). I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost. Of course, when you're thinking about arbitrarily subtle actions, this distinction doesn't matter: there are actions subtle enough that overseeing them would take more than the whole budget of the AI project, and for those actions these definitions are equivalent. But I think that this distinction will be very important in practice, and I often think that people make reasoning errors when thinking about scalable oversight due to neglecting it. The basic reason to think that it's important is that we probably need to oversee many model actions, and so we probably need to decide on a per-action oversight budget (e.g. the amount of human labor spent on ) that will determine what subtlety of actions we will or won't be able to resolve, but we probably won't choose a per-action oversight budget that is close to the whole AI project budget. Another reason this feels important to me is that, as discussed here, I think a lot of AI risk comes from early transformative models that aren't that much smarter than humans; when thinking through the risk associated with them, it feels pretty implausible that a dangerous action of theirs would be unoverseeable if the whole AI project's budget was spent on overseeing it, but it feel very plausible that with the practical per-action oversight budget, the actions will be too subtle to oversee properly. What implications does this have for research: One question that seems extremely important to me as a result of all this is: what sample efficiency should we expect when training very powerful models? This is relevant because it's a key part of forecasting the per-action oversight budget. In particular, if the sample efficiency of RL increases with large models, it might turn out that the optimal strategy for RLing early transformative models is to produce many fewer and much more expensive labels than people use when training current systems; I think people often neglect this possibility when thinking about the future of scalable oversight. I'd love to see some careful research analyzing the tradeoffs between label quality and quantity, and how these change with model scale. (Ansh has done a bit of investigation along these lines.) When imagining scalable oversight in the future, I don't think people should entirely be imagining cases where the research...
undefined
Jul 6, 2024 • 8min

EA - Why I'm working on AI welfare by kyle fish

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I'm working on AI welfare, published by kyle fish on July 6, 2024 on The Effective Altruism Forum. Since Alvea wound down, AI moral patienthood/welfare/wellbeing[1] has been my primary professional/impact focus. I started exploring this topic casually last fall, then spent the winter investigating potential grantmaking opportunities and related strategy questions. Now, I'm working with Rob Long on a combination of research projects and engagement with leading AI labs on AI moral patienthood and related issues. I'm approaching this topic as one component of the broader puzzle of making a future with transformative AI go well, rather than as an independent/standalone issue. My overall focus is on building a clearer picture of relevant questions and issues, developing tools and recommendations for how labs and other stakeholders might navigate them, and improving the discourse and decision-making around this topic. I'm not focused on general advocacy for digital minds - there are significant potential risks associated with both under- and over-attributing moral status to AI systems, which makes it important to keep the focus on getting the issues right, rather than prematurely pushing for particular conclusions. For AI welfare debate week, I'll briefly share the core factors contributing to my prioritization of this topic: 1. Direct longtermist relevance: Decisions we make soon about the development, deployment, and treatment of AI systems could have lasting effects on the wellbeing of digital minds, with potentially enormous significance for the moral value of the future. 1. I put some weight on the simple argument that the vast majority of the moral value of the future may come from the wellbeing/experiences of digital beings, and that therefore we ought to work now to understand this issue and take relevant actions. 2. I don't think strong versions of this argument hold, given the possibility of deferring (certain kinds of) work on this topic until advanced AI systems can help or solve it on their own, the potentially overwhelming importance of alignment for existential security, and other factors. As such, I still rank alignment substantially higher than AI welfare in terms of overall importance from a longtermist standpoint. 3. However, I don't think the longtermist value of working on AI welfare now is entirely defeated by these concerns. It still seems plausible enough to me that there are important path dependencies due to e.g. the need to make critical decisions prior to the development of transformative AI, that some low single-digit percentage of the total resources going into alignment should be invested here as a baseline. 2. Synergies with other longtermist priorities: Work on the potential wellbeing of AI systems may be instrumentally/synergistically valuable to work on other longtermist priorities, particularly AI alignment and safety. 1. Understanding the potential interests, motivations, goals, and experiences of AI systems - a key aim of work on AI moral patienthood - seems broadly useful in efforts to build positive futures with such systems. 2. It seems plausible to me that AI welfare will become a major topic of public interest/concern, and that it will factor into many critical decisions about AI development/deployment in general. By default, I do not expect the reasoning and decision-making about this topic to be good - this seems like a place where early thoughtful efforts could make a big positive difference. 3. There are also potential tensions that arise between AI welfare and other longtermist priorities (e.g. the possibility that some AI safety/control strategies involve mistreating AI systems), but my current view is that these topics are, or at least can be, predominantly synergistic. 3. Near-term/moral decency considerations: Taking ...
undefined
Jul 6, 2024 • 2min

EA - Notes on new UK AISI minister by Pseudaemonia

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Notes on new UK AISI minister, published by Pseudaemonia on July 6, 2024 on The Effective Altruism Forum. As of today, Peter Kyle is the UK's new Secretary of State for Science, Innovation and Technology, which means he's the elected official responsible for UK AISI. Here are some brief notes I found on Peter Kyle's stance on AI (all from his Wikipedia page). Peter Kyle was appointed as Shadow Secretary of State for Science, Innovation and Technology in September 2023. In October 2023, Kyle said that a future Labour government would reduce regulations on technology companies in the development and use of AI and "put it to work for everyone from every background." In November 2023, Kyle outlined Labour policies to impose stricter regulations on general AI research companies such as OpenAI and Anthropic with stronger requirements for reporting, data-sharing, and user safety. Kyle has also proposed the creation of a 'Regulatory Innovation office', which would expedite important regulatory decisions. In February 2024, Kyle announced that Labour would scrap the existing voluntary testing agreement between technology companies and the British government, instead creating a statutory regime in which AI businesses would be required to share with the government the results of their test data. He claimed that this would enable the government to better understand the risks involved in AI and the challenges it could pose to different industries and society in general. He also met with representatives of the United States government and AI companies in Washington, United States, including Microsoft, Amazon, Meta, Google, Apple, OpenAI, and Anthropic. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 6, 2024 • 5min

LW - Musings on LLM Scale (Jul 2024) by Vladimir Nesov

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Musings on LLM Scale (Jul 2024), published by Vladimir Nesov on July 6, 2024 on LessWrong. In a recent interview, Dario Amodei claimed that cost of training is (starting with models already available) Right now, $100 million. There are models in training today that are more like a $1 billion. I think if we go to $10 or a $100 billion, and I think that will happen in 2025-2026, maybe 2027, ... (Epistemic status: Fermi estimates, 8 is approximately 10 which is greater than 9.) Assuming $40,000 per H100 and associated infrastructure in a datacenter, $1 billion gives 25K H100s, which matches the scale of for example Meta's new training clusters and requires about 40MW of power. At $2 per hour, training time cost of 25K H100s reaches $100 million in 80 days, which seems reasonable if on the short side for a production training run. The cost of time matches $1 billion at 2.3 years. An H100 (SXM) is rated for 2e15 FLOP/s in BF16 (my impression is this is usually stable out of the box). This becomes 4e15 FLOP/s in FP8, which seems practical if done carefully, no degradation in pre-training loss compared to FP32. The $100 million run then translates to 9e25 FLOPs at 30% utilization in BF16, or 2e26 FLOPs in FP8. (For some reason this SemiAnalysis estimate is 2x lower, peak 2e20 FLOP/s for 100,000 H100s at FP8, possibly the sparsity footnote in H100 specification for the 4000 teraFLOP/s figure is the culprit.) This is maybe 10x original GPT-4, estimated at 2e25 FLOPs. The leading models (Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4 Omni) cost $15-20 per million output tokens, compared to $75-120 for once-frontier models Claude 3 Opus, Gemini 1 Ultra, original GPT-4. Given a Chinchilla optimal model, if we reduce its active parameters 3x and increase training compute 3x, we get approximately the same performance, but it's now at least 3x cheaper for inference. This increases data 10x, which if everything else fails can be obtained by repeating the old data, giving 30x overtraining in compute compared to what is Chinchilla optimal for the smaller model. Llama-3-70b is overtrained 10x, Llama-3-8b 90x, though they don't use MoE and their performance is lower than for MoE models with the same active parameters and training cost. Beyond $100 million The current frontier models are overtrained on compute that could enable even smarter models. Compute is increasing, but it mostly goes to reduction of inference cost, and only a little bit to capabilities. Why aren't any of the three labs directing the compute to train/release models optimized for maximum capability? Possibly costs are already such that training at too many parameter/data tradeoff points won't be done, instead they choose an option that's currently most useful and spend the rest on experiments that would make imminent larger scale runs better. Even OpenAI's next frontier model in training as of May 28 might just be using compute comparable to what GPT-4 Omni required, not OOMs more, and it could still get much more capable if allowed to be more expensive for inference. To do a run at $1 billion in cost of time, even 100K H100s would need 200 days (powered by 150MW). There probably aren't any individual clusters of this scale yet (which would cost about $4 billion). Gemini 1.0 report stated that Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. ... we combine SuperPods in multiple datacenters using Google's intra-cluster and inter-cluster network. Google's network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods. This together with Amodei's claim of current $1 billion training runs and individual 100K H100 clusters still getting built ...
undefined
Jul 6, 2024 • 3min

AF - What progress have we made on automated auditing? by Lawrence Chan

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What progress have we made on automated auditing?, published by Lawrence Chan on July 6, 2024 on The AI Alignment Forum. One use case for model internals work is to perform automated auditing of models: https://www.alignmentforum.org/posts/cQwT8asti3kyA62zc/automating-auditing-an-ambitious-concrete-technical-research That is, given a specification of intended behavior, the attacker produces a model that doesn't satisfy the spec, and the auditor needs to determine how the model doesn't satisfy the spec. This is closely related to static backdoor detection: given a model M, determine if there exists a backdoor function that, for any input, transforms that input to one where M has different behavior.[1] There's some theoretical work (Goldwasser et al. 2022) arguing that for some model classes, static backdoor detection is impossible even given white-box model access -- specifically, they prove their results for random feature regression and (the very similar setting) of wide 1-layer ReLU networks. Relatedly, there's been some work looking at provably bounding model performance (Gross et al. 2024) -- if this succeeds on "real" models and "real" specification, then this would solve the automated auditing game. But the results so far are on toy transformers, and are quite weak in general (in part because the task is so difficult).[2] Probably the most relevant work is Halawi et al. 2024's Covert Malicious Finetuning (CMFT), where they demonstrate that it's possible to use finetuning to insert jailbreaks and extract harmful work, in ways that are hard to detect with ordinary harmlessness classifiers.[3] As this is machine learning, just because something is impossible in theory and difficult on toy models doesn't mean we can't do this in practice. It seems plausible to me that we've demonstrated non-zero empirical results in terms of automatically auditing model internals. So I'm curious: how much progress have we made on automated auditing empirically? What work exists in this area? What does the state-of-the-art in automated editing look like? 1. ^ Note that I'm not asking about mechanistically anomaly detection/dynamic backdoor detection; I'm aware that it's pretty easy to distinguish if a particular example is backdoored using baseline techniques like "fit a Gaussian density on activations and look at the log prob of the activations on each input" or "fit a linear probe on a handful of examples using logistic regression". 2. ^ I'm also aware of some of the work in the trojan detection space, including 2023 Trojan detection contest, where performance on extracting embedded triggers was little better than chance. 3. ^ That being said, it's plausible that dynamically detecting them given model internals is very easy. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Jul 6, 2024 • 11min

EA - Case study: The Lübeck vaccine by NunoSempere

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Case study: The Lübeck vaccine, published by NunoSempere on July 6, 2024 on The Effective Altruism Forum. Winfried Stöcker is a German doctor. After a career developing immunization technologies, he sells Euroinmun to to Perkin Elmer, a US megacorp, for $1.3B; his own majority stake would be worth $600M, so after taxes a German magazine estimated his net worth at circa $300M. After selling the company, he makes a variety of investments, like buying a nearby airport, and continues working on his own, smaller lab, Labor Stöcker. He also donates to the Alternativ für Deutschland (AfD), a German far right party, and positions himself as being against more immigration. Come COVID, the guy realizes that he has the relevant expertise, and in March 2020 develops a vaccine in his personal lab. He tests this vaccine on himself, and later on family and employees. After some experiments, he becomes sure of the vaccine's efficacy and harmlessness, as it uses the same mechanisms as e.g., Hepatitis vaccines. He decides that, in times of urgency, double blind trials would take too long, and organizes a vaccine drive on his airport. But the Paul Ehrlich Institute and the Landesamt für soziale Dienste sue him, and police interrupt the vaccine drive. After my resounding success with the first five immunizations (within my own family), I applied for approval for a corresponding study in September 2020 (Wednesday, 2 September 2020, 18:52) with the head of the Paul Ehrlich Institute, Klaus Cichutek. I tried to make it clear to him that within six months I could have immunized all of Germany safely and efficiently against Covid-19. And yet, instead of eliciting enthusiasm from this important senior civil servant, I aroused his displeasure, apparently because he felt overlooked. Or perhaps there were other interests involved. And so, he had me prosecuted (the proceedings were closed)... I'm not quite clear on the finer legal points, but although the vaccination drive at his airport seems to have been somehow illegal, Stöcker finds another legal interpretation such that individual doctors can order two vaccine components, an "antigen" and an "adjuvant" and join them together. Separate, they aren't a vaccine, and so its distribution isn't prohibited, per Stöcker's interpretation of the Arzneimittelgesetz. The story then continues, with Stöcker becoming convinced that his vaccine is superior to more recent mRNA vaccines, and winning some and losing other legal cases. Recently, he boasted of reaching 100,000 vaccinations. But he also received a €250K fine for the vaccination drive in his airport. Cichutek, the bureaucrat, received a Federal Order of Merit, but is no longer the chief of the Paul Ehrlich Institute. Overall, sources defending Stöcker's point of view are more numerous and accessible than sources presenting the opposite perspective. Still, here is a blog post outlining the lack of information about quality assurance processes, and other problems with the potential vaccine. Ultimately, it could be the case that the Lübeck vaccine was inferior, and that legal censure was indeed justified. However, my impression is that, on the balance of probabilities, that's not the case. Particularly on the early days of the pandemic, scaling a cheap method seems like it would have been much better than subjecting the population to the also uncertain effects of more infections. And as time goes on, having had different vaccines also seems like it would have been more robust. Still, there are many technical details that I don't understand. With this in mind, what are some possible lessons? One clear one seems that Stöcker's support of the AfD generated general antipathy, and reduced his freedom for action. It could have been the deciding factor in having hostile interactions with the opaque bureaucra...
undefined
Jul 6, 2024 • 41min

AF - A "Bitter Lesson" Approach to Aligning AGI and ASI by Roger Dearnaley

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A "Bitter Lesson" Approach to Aligning AGI and ASI, published by Roger Dearnaley on July 6, 2024 on The AI Alignment Forum. TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role. Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps even for ASI, but then we don't know for sure how hard a problem Alignment is. However, if you're proposing any solution to Alignment that's more complicated than this (and most of them are), you should probably have an argument for why this conceptually-simple approach won't work, or won't be sufficient. If you're not already familiar with it, you should first read Rich Sutton's excellent and influential post The Bitter Lesson. (Even if you are already familiar with it, it's a quick reread, only a page-and-a-half long, and its message is worth remembering.) Why The Alignment Problem is Hard (In My Opinion) We have been training LLM-based AIs off enormous web + books + video + etc datasets created by humans, which are full of a vast number of examples of human behavior. We are basically "distilling" human intelligence into these LLMs,[1] teaching them to imitate us. In this process, they become familiar with, understand, and learn to imitate basically all aspects of human behavior - including the many problematic ones for Alignment, such as prejudice, deception, power-seeking, and criminality (and even ones like gluttony and lust that have little practical use for a non-corporal intelligence). We humans are living beings, the products of evolution, so evolutionary psychology applies to us. While we are a social species, good at cooperating on non-zero-sum games, if you put humans in (what they perceive as) a non-iterated zero-sum situation, they will generally act selfishly for the benefit of themselves and their close genetic relatives, just as evolutionary theory would predict. So the behavioral potentials for deception, power-seeking, criminality etc. are all inherent, evolutionarily adaptive, and thus unsurprising. This is human nature, and there are evolutionary reasons why it is this way. Despite this, we have learned how to build a cooperating society out of humans, using social techniques and incentives such as an economy, laws, and law enforcement to encourage and productively harness cooperative human behavior and keep the bad consequences of selfish behavior under control. The results aren't perfect: things like crime, inequality, and war still happen, but they're acceptable - we've survived so far, even thrived. By default, if we continue this LLM training process to larger-and-larger scales, and if the LLM-based approach to AI doesn't hit any major roadblocks, then some time, probably in the next few years, we will have human-level AIs - usually referred to as AGIs - who are roughly as well/badly-aligned as humans, and (at least for the base-model LLMs before any Alignment processes are applied) have a comparable-to-human propensity to cooperate on non-zero-sum games and act selfishly on non-iterated zero-sum games. They are not alive, and evolution doesn't apply to them directly, but they were trained to simulate our behavior, including our evolved survival strategies like selfishness. They will thus have alignment properties comparable to humans: they understand what human values, mor...
undefined
Jul 5, 2024 • 20min

LW - [Interim research report] Activation plateaus & sensitive directions in GPT2 by StefanHex

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Interim research report] Activation plateaus & sensitive directions in GPT2, published by StefanHex on July 5, 2024 on LessWrong. This part-report / part-proposal describes ongoing research, but I'd like to share early results for feedback. I am especially interested in any comment finding mistakes or trivial explanations for these results. I will work on this proposal with a LASR Labs team over the next 3 months. If you are working (or want to work) on something similar I would love to chat! Experiments and write-up by Stefan, with substantial inspiration and advice from Jake (who doesn't necessarily endorse every sloppy statement I write). Work produced at Apollo Research. TL,DR: Toy models of how neural networks compute new features in superposition seem to imply that neural networks that utilize superposition require some form of error correction to avoid interference spiraling out of control. This means small variations along a feature direction shouldn't affect model outputs, which I can test: 1. Activation plateaus: Real activations should be resistant to small perturbations. There should be a "plateau" in the output as a function of perturbation size. 2. Sensitive directions: Perturbations towards the direction of a feature should change the model output earlier (at a lower perturbation size) than perturbations into a random direction. I find that both of these predictions hold; the latter when I operationalize "feature" as the difference between two real model activations. As next steps we are planning to Test both predictions for SAE features: We have some evidence for the latter by Gurnee (2024) and Lindsey (2024). Are there different types of SAE features, atomic and composite features? Can we get a handle on the total number of features? If sensitivity-features line up with SAE features, can we find or improve SAE feature directions by finding local optima in sensitivity (similar to how Mack & Turner (2024) find steering vectors)? My motivation for this project is to get data on computation in superposition, and to get dataset-independent evidence for (SAE-)features. Core results & discussion I run two different experiments that test the error correction hypothesis: 1. Activation Plateaus: A real activation is the center of a plateau, in the sense that perturbing the activation affects the model output less than expected. Concretely: applying random-direction perturbations to an activation generated from a random openwebtext input ("real activation") has less effect than applying the same perturbations to a random activation (generated from a Normal distribution). This effect on the model can be measured in KL divergence of logits (shown below) but also L2 difference or cosine similarity of late-layer activations. 2. Sensitive directions: Perturbing a (real) activation into a direction towards another real activation ("poor man's feature directions") affects the model-outputs more than perturbing the same activation into a random direction. In the plot below focus on the size of the "plateau" in the left-hand side 1. Naive random direction vs mean & covariance-adjusted random: Naive isotropic random directions are much less sensitive. Thus we use mean & covariance-adjusted random activations everywhere else in this report. 2. The sensitive direction results are related to Gurnee (2024, SAE-replacement-error direction vs naive random direction) and Lindsey (2024, Anthropic April Updates, SAE-feature direction vs naive random direction). The theoretical explanation for activation plateaus & sensitive direction may be error correction (also referred to as noise suppression): NNs in superposition should expect small amounts of noise in feature activations due to interference. (The exact properties depend on how computation happens in superposition, this toy...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app