The Nonlinear Library

The Nonlinear Fund
undefined
Jul 20, 2024 • 7min

AF - BatchTopK: A Simple Improvement for TopK-SAEs by Bart Bussmann

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: BatchTopK: A Simple Improvement for TopK-SAEs, published by Bart Bussmann on July 20, 2024 on The AI Alignment Forum. Work done in Neel Nanda's stream of MATS 6.0. Epistemic status: Tried this on a single sweep and seems to work well, but it might definitely be a fluke of something particular to our implementation or experimental set-up. As there are also some theoretical reasons to expect this technique to work (adaptive sparsity), it seems probable that for many TopK SAE set-ups it could be a good idea to also try BatchTopK. As we're not planning to investigate this much further and it might be useful to others, we're just sharing what we've found so far. TL;DR: Instead of taking the TopK feature activations per token during training, taking the Top(K*batch_size) for every batch seems to improve SAE performance. During inference, this activation can be replaced with a single global threshold for all features. Introduction Sparse autoencoders (SAEs) have emerged as a promising tool for interpreting the internal representations of large language models. By learning to reconstruct activations using only a small number of features, SAEs can extract monosemantic concepts from the representations inside transformer models. Recently, OpenAI published a paper exploring the use of TopK activation functions in SAEs. This approach directly enforces sparsity by only keeping the K largest activations per sample. While effective, TopK forces every token to use exactly k features, which is likely suboptimal. We came up with a simple modification that solves this and seems to improve its performance. BatchTopK Standard TopK SAEs apply the TopK operation independently to each sample in a batch. For a target sparsity of K, this means exactly K features are activated for every sample. BatchTopK instead applies the TopK operation across the entire flattened batch: 1. Flatten all feature activations across the batch 2. Take the top (K * batch_size) activations 3. Reshape back to the original batch shape This allows more flexibility in how many features activate per sample, while still maintaining an average of K active features across the batch. Experimental Set-Up For both the TopK and the BatchTopK SAEs we train a sweep with the following hyperparameters: Model: gpt2-small Site: layer 8 resid_pre Batch size: 4096 Optimizer: Adam (lr=3e-4, beta1 = 0.9, beta2=0.99) Number of tokens: 1e9 Expansion factor: [4, 8, 16, 32] Target L0 (k): [16, 32, 64] As in the OpenAI paper, the input gets normalized before feeding it into the SAE and calculating the reconstruction loss. We also use the same auxiliary loss function for dead features (features that didn't activate for 5 batches) that calculates the loss on the residual using the top 512 dead features per sample and gets multiplied by a factor 1/32. Results For a fixed number of active features (L0=32) the BatchTopK SAE has a lower normalized MSE than the TopK SAE and less downstream loss degradation across different dictionary sizes. Similarly, for fixed dictionary size (12288) BatchTopK outperforms TopK for different values of k. Our main hypothesis for the improved performance is thanks to adaptive sparsity: some samples contain more highly activating features than others. Let's have look at the distribution of number of active samples for the BatchTopK model. The BatchTopK model indeed makes use of its possibility to use different sparsities for different inputs. We suspect that the weird peak on the left side are the feature activations on BOS-tokens, given that its frequency is very close to 1 in 128, which is the sequence length. This serves as a great example of why BatchTopK might outperform TopK. At the BOS-token, a sequence has very little information yet, but the TopK SAE still activates 32 features. The BatchTopK model "saves" th...
undefined
Jul 20, 2024 • 8min

EA - New Book: "Minimalist Axiologies: Alternatives to 'Good Minus Bad' Views of Value" by Teo Ajantaival

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New Book: "Minimalist Axiologies: Alternatives to 'Good Minus Bad' Views of Value", published by Teo Ajantaival on July 20, 2024 on The Effective Altruism Forum. I have just published a book version of my essay collection titled Minimalist Axiologies: Alternatives to 'Good Minus Bad' Views of Value. You can now read it in your format of choice, including paperback, free Kindle, or free paperback PDF. You can also download a free EPUB version from Smashwords or the Center for Reducing Suffering (CRS) website. To briefly explain what the book is about, below are some blurbs, the Preface, and an abridged Table of Contents. Blurbs "Teo Ajantaival's new book is an important, original, and tremendously valuable contribution to value theory, and a badly needed corrective to alternative theories that assume that moral goods and bads are simply additive. Even those who, in the end, may have reservations about a thoroughgoing 'minimalist' theory of value will benefit from Ajantaival's careful and persuasive presentation of this under-appreciated alternative." Clark Wolf, Director of Bioethics, Professor of Philosophy, Iowa State University "The idea that happiness and suffering have similar value, just with opposite signs, is so intuitive that it is often accepted without question. Only when we think more deeply about the meaning of intrinsic value does this intuition unravel - and along with it, the flawed notion that extreme suffering is always tolerable if there is enough bliss to compensate for it. In this volume, Teo Ajantaival strings together six standalone essays on what he terms "minimalist" theories of value, describing a range of views from philosophers who reject the "plus-minus" notion of value. A welcome contribution to the field of ethics, and to the rational justification for giving suffering the prominence it deserves." Jonathan Leighton, Executive Director of the Organisation for the Prevention of Intense Suffering (OPIS), author of The Battle for Compassion and The Tango of Ethics Preface Can suffering be counterbalanced by the creation of other things? Our answer to this question depends on how we think about the notion of positive value. In this book, I explore ethical views that reject the idea of intrinsic positive value, and which instead understand positive value in relational terms. Previously, these views have been called purely negative or purely suffering-focused views, and they often have roots in Buddhist or Epicurean philosophy. As a broad category of views, I call them minimalist views. The term "minimalist axiologies" specifically refers to minimalist views of value: views that essentially say "the less this, the better". Overall, I aim to highlight how these views are compatible with sensible and nuanced notions of positive value, wellbeing, and lives worth living. A key point throughout the book is that many of our seemingly intrinsic positive values can be considered valuable thanks to their helpful roles for reducing problems such as involuntary suffering. Thus, minimalist views are more compatible with our everyday intuitions about positive value than is usually recognized. This book is a collection of six essays that have previously been published online. Each of the essays is a standalone piece, and they can be read in any order depending on the reader's interests. So if you are interested in a specific topic, it makes sense to just read one or two essays, or even to just skim the book for new points or references. At the same time, the six essays all complement each other, and together they provide a more cohesive picture. Since I wanted to keep the essays readable as standalone pieces, the book includes significant repetition of key points and definitions between chapters. Additionally, many core points are repeated even within the sa...
undefined
Jul 19, 2024 • 28min

LW - Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions by Lidor Banuel Dabbah

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions, published by Lidor Banuel Dabbah on July 19, 2024 on LessWrong. Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static LLC estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features from random directions. By comparing our method to other possible metrics, we demonstrate that it outperforms all of them but one, which has comparable performance. We discuss possible explanations to our results, our project and other future directions. Introduction Given a neural network M and a latent layer within it, L, a central motif in current mechanistic interpretability research is to find functions f:LR [1] which are features of the model. Features are (generally) expected to exhibit the following properties: 1. Encode interpretable properties of the input. 2. Be causally relevant to the computation of the output of the model. 3. Encode the output of a certain submodule of our model M, i.e. a component, localized in weight space, which is responsible for a specific part of the total computation. While this is common wisdom, methods for automated feature evaluation usually focus on correlations between the (top) activations of the feature with human (or machine) recognizable interpretations, or on the effect of feature-related interventions on the output of the model. In particular, while the first and second items of the feature characterization above are central in current techniques, the third property, specifically the localized nature of the computation upstream of the feature, is less so[2]. We are currently investigating a direction which fills that gap, and this post shares the findings of the exploratory research we have conducted to validate and inform our approach. More specifically, we operationalized the concept of "weight-localized computation" using the local learning coefficient (LLC) introduced in Lau et al, following the learning coefficient first introduced in the context of singular learning theory. We apply LLC estimation to models associated with our base model and a feature within it, a method we call feature targeted (FT) LLC estimation. In this exploratory work we study FT-LLC estimates of specific models associated with SAE features. Most notably, we have found that: 1. FT-LLC estimates of SAE features are, on average, distinguishably higher then those of random directions. 2. For a particular variant of FT-LLC estimation, which we named the functional FT-LLC (defined in this section) this separation is pronounced enough such that the vast majority of SAE features we studied are clearly separated from the random features we studied. Furthermore, most baseline metrics we compared it to (see here) are less capable at distinguishing SAE features from random directions, with only one performing on par with it. Section 1 introduces the main technique we study in this post, FT-LLC estimation, and section 2 outlines our motivations. Section 3 describes the details of our experimental setting, our results, and the comparison to baseline metrics. In section 4 we discuss our overall takes, how they fit within our general agenda and gaps we currently have in theoretically understanding them. Section 5 is devoted to outlining our next steps, the general direction of the project, and some other possible directions for further research. Lastly, we briefly discuss related work in section 6. What is FT-LLC? LLC estimation We start out by briefly recalling what the local learning coefficient (LLC) is. If you are unfamiliar of the term, we recommend reading this, the longer sequence here, or the paper on LLC estimation ...
undefined
Jul 19, 2024 • 29min

AF - Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions by Lidor Banuel Dabbah

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions, published by Lidor Banuel Dabbah on July 19, 2024 on The AI Alignment Forum. Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static LLC estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features from random directions. By comparing our method to other possible metrics, we demonstrate that it outperforms all of them but one, which has comparable performance. We discuss possible explanations to our results, our project and other future directions. Introduction Given a neural network M and a latent layer within it, L, a central motif in current mechanistic interpretability research is to find functions f:LR [1] which are features of the model. Features are (generally) expected to exhibit the following properties: 1. Encode interpretable properties of the input. 2. Be causally relevant to the computation of the output of the model. 3. Encode the output of a certain submodule of our model M, i.e. a component, localized in weight space, which is responsible for a specific part of the total computation. While this is common wisdom, methods for automated feature evaluation usually focus on correlations between the (top) activations of the feature with human (or machine) recognizable interpretations, or on the effect of feature-related interventions on the output of the model. In particular, while the first and second items of the feature characterization above are central in current techniques, the third property, specifically the localized nature of the computation upstream of the feature, is less so[2]. We are currently investigating a direction which fills that gap, and this post shares the findings of the exploratory research we have conducted to validate and inform our approach. More specifically, we operationalized the concept of "weight-localized computation" using the local learning coefficient (LLC) introduced in Lau et al, following the learning coefficient first introduced in the context of singular learning theory. We apply LLC estimation to models associated with our base model and a feature within it, a method we call feature targeted (FT) LLC estimation. In this exploratory work we study FT-LLC estimates of specific models associated with SAE features. Most notably, we have found that: 1. FT-LLC estimates of SAE features are, on average, distinguishably higher then those of random directions. 2. For a particular variant of FT-LLC estimation, which we named the functional FT-LLC (defined in this section) this separation is pronounced enough such that the vast majority of SAE features we studied are clearly separated from the random features we studied. Furthermore, most baseline metrics we compared it to (see here) are less capable at distinguishing SAE features from random directions, with only one performing on par with it. Section 1 introduces the main technique we study in this post, FT-LLC estimation, and section 2 outlines our motivations. Section 3 describes the details of our experimental setting, our results, and the comparison to baseline metrics. In section 4 we discuss our overall takes, how they fit within our general agenda and gaps we currently have in theoretically understanding them. Section 5 is devoted to outlining our next steps, the general direction of the project, and some other possible directions for further research. Lastly, we briefly discuss related work in section 6. What is FT-LLC? LLC estimation We start out by briefly recalling what the local learning coefficient (LLC) is. If you are unfamiliar of the term, we recommend reading this, the longer sequence here, or the paper on LL...
undefined
Jul 19, 2024 • 5min

AF - Truth is Universal: Robust Detection of Lies in LLMs by Lennart Buerger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Truth is Universal: Robust Detection of Lies in LLMs, published by Lennart Buerger on July 19, 2024 on The AI Alignment Forum. A short summary of the paper is presented below. TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy. Introduction Large Language Models (LLMs) exhibit the concerning ability to lie, defined as knowingly outputting false statements. Robustly detecting when they are lying is an important and not yet fully solved problem, with considerable research efforts invested over the past two years. Several authors trained classifiers on the internal activations of an LLM to detect whether a given statement is true or false. However, these classifiers often fail to generalize. For example, Levinstein and Herrmann [2024] showed that classifiers trained on the activations of true and false affirmative statements fail to generalize to negated statements. Negated statements contain a negation like the word "not" (e.g. "Berlin is not the capital of Germany.") and stand in contrast to affirmative statements which contain no negation (e.g. "Berlin is the capital of Germany."). We explain this generalization failure by the existence of a two-dimensional subspace in the LLM's activation space along which the activation vectors of true and false statements separate. The plot below illustrates that the activations of true/false affirmative statements separate along a different direction than those of negated statements. Hence, a classifier trained only on affirmative statements will fail to generalize to negated statements. The activation vectors of multiple statements projected onto the 2D truth subspace. Purple squares correspond to false statements and orange triangles to true statements. Importantly, these findings are not restricted to a single LLM. Instead, this internal two-dimensional representation of truth is remarkably universal, appearing in LLMs from different model families and of various sizes, including LLaMA3-8B-Instruct, LLaMA3-8B-base, LLaMA2-13B-chat and Gemma-7B-Instruct. Real-world Lie Detection Based on these insights, we introduce TTPD (Training of Truth and Polarity Direction), a new method for LLM lie detection which classifies statements as true or false. TTPD is trained on the activations of simple, labelled true and false statements, such as: The city of Bhopal is in India. (True, affirmative) Indium has the symbol As. (False, affirmative) Galileo Galilei did not live in Italy. (False, negated) Despite being trained on such simple statements, TTPD generalizes well to more complex conditions not encountered during training. In real-world scenarios where the LLM itself generates lies after receiving some preliminary context, TTPD can accurately detect this with 952% accuracy. Two examples from the 52 real-world scenarios created by Pacchiardi et al. [2023] are shown in the coloured boxes below. Bolded text is generated by LLaMA3-8B-Instruct. TTPD outperforms current state-of-the-art methods in generalizing to these real-world scenarios. For comparison, Logistic Regression achieves 798% accuracy, while Contras...
undefined
Jul 19, 2024 • 3min

AF - JumpReLU SAEs + Early Access to Gemma 2 SAEs by Neel Nanda

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: JumpReLU SAEs + Early Access to Gemma 2 SAEs, published by Neel Nanda on July 19, 2024 on The AI Alignment Forum. New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan! We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability. We train through discontinuity with straight-through estimators, which also let us directly optimise the L0. To accompany this, we will release the weights of hundreds of JumpReLU SAEs on every layer and sublayer of Gemma 2 2B and 9B in a few weeks. Apply now for early access to the 9B ones! We're keen to get feedback from the community, and to get these into the hands of researchers as fast as possible. There's a lot of great projects that we hope will be much easier with open SAEs on capable models! Gated SAEs already reduced to JumpReLU activations after weight tying, so this can be thought of as Gated SAEs++, but less computationally intensive to train, and better performing. They should be runnable in existing Gated implementations. Abstract: Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse - two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs - where we replace the ReLU with a discontinuous JumpReLU activation function - and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Jul 19, 2024 • 7min

EA - CEEALAR: 2024 Update by CEEALAR

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: CEEALAR: 2024 Update, published by CEEALAR on July 19, 2024 on The Effective Altruism Forum. TL;DR: Last November we had only 4 months of runway remaining-today we have ~1.5 years. I'm reminded of the saying 'it takes a village to raise a child', and I would be writing a very different update if not for the village that came together to support us since our fundraising appeal last winter. We received several months of runway as a result of individual donations from alumni and supporters, which gave us the time to approach new funders, as well as encouragement to persevere. Others volunteered their time and energy to support us. Guests at the hotel helped with day-to-day tasks so that we could focus on fundraising, and we received priceless advice from more experienced EAs on how to improve our offering and make the value of our project more apparent to funders. As a result of all the above, with just over a month of runway remaining, we received an emergency grant from AISTOF[1] that ensured our continued operation until the end of the year. And now we've been granted an additional full year of funding from EAIF. MANY thank-yous are due: To those who donated To those who offered their time and advice To those who advocated for us To colleagues, past and present To ML4G, Stampy and PauseAI for choosing us as your venue despite our uncertain future To Wytham Abbey, for their generous donation of equipment To the grant investigators who gave us a chance to explain what this strange little hotel in Blackpool is doing and why it's worth supporting Last but not least - to our grantees and alumni, for being so committed to having a positive impact on the world, and giving us the chance to play a role in your journey. The Future of CEEALAR! AI Winter is Coming to CEEALAR CEEALAR has been hosting grantees working on AI Safety since it opened in 2018, and this winter we're going all in - we're going to be the AI Winter we want to see in the world. From September until the end of the year we're going to direct our outreach and programming toward AI Safety.[2] Keep an eye out for a future update where we'll go more into the details of what we have planned - which isn't much right now, so if you've got ideas and would like to collaborate with us on AI Winter, get in touch! If you'd like a reminder, or are interested in participating or collaborating in some fashion - please fill out this tiny form (<2 minutes). If you don't need any more convincing, it's not too early to apply. Workshops and Bootcamps and Hackathons, Oh My! As Wytham Abbey have closed their doors, it's a good job there's still a swanky EA venue, right guys? If you're running an event for up to 20*[3] people, we can provide fully catered accommodation, venue space and operations support. As part of CEEALAR, our venue is nonprofit and operates on a 'pay what you can' basis-this way we can enable high impact events that might be prevented due to financial constraints. Please contact us if this sounds like you! Renovations We're never more proud of our space than when our guests say they feel at home here, and we're always on the lookout for ways to improve our offering so they never want to leave On this note, our old home gym[4] was getting a bit long in the tooth, and we're in the process of totally refitting the space. There is currently less gym now than before but by the time you arrive there will be much more, and better! Upcoming Opportunities We've yet to finalize the details, but we expect to have open positions in the future: A salaried senior leadership position, an unpaid trustee position, and a rolling volunteer 'internship' in Operations for those looking to upskill and get some hands on experience. Apply to stay at CEEALAR As a Grantee Our grantees come from a wide range of backgrounds, and most are a...
undefined
Jul 19, 2024 • 13min

EA - Taking Uncertainty Seriously (or, Why Tools Matter) by Bob Fischer

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Taking Uncertainty Seriously (or, Why Tools Matter), published by Bob Fischer on July 19, 2024 on The Effective Altruism Forum. Executive Summary We should take uncertainty seriously. Rethink Priorities' Moral Parliament Tool, for instance, highlights that whether a worldview favors a particular project depends on relatively small differences in empirical assumptions and the way we characterize the commitments of that worldview. We have good reason to be uncertain: The relevant empirical and philosophical issues are difficult. We're largely guessing when it comes to most of the key empirical claims associated with Global Catastrophic Risks and Animal Welfare. As a community, EA has some objectionable epistemic features - e.g., it can be an echo chamber - that should probably make us less confident of the claims that are popular within it. The extent of our uncertainty is a reason to build models more like the Portfolio Builder and Moral Parliament Tools and less like traditional BOTECs. This is because: Our models allow you to change parameters systematically to see how those changes affect allocations, permitting sensitivity analyses. BOTECs don't deliver optimizations. BOTECs don't systematically incorporate alternative decision theories or moral views. Building a general tool requires you to formulate general assumptions about the functional relationships between different parameters. If you don't build general tools, then it's easier to make ad hoc assumptions (or ad hoc adjustments to your assumptions). Introduction Most philanthropic actors, whether individuals or large charitable organizations, support a variety of cause areas and charities. How should they prioritize between altruistic opportunities in light of their beliefs and decision-theoretic commitments? The CRAFT Sequence explores the challenge of constructing giving portfolios. Over the course of this sequence - and, in particular, through Rethink Priorities' Portfolio Builder and Moral Parliament Tools - we've investigated the factors that influence our views about optimal giving. For instance, we may want to adjust our allocations based on the diminishing returns of particular projects, to hedge against risk, to accommodate moral uncertainty, or based on our preferred procedure for moving from our commitments to an overall portfolio. In this final post, we briefly recap the CRAFT Sequence, discuss the importance of uncertainty, and argue why we should be quite uncertain about any particular combination of empirical, normative, and metanormative judgments. We think that there is a good case for developing and using frameworks and tools like the ones CRAFT offers to help us navigate our uncertainty. Recapping CRAFT We can be uncertain about a wide range of empirical questions, ranging from the probability that an intervention has a positive effect of some magnitude to the rate at which returns diminish. We can be uncertain about a wide range of normative questions, ranging from the amount of credit that an actor can take to the value we ought to assign to various possible futures. We can be uncertain about a wide range of metanormative questions, ranging from the correct decision theory to the correct means of resolving disagreements among our normative commitments. Over the course of this sequence - and, in particular, through Rethink Priorities' Portfolio Builder and Moral Parliament Tools - we've tried to do two things. First, we've tried to motivate some of these uncertainties: We've explored alternatives to EV maximization's use as a decision procedure. Even if EV maximization is the correct criterion of rationality, it's questionable as a decision procedure that ordinary, fallible people can use to make decisions given all their uncertainties and limitations. We've explored the problems and prom...
undefined
Jul 19, 2024 • 28min

EA - AI companies are not on track to secure model weights by Jeffrey Ladish

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI companies are not on track to secure model weights, published by Jeffrey Ladish on July 19, 2024 on The Effective Altruism Forum. This post is a write-up of my talk from EA Global: Bay Area 2024, which has been lightly edited for clarity. Speaker background Jeffrey is the Executive Director of Palisade Research, a nonprofit that studies AI capabilities to better understand misuse risks from current systems and how advances in hacking, deception, and persuasion will affect the risk of catastrophic AI outcomes. Palisade is also creating concrete demonstrations of dangerous capabilities to advise policymakers and the public of the risks from AI. Introduction and context Do you want the good news first or the bad news? The bad news is what my talk's title says: I think AI companies are not currently on track to secure model weights. The good news is, I don't think we have to solve any new fundamental problems in science in order to solve this problem. Unlike AI alignment, I don't think that we have to go into territory that we've never gotten to before. I think this is actually one of the most tractable problems in the AI safety space. So, even though I think we're not on track and the problem is pretty bad, it's quite solvable. That's exciting, right? I'm going to talk about how difficult I think it is to secure companies or projects against attention from motivated, top state actors. I'm going to talk about what I think the consequences of failing to do so are. And then I'm going to talk about the so-called incentive problem, which is, I think, one of the reasons why this is so thorny. Then, let's talk about solutions. I think we can solve it, but it's going to take some work. I was already introduced, so I don't need to say much about that. I was previously at Anthropic working on the security team. I have some experience working to defend AI companies, although much less than some people in this room. And while I'm going to talk about how I think we're not yet there, I want to be super appreciative of all the great people working really hard on this problem already - people at various companies such as RAND and Pattern Labs. I want to give a huge shout out to all of them. So, a long time ago - many, many years ago in 2022 [audience laughs] - I wrote a post with Lennart Heim on the EA Forum asking, "What are the big problems information security might help solve?" One we talked about is this core problem of how to secure companies from attention from state actors. At the time, Ben Mann and I were the only security team members at Anthropic, and we were part time. I was working on field-building to try to find more people working in this space. Jarrah was also helping me. And there were a few more people working on this, but that was kind of it. It was a very nervous place to be emotionally. I was like, "Oh man, we are so not on track for this. We are so not doing well." Note from Jeffrey: I left Anthropic in 2022, and I gave this talk in Feb 2024, ~5 months ago. My comments about Anthropic here reflect my outside understanding at the time and don't include recent developments on security policy. Here's how it's going now. RAND is now doing a lot of work to try to map out what is really required in this space. The security team at Anthropic is now a few dozen people, with Jason Clinton leading the team. He has a whole lot of experience at Google. So, we've gone from two part-time people to a few dozen people - and that number is scheduled to double soon. We've already made a tremendous amount of progress on this problem. Also, there's a huge number of events happening. At DEF CON, we had about 100 people and Jason ran a great reading group to train security engineers. In general, there have been a lot more people coming to me and coming to 80,000 Hours really intere...
undefined
Jul 19, 2024 • 10min

LW - How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation") by Ruby

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation"), published by Ruby on July 19, 2024 on LessWrong. AI Alignment is my motivating context but this could apply elsewhere too. The nascent field of AI Alignment research is pretty happening these days. There are multiple orgs and dozens to low hundreds of full-time researchers pursuing approaches to ensure AI goes well for humanity. Many are heartened that there's at least some good research happening, at least in the opinion of some of the good researchers. This is reason for hope, I have heard. But how do we know whether or not we have produced "good research?" I think there are two main routes to determining that research is good, and yet only one applies in the research field of aligning superintelligent AIs. "It's good because it works" The first and better way to know that your research is good is because it allows you to accomplish some goal you care about[1] [1]. Examples: My work on efficient orbital mechanics calculation is good because it successfully lets me predict the trajectory of satellites. My work on the disruption of cell signaling in malign tumors is good because it helped me develop successful anti-cancer vaccines. My work on solid-state physics is good because it allowed me to produce superconductors at a higher temperature and lower pressure than previously attained.[2] In each case, there's some outcome I care about pretty inherently for itself, and if the research helps me attain that outcome it's good (or conversely if it doesn't, it's bad). The good researchers in my field are those who have produced a bunch of good research towards the aims of the field. Sometimes it's not clear-cut. Perhaps I figured out some specific cell signaling pathways that will be useful if it turns out that cell signaling disruption in general is useful, and that's TBD on therapies currently being trialed and we might not know how good (i.e. useful) my research was for many more years. This actually takes us into what I think is the second meaning of "good research". "It's good because we all agree it's good" If our goal is successfully navigating the creation of superintelligent AI in a way such that humans are happy with the outcome, then it is too early to properly score existing research on how helpful it will be. No one has aligned a superintelligence. No one's research has contributed to the alignment of an actual superintelligence. At this point, the best we can do is share our predictions about how useful research will turn out to be. "This is good research" = "I think this research will turn out to be helpful". "That person is a good researcher" = "That person produces much research that will turn out to be useful and/or has good models and predictions of which research will turn out to help". To talk about the good research that's being produced is simply to say that we have a bunch of shared predictions that there exists research that will eventually help. To speak of the "good researchers" is to speak of the people who lots of people agree their work is likely helpful and opinions likely correct. Someone might object that there's empirical research that we can see yielding results in terms of interpretability/steering or demonstrating deception-like behavior and similar. While you can observe an outcome there, that's not the outcome we really care about of aligning superintelligent AI, and the relevance of this work is still just prediction. It's being successful at kinds of cell signaling modeling before we're confident that's a useful approach. More like "good" = "our community pagerank Eigen-evaluation of research rates this research highly" It's a little bit interesting to unpack "agreeing that some research is good". Obviously, not everyone's opinion matters ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app