The Nonlinear Library

The Nonlinear Fund
undefined
Aug 3, 2024 • 5min

LW - Some comments on intelligence by Viliam

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some comments on intelligence, published by Viliam on August 3, 2024 on LessWrong. After reading another article on IQ, there are a few things that I wish would become common knowledge to increase the quality of the debate. Posting them here: 1) There is a difference between an abstract definition of intelligence such that it could also apply to aliens or AIs (something like "an agent able to optimize for outcomes in various environments") and the specific way the intelligence is implemented in human brains. Because of the implementation details, things can be true about human intelligence even if they are not necessarily true about intelligence in general. For example, we might empirically find that humans better at X are usually also better at Y, even if we could imagine a hypothetical AI (or even take an already existing one) whose skills at X and Y are unrelated. The fact that X and Y are unrelated in principle doesn't disprove the hypothesis that they are related in human brains. 2) Saying "the important thing is not intelligence (or rationality), but domain knowledge or experience or something else" is... ...on one hand, true; and the fans of intelligence (or rationality) should probably be reminded of it quite often. Yes, your Mensa membership card or LessWrong account doesn't mean that you no longer have to study things because you can solve relativity in five minutes of armchair reasoning... ...on the other hand, it's not like these things are completely unrelated. Yes, you acquire knowledge by studying, but your intelligence probably has a huge impact on how fast you can do that, or even whether you can do that at all. So we need to distinguish between short term and long term. In short term, yes, domain knowledge and experience matter a lot, and intelligence is probably not going to save you if the inferential distances are large. But in long term, intelligence may be necessary for acquiring the domain knowledge and experience. In other words, there is a huge difference between "can use intelligence instead of X, Y, Z" and "can use intelligence to acquire X, Y, Z". The argument about intelligence being less important that X, Y, Z is irrelevant as an objection to the latter. 3) An article that led me to writing this all proposed that we do not need separate education for gifted children; instead we should simply say that some children are further ahead in certain topics (this part is not going to trigger anyone's political instincts) and therefore we should have separate classes for... those who already know something, and those who don't know it yet. This would nicely avoid the controversy around intelligence and heredity etc., while still allowing the more intelligent kids (assuming that there is such a thing) to study at their own speed. A win/win solution for both those who believe in intelligence and those who don't? Unfortunately, I think this is not going to work. I approve of the idea of disentangling "intelligence" from "previously gained experience". But the entire point of IQ is that previously gained experience does not screen off intelligence. Your starting point is one thing; the speed at which you progress is another thing. Yes, it makes sense in the classroom to separate the children who already know X ("advanced") from the children who don't know X yet ("beginners"). No need for the advanced to listen again to the things they already know. But if you keep teaching both groups at the speed optimal for their average members, both the gifted beginners and the gifted advanced will be bored, each one in their own group. A system that allows everyone to achieve their full potential would be the one where the gifted beginner is allowed to catch up on the average advanced, and where the gifted advanced is allowed to leave the average advanced behin...
undefined
Aug 3, 2024 • 3min

EA - On Living by emre kaplan

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Living, published by emre kaplan on August 3, 2024 on The Effective Altruism Forum. This is a famous Turkish poem by Nazım Hikmet. I just noticed its interesting overlap with some of the EA themes . Some here might find it motivating to read it. Translation by ChatGPT: On Living Living is no laughing matter: you must live with great earnestness like a squirrel, for example, I mean without looking for something beyond and above living, living must be your whole occupation. Living is no laughing matter: you must take it seriously, so much so and to such a degree that, for example, your hands tied behind your back, your back to the wall, or else in a laboratory in your white coat and safety glasses, you can die for people even for people whose faces you've never seen, even though nobody forced you to do so, even though you know living is the most real, the most beautiful thing. I mean, you must take living so seriously that even at seventy, for example, you'll plant olive trees and not for your children, either, but because although you fear death you don't believe it, because living, I mean, weighs heavier. II Let's say we're seriously ill, need surgery which is to say we might never get up from the white table. Even though it's impossible not to feel sad about going a little too soon, we'll still laugh at the jokes being told, we'll look out the window to see if it's raining, or still wait anxiously for the latest news. Let's say we're at the front for something worth fighting for, say. There, in the first offensive, on that very day, we might fall on our face, dead. We'll feel a strange, bitter anger, yet we'll still be consumed with worry about the war's outcome, which could drag on for years. Let's say we're in prison and close to fifty, and we have eighteen more years, say, before the iron doors will open. We'll still live with the outside, with its people and animals, struggle and wind I mean with the outside beyond the walls. I mean, however and wherever we are, we must live as if we will never die. III This earth will grow cold, a star among stars and one of the smallest, a gilded mote on blue velvet I mean this, our great earth. This earth will grow cold one day, not like a block of ice or a dead cloud even but like an empty walnut it will roll along in pitch-black space... You must grieve for this right now you have to feel this sorrow now for the world must be loved this much if you're going to say "I lived"... Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Aug 3, 2024 • 5min

EA - Turning Privilege into Effective Donations by Nina Friedrich

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Turning Privilege into Effective Donations, published by Nina Friedrich on August 3, 2024 on The Effective Altruism Forum. This article was originally written for my LinkedIn audience, which includes individuals from varying levels of familiarity with Effective Altruism. The explanations and context are provided accordingly. I am incredibly grateful to have received a scholarship that funded my university studies from a bachelor's to a Ph.D. The financial support I received not only eased my education but also made me reflect on how I could extend this privilege to others. Today, I am thrilled to share that I have reached my goal of donating the full amount of my scholarship's value to effective charities. Here's an overview of what inspired me, what lies ahead, and how you can get involved. Why Give Back? Recognising our own financial privilege can be eye-opening and humbling. Many of us enjoy a level of wealth that far exceeds that of the average person worldwide. To understand my global financial standing, I used the " How Rich Am I?" calculator from Giving What We Can: If you earn at least £50,000 (post-tax) per year, you are in the richest 1% globally! This tool reveals that many of us are among the wealthiest people on the planet. This stark comparison made me realise the immense potential I have to make a positive impact and motivated me to commit to giving back effectively. Choosing Effective Charities Some charities are vastly more cost-effective than others. For instance, GiveWell, an organisation that rigorously evaluates charities, estimates that around $5000 can save a life through particularly impactful global health interventions. By prioritising donations to charities like these - that do so much more good per dollar than the average charity -, we can ensure our contributions have the greatest possible effect. My giving has impacted 50,000 lives By donating the value of my scholarships, I have likely impacted 50,000 lives, prevented 9 deaths, and significantly improved the lives of thousands of animals. Before I started giving to these types of organisations, I could never have imagined that I could do so much good, while working as a software engineer and a consultant. Continuing My Journey Reaching my donation goal is a huge milestone, but it's not the end. I will continue my monthly donations to effective charities. Additionally, I took the 10% Pledge, committing to donate 10% of my income to effective charities. In addition to contributing financially, I now lead High Impact Professionals, a non-profit whose mission is to support experienced professionals in using their careers to do the most good. There's more information about High Impact Professionals at the end of this article, as our programs might be interesting to readers like you. How You Can Make a Difference If you're inspired by the idea of effective giving, there are several ways you can get involved. Here are some steps you can take to make a difference: 1. Take the 10% Pledge: Consider committing to donate 10% of your income to effective charities. This pledge can make a substantial impact. Take the 10% Pledge here. 2. Take the Trial Pledge: If you're not ready for a long-term commitment, consider taking the trial pledge to see the difference it can make - you pick your own percentage (starting at as little as 1%) and duration of the pledge. Take the Trial Pledge here. 3. Talk About It to Encourage Others: Sharing your commitment to effective giving can inspire others to join the cause, amplifying the overall impact. By taking these steps, you can make a significant difference and help create a more equitable world. In addition to helping others, effective givers often experience personal rewards such as a sense of achievement, a deeper sense of meaning, and a strengthened sense of c...
undefined
Aug 2, 2024 • 7min

EA - FarmKind (a new animal fundraising platform) is live - Please DON'T DONATE by Aidan Alexander

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: FarmKind (a new animal fundraising platform) is live - Please DON'T DONATE, published by Aidan Alexander on August 2, 2024 on The Effective Altruism Forum. TL;DR FarmKind is a new effective giving platform dedicated to tackling factory farming We've just launched - www.farmkind.giving There are many ways you can help us if you're interested - details below FarmKind's aim Factory farming is one of the most neglected cause areas relative to the amount of suffering it causes. Globally, Farmed Animal Funders estimates that just $200 million is channeled specifically to this issue,[1] while more than 10 billion land animals (excluding insects) are factory farmed annually in the US alone.[2] Even when it comes to effective altruism, factory farming is a minority within a minority. We estimate that less than 10% of the funds raised by effective giving organizations go to factory farming.[3] All this despite the fact that proven interventions in the lives of factory-farmed animals remain arguably some of the most cost-effective ways to prevent suffering that we have yet discovered. The lack of funding has several consequences: 1. Proven strategies for reducing suffering are being scaled more slowly. 2. Promising new interventions struggle to get off the ground. 3. The space is overly reliant on a few large funders, posing many structural risks. FarmKind's mission is to increase funding for farmed animal charities by bringing in new donors and donations. To do this, we've built a platform inspired by the innovative work of Prof. Joshua Greene and Dr. Lucius Caviola and their Giving Multiplier platform, tailored specifically to raise money for farmed animal charities. People donate because they feel compassion but also want their donations to be spent wisely and have an impact. FarmKind seeks to meet both these motivations so people can feel good while they do a huge amount of good too. We work with expert charity evaluators, including Animal Charity Evaluators, to find charities that are super-effective at making the lives of factory-farmed animals better. We help donors give to a curated set of these charities while, at the same time, splitting their donation with their favorite charities. Then FarmKind boosts both donations with a bonus. We hope that, by increasing funding to these higher-profile and more thoroughly evaluated organizations, we will be able to free up other funders to look at interventions that require more vetting than we currently have the capacity for. Our story so far FarmKind was incubated in the first Charity Entrepreneurship Incubation Program of 2024, in April this year. Since then, founders Aidan Alexander and Thom Norman have been working to launch our giving platform as soon as possible. We have now launched our platform and are receiving donations. You can find us here: www.farmkind.giving To get to this point so far would not have been possible without the amazing support of a lot of people who want to see a better world for animals. In particular, our platform would not have been possible without Hive's Douglas Browne, Violet Studios or Every.org. Many others have helped us get to launch, are too many people to list here, but please check out our acknowledgements on our site here. DON'T donate through us (please) If you're reading this post, our platform isn't aimed at you. We aim to convert new donors to supporting effective farmed animal welfare. If people who would have given to our recommended charities anyway donate through our platform it doesn't add any value, and what's more, by using up limiting matching funding, donating through our platform hurts our ability incentivise counterfactual donations. If, however, you're the type of person who gives to ACE recommended charities already, you may be interested in giving to them via our bonus...
undefined
Aug 2, 2024 • 7min

AF - The Bitter Lesson for AI Safety Research by Adam Khoja

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Bitter Lesson for AI Safety Research, published by Adam Khoja on August 2, 2024 on The AI Alignment Forum. Read the associated paper "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?": https://arxiv.org/abs/2407.21792 Focus on safety problems that aren't solved with scale. Benchmarks are crucial in ML to operationalize the properties we want models to have (knowledge, reasoning, ethics, calibration, truthfulness, etc.). They act as a criterion to judge the quality of models and drive implicit competition between researchers. "For better or worse, benchmarks shape a field." We performed the largest empirical meta-analysis to date of AI safety benchmarks on dozens of open language models. Around half of the benchmarks we examined had high correlation with upstream general capabilities. Some safety properties improve with scale, while others do not. For the models we tested, benchmarks on human preference alignment, scalable oversight (e.g., QuALITY), truthfulness (TruthfulQA MC1 and TruthfulQA Gen), and static adversarial robustness were highly correlated with upstream general capabilities. Bias, dynamic adversarial robustness, and calibration when not measured with Brier scores had relatively low correlations. Sycophancy and weaponization restriction (WMDP) had significant negative correlations with general capabilities. Often, intuitive arguments from alignment theory are used to guide and prioritize deep learning research priorities. We find these arguments to be poorly predictive of these correlations and are ultimately counterproductive. In fact, in areas like adversarial robustness, some benchmarks basically measured upstream capabilities while others did not. We argue instead that empirical measurement is necessary to determine which safety properties will be naturally achieved by more capable systems, and which safety problems will remain persistent.[1] Abstract arguments from genuinely smart people may be highly "thoughtful," but these arguments generally do not track deep learning phenomena, as deep learning is too often counterintuitive. We provide several recommendations to the research community in light of our analysis: Measure capabilities correlations when proposing new safety evaluations. When creating safety benchmarks, aim to measure phenomena which are less correlated with capabilities. For example, if truthfulness entangles Q/A accuracy, honesty, and calibration - then just make a decorrelated benchmark that measures honesty or calibration. In anticipation of capabilities progress, work on safety problems that are disentangled with capabilities and thus will likely persist in future models (e.g., GPT-5). The ideal is to find training techniques that cause as many safety properties as possible to be entangled with capabilities. Ultimately, safety researchers should prioritize differential safety progress, and should attempt to develop a science of benchmarking that can effectively identify the most important research problems to improve safety relative to the default capabilities trajectory. We're not claiming that safety properties and upstream general capabilities are orthogonal. Some are, some aren't. Safety properties are not a monolith. Weaponization risks increase as upstream general capabilities increase. Jailbreaking robustness isn't strongly correlated with upstream general capabilities. However, if we can isolate less-correlated safety properties in AI systems which are distinct from greater intelligence, these are the research problems safety researchers should most aggressively pursue and allocate resources toward. The other model properties can be left to capabilities researchers. This amounts to a "Bitter Lesson" argument for working on safety issues which are relatively uncorrelated (or negatively correlate...
undefined
Aug 2, 2024 • 2min

EA - Want to work on US emerging tech policy? Consider the Horizon Fellowship. by ES

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Want to work on US emerging tech policy? Consider the Horizon Fellowship., published by ES on August 2, 2024 on The Effective Altruism Forum. Applications are now open for the 2025 Horizon Fellowship cohort What do you get? The fellowship program will fund and facilitate placements for 1-2 years in full-time US policy roles at executive branch offices, Congressional offices, and think tanks in Washington, DC. Horizon has placed fellows at the Department of Defense, White House, Department of Commerce, Senate committees, House personal offices and prominent think tanks. You can learn more about past fellows and their placements at Meet our Fellows and Fellow Accomplishments. It also includes ten weeks of remote, part time policy-focused training, mentorship, and an access to an extended network of emerging tech policy professionals. Who is it for? Entry-level and mid-career roles No prior policy experience is required (but is welcome) Demonstrated interest in emerging technology US citizens, green card holders, or students on OPT Able to start a full time role in Washington DC by Aug 2025 Training is remote, so current undergraduate and graduate school students graduating by summer 2025 are eligible Research shows that great candidates often disqualify themselves too quickly, especially if they are from underrepresented groups. If you are excited about the program but on the fence about whether you are eligible or qualified, we strongly encourage you to apply. The application deadline is August 30th, 2024. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Aug 2, 2024 • 12min

LW - A Simple Toy Coherence Theorem by johnswentworth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on LessWrong. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, and that value...
undefined
Aug 2, 2024 • 12min

AF - A Simple Toy Coherence Theorem by johnswentworth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on The AI Alignment Forum. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, a...
undefined
Aug 2, 2024 • 3min

LW - AI Rights for Human Safety by Simon Goldstein

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Rights for Human Safety, published by Simon Goldstein on August 2, 2024 on LessWrong. Just wanted to share a new paper on AI rights, co-authored with Peter Salib, that members of this community might be interested in. Here's the abstract: AI companies are racing to create artificial general intelligence, or "AGI." If they succeed, the result will be human-level AI systems that can independently pursue high-level goals by formulating and executing long-term plans in the real world. Leading AI researchers agree that some of these systems will likely be "misaligned"-pursuing goals that humans do not desire. This goal mismatch will put misaligned AIs and humans into strategic competition with one another. As with present-day strategic competition between nations with incompatible goals, the result could be violent and catastrophic conflict. Existing legal institutions are unprepared for the AGI world. New foundations for AGI governance are needed, and the time to begin laying them is now, before the critical moment arrives. This Article begins to lay those new legal foundations. It is the first to think systematically about the dynamics of strategic competition between humans and misaligned AGI. The Article begins by showing, using formal game-theoretic models, that, by default, humans and AIs will be trapped in a prisoner's dilemma. Both parties' dominant strategy will be to permanently disempower or destroy the other, even though the costs of such conflict would be high. The Article then argues that a surprising legal intervention could transform the game theoretic equilibrium and avoid conflict: AI rights. Not just any AI rights would promote human safety. Granting AIs the right not to be needlessly harmed-as humans have granted to certain non-human animals-would, for example, have little effect. Instead, to promote human safety, AIs should be given those basic private law rights-to make contracts, hold property, and bring tort claims-that law already extends to non-human corporations. Granting AIs these economic rights would enable long-run, small-scale, mutually-beneficial transactions between humans and AIs. This would, we show, facilitate a peaceful strategic equilibrium between humans and AIs for the same reasons economic interdependence tends to promote peace in international relations. Namely, the gains from trade far exceed those from war. Throughout, we argue that human safety, rather than AI welfare, provides the right framework for developing AI rights. This Article explores both the promise and the limits of AI rights as a legal tool for promoting human safety in an AGI world. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Aug 2, 2024 • 31min

LW - The 'strong' feature hypothesis could be wrong by lsgos

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'strong' feature hypothesis could be wrong, published by lsgos on August 2, 2024 on LessWrong. NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. "It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an "ideal" ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout" : Elhage et. al, Toy Models of Superposition Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition. Conventionally there understood to be two key theories underlying this agenda. The first is the 'linear representation hypothesis' (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the 'features of the input' in the opening quote) as linear directions in it's representation space, or atoms[1]. And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis). While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as 'networks represent features of the input as directions in representation space'. There are a few possible formulations of this: 1. (Weak LRH) some features used by neural networks are represented as atoms in representation space 2. (Strong LRH) all features used by neural networks are represented by atoms. The weak LRH I would say is now well supported by considerable empirical evidence. The strong form is much more speculative: confirming the existence of many linear representations does not necessarily provide strong evidence for the strong hypothesis. Both the weak and the strong forms of the hypothesis can still have considerable variation, depending on what we understand by a feature. I think that in addition to the acknowledged assumption of the LRH and superposition hypotheses, much work on SAEs in practice makes the assumption that each atom in the network will represent a "simple feature" or a "feature of the input". These features that the atoms are representations of are assumed to be 'monosemantic': they will all stand for features which are human interpretable in isolation. I will call this the monosemanticity assumption. This is difficult to state precisely, but we might formulate as the theory that every represented variable will have a single meaning in a good description of a model. This is not a straightforward assumption due to how imprecise the notion of a single meaning is. While various more or less reasonable definitions for features are discussed in the pioneering work of Elhage, these assumptions have different implications. For instance, if one thinks of 'features' as computational intermediates in a broad sense, then superposition and the LRH imply a certain picture of the format of a models internal representation: that what the network is doing is manipulating atoms in superposition (if y...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app