The Nonlinear Library

The Nonlinear Fund
undefined
Jun 4, 2024 • 1min

EA - GiveWell is hiring a Head of Fundraising Operations and Analytics by GiveWell

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GiveWell is hiring a Head of Fundraising Operations and Analytics, published by GiveWell on June 4, 2024 on The Effective Altruism Forum. We're hiring a Head of Fundraising Operations and Analytics to understand, analyze, and scale GiveWell's fundraising work. The Head of Fundraising Ops will define the research questions that we want to answer, design processes and a tech stack to collect information, and use creative qualitative and quantitative methods to analyze the data we obtain. They'll manage a small, efficient fundraising team and report to the Director of Development. Competitive candidates have prior experience solving open-ended business questions with limited or incomplete data, and have previously developed efficient and scalable processes for others to implement. Management experience is a major plus. Professional backgrounds that might be a particularly strong fit include revenue operations, data analytics, market research, and consulting. See compensation details and full job description and apply at the link above - please also share this job with anyone in your network who might be a good fit. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 4, 2024 • 22min

AF - [Paper] Stress-testing capability elicitation by training password-locked models by Fabien Roger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Paper] Stress-testing capability elicitation by training password-locked models, published by Fabien Roger on June 4, 2024 on The AI Alignment Forum. The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David. Scheming models might try to perform less capably than they are able to ( sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions. We're uncertain how this goes in practice. It could be the case that simple countermeasures prevent sandbagging in most of the important cases such that it isn't a big problem for safety. But it could also turn out that even high-effort approaches can't rule out these issues. We released a paper studying this by examining how well supervised fine-tuning and RL can elicit capabilities from password-locked models (LLMs trained to produce low-quality outputs except when a particular password is provided). We find that: If you have some high-quality demonstrations, supervised fine-tuning usually recovers the hidden capabilities with very few samples. Elicitation on password-locked models generalizes well across domains (even when we made password-locked models using techniques that we thought might prevent generalization). Using RL on password-locked models recovers hidden capabilities, except when the weak model RL starts from is extremely weak. While we target cases where AIs intentionally produce low-quality outputs and they could easily produce higher quality outputs, we also think our results have some relevance to other cases where AIs produce low-quality outputs despite being able to produce better ones. Here are the main takeaways: We think our results are a moderate update against scheming models being able to sandbag effectively (given countermeasures) for the kind of tasks we studied here, though there are differences between password-locked models and actual scheming models. We think this should be a bigger update for people who are relatively more pessimistic than us about sandbagging (at least insofar as their views make pessimistic predictions about current architectures). (We think this is also a small update against scheming, since scheming models might still try to sandbag, and if elicitation fine-tunes sandbagging out, it may partially erode scheming.) People interested in studying sandbagging or similar elicitation failures should consider using a methodology similar to the password-locked models methodology we use here. More generally, they should consider using some sort of adversarial methodology in which a red team constructs a model organism. In this post: We explain why good capability elicitation matters for AI safety We describe the sandbagging concern we are trying to assess; We summarize the results of our paper, and discuss their implications; We discuss how our results relate to relevant AGI safety work; We describe future work we are excited by; We discuss concerns other than intentional sandbagging for which our results might be relevant. The big picture: why good capability elicitation matters for AI safety Currently, AI labs try reasonably hard to get as much performance as possible out of their expensive pretrained model. The process of making an AI perform a task competently (without a huge amount of additional training) is often called eli...
undefined
Jun 4, 2024 • 4min

LW - in defense of Linus Pauling by bhauth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: in defense of Linus Pauling, published by bhauth on June 4, 2024 on LessWrong. Linus Pauling was a chemist who won a Nobel Prize in Chemistry in 1954. He later became well-known for advocating large doses of Vitamin C. I've heard that advocacy referred to as a cautionary tale, but I've long had a bit of sympathy for Linus, and I'm writing this post to explain how and why. mainstream nutrition One reason for my sympathy for Linus is that I've heard him used as an example of why you shouldn't go off on your own instead of trusting mainstream views in a field, yet his advice, while not particularly helpful, caused much less harm than contemporary "mainstream nutritional advice", such as: the food pyramid "partially hydrogenated vegetable oil is healthier than butter" "a very-low-fat diet is healthy" "eating very little salt is healthy" I certainly wouldn't suggest trying to independently compete with the conceptual framework of, say, semiconductor physics or structural engineering, but when a field is rotten enough (nutrition, psychology, education, and economics come to mind) history indicates to me that someone smart from another field is often more correct than specialists on that topic, when they have an interest in it. my view of Vitamin C To be clear, I'm not advocating for the very high doses of Vitamin C that Linus did. I do think the World Health Organization's RDA (45 mg/day) is a bit low, but the RDA of Japan and the EU (~100 mg/day) seems entirely reasonable to me. Amounts above that generally don't have much effect on blood levels of Vitamin C, because it's absorbed less and the excess is expelled. Thus, some people have advocated administering it by IV, but one has to consider the possibility that there's a good reason for a specific level being naturally maintained. Research since Linus first advocated for Vitamin C megadoses has supported oxidative stress being a major cause of aging. It's associated with many problems (Alzheimer's comes to mind) and there's a good theoretical basis (DNA I-compounds) for its long-term harms. We've also seen that previous suggestions for Vitamin D doses were much too low, so there's also precedent for that. Where, then, did Linus go wrong? Vitamin C is an antioxidant, but it's also a pro-oxidant. It can reduce iron and copper, which can then react with hydrogen peroxide or oxygen to form hydroxyl radicals or peroxide ions. It can also form some complexes with metal ions that could conceivably have some harmful catalytic effects. (Its ability to interact with metal ions in certain ways is the main reason it's used in cells rather than some other compound: it's a cofactor.) The normal levels of free Fe and Cu ions are low, but my view is that the natural blood level of Vitamin C is a compromise set by pro-oxidant effects. When an infection happens that causes hypochlorite production by immune cells, it's logical that the optimal level of Vitamin C would be higher. And indeed, there's evidence that extra Vitamin C during infections (especially bacterial infections) helps somewhat. But the main antioxidant in mammals seems to be glutathione rather than Vitamin C, and it has to be used in combination with superoxide dismutase. So, Linus was one the right track. He was trying to solve the right problem, and he found a reasonable solution to it, but he overlooked some complicated side effects. That's a mistake I consider forgivable. He should have realized that there was a reason for homeostasis of Vitamin C levels, but the ideology of his time was that biological regulatory systems were so ineffective that any deliberate management by people would be better. There were, thus, people optimizing the balance of purified starch/fat/protein diets they fed rats, and being puzzled when the rats kept dying. Then, as soon as they discov...
undefined
Jun 4, 2024 • 23min

EA - The "TESCREAL" Bungle by ozymandias

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The "TESCREAL" Bungle, published by ozymandias on June 4, 2024 on The Effective Altruism Forum. A specter is haunting Silicon Valley - the specter of TESCREALism. "TESCREALism" is a term coined by philosopher Émile Torres and AI ethicist Timnit Gebru to refer to a loosely connected group of beliefs popular in Silicon Valley. The acronym unpacks to: Transhumanism - the belief that we should develop and use "human enhancement" technologies that would give people everything from indefinitely long lives and new senses like echolocation to math skills that rival John von Neumann's. Extropianism - the belief that we should settle outer space and create or become innumerable kinds of "posthuman" minds very different from present humanity. Singularitarianism - the belief that humans are going to create a superhuman intelligence in the medium-term future. Cosmism - a near-synonym to extropianism. Rationalism - a community founded by AI researcher Eliezer Yudkowsky, which focuses on figuring out how to improve people's ability to make good decisions and come to true beliefs. Effective altruism - a community focused on using reason and evidence to improve the world as much as possible. Longtermism - the belief that one of the most important considerations in ethics is the effects of our actions on the long-term future.[1] TESCREALism is a personal issue for Torres,[2] who used to be a longtermist philosopher before becoming convinced that the ideology was deeply harmful. But the concept is beginning to go mainstream, with endorsements in publications like Scientific American and the Financial Times. The concept of TESCREALism is at its best when it points out the philosophical underpinnings of many conversations occurring in Silicon Valley - principally about artificial intelligence but also about everything from gene-selection technologies to biosecurity. Eliezer Yudkowsky and Marc Andreessen - two influential thinkers Torres and Gebru have identified as TESCREAList - don't agree on much. Eliezer Yudkowsky believes that with our current understanding of AI we're unable to program an artificial general intelligence that won't wipe out humanity; therefore, he argues, we should pause AI research indefinitely. Marc Andreessen believes that artificial intelligence will be the most beneficial invention in human history: People who push for delay have the blood of the starving people and sick children whom AI could have helped on their hands. But their very disagreement depends on a number of common assumptions: that human minds aren't special or unique, that the future is going to get very strange very quickly, that artificial intelligence is one of the most important technologies determining the trajectory of future, that intelligences descended from humanity can and should spread across the stars.[3] As an analogy, Republicans and Democrats don't seem to agree about much. But if you were explaining American politics to a medieval peasant, the peasant would notice a number of commonalities: that citizens should choose their political leaders through voting, that people have a right to criticize those in charge, that the same laws ought to apply to everyone. To explain what was going on, you'd call this "liberal democracy." Similarly, many people in Silicon Valley share a worldview that is unspoken and, all too often, invisible to them. When you mostly talk to people who share your perspective, it's easy to not notice the controversial assumptions behind it. We learn about liberal democracy in school, but the philosophical underpinnings beneath some common debates in Silicon Valley can be unclear. It's easy to stumble across Andreesen's or Yudkowsky's writing without knowing anything about transhumanism. The TESCREALism concept can clarify what's going on for confused outsiders. How...
undefined
Jun 4, 2024 • 6min

LW - (Not) Derailing the LessOnline Puzzle Hunt by Error

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: (Not) Derailing the LessOnline Puzzle Hunt, published by Error on June 4, 2024 on LessWrong. (spoiler alert: may meta-spoil future iterations of the LOPH, if you haven't already read other posts about it) I knew early on that I wouldn't be able to finish the LessOnline puzzle hunt. I contributed to solving two of the first six puzzles, each of which revealed the combination to a locked box. Each box contained wooden medallions for the solvers, plus a QR code. The QR codes led to the second layer of a larger meta-puzzle. On discovering the existence of the meta-puzzle, I knew I would have to drop out. It was a shame, because I have never done a puzzle hunt before and I could see that the puzzlemasters had produced something amazing. But the opportunity cost of playing "for real" was just too high. There were too many other things I wanted to do, and the integral of my attention over time is not infinite. So I stopped hunting. But, unexpectedly and hilariously, I ended up contributing to the game in a very different way. Suspicion During the Fooming Shoggoth concert I noticed (probably thanks to the song lyrics) that the parts of the puzzle I was aware of were all lockboxes. I mean, I knew that already, but now I noticed. And I knew from the contents of the two boxes I had opened that there was more to the puzzles than the obvious. The whole setup seemed like a suspiciously plausible metaphor for the AI Box Experiment. I suddenly, strongly suspected that the puzzle hunt had a hidden narrative -- one that would end with the release of a rogue AI. Sequence Breaker! So I did what any Less Wronger should have done: I tried to warn others. No unbinding of seals, no opening of gates! At first I just told nearby hunters directly, but quickly realized it wouldn't work; if nothing else, I didn't know who was working on the puzzle hunt and who wasn't. I abandoned that plan and asked the front desk to print out three notes for me. The notes outlined my suspicions and warned the reader not to open whatever final box might be found at the end of the puzzle chain. I taped Note 1 to my medallion, which I returned to its original box. In addition to the warning, Note 1 asked anyone opening the box to leave the medallion and the note itself for later hunters to see, and suggested they return their own medallion(s), just in case. I had no reason to believe it mattered whether the medallions stayed in the boxes, but I had no reason to believe it didn't, and it was an obvious thing to try. I taped Note 2 to the table by the lockboxes. Note 2 asked anyone who shared my suspicions to sign it, as social pressure on others to not open boxes that might contain Very Bad Things. Note 3 was a copy of note 2, and stayed in my backpack for contingencies. After placing the notes, I moved on to other things. I'd volunteered to run a talk the following day on a whim, and preparation funged against sleep. By the time I had the slide deck put together it was 4am. Before going to bed, I checked on my warning notes. Note 2 was gone. I'd thought of that possibility, which was why I had a contingency copy. Now I had to decide whether to use it. Crap. Decision Theory So far I'd been intentionally open about what I was doing. I told multiple hunters about both my suspicions and the notes. I showed Note 1 to a volunteer when placing it. Note 2 was in plain sight. I even had a staff member at the front desk print the notes for me. My in-character warnings to hunters doubled as an out-of-character warning to the puzzlemasters: "possible derailment in progress". I didn't want to actually ruin whatever awesomeness they had planned. I wanted to prevent hunters from opening the hypothetical final box if and only if that was the true win condition. The problem was that I didn't know -- and couldn't know -- whether th...
undefined
Jun 3, 2024 • 20min

LW - The Standard Analogy by Zack M Davis

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Standard Analogy, published by Zack M Davis on June 3, 2024 on LessWrong. [Scene: a suburban house, a minute after the conclusion of "And All the Shoggoths Merely Players". Doomimir returns with his package, which he places by the door, and turns his attention to Simplicia, who has been waiting for him.] Simplicia: Right. To recap for [coughs] no one in particular, when we left off [pointedly, to the audience] one minute ago, Doomimir Doomovitch, you were expressing confidence that approaches to aligning artificial general intelligence within the current paradigm were almost certain to fail. You don't think that the apparent tractability of getting contemporary generative AI techniques to do what humans want bears on that question. But you did say you have empirical evidence for your view, which I'm excited to hear about! Doomimir: Indeed, Simplicia Optimistovna. My empirical evidence is the example of the evolution of human intelligence. You see, humans were optimized for one thing only: inclusive genetic fitness [Simplicia turns to the audience and makes a face.] Doomimir: [annoyed] What? Simplicia: When you said you had empirical evidence, I thought you meant empirical evidence about AI, not the same analogy to an unrelated field that I've been hearing for the last fifteen years. I was hoping for, you know, ArXiv papers about SGD's inductive biases, or online regret bounds, or singular learning theory ... something, anything at all, from this century, that engages with what we've learned from the experience of actually building artificial minds. Doomimir: That's one of the many things you Earthlings refuse to understand. You didn't build that. Simplicia: What? Doomimir: The capabilities advances that your civilization's AI guys have been turning out these days haven't come from a deeper understanding of cognition, but by improvements to generic optimization methods, fueled with ever-increasing investments in compute. Deep learning not only isn't a science, it isn't even an engineering discipline in the traditional sense: the opacity of the artifacts it produces has no analogue among bridge or engine designs. In effect, all the object-level engineering work is being done by gradient descent. The autogenocidal maniac Richard Sutton calls this the bitter lesson, and attributes the field's slowness to embrace it to ego and recalcitrance on the part of practitioners. But in accordance with the dictum to feel fully the emotion that fits the facts, I think bitterness is appropriate. It makes sense to be bitter about the shortsighted adoption of a fundamentally unalignable paradigm on the basis of its immediate performance, when a saner world would notice the glaring foreseeable difficulties and coordinate on doing Something Else Which Is Not That. Simplicia: I don't think that's quite the correct reading of the bitter lesson. Sutton is advocating general methods that scale with compute, as contrasted to hand-coding human domain knowledge, but that doesn't mean that we're ignorant of what those general methods are doing. One of the examples Sutton gives is computer chess, where minimax search with optimizations like α-β pruning prevailed over trying to explicitly encode what human grandmasters know about the game. But that seems fine. Writing a program that thinks about tactics the way humans do rather than letting tactical play emerge from searching the game tree would be a lot more work for less than no benefit. A broadly similar moral could apply to using deep learning to approximate complicated functions between data distributions: we specify the training distribution, and the details of fitting it are delegated to a network architecture with the appropriate invariances: convolutional nets for processing image data, transformers for variable-length sequences. Ther...
undefined
Jun 3, 2024 • 16min

LW - Companies' safety plans neglect risks from scheming AI by Zach Stein-Perlman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Companies' safety plans neglect risks from scheming AI, published by Zach Stein-Perlman on June 3, 2024 on LessWrong. Without countermeasures, a scheming AI could escape. A safety case (for deployment) is an argument that it is safe to deploy a particular AI system in a particular way.[1] For any existing LM-based system, the developer can make a great safety case by demonstrating that the system does not have dangerous capabilities.[2] No dangerous capabilities is a straightforward and solid kind of safety case. For future systems with dangerous capabilities, a safety case will require an argument that the system is safe to deploy despite those dangerous capabilities. In this post, I discuss the safety cases that the labs are currently planning to make, note that they ignore an important class of threats - namely threats from scheming AI escaping - and briefly discuss and recommend control-based safety cases. I. Safety cases implicit in current safety plans: no dangerous capabilities and mitigations to prevent misuse Four documents both (a) are endorsed by one or more frontier AI labs and (b) have implicit safety cases (that don't assume away dangerous capabilities): Anthropic's Responsible Scaling Policy v1.0, OpenAI's Preparedness Framework (Beta), Google DeepMind's Frontier Safety Framework v1.0, and the AI Seoul Summit's Frontier AI Safety Commitments. With small variations, all four documents have the same basic implicit safety case: before external deployment, we check for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice and implement appropriate mitigations before deploying it externally.[3] Central examples of dangerous capabilities include hacking, bioengineering, and operating autonomously in the real world. 1. Anthropic's Responsible Scaling Policy: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold,[4] we will notice and implement corresponding mitigations[5] before deploying it (internally or externally). 2. OpenAI's Preparedness Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. We only externally deploy[6] models with "post-mitigation risk" at 'Medium' or below in each risk category. (That is, after mitigations, the capabilities that define 'High' risk can't be elicited.) 3. Google DeepMind's Frontier Safety Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice before external deployment.[7] "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results."[8] Mitigations are centrally about preventing "critical capabilities" from being "accessed" (and securing model weights). 4. Frontier AI Safety Commitments (joined by 16 AI companies): before external deployment, we will do risk assessment with risk thresholds.[9] We use mitigations[10] "to keep risks within defined thresholds." These safety cases miss (or assume unproblematic) some crucial kinds of threats. II. Scheming AI and escape during internal deployment By default, AI labs will deploy AIs internally to do AI development. Maybe lots of risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes - in particular, those resources can be abused to run AIs unmonitored." Without countermeasures, if the AI is scheming, i...
undefined
Jun 3, 2024 • 14min

AF - AI catastrophes and rogue deployments by Buck Shlegeris

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI catastrophes and rogue deployments, published by Buck Shlegeris on June 3, 2024 on The AI Alignment Forum. [Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren't very original to me; many of the examples of threat models are from other people.] In this post, I want to introduce the concept of a "rogue deployment" and argue that it's interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I'll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or multiple of these at once). Rogue deployments Suppose you've developed some powerful model. (It might actually be many models; I'm just going to talk about a singular model for convenience.) You have some definition of a catastrophe - a failure that you don't want to happen even once. Your AI is powerful enough that, in absence of any safety measures, it would pose a substantial risk of causing a catastrophe. In the terminology of " Safety Cases: How to Justify the Safety of Advanced AI Systems", we'd phrase this as "the model is too powerful for us to make inability arguments for safety." (Sometimes I use the word "catastrophe" to include things like "the AI escapes the datacenter". Right now, I'm excluding that. I want to reserve "catastrophe" for the end result that we find unacceptable, e.g. the AI or another adversary amassing power, launching nukes, starting a war, building superintelligence, etc.) For the moment, let's be agnostic between various adversaries who might try to cause a catastrophe (AI or various humans). You have some safety measures that you've put into place to prevent the catastrophe from occurring. Safety measures are interventions that are applied to all of the deployments of your model (to any users, including just internal users) that try to prevent safety problems. This might include control measures like trusted monitoring, and security measures to make it harder to exfiltrate your model weights. A rogue deployment is a deployment of your model where the safety measures are absent. For example: an AI hacking its datacenter and running itself while hiding what it's doing from its creators an AI self-exfiltrating and running itself on unmonitored servers someone stealing the AI and running it in their own datacenter You can classify AI catastrophes based on whether they involve a rogue deployment or not: In the case of a catastrophe that doesn't involve rogue deployment, the catastrophe happened with all your safety measures in place the whole time; these measures just didn't suffice to spot (or prevent) that something terrible was happening until it was too late. For catastrophes that did involve rogue deployment, the catastrophe happened in two steps: Firstly, while the safety measures were in place, something went wrong such that a rogue deployment could occur. (Perhaps there was a security failure, or a failure of control that allowed an AI to run itself unmonitored in its datacenter.) Secondly, that rogue deployment caused a catastrophe. (Another way of phrasing all this would be to talk about instrumental vs terminal catastrophes - the terminal catastrophe is the end state we wanted to avoid, and the instrumental catastrophe is something happening that makes the situation go out of our control such that a terminal catastrophe is now likely and it's much harder for us to prevent it. So in...
undefined
Jun 3, 2024 • 4min

EA - Review of Past Grants: The $100,000 Grant for a Video Game? by Nicolae

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Review of Past Grants: The $100,000 Grant for a Video Game?, published by Nicolae on June 3, 2024 on The Effective Altruism Forum. Since 2017, EA Funds has been providing grants across four distinct cause areas. While there are payout reports available, there is a lack of reports detailing the outcomes of these grants, so I delved out of curiosity into the Grants Database to review some of the proposals that received funding and evaluate their outcomes. Some of the findings were quite unexpected, particularly for the Long-Term Future Fund and the EA Infrastructure Fund. The case involving a $100,000 grant for a video game In July 2022, EA approved a $100,000 grant to Lone Pine Games, LLC, for developing and marketing a video game designed to explain the Stop Button Problem to the public and STEM professionals. Outcomes from looking into Lone Pine Games, LLC: 1. After almost two years, there are no online mentions of such a game being developed by this company, except for the note on the EA Funds page. 2. Lone Pine Games released only one game available on multiple gaming platforms back in 2020 called NewCity, a city-building simulator, similar to SimCity from the 1990s. 3. After a few updates, the development of NewCity was abandoned last year, and its source code was made public on GitHub. Video trailer for the game NewCity, which was highly likely the primary track record in the grant proposal for the development of the Stop Button Problem video game. Despite the absence of any concrete results from this grant, let's entertain the idea that the Stop Button Problem game was produced and that it was decent. Would such a game "positively influence the long-term trajectory of civilization," as described by the Long-Term Future Fund? For context, Rob Miles's videos (1) and (2) from 2017 on the Stop Button Problem already provided clear explanations for the general public. It seems insane to even compare, but was this expenditure of $100,000 really justified when these funds could have been used to save 20-30 children's lives or provide cataract surgery to around 4000 people? With a grant like this one, Steven Pinker's remarks on longtermism funding seem increasingly accurate. Training guide dogs for the visually impaired, often highlighted in EA discussions, stands out as a highly effective donation strategy by contrast. I selected this example because it involves a company rather than individual recipients. However, I found numerous other cases, many even worse, primarily involving digital creators with barely any content produced during their funding period, people needing big financial support to change careers, and independent researchers whose proposals had not, at the time of writing, resulted in any published papers. As Yann LeCun recently said, "If you do research and don't publish, it's not science." These grants are not just a few thousand dollars distributed here and there; in many instances, they amount to tens or even hundreds of thousands of dollars. While many people, myself included, strongly support funding people's passion projects, because some of these grants really look that way, this approach seems more akin to Patreon or Kickstarter and not to effective giving. I believe many donors, who think they are contributing effectively, may not be fully aware of how their money is being utilized. Donors contribute to these funds expecting rigorous analysis comparable to GiveWell's standards, even for more speculative areas that rely on hypotheticals, hoping their money is not wasted, so they entrust that responsibility to EA fund managers, whom they assume make better and more informed decisions with their contributions. With seven full years of funding on record, I believe a thorough evaluation of previous grants is needed. Even if the grants were provide...
undefined
Jun 3, 2024 • 14min

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comments on Anthropic's Scaling Monosemanticity, published by Robert AIZI on June 3, 2024 on LessWrong. These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. TL;DR In roughly descending order of importance: 1. Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features. Its great those features allow interventions like the recently-departed Golden Gate Claude. I especially like the code bug feature. 2. I worry that naming features after high-activating examples (e.g. "the Golden Gate Bridge feature") gives a false sense of security. Most of the time that feature activates, it is irrelevant to the golden gate bridge. That feature is only well-described as "related to the golden gate bridge" if you condition on a very high activation, and that's <10% of its activations (from an eyeballing of the graph). 3. This work does not address my major concern about dictionary learning: it is not clear dictionary learning can find specific features of interest, "called-shot" features, or "all" features (even in a subdomain like "safety-relevant features"). I think the report provides ample evidence that current SAE techniques fail at this. 4. The SAE architecture seems to be almost identical to how Anthropic and my team were doing it 8 months ago, except that the ratio of features to input dimension is higher. I can't say exactly how much because I don't know the dimensions of Claude, but I'm confident the ratio is at least 30x (for their smallest SAE), up from 8x 8 months ago. 5. The correlations between features and neurons seems remarkably high to me, and I'm confused by Anthropic's claim that "there is no strongly correlated neuron". 6. Still no breakthrough on "a gold-standard method of assessing the quality of a dictionary learning run", which continues to be a limitation on developing the technique. The metric they primarily used was the loss function (a combination of reconstruction accuracy and L1 sparsity). I'll now expand some of these into sections. Finally, I'll suggest some follow-up research/tests that I'd love to see Anthropic (or a reader like you) try. A Feature Isn't Its Highest Activating Examples Let's look at the Golden Gate Bridge feature because its fun and because it's a good example of what I'm talking about. Here's my annotated version of Anthropic's diagram: I think Anthropic successfully demonstrated (in the paper and with Golden Gate Claude) that this feature, at very high activation levels, corresponds to the Golden Gate Bridge. But on a median instance of text where this feature is active, it is "irrelevant" to the Golden Gate Bridge, according to their own autointerpretability metric! I view this as analogous to naming water "the drowning liquid", or Boeing the "door exploding company". Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited. Anthropic's work writes around this uninterpretability in a few ways, by naming the feature based on the top examples, highlighting the top examples, pinning the intervention model to 10x the activation (vs .1x its top activation), and showing subsamples from evenly spaced intervals (vs deciles). I think would be illuminating if they added to their feature browser page some additional information about the fraction of instances in each subsample, e.g., "Subsample Interval 2 (0.4% of activations)". Whether a feature is or isn't its top activating examples is important because it constrains their usefulness: Could work with our current feature discovery approach: find the "aligned with human flourishing" feature, and pin that to 10x its max activation. ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app