The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jun 4, 2024 • 1min

EA - GiveWell is hiring a Head of Fundraising Operations and Analytics by GiveWell

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GiveWell is hiring a Head of Fundraising Operations and Analytics, published by GiveWell on June 4, 2024 on The Effective Altruism Forum. We're hiring a Head of Fundraising Operations and Analytics to understand, analyze, and scale GiveWell's fundraising work. The Head of Fundraising Ops will define the research questions that we want to answer, design processes and a tech stack to collect information, and use creative qualitative and quantitative methods to analyze the data we obtain. They'll manage a small, efficient fundraising team and report to the Director of Development. Competitive candidates have prior experience solving open-ended business questions with limited or incomplete data, and have previously developed efficient and scalable processes for others to implement. Management experience is a major plus. Professional backgrounds that might be a particularly strong fit include revenue operations, data analytics, market research, and consulting. See compensation details and full job description and apply at the link above - please also share this job with anyone in your network who might be a good fit. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 4, 2024 • 22min

AF - [Paper] Stress-testing capability elicitation by training password-locked models by Fabien Roger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Paper] Stress-testing capability elicitation by training password-locked models, published by Fabien Roger on June 4, 2024 on The AI Alignment Forum. The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David. Scheming models might try to perform less capably than they are able to ( sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions. We're uncertain how this goes in practice. It could be the case that simple countermeasures prevent sandbagging in most of the important cases such that it isn't a big problem for safety. But it could also turn out that even high-effort approaches can't rule out these issues. We released a paper studying this by examining how well supervised fine-tuning and RL can elicit capabilities from password-locked models (LLMs trained to produce low-quality outputs except when a particular password is provided). We find that: If you have some high-quality demonstrations, supervised fine-tuning usually recovers the hidden capabilities with very few samples. Elicitation on password-locked models generalizes well across domains (even when we made password-locked models using techniques that we thought might prevent generalization). Using RL on password-locked models recovers hidden capabilities, except when the weak model RL starts from is extremely weak. While we target cases where AIs intentionally produce low-quality outputs and they could easily produce higher quality outputs, we also think our results have some relevance to other cases where AIs produce low-quality outputs despite being able to produce better ones. Here are the main takeaways: We think our results are a moderate update against scheming models being able to sandbag effectively (given countermeasures) for the kind of tasks we studied here, though there are differences between password-locked models and actual scheming models. We think this should be a bigger update for people who are relatively more pessimistic than us about sandbagging (at least insofar as their views make pessimistic predictions about current architectures). (We think this is also a small update against scheming, since scheming models might still try to sandbag, and if elicitation fine-tunes sandbagging out, it may partially erode scheming.) People interested in studying sandbagging or similar elicitation failures should consider using a methodology similar to the password-locked models methodology we use here. More generally, they should consider using some sort of adversarial methodology in which a red team constructs a model organism. In this post: We explain why good capability elicitation matters for AI safety We describe the sandbagging concern we are trying to assess; We summarize the results of our paper, and discuss their implications; We discuss how our results relate to relevant AGI safety work; We describe future work we are excited by; We discuss concerns other than intentional sandbagging for which our results might be relevant. The big picture: why good capability elicitation matters for AI safety Currently, AI labs try reasonably hard to get as much performance as possible out of their expensive pretrained model. The process of making an AI perform a task competently (without a huge amount of additional training) is often called eli...

Jun 4, 2024 • 4min

LW - Companies' safety plans neglect risks from scheming AI by Zach Stein-Perlman

Jun 3, 2024 • 14min

AF - AI catastrophes and rogue deployments by Buck Shlegeris

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI catastrophes and rogue deployments, published by Buck Shlegeris on June 3, 2024 on The AI Alignment Forum. [Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren't very original to me; many of the examples of threat models are from other people.] In this post, I want to introduce the concept of a "rogue deployment" and argue that it's interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I'll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or multiple of these at once). Rogue deployments Suppose you've developed some powerful model. (It might actually be many models; I'm just going to talk about a singular model for convenience.) You have some definition of a catastrophe - a failure that you don't want to happen even once. Your AI is powerful enough that, in absence of any safety measures, it would pose a substantial risk of causing a catastrophe. In the terminology of " Safety Cases: How to Justify the Safety of Advanced AI Systems", we'd phrase this as "the model is too powerful for us to make inability arguments for safety." (Sometimes I use the word "catastrophe" to include things like "the AI escapes the datacenter". Right now, I'm excluding that. I want to reserve "catastrophe" for the end result that we find unacceptable, e.g. the AI or another adversary amassing power, launching nukes, starting a war, building superintelligence, etc.) For the moment, let's be agnostic between various adversaries who might try to cause a catastrophe (AI or various humans). You have some safety measures that you've put into place to prevent the catastrophe from occurring. Safety measures are interventions that are applied to all of the deployments of your model (to any users, including just internal users) that try to prevent safety problems. This might include control measures like trusted monitoring, and security measures to make it harder to exfiltrate your model weights. A rogue deployment is a deployment of your model where the safety measures are absent. For example: an AI hacking its datacenter and running itself while hiding what it's doing from its creators an AI self-exfiltrating and running itself on unmonitored servers someone stealing the AI and running it in their own datacenter You can classify AI catastrophes based on whether they involve a rogue deployment or not: In the case of a catastrophe that doesn't involve rogue deployment, the catastrophe happened with all your safety measures in place the whole time; these measures just didn't suffice to spot (or prevent) that something terrible was happening until it was too late. For catastrophes that did involve rogue deployment, the catastrophe happened in two steps: Firstly, while the safety measures were in place, something went wrong such that a rogue deployment could occur. (Perhaps there was a security failure, or a failure of control that allowed an AI to run itself unmonitored in its datacenter.) Secondly, that rogue deployment caused a catastrophe. (Another way of phrasing all this would be to talk about instrumental vs terminal catastrophes - the terminal catastrophe is the end state we wanted to avoid, and the instrumental catastrophe is something happening that makes the situation go out of our control such that a terminal catastrophe is now likely and it's much harder for us to prevent it. So in...

Jun 3, 2024 • 4min

EA - Review of Past Grants: The $100,000 Grant for a Video Game? by Nicolae

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Review of Past Grants: The $100,000 Grant for a Video Game?, published by Nicolae on June 3, 2024 on The Effective Altruism Forum. Since 2017, EA Funds has been providing grants across four distinct cause areas. While there are payout reports available, there is a lack of reports detailing the outcomes of these grants, so I delved out of curiosity into the Grants Database to review some of the proposals that received funding and evaluate their outcomes. Some of the findings were quite unexpected, particularly for the Long-Term Future Fund and the EA Infrastructure Fund. The case involving a $100,000 grant for a video game In July 2022, EA approved a $100,000 grant to Lone Pine Games, LLC, for developing and marketing a video game designed to explain the Stop Button Problem to the public and STEM professionals. Outcomes from looking into Lone Pine Games, LLC: 1. After almost two years, there are no online mentions of such a game being developed by this company, except for the note on the EA Funds page. 2. Lone Pine Games released only one game available on multiple gaming platforms back in 2020 called NewCity, a city-building simulator, similar to SimCity from the 1990s. 3. After a few updates, the development of NewCity was abandoned last year, and its source code was made public on GitHub. Video trailer for the game NewCity, which was highly likely the primary track record in the grant proposal for the development of the Stop Button Problem video game. Despite the absence of any concrete results from this grant, let's entertain the idea that the Stop Button Problem game was produced and that it was decent. Would such a game "positively influence the long-term trajectory of civilization," as described by the Long-Term Future Fund? For context, Rob Miles's videos (1) and (2) from 2017 on the Stop Button Problem already provided clear explanations for the general public. It seems insane to even compare, but was this expenditure of $100,000 really justified when these funds could have been used to save 20-30 children's lives or provide cataract surgery to around 4000 people? With a grant like this one, Steven Pinker's remarks on longtermism funding seem increasingly accurate. Training guide dogs for the visually impaired, often highlighted in EA discussions, stands out as a highly effective donation strategy by contrast. I selected this example because it involves a company rather than individual recipients. However, I found numerous other cases, many even worse, primarily involving digital creators with barely any content produced during their funding period, people needing big financial support to change careers, and independent researchers whose proposals had not, at the time of writing, resulted in any published papers. As Yann LeCun recently said, "If you do research and don't publish, it's not science." These grants are not just a few thousand dollars distributed here and there; in many instances, they amount to tens or even hundreds of thousands of dollars. While many people, myself included, strongly support funding people's passion projects, because some of these grants really look that way, this approach seems more akin to Patreon or Kickstarter and not to effective giving. I believe many donors, who think they are contributing effectively, may not be fully aware of how their money is being utilized. Donors contribute to these funds expecting rigorous analysis comparable to GiveWell's standards, even for more speculative areas that rely on hypotheticals, hoping their money is not wasted, so they entrust that responsibility to EA fund managers, whom they assume make better and more informed decisions with their contributions. With seven full years of funding on record, I believe a thorough evaluation of previous grants is needed. Even if the grants were provide...

Jun 3, 2024 • 14min

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comments on Anthropic's Scaling Monosemanticity, published by Robert AIZI on June 3, 2024 on LessWrong. These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. TL;DR In roughly descending order of importance: 1. Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features. Its great those features allow interventions like the recently-departed Golden Gate Claude. I especially like the code bug feature. 2. I worry that naming features after high-activating examples (e.g. "the Golden Gate Bridge feature") gives a false sense of security. Most of the time that feature activates, it is irrelevant to the golden gate bridge. That feature is only well-described as "related to the golden gate bridge" if you condition on a very high activation, and that's <10% of its activations (from an eyeballing of the graph). 3. This work does not address my major concern about dictionary learning: it is not clear dictionary learning can find specific features of interest, "called-shot" features, or "all" features (even in a subdomain like "safety-relevant features"). I think the report provides ample evidence that current SAE techniques fail at this. 4. The SAE architecture seems to be almost identical to how Anthropic and my team were doing it 8 months ago, except that the ratio of features to input dimension is higher. I can't say exactly how much because I don't know the dimensions of Claude, but I'm confident the ratio is at least 30x (for their smallest SAE), up from 8x 8 months ago. 5. The correlations between features and neurons seems remarkably high to me, and I'm confused by Anthropic's claim that "there is no strongly correlated neuron". 6. Still no breakthrough on "a gold-standard method of assessing the quality of a dictionary learning run", which continues to be a limitation on developing the technique. The metric they primarily used was the loss function (a combination of reconstruction accuracy and L1 sparsity). I'll now expand some of these into sections. Finally, I'll suggest some follow-up research/tests that I'd love to see Anthropic (or a reader like you) try. A Feature Isn't Its Highest Activating Examples Let's look at the Golden Gate Bridge feature because its fun and because it's a good example of what I'm talking about. Here's my annotated version of Anthropic's diagram: I think Anthropic successfully demonstrated (in the paper and with Golden Gate Claude) that this feature, at very high activation levels, corresponds to the Golden Gate Bridge. But on a median instance of text where this feature is active, it is "irrelevant" to the Golden Gate Bridge, according to their own autointerpretability metric! I view this as analogous to naming water "the drowning liquid", or Boeing the "door exploding company". Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited. Anthropic's work writes around this uninterpretability in a few ways, by naming the feature based on the top examples, highlighting the top examples, pinning the intervention model to 10x the activation (vs .1x its top activation), and showing subsamples from evenly spaced intervals (vs deciles). I think would be illuminating if they added to their feature browser page some additional information about the fraction of instances in each subsample, e.g., "Subsample Interval 2 (0.4% of activations)". Whether a feature is or isn't its top activating examples is important because it constrains their usefulness: Could work with our current feature discovery approach: find the "aligned with human flourishing" feature, and pin that to 10x its max activation. ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app