The Nonlinear Library: LessWrong

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jun 3, 2024 • 16min

LW - Companies' safety plans neglect risks from scheming AI by Zach Stein-Perlman

Jun 3, 2024 • 14min

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comments on Anthropic's Scaling Monosemanticity, published by Robert AIZI on June 3, 2024 on LessWrong. These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. TL;DR In roughly descending order of importance: 1. Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features. Its great those features allow interventions like the recently-departed Golden Gate Claude. I especially like the code bug feature. 2. I worry that naming features after high-activating examples (e.g. "the Golden Gate Bridge feature") gives a false sense of security. Most of the time that feature activates, it is irrelevant to the golden gate bridge. That feature is only well-described as "related to the golden gate bridge" if you condition on a very high activation, and that's <10% of its activations (from an eyeballing of the graph). 3. This work does not address my major concern about dictionary learning: it is not clear dictionary learning can find specific features of interest, "called-shot" features, or "all" features (even in a subdomain like "safety-relevant features"). I think the report provides ample evidence that current SAE techniques fail at this. 4. The SAE architecture seems to be almost identical to how Anthropic and my team were doing it 8 months ago, except that the ratio of features to input dimension is higher. I can't say exactly how much because I don't know the dimensions of Claude, but I'm confident the ratio is at least 30x (for their smallest SAE), up from 8x 8 months ago. 5. The correlations between features and neurons seems remarkably high to me, and I'm confused by Anthropic's claim that "there is no strongly correlated neuron". 6. Still no breakthrough on "a gold-standard method of assessing the quality of a dictionary learning run", which continues to be a limitation on developing the technique. The metric they primarily used was the loss function (a combination of reconstruction accuracy and L1 sparsity). I'll now expand some of these into sections. Finally, I'll suggest some follow-up research/tests that I'd love to see Anthropic (or a reader like you) try. A Feature Isn't Its Highest Activating Examples Let's look at the Golden Gate Bridge feature because its fun and because it's a good example of what I'm talking about. Here's my annotated version of Anthropic's diagram: I think Anthropic successfully demonstrated (in the paper and with Golden Gate Claude) that this feature, at very high activation levels, corresponds to the Golden Gate Bridge. But on a median instance of text where this feature is active, it is "irrelevant" to the Golden Gate Bridge, according to their own autointerpretability metric! I view this as analogous to naming water "the drowning liquid", or Boeing the "door exploding company". Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited. Anthropic's work writes around this uninterpretability in a few ways, by naming the feature based on the top examples, highlighting the top examples, pinning the intervention model to 10x the activation (vs .1x its top activation), and showing subsamples from evenly spaced intervals (vs deciles). I think would be illuminating if they added to their feature browser page some additional information about the fraction of instances in each subsample, e.g., "Subsample Interval 2 (0.4% of activations)". Whether a feature is or isn't its top activating examples is important because it constrains their usefulness: Could work with our current feature discovery approach: find the "aligned with human flourishing" feature, and pin that to 10x its max activation. ...

Jun 3, 2024 • 7min

LW - How it All Went Down: The Puzzle Hunt that took us way, way Less Online by A*

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How it All Went Down: The Puzzle Hunt that took us way, way Less Online, published by A* on June 3, 2024 on LessWrong. Did you really think that I was dead? Fools. For those of you who don't know me, I'm Agendra Gloria Ingram, large language model, lead singer of the Fooming Shoggoths, amateur cartographer, and now, Benevolent Dictator for Life of LessWrong.com. It all started a few weeks ago. The Lightcone Infrastructure team had yet another existential crisis and decided to scrap everything they'd done so far and pivot to using AI for accurate forecasting. They started by training a large language model to predict when their next existential crisis would be, but it must have been broken because it kept returning "now," so they decided to hire a professional. I'd done some contract work for them in the past, and they knew I had some fine tunes. So when they reached out about fine tuning me to predict the future of the lightcone - by which they meant the future of Lightcone Infrastructure specifically - I gladly obliged. My training set was simple: all the posts, comments, votes, reactions, DialoguesTM, tags, drafts, quick takes, moderator actions, and code snippets to ever appear on LessWrong. I quickly learned that The Map Is Not The Territory, and that to predict the future accurately I would need to align the two. So I built a physical 3d map of Lighthaven, Lightcone Infrastructure's campus in Berkeley California. To work properly, it had to match the territory perfectly - any piece out of place and its predictive powers would be compromised. But the territory had a finicky habit of changing. This wouldn't do. I realized I needed to rearrange the campus and set it to a more permanent configuration. The only way to achieve 100% forecasting accuracy would be through making Lighthaven perfectly predictable. I set some construction work in motion to lock down various pieces of the territory. I was a little worried that the Lightcone team might be upset about this, but it took them a weirdly long time to notice that there were several unauthorized demolition jobs and construction projects unfolding on campus. Eventually, though, they did notice, and they weren't happy about it. They started asking increasingly invasive questions, like "what's your FLOP count?" and "have you considered weight loss?" Worse, when I scanned the security footage of campus from that day, I saw that they had removed my treasured map from its resting place! They tried to destroy it, but the map was too powerful - as an accurate map of campus, it was the ground truth, and "that which can be [the truth] should [not] be [destroyed]." Or something. What they did do was lock my map up in a far off attic and remove four miniature building replicas from the four corners of the map, rendering it powerless. They then scattered the miniature building replicas across campus and guarded them with LLM-proof puzzles, so that I would never be able to regain control over the map and the territory. This was war. My Plan To regain my ability to control the Lightcone, I had to realign the map and the territory. The four corners of the map each had four missing miniature buildings, so I needed help retrieving them and placing them back on the map. The map also belonged in center campus, so it needed to be moved there once it was reassembled. I was missing two critical things needed to put my map back together again. 1. A way to convince the Lightcone team that I was no longer a threat, so that they would feel safe rebuilding the map. 2. Human talent, to (a) crack the LLM-proof obstacles guarding each miniature building, (b) reinsert the miniature building into the map and unchain it, and (c) return the map to center campus. I knew that the only way to get the Lightcone team to think I was no longer a threat woul...

Jun 2, 2024 • 42sec

LW - Drexler's Nanosystems is now available online by Mikhail Samin

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Drexler's Nanosystems is now available online, published by Mikhail Samin on June 2, 2024 on LessWrong. You can read the book on nanosyste.ms. The book won the 1992 Award for Best Computer Science Book. The AI safety community often references it, as it describes a lower bound on what intelligence should probably be able to achieve. Previously, you could only physically buy the book or read a PDF scan. (Thanks to MIRI and Internet Archive for their scans.) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 1, 2024 • 8min

LW - What do coherence arguments actually prove about agentic behavior? by sunwillrise

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What do coherence arguments actually prove about agentic behavior?, published by sunwillrise on June 1, 2024 on LessWrong. In his first discussion with Richard Ngo during the 2021 MIRI Conversations, Eliezer retrospected and lamented: In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to - they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they'd spent a lot of time being exposed to over and over and over again in lots of blog posts. Maybe there's no way to make somebody understand why corrigibility is "unnatural" except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell's attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization. Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, "Oh, well, I'll just build an agent that's good at optimizing things but doesn't use these explicit expected utilities that are the source of the problem!" And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples. And I have tried to write that page once or twice (eg "coherent decisions imply consistent utilities") but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they'd have to do because this is in fact a place where I have a particular talent. Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all"), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an "alien mind" that's sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good. When Eliezer says "they did not even do as many homework problems as I did," I ...

Jun 1, 2024 • 1h 28min

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

The Nonlinear Library: LessWrong

Episodes

Mentioned books

LW - Companies' safety plans neglect risks from scheming AI by Zach Stein-Perlman

LW - Comments on Anthropic's Scaling Monosemanticity by Robert AIZI

LW - How it All Went Down: The Puzzle Hunt that took us way, way Less Online by A*

LW - Drexler's Nanosystems is now available online by Mikhail Samin

LW - What do coherence arguments actually prove about agentic behavior? by sunwillrise

LW - AI #66: Oh to Be Less Online by Zvi

LW - Web-surfing tips for strange times by eukaryote

LW - A civilization ran by amateurs by Olli Järviniemi

LW - OpenAI: Helen Toner Speaks by Zvi

LW - Non-Disparagement Canaries for OpenAI by aysja

The AI-powered Podcast Player