The Nonlinear Library: LessWrong

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jul 22, 2024 • 27min

LW - Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions by Lidor Banuel Dabbah

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions, published by Lidor Banuel Dabbah on July 19, 2024 on LessWrong. Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static LLC estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features from random directions. By comparing our method to other possible metrics, we demonstrate that it outperforms all of them but one, which has comparable performance. We discuss possible explanations to our results, our project and other future directions. Introduction Given a neural network M and a latent layer within it, L, a central motif in current mechanistic interpretability research is to find functions f:LR [1] which are features of the model. Features are (generally) expected to exhibit the following properties: 1. Encode interpretable properties of the input. 2. Be causally relevant to the computation of the output of the model. 3. Encode the output of a certain submodule of our model M, i.e. a component, localized in weight space, which is responsible for a specific part of the total computation. While this is common wisdom, methods for automated feature evaluation usually focus on correlations between the (top) activations of the feature with human (or machine) recognizable interpretations, or on the effect of feature-related interventions on the output of the model. In particular, while the first and second items of the feature characterization above are central in current techniques, the third property, specifically the localized nature of the computation upstream of the feature, is less so[2]. We are currently investigating a direction which fills that gap, and this post shares the findings of the exploratory research we have conducted to validate and inform our approach. More specifically, we operationalized the concept of "weight-localized computation" using the local learning coefficient (LLC) introduced in Lau et al, following the learning coefficient first introduced in the context of singular learning theory. We apply LLC estimation to models associated with our base model and a feature within it, a method we call feature targeted (FT) LLC estimation. In this exploratory work we study FT-LLC estimates of specific models associated with SAE features. Most notably, we have found that: 1. FT-LLC estimates of SAE features are, on average, distinguishably higher then those of random directions. 2. For a particular variant of FT-LLC estimation, which we named the functional FT-LLC (defined in this section) this separation is pronounced enough such that the vast majority of SAE features we studied are clearly separated from the random features we studied. Furthermore, most baseline metrics we compared it to (see here) are less capable at distinguishing SAE features from random directions, with only one performing on par with it. Section 1 introduces the main technique we study in this post, FT-LLC estimation, and section 2 outlines our motivations. Section 3 describes the details of our experimental setting, our results, and the comparison to baseline metrics. In section 4 we discuss our overall takes, how they fit within our general agenda and gaps we currently have in theoretically understanding them. Section 5 is devoted to outlining our next steps, the general direction of the project, and some other possible directions for further research. Lastly, we briefly discuss related work in section 6. What is FT-LLC? LLC estimation We start out by briefly recalling what the local learning coefficient (LLC) is. If you are unfamiliar of the term, we recommend reading this, the longer sequence here, or the paper on LLC estimation ...

Jul 19, 2024 • 10min

LW - How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation") by Ruby

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation"), published by Ruby on July 19, 2024 on LessWrong. AI Alignment is my motivating context but this could apply elsewhere too. The nascent field of AI Alignment research is pretty happening these days. There are multiple orgs and dozens to low hundreds of full-time researchers pursuing approaches to ensure AI goes well for humanity. Many are heartened that there's at least some good research happening, at least in the opinion of some of the good researchers. This is reason for hope, I have heard. But how do we know whether or not we have produced "good research?" I think there are two main routes to determining that research is good, and yet only one applies in the research field of aligning superintelligent AIs. "It's good because it works" The first and better way to know that your research is good is because it allows you to accomplish some goal you care about[1] [1]. Examples: My work on efficient orbital mechanics calculation is good because it successfully lets me predict the trajectory of satellites. My work on the disruption of cell signaling in malign tumors is good because it helped me develop successful anti-cancer vaccines. My work on solid-state physics is good because it allowed me to produce superconductors at a higher temperature and lower pressure than previously attained.[2] In each case, there's some outcome I care about pretty inherently for itself, and if the research helps me attain that outcome it's good (or conversely if it doesn't, it's bad). The good researchers in my field are those who have produced a bunch of good research towards the aims of the field. Sometimes it's not clear-cut. Perhaps I figured out some specific cell signaling pathways that will be useful if it turns out that cell signaling disruption in general is useful, and that's TBD on therapies currently being trialed and we might not know how good (i.e. useful) my research was for many more years. This actually takes us into what I think is the second meaning of "good research". "It's good because we all agree it's good" If our goal is successfully navigating the creation of superintelligent AI in a way such that humans are happy with the outcome, then it is too early to properly score existing research on how helpful it will be. No one has aligned a superintelligence. No one's research has contributed to the alignment of an actual superintelligence. At this point, the best we can do is share our predictions about how useful research will turn out to be. "This is good research" = "I think this research will turn out to be helpful". "That person is a good researcher" = "That person produces much research that will turn out to be useful and/or has good models and predictions of which research will turn out to help". To talk about the good research that's being produced is simply to say that we have a bunch of shared predictions that there exists research that will eventually help. To speak of the "good researchers" is to speak of the people who lots of people agree their work is likely helpful and opinions likely correct. Someone might object that there's empirical research that we can see yielding results in terms of interpretability/steering or demonstrating deception-like behavior and similar. While you can observe an outcome there, that's not the outcome we really care about of aligning superintelligent AI, and the relevance of this work is still just prediction. It's being successful at kinds of cell signaling modeling before we're confident that's a useful approach. More like "good" = "our community pagerank Eigen-evaluation of research rates this research highly" It's a little bit interesting to unpack "agreeing that some research is good". Obviously, not everyone's opinion matters ...

Jul 19, 2024 • 2min

LW - Linkpost: Surely you can be serious by kave

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Linkpost: Surely you can be serious, published by kave on July 19, 2024 on LessWrong. Adam Mastroianni writes about "actually caring about stuff, and for the right reasons", rather than just LARPing. The opening is excerpted below. I once saw someone give a talk about a tiny intervention that caused a gigantic effect, something like, "We gave high school seniors a hearty slap on the back and then they scored 500 points higher on the SAT." 1 Everyone in the audience was like, "Hmm, interesting, I wonder if there were any gender effects, etc." I wanted to get up and yell: "EITHER THIS IS THE MOST POTENT PSYCHOLOGICAL INTERVENTION EVER, OR THIS STUDY IS TOTAL BULLSHIT." If those results are real, we should start a nationwide backslapping campaign immediately. We should be backslapping astronauts before their rocket launches and Olympians before their floor routines. We should be running followup studies to see just how many SAT points we can get - does a second slap get you another 500? Or just another 250? Can you slap someone raw and turn them into a genius? Or - much more likely - the results are not real, and we should either be a) helping this person understand where they screwed up in their methods and data analysis, or b) kicking them out for fraud. Those are the options. Asking a bunch of softball questions ("Which result was your favorite?") is not a reasonable response. That's like watching someone pull a rabbit out of a hat actually for real, not a magic trick, and then asking them, "What's the rabbit's name?" Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 18, 2024 • 1h 23min

LW - AI #73: Openly Evil AI by Zvi

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #73: Openly Evil AI, published by Zvi on July 18, 2024 on LessWrong. What do you call a clause explicitly saying that you waive the right to whistleblower compensation, and that you need to get permission before sharing information with government regulators like the SEC? I have many answers. I also know that OpenAI, having f***ed around, seems poised to find out, because that is the claim made by whistleblowers to the SEC. Given the SEC fines you for merely not making an explicit exception to your NDA for whistleblowers, what will they do once aware of explicit clauses going the other way? (Unless, of course, the complaint is factually wrong, but that seems unlikely.) We also have rather a lot of tech people coming out in support of Trump. I go into the reasons why, which I do think is worth considering. There is a mix of explanations, and at least one very good reason. Then I also got suckered into responding to a few new (well, not really new, but renewed) disingenuous attacks on SB 1047. The entire strategy is to be loud and hyperbolic, especially on Twitter, and either hallucinate or fabricate a different bill with different consequences to attack, or simply misrepresent how the law works, then use that, to create the illusion the bill is unliked or harmful. Few others respond to correct such claims, and I constantly worry that the strategy might actually work. But that does not mean you, my reader who already knows, need to read all that. Also a bunch of fun smaller developments. Karpathy is in the AI education business. Table of Contents 1. Introduction. 2. Table of Contents. 3. Language Models Offer Mundane Utility. Fight the insurance company. 4. Language Models Don't Offer Mundane Utility. Have you tried using it? 5. Clauding Along. Not that many people are switching over. 6. Fun With Image Generation. Amazon Music and K-Pop start to embrace AI. 7. Deepfaketown and Botpocalypse Soon. FoxVox, turn Fox into Vox or Vox into Fox. 8. They Took Our Jobs. Take away one haggling job, create another haggling job. 9. Get Involved. OpenPhil request for proposals. Job openings elsewhere. 10. Introducing. Karpathy goes into AI education. 11. In Other AI News. OpenAI's Q* is now named Strawberry. Is it happening? 12. Denying the Future. Projects of the future that think AI will never improve again. 13. Quiet Speculations. How to think about stages of AI capabilities. 14. The Quest for Sane Regulations. EU, UK, The Public. 15. The Other Quest Regarding Regulations. Many in tech embrace The Donald. 16. SB 1047 Opposition Watch (1). I'm sorry. You don't have to read this. 17. SB 1047 Opposition Watch (2). I'm sorry. You don't have to read this. 18. Open Weights are Unsafe and Nothing Can Fix This. What to do about it? 19. The Week in Audio. Joe Rogan talked to Sam Altman and I'd missed it. 20. Rhetorical Innovation. Supervillains, oh no. 21. Oh Anthropic. More details available, things not as bad as they look. 22. Openly Evil AI. Other things, in other places, on the other hand, look worse. 23. Aligning a Smarter Than Human Intelligence is Difficult. Noble attempts. 24. People Are Worried About AI Killing Everyone. Scott Adams? Kind of? 25. Other People Are Not As Worried About AI Killing Everyone. All glory to it. 26. The Lighter Side. A different kind of mental gymnastics. Language Models Offer Mundane Utility Let Claude write your prompts for you. He suggests using the Claude prompt improver. Sully: convinced that we are all really bad at writing prompts I'm personally never writing prompts by hand again Claude is just too good - managed to feed it evals and it just optimized for me Probably a crude version of dspy but insane how much prompting can make a difference. Predict who will be the shooting victim. A machine learning model did this for citizens of Chicago (a ...

Jul 18, 2024 • 40min

LW - Individually incentivized safe Pareto improvements in open-source bargaining by Nicolas Macé

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Individually incentivized safe Pareto improvements in open-source bargaining, published by Nicolas Macé on July 18, 2024 on LessWrong. Summary Agents might fail to peacefully trade in high-stakes negotiations. Such bargaining failures can have catastrophic consequences, including great power conflicts, and AI flash wars. This post is a distillation of DiGiovanni et al. (2024) (DCM), whose central result is that agents that are sufficiently transparent to each other have individual incentives to avoid catastrophic bargaining failures. More precisely, DCM constructs strategies that are plausibly individually incentivized, and, if adopted by all, guarantee each player no less than their least preferred trade outcome. Figure 0 below illustrates this. This result is significant because artificial general intelligences (AGIs) might (i) be involved in high-stakes negotiations, (ii) be designed with the capabilities required for the type of strategy we'll present, and (iii) bargain poorly by default (since bargaining competence isn't necessarily a direct corollary of intelligence-relevant capabilities). Introduction Early AGIs might fail to make compatible demands with each other in high-stakes negotiations (we call this a "bargaining failure"). Bargaining failures can have catastrophic consequences, including great power conflicts, or AI triggering a flash war. More generally, a "bargaining problem" is when multiple agents need to determine how to divide value among themselves. Early AGIs might possess insufficient bargaining skills because intelligence-relevant capabilities don't necessarily imply these skills: For instance, being skilled at avoiding bargaining failures might not be necessary for taking over. Another problem is that there might be no single rational way to act in a given multi-agent interaction. Even arbitrarily capable agents might have different priors, or different approaches to reasoning under bounded computation. Therefore they might fail to solve equilibrium selection, i.e., make incompatible demands (see Stastny et al. (2021) and Conitzer & Oesterheld (2023)). What, then, are sufficient conditions for agents to avoid catastrophic bargaining failures? Sufficiently advanced AIs might be able to verify each other's decision algorithms (e.g. via verifying source code), as studied in open-source game theory. This has both potential downsides and upsides for bargaining problems. On one hand, transparency of decision algorithms might make aggressive commitments more credible and thus more attractive (see Sec. 5.2 of Dafoe et al. (2020) for discussion). On the other hand, agents might be able to mitigate bargaining failures by verifying cooperative commitments. Oesterheld & Conitzer (2022)'s safe Pareto improvements[1] (SPI) leverages transparency to reduce the downsides of incompatible commitments. In an SPI, agents conditionally commit to change how they play a game relative to some default such that everyone is (weakly) better off than the default with certainty.[2] For example, two parties A and B who would otherwise go to war over some territory might commit to, instead, accept the outcome of a lottery that allocates the territory to A with the probability that A would have won the war (assuming this probability is common knowledge). See also our extended example below. Oesterheld & Conitzer (2022) has two important limitations: First, many different SPIs are in general possible, such that there is an "SPI selection problem", similar to the equilibrium selection problem in game theory (Sec. 6 of Oesterheld & Conitzer (2022)). And if players don't coordinate on which SPI to implement, they might fail to avoid conflict.[3] Second, if expected utility-maximizing agents need to individually adopt strategies to implement an SPI, it's unclear what conditions...

Jul 18, 2024 • 25min

LW - Mech Interp Lacks Good Paradigms by Daniel Tan

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Lacks Good Paradigms, published by Daniel Tan on July 18, 2024 on LessWrong. Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free! Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome! Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research? A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret. In this post, I'll critically examine several leading candidates for "paradigms" in MI, consider the available evidence for / against, and identify good future research directions (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined. Towards a Grand Unifying Theory (GUT) with MI Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned. Some people who have espoused this opinion: Richard Ngo has argued here that MI enables "big breakthroughs" towards a "principled understanding" of deep learning. Rohin Shah has argued here that MI builds "new affordances" for alignment methods. Evan Hubinger has argued for MI here because it helps us identify "unknown unknowns". Leo Gao argues here that MI aids in "conceptual research" and "gets many bits" per experiment. As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis. It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated. (Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here. A GUT Needs Paradigms Paradigm - an overarching framework for thinking about a field In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for th...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app