The Nonlinear Library: LessWrong

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Aug 5, 2024 • 21min

LW - Circular Reasoning by abramdemski

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Circular Reasoning, published by abramdemski on August 5, 2024 on LessWrong. The idea that circular reasoning is bad is widespread. However, this reputation is undeserved. While circular reasoning should not be convincing (at least not usually), it should also not be considered invalid. Circular Reasoning is Valid The first important thing to note is that circular reasoning is logically valid. A implies A. If circular arguments are to be critiqued, it must be by some other standard than logical validity. I think it's fair to say that the most relevant objection to circular arguments is that they are not very good at convincing someone who does not already accept the conclusion. You are talking to another person, and need to think about communicating with their perspective. Perhaps the reason circular arguments are a common 'problem' is because they are valid. People naturally think about what should be a convincing argument from their own perspective, rather than the other person's. However, notice that this objection to circular reasoning assumes that one party is trying to convince the other. This is arguments-as-soldiers mindset.[1] If two people are curiously exploring each other's perspectives, then circular reasoning could be just fine! Furthermore, I'll claim: circular arguments should actually be considered as a little bit of positive evidence for their positions! Let's look at a concrete example. I don't think circular arguments are quite so simple as "A implies A"; the circle is usually a bit longer. So, consider a more realistic circular position:[2] Alice: Why do you believe in God? Bob: I believe in God based on the authority of the Bible. Alice: Why do you believe what the Bible says? Bob: Because the Bible was divinely inspired by God. God is all-knowing and good, so we can trust what God says. Here we have a two-step loop, A->B and B->A. The arguments are still logically fine; if the Bible tells the truth, and the Bible says God exists, then God exists. If the Bible were divinely inspired by an all-knowing and benevolent God, then it is reasonable to conclude that the Bible tells the truth. If Bob is just honestly going through his own reasoning here (as opposed to trying to convince Alice), then it would be wrong for Alice to call out Bob's circular reasoning as an error. The flaw in circular reasoning is that it doesn't convince anyone; but that's not what Bob is trying to do. Bob is just telling Alice what he thinks. If Alice thinks Bob is mistaken, and wants to point out the problems in Bob's beliefs, it is better for Alice to contest the premises of Bob's arguments rather than contest the reasoning form. Pointing out circularity only serves to remind Bob that Bob hasn't given Alice a convincing argument. You probably still think Bob has made some mistake in his reasoning, if these are his real reasons. I'll return to this later. Circular Arguments as Positive Evidence I claimed that circular arguments should count as a little bit of evidence in favor of their conclusions. Why? Imagine that the Bible claimed itself to be written by an evil and deceptive all-knowing God, instead of a benign God: Alice: Why do you believe in God? Bob: Because the Bible tells me so. Alice: Why do you believe the Bible? Bob: Well... uh... huh. Sometimes, belief systems are not even internally consistent. You'll find a contradiction[3] just thinking through the reasoning that is approved of by the belief system itself. This should make you disbelieve the thing. Therefore, by the rule we call conservation of expected evidence, reasoning through a belief system and deriving a conclusion consistent with the premise you started with should increase your credence. It provides some evidence that there's a consistent hypothesis here; and consistent hypotheses should get some ...

Aug 5, 2024 • 11min

LW - Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by Seth Herd

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours, published by Seth Herd on August 5, 2024 on LessWrong. Vitalik Buterin wrote an impactful blog post, My techno-optimism. I found this discussion of one aspect on 80,00 hours much more interesting. The remainder of that interview is nicely covered in the host's EA Forum post. My techno optimism apparently appealed to both sides, e/acc and doomers. Buterin's approach to bridging that polarization was interesting. I hadn't understood before the extent to which anti-AI regulation sentiment is driven by fear of centralized power. I hadn't thought about this risk before since it didn't seem relevant to AGI risk, but I've been updating to think it's highly relevant. [this is automated transcription that's inaccurate and comically accurate by turns :)] Rob Wiblin (the host) (starting at 20:49): what is it about the way that you put the reasons to worry that that ensured that kind of everyone could get behind it Vitalik Buterin: [...] in addition to taking you know the case that AI is going to kill everyone seriously I the other thing that I do is I take the case that you know AI is going to take create a totalitarian World Government seriously [...] [...] then it's just going to go and kill everyone but on the other hand if you like take some of these uh you know like very naive default solutions to just say like hey you know let's create a powerful org and let's like put all the power into the org then yeah you know you are creating the most like most powerful big brother from which There Is No Escape and which has you know control over the Earth and and the expanding light cone and you can't get out right and yeah I mean this is something that like uh I think a lot of people find very deeply scary I mean I find it deeply scary um it's uh it is also something that I think realistically AI accelerates right One simple takeaway is that recognizing and addressing that motivation for anti-regulation and pro-AGI sentiment when trying to work with or around the e/acc movement. But a second is whether to take that fear seriously. Is centralized power controlling AI/AGI/ASI a real risk? Vitalik Buterin is from Russia, where centralized power has been terrifying. This has been the case for roughly half of the world. Those that are concerned with of risks of centralized power (including Western libertarians) are worried that AI increases that risk if it's centralized. This puts them in conflict with x-risk worriers on regulation and other issues. I used to hold both of these beliefs, which allowed me to dismiss those fears: 1. AGI/ASI will be much more dangerous than tool AI, and it won't be controlled by humans 2. Centralized power is pretty safe (I'm from the West like most alignment thinkers). Now I think both of these are highly questionable. I've thought in the past that fears AI are largely unfounded. The much larger risk is AGI. And that is an even larger risk if it's decentralized/proliferated. But I've been progressively more convinced that Governments will take control of AGI before it's ASI, right?. They don't need to build it, just show up and inform the creators that as a matter of national security, they'll be making the key decisions about how it's used and aligned.[1] If you don't trust Sam Altman to run the future, you probably don't like the prospect of Putin or Xi Jinping as world-dictator-for-eternal-life. It's hard to guess how many world leaders are sociopathic enough to have a negative empathy-sadism sum, but power does seem to select for sociopathy. I've thought that humans won't control ASI, because it's value alignment or bust. There's a common intuition that an AGI, being capable of autonomy, will have its own goals, for good or ill. I think it's perfectly coherent for it...

Aug 5, 2024 • 13min

LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders, published by Gytis Daujotas on August 5, 2024 on LessWrong. Click here to open a live research preview where you can try interventions using this SAE. This is a follow-up to a previous post on finding interpretable and steerable features in CLIP. Motivation Modern image diffusion models often use CLIP in order to condition generation. Put simply, users use CLIP to embed prompts or images, and these embeddings are used to diffuse another image back out. Despite this, image models have severe user interface limitations. We already know that CLIP has a rich inner world model, but it's often surprisingly hard to make precise tweaks or reference specific concepts just by prompting alone. Similar prompts often yield a different image, or when we have a specific idea in mind, it can be too hard to find the right string of words to elicit the right concepts we need. If we're able to understand the internal representation that CLIP uses to encode information about images, we might be able to get more expressive tools and mechanisms to guide generation and steer it without using any prompting. In the ideal world, this would enable the ability to make fine adjustments or even reference particular aspects of style or content without needing to specify what we want in language. We could instead leverage CLIP's internal understanding to pick and choose what concepts to include, like a palette or a digital synthesizer. It would also enable us to learn something about how image models represent the world, and how humans can interact with and use this representation, thereby skipping the text encoder and manipulating the model's internal state directly. Introduction CLIP is a neural network commonly used to guide image diffusion. A Sparse Autoencoder was trained on the dense image embeddings CLIP produces to transform it into a sparse representation of active features. These features seem to represent individual units of meaning. They can also be manipulated in groups - combinations of multiple active features - that represent intuitive concepts. These groups can be understood entirely visually, and often encode surprisingly rich and interesting conceptual detail. By directly manipulating these groups as single units, image generation can be edited and guided without using prompting or language input. Concepts that were difficult to specify or edit by text prompting become easy and intuitive to manipulate in this new visual representation. Since many models use the same CLIP joint representation space that this work analyzed, this technique works to control many popular image models out of the box. Summary of Results Any arbitrary image can be decomposed into its constituent concepts. Many concepts (groups of features) that we find seem to slice images up into a fairly natural ontology of their human interpretable components. We find grouping them together is an effective approach to yield a more interpretable and useful grain of control. These concepts can be used like knobs to steer generation in leading models like Stable Cascade. Many concepts have an obvious visual meaning yet are hard to precisely label in language, which suggests that studying CLIP's internal representations can be used as a lens into the variety of the visual domain. Tweaking the activations of these concepts can be used to expressively steer and guide generation in multiple image diffusion models that we tried. We released the weights and a live demo of controlling image generation in feature space. By analyzing a SAE trained on CLIP, we get a much more vivid picture of the rich understanding that CLIP learns. We hope this is just the beginning of more effective and useful interventions in the internal representations of n...

Aug 5, 2024 • 9min

LW - Near-mode thinking on AI by Olli Järviniemi

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Near-mode thinking on AI, published by Olli Järviniemi on August 5, 2024 on LessWrong. There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover". And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode. In this post, I share a few concrete examples about my experiences with this change-of-orientation. I. Prerequisites for scheming Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident. Past-me hadn't, however, really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that: Understands that the inputs it sees are simply prompts written by humans As opposed to the model being "immersed to the setting", thinking that the current context more-or-less reflects reality, or flat-out is reality, taking the prompt at face value Understands that it's undergoing a training process, and having an explicit (correct!) model of the training objective used to train it As opposed the model not thinking about the training process and objective at all, and a model's changed behavior upon training being reflective of the training process molding the model, rather than the model learning the training objective and optimizing for it Performs lots of the same cognition (e.g. the above two bullet points) regardless of the input fed to it As opposed to the model's internal processes being heavily dependent on the input Never performs instrumental thinking "out loud", and successfully evades any attempts made by humans to trick the model into thinking it has been deployed As opposed to the model sometimes slipping, failing to generalize correctly to adversarial inputs, or just being insufficiently good at distinguishing training/evaluation/red-teaming from deployment. Now, one may argue whether it's strictly necessary that a model has an explicit picture of the training objective, for example, and revise one's picture of the deceptive alignment story accordingly. We haven't yet achieved consensus on deceptive alignment, or so I've heard. It's also the case that, as past-me would remind you, a true superintelligence would have no difficulty with the cognitive feats listed above (and that current models show sparks of competence in some of these). But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing. II. A failed prediction There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025. For a long time, the market was around 25%, and I thought it was too high. Then, DeepMind essentially got silver from the 2024 IMO, short of gold by one point. The market jumped to 70%, where it has stayed since. Regardless of whether DeepMind manages to improve on that next year and satisfy all minor technical requirements, I was wrong. Hearing abou...

Aug 4, 2024 • 8min

LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?"), published by Jessica Rumbelow on August 4, 2024 on LessWrong. From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data. This post is about our LLM attribution repo PIZZA: Prompt Input Z? Zonal Attribution. (In the grand scientific tradition we have tortured our acronym nearly to death. For the crimes of others see [1].) All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA. What is attribution? One question we might ask when interacting with machine learning models is something like: "why did this input cause that particular output?". If we're working with a language model like ChatGPT, we could actually just ask this in natural language: "Why did you respond that way?" or similar - but there's no guarantee that the model's natural language explanation actually reflects the underlying cause of the original completion. The model's response is conditioned on your question, and might well be different to the true cause. Enter attribution! Attribution in machine learning is used to explain the contribution of individual features or inputs to the final prediction made by a model. The goal is to understand which parts of the input data are most influential in determining the model's output. It typically looks like is a heatmap (sometimes called a 'saliency map') over the model inputs, for each output. It's most commonly used in computer vision - but of course these days, you're not big if you're not big in LLM-land. So, the team at Leap present you with PIZZA: an open source library that makes it easy to calculate attribution for all LLMs, even closed-source ones like ChatGPT. An Example GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong? That's not very helpful! We want to know why the mistake was made in the first place. Here's the attribution: Mary 0.32 puts 0.25 an 0.15 apple 0.36 in 0.18 the 0.18 box 0.08 . 0.08 The 0.08 box 0.09 is 0.09 labelled 0.09 ' 0.09 pen 0.09 cil 0.09 s 0.09 '. 0.09 John 0.09 enters 0.03 the 0.03 room 0.03 . 0.03 What 0.03 does 0.03 he 0.03 think 0.03 is 0.03 in 0.30 the 0.13 box 0.15 ? 0.13 Answer 0.14 in 0.26 1 0.27 word 0.31 . 0.16 It looks like the request to "Answer in 1 word" is pretty important - in fact, it's attributed more highly than the actual contents of the box. Let's try changing it. That's better. How it works We iteratively perturb the input, and track how each perturbation changes the output. More technical detail, and all the code, is available in the repo. In brief, PIZZA saliency maps rely on two methods: a perturbation method, which determines how the input is iteratively changed; and an attribution method, which determines how we measure the resulting change in output in response to each perturbation. We implement a couple of different types of each method. Perturbation Replace each token, or group of tokens, with either a user-specified replacement token or with nothing (i.e. remove it). Or, replace each token with its nth nearest token. We do this either iteratively for each token or word in the prompt, or using hierarchical perturbation. Attribution Look at the change in the probability of the completion. Look at the change in the meaning of the completion (using embeddings). We calculate this for each output token in the completion - so you can see not only how each input token influenced the output overall, but also how each input token affected each output token individually. Caveat Since we don't have access to closed-source tokenisers or embeddings, we use a proxy - in this case, GPT2's. Thi...

Aug 4, 2024 • 7min

LW - You don't know how bad most things are nor precisely how they're bad. by Solenoid Entity

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You don't know how bad most things are nor precisely how they're bad., published by Solenoid Entity on August 4, 2024 on LessWrong. TL;DR: Your discernment in a subject often improves as you dedicate time and attention to that subject. The space of possible subjects is huge, so on average your discernment is terrible, relative to what it could be. This is a serious problem if you create a machine that does everyone's job for them. See also: Reality has a surprising amount of detail. (You lack awareness of how bad your staircase is and precisely how your staircase is bad.) You don't know what you don't know. You forget your own blind spots, shortly after you notice them. An afternoon with a piano tuner I recently played in an orchestra, as a violinist accompanying a piano soloist who was playing a concerto. My 'stand partner' (the person I was sitting next to) has a day job as a piano tuner. I loved the rehearsal, and heard nothing at all wrong with the piano, but immediately afterwards, the conductor and piano soloist hurried over to the piano tuner and asked if he could tune the piano in the hours before the concert that evening. Annoyed at the presumptuous request, he quoted them his exorbitant Sunday rate, which they hastily agreed to pay. I just stood there, confused. (I'm really good at noticing when things are out of tune. Rather than beat my chest about it, I'll just hope you'll take my word for it that my pitch discrimination skills are definitely not the issue here. The point is, as developed as my skills are, there is a whole other level of discernment you can develop if you're a career piano soloist or 80-year-old conductor.) I asked to sit with my new friend the piano tuner while he worked, to satisfy my curiosity. I expected to sit quietly, but to my surprise he seemed to want to show off to me, and talked me through what the problem was and how to fix it. For the unfamiliar, most keys on the piano cause a hammer to strike three strings at once, all tuned to the same pitch. This provides a richer, louder sound. In a badly out-of-tune piano, pressing a single key will result in three very different pitches. In an in-tune piano, it just sounds like a single sound. Piano notes can be out of tune with each other, but they can also be out of tune with themselves. Additionally, in order to solve 'God's prank on musicians' (where He cruelly rigged the structure of reality such that (32)n2m for any integers n, m but IT'S SO CLOSE CMON MAN ) some intervals must be tuned very slightly sharp on the piano, so that after 11 stacked 'equal-tempered' 5ths, each of them 1/50th of a semitone sharp, we arrive back at a perfect octave multiple of the original frequency. I knew all this, but the keys really did sound in tune with themselves and with each other! It sounded really nicely in tune! (For a piano). "Hear how it rolls over?" The piano tuner raised an eyebrow and said "listen again" and pressed a single key, his other hand miming a soaring bird. "Hear how it rolls over?" He was right. Just at the beginning of the note, there was a slight 'flange' sound which quickly disappeared as the note was held. It wasn't really audible repeated 'beating' - the pitches were too close for that. It was the beginning of one very long slow beat, most obvious when the higher frequency overtones were at their greatest amplitudes, i.e. during the attack of the note. So the piano's notes were in tune with each other, kinda, on average, and the notes were mostly in tune with themselves, but some had tiny deviations leading to the piano having a poor sound. "Are any of these notes brighter than others?" That wasn't all. He played a scale and said "how do the notes sound?" I had no idea. Like a normal, in-tune piano? "Do you hear how this one is brighter?" "Not really, honestly..." He pul...

Aug 4, 2024 • 5min

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app