The Nonlinear Library: LessWrong

The Nonlinear Fund
undefined
Aug 5, 2024 • 21min

LW - Circular Reasoning by abramdemski

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Circular Reasoning, published by abramdemski on August 5, 2024 on LessWrong. The idea that circular reasoning is bad is widespread. However, this reputation is undeserved. While circular reasoning should not be convincing (at least not usually), it should also not be considered invalid. Circular Reasoning is Valid The first important thing to note is that circular reasoning is logically valid. A implies A. If circular arguments are to be critiqued, it must be by some other standard than logical validity. I think it's fair to say that the most relevant objection to circular arguments is that they are not very good at convincing someone who does not already accept the conclusion. You are talking to another person, and need to think about communicating with their perspective. Perhaps the reason circular arguments are a common 'problem' is because they are valid. People naturally think about what should be a convincing argument from their own perspective, rather than the other person's. However, notice that this objection to circular reasoning assumes that one party is trying to convince the other. This is arguments-as-soldiers mindset.[1] If two people are curiously exploring each other's perspectives, then circular reasoning could be just fine! Furthermore, I'll claim: circular arguments should actually be considered as a little bit of positive evidence for their positions! Let's look at a concrete example. I don't think circular arguments are quite so simple as "A implies A"; the circle is usually a bit longer. So, consider a more realistic circular position:[2] Alice: Why do you believe in God? Bob: I believe in God based on the authority of the Bible. Alice: Why do you believe what the Bible says? Bob: Because the Bible was divinely inspired by God. God is all-knowing and good, so we can trust what God says. Here we have a two-step loop, A->B and B->A. The arguments are still logically fine; if the Bible tells the truth, and the Bible says God exists, then God exists. If the Bible were divinely inspired by an all-knowing and benevolent God, then it is reasonable to conclude that the Bible tells the truth. If Bob is just honestly going through his own reasoning here (as opposed to trying to convince Alice), then it would be wrong for Alice to call out Bob's circular reasoning as an error. The flaw in circular reasoning is that it doesn't convince anyone; but that's not what Bob is trying to do. Bob is just telling Alice what he thinks. If Alice thinks Bob is mistaken, and wants to point out the problems in Bob's beliefs, it is better for Alice to contest the premises of Bob's arguments rather than contest the reasoning form. Pointing out circularity only serves to remind Bob that Bob hasn't given Alice a convincing argument. You probably still think Bob has made some mistake in his reasoning, if these are his real reasons. I'll return to this later. Circular Arguments as Positive Evidence I claimed that circular arguments should count as a little bit of evidence in favor of their conclusions. Why? Imagine that the Bible claimed itself to be written by an evil and deceptive all-knowing God, instead of a benign God: Alice: Why do you believe in God? Bob: Because the Bible tells me so. Alice: Why do you believe the Bible? Bob: Well... uh... huh. Sometimes, belief systems are not even internally consistent. You'll find a contradiction[3] just thinking through the reasoning that is approved of by the belief system itself. This should make you disbelieve the thing. Therefore, by the rule we call conservation of expected evidence, reasoning through a belief system and deriving a conclusion consistent with the premise you started with should increase your credence. It provides some evidence that there's a consistent hypothesis here; and consistent hypotheses should get some ...
undefined
Aug 5, 2024 • 11min

LW - Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by Seth Herd

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours, published by Seth Herd on August 5, 2024 on LessWrong. Vitalik Buterin wrote an impactful blog post, My techno-optimism. I found this discussion of one aspect on 80,00 hours much more interesting. The remainder of that interview is nicely covered in the host's EA Forum post. My techno optimism apparently appealed to both sides, e/acc and doomers. Buterin's approach to bridging that polarization was interesting. I hadn't understood before the extent to which anti-AI regulation sentiment is driven by fear of centralized power. I hadn't thought about this risk before since it didn't seem relevant to AGI risk, but I've been updating to think it's highly relevant. [this is automated transcription that's inaccurate and comically accurate by turns :)] Rob Wiblin (the host) (starting at 20:49): what is it about the way that you put the reasons to worry that that ensured that kind of everyone could get behind it Vitalik Buterin: [...] in addition to taking you know the case that AI is going to kill everyone seriously I the other thing that I do is I take the case that you know AI is going to take create a totalitarian World Government seriously [...] [...] then it's just going to go and kill everyone but on the other hand if you like take some of these uh you know like very naive default solutions to just say like hey you know let's create a powerful org and let's like put all the power into the org then yeah you know you are creating the most like most powerful big brother from which There Is No Escape and which has you know control over the Earth and and the expanding light cone and you can't get out right and yeah I mean this is something that like uh I think a lot of people find very deeply scary I mean I find it deeply scary um it's uh it is also something that I think realistically AI accelerates right One simple takeaway is that recognizing and addressing that motivation for anti-regulation and pro-AGI sentiment when trying to work with or around the e/acc movement. But a second is whether to take that fear seriously. Is centralized power controlling AI/AGI/ASI a real risk? Vitalik Buterin is from Russia, where centralized power has been terrifying. This has been the case for roughly half of the world. Those that are concerned with of risks of centralized power (including Western libertarians) are worried that AI increases that risk if it's centralized. This puts them in conflict with x-risk worriers on regulation and other issues. I used to hold both of these beliefs, which allowed me to dismiss those fears: 1. AGI/ASI will be much more dangerous than tool AI, and it won't be controlled by humans 2. Centralized power is pretty safe (I'm from the West like most alignment thinkers). Now I think both of these are highly questionable. I've thought in the past that fears AI are largely unfounded. The much larger risk is AGI. And that is an even larger risk if it's decentralized/proliferated. But I've been progressively more convinced that Governments will take control of AGI before it's ASI, right?. They don't need to build it, just show up and inform the creators that as a matter of national security, they'll be making the key decisions about how it's used and aligned.[1] If you don't trust Sam Altman to run the future, you probably don't like the prospect of Putin or Xi Jinping as world-dictator-for-eternal-life. It's hard to guess how many world leaders are sociopathic enough to have a negative empathy-sadism sum, but power does seem to select for sociopathy. I've thought that humans won't control ASI, because it's value alignment or bust. There's a common intuition that an AGI, being capable of autonomy, will have its own goals, for good or ill. I think it's perfectly coherent for it...
undefined
Aug 5, 2024 • 13min

LW - Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders by Gytis Daujotas

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders, published by Gytis Daujotas on August 5, 2024 on LessWrong. Click here to open a live research preview where you can try interventions using this SAE. This is a follow-up to a previous post on finding interpretable and steerable features in CLIP. Motivation Modern image diffusion models often use CLIP in order to condition generation. Put simply, users use CLIP to embed prompts or images, and these embeddings are used to diffuse another image back out. Despite this, image models have severe user interface limitations. We already know that CLIP has a rich inner world model, but it's often surprisingly hard to make precise tweaks or reference specific concepts just by prompting alone. Similar prompts often yield a different image, or when we have a specific idea in mind, it can be too hard to find the right string of words to elicit the right concepts we need. If we're able to understand the internal representation that CLIP uses to encode information about images, we might be able to get more expressive tools and mechanisms to guide generation and steer it without using any prompting. In the ideal world, this would enable the ability to make fine adjustments or even reference particular aspects of style or content without needing to specify what we want in language. We could instead leverage CLIP's internal understanding to pick and choose what concepts to include, like a palette or a digital synthesizer. It would also enable us to learn something about how image models represent the world, and how humans can interact with and use this representation, thereby skipping the text encoder and manipulating the model's internal state directly. Introduction CLIP is a neural network commonly used to guide image diffusion. A Sparse Autoencoder was trained on the dense image embeddings CLIP produces to transform it into a sparse representation of active features. These features seem to represent individual units of meaning. They can also be manipulated in groups - combinations of multiple active features - that represent intuitive concepts. These groups can be understood entirely visually, and often encode surprisingly rich and interesting conceptual detail. By directly manipulating these groups as single units, image generation can be edited and guided without using prompting or language input. Concepts that were difficult to specify or edit by text prompting become easy and intuitive to manipulate in this new visual representation. Since many models use the same CLIP joint representation space that this work analyzed, this technique works to control many popular image models out of the box. Summary of Results Any arbitrary image can be decomposed into its constituent concepts. Many concepts (groups of features) that we find seem to slice images up into a fairly natural ontology of their human interpretable components. We find grouping them together is an effective approach to yield a more interpretable and useful grain of control. These concepts can be used like knobs to steer generation in leading models like Stable Cascade. Many concepts have an obvious visual meaning yet are hard to precisely label in language, which suggests that studying CLIP's internal representations can be used as a lens into the variety of the visual domain. Tweaking the activations of these concepts can be used to expressively steer and guide generation in multiple image diffusion models that we tried. We released the weights and a live demo of controlling image generation in feature space. By analyzing a SAE trained on CLIP, we get a much more vivid picture of the rich understanding that CLIP learns. We hope this is just the beginning of more effective and useful interventions in the internal representations of n...
undefined
Aug 5, 2024 • 9min

LW - Near-mode thinking on AI by Olli Järviniemi

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Near-mode thinking on AI, published by Olli Järviniemi on August 5, 2024 on LessWrong. There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover". And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode. In this post, I share a few concrete examples about my experiences with this change-of-orientation. I. Prerequisites for scheming Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident. Past-me hadn't, however, really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that: Understands that the inputs it sees are simply prompts written by humans As opposed to the model being "immersed to the setting", thinking that the current context more-or-less reflects reality, or flat-out is reality, taking the prompt at face value Understands that it's undergoing a training process, and having an explicit (correct!) model of the training objective used to train it As opposed the model not thinking about the training process and objective at all, and a model's changed behavior upon training being reflective of the training process molding the model, rather than the model learning the training objective and optimizing for it Performs lots of the same cognition (e.g. the above two bullet points) regardless of the input fed to it As opposed to the model's internal processes being heavily dependent on the input Never performs instrumental thinking "out loud", and successfully evades any attempts made by humans to trick the model into thinking it has been deployed As opposed to the model sometimes slipping, failing to generalize correctly to adversarial inputs, or just being insufficiently good at distinguishing training/evaluation/red-teaming from deployment. Now, one may argue whether it's strictly necessary that a model has an explicit picture of the training objective, for example, and revise one's picture of the deceptive alignment story accordingly. We haven't yet achieved consensus on deceptive alignment, or so I've heard. It's also the case that, as past-me would remind you, a true superintelligence would have no difficulty with the cognitive feats listed above (and that current models show sparks of competence in some of these). But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing. II. A failed prediction There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025. For a long time, the market was around 25%, and I thought it was too high. Then, DeepMind essentially got silver from the 2024 IMO, short of gold by one point. The market jumped to 70%, where it has stayed since. Regardless of whether DeepMind manages to improve on that next year and satisfy all minor technical requirements, I was wrong. Hearing abou...
undefined
Aug 4, 2024 • 8min

LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?"), published by Jessica Rumbelow on August 4, 2024 on LessWrong. From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data. This post is about our LLM attribution repo PIZZA: Prompt Input Z? Zonal Attribution. (In the grand scientific tradition we have tortured our acronym nearly to death. For the crimes of others see [1].) All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA. What is attribution? One question we might ask when interacting with machine learning models is something like: "why did this input cause that particular output?". If we're working with a language model like ChatGPT, we could actually just ask this in natural language: "Why did you respond that way?" or similar - but there's no guarantee that the model's natural language explanation actually reflects the underlying cause of the original completion. The model's response is conditioned on your question, and might well be different to the true cause. Enter attribution! Attribution in machine learning is used to explain the contribution of individual features or inputs to the final prediction made by a model. The goal is to understand which parts of the input data are most influential in determining the model's output. It typically looks like is a heatmap (sometimes called a 'saliency map') over the model inputs, for each output. It's most commonly used in computer vision - but of course these days, you're not big if you're not big in LLM-land. So, the team at Leap present you with PIZZA: an open source library that makes it easy to calculate attribution for all LLMs, even closed-source ones like ChatGPT. An Example GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong? That's not very helpful! We want to know why the mistake was made in the first place. Here's the attribution: Mary 0.32 puts 0.25 an 0.15 apple 0.36 in 0.18 the 0.18 box 0.08 . 0.08 The 0.08 box 0.09 is 0.09 labelled 0.09 ' 0.09 pen 0.09 cil 0.09 s 0.09 '. 0.09 John 0.09 enters 0.03 the 0.03 room 0.03 . 0.03 What 0.03 does 0.03 he 0.03 think 0.03 is 0.03 in 0.30 the 0.13 box 0.15 ? 0.13 Answer 0.14 in 0.26 1 0.27 word 0.31 . 0.16 It looks like the request to "Answer in 1 word" is pretty important - in fact, it's attributed more highly than the actual contents of the box. Let's try changing it. That's better. How it works We iteratively perturb the input, and track how each perturbation changes the output. More technical detail, and all the code, is available in the repo. In brief, PIZZA saliency maps rely on two methods: a perturbation method, which determines how the input is iteratively changed; and an attribution method, which determines how we measure the resulting change in output in response to each perturbation. We implement a couple of different types of each method. Perturbation Replace each token, or group of tokens, with either a user-specified replacement token or with nothing (i.e. remove it). Or, replace each token with its nth nearest token. We do this either iteratively for each token or word in the prompt, or using hierarchical perturbation. Attribution Look at the change in the probability of the completion. Look at the change in the meaning of the completion (using embeddings). We calculate this for each output token in the completion - so you can see not only how each input token influenced the output overall, but also how each input token affected each output token individually. Caveat Since we don't have access to closed-source tokenisers or embeddings, we use a proxy - in this case, GPT2's. Thi...
undefined
Aug 4, 2024 • 7min

LW - You don't know how bad most things are nor precisely how they're bad. by Solenoid Entity

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You don't know how bad most things are nor precisely how they're bad., published by Solenoid Entity on August 4, 2024 on LessWrong. TL;DR: Your discernment in a subject often improves as you dedicate time and attention to that subject. The space of possible subjects is huge, so on average your discernment is terrible, relative to what it could be. This is a serious problem if you create a machine that does everyone's job for them. See also: Reality has a surprising amount of detail. (You lack awareness of how bad your staircase is and precisely how your staircase is bad.) You don't know what you don't know. You forget your own blind spots, shortly after you notice them. An afternoon with a piano tuner I recently played in an orchestra, as a violinist accompanying a piano soloist who was playing a concerto. My 'stand partner' (the person I was sitting next to) has a day job as a piano tuner. I loved the rehearsal, and heard nothing at all wrong with the piano, but immediately afterwards, the conductor and piano soloist hurried over to the piano tuner and asked if he could tune the piano in the hours before the concert that evening. Annoyed at the presumptuous request, he quoted them his exorbitant Sunday rate, which they hastily agreed to pay. I just stood there, confused. (I'm really good at noticing when things are out of tune. Rather than beat my chest about it, I'll just hope you'll take my word for it that my pitch discrimination skills are definitely not the issue here. The point is, as developed as my skills are, there is a whole other level of discernment you can develop if you're a career piano soloist or 80-year-old conductor.) I asked to sit with my new friend the piano tuner while he worked, to satisfy my curiosity. I expected to sit quietly, but to my surprise he seemed to want to show off to me, and talked me through what the problem was and how to fix it. For the unfamiliar, most keys on the piano cause a hammer to strike three strings at once, all tuned to the same pitch. This provides a richer, louder sound. In a badly out-of-tune piano, pressing a single key will result in three very different pitches. In an in-tune piano, it just sounds like a single sound. Piano notes can be out of tune with each other, but they can also be out of tune with themselves. Additionally, in order to solve 'God's prank on musicians' (where He cruelly rigged the structure of reality such that (32)n2m for any integers n, m but IT'S SO CLOSE CMON MAN ) some intervals must be tuned very slightly sharp on the piano, so that after 11 stacked 'equal-tempered' 5ths, each of them 1/50th of a semitone sharp, we arrive back at a perfect octave multiple of the original frequency. I knew all this, but the keys really did sound in tune with themselves and with each other! It sounded really nicely in tune! (For a piano). "Hear how it rolls over?" The piano tuner raised an eyebrow and said "listen again" and pressed a single key, his other hand miming a soaring bird. "Hear how it rolls over?" He was right. Just at the beginning of the note, there was a slight 'flange' sound which quickly disappeared as the note was held. It wasn't really audible repeated 'beating' - the pitches were too close for that. It was the beginning of one very long slow beat, most obvious when the higher frequency overtones were at their greatest amplitudes, i.e. during the attack of the note. So the piano's notes were in tune with each other, kinda, on average, and the notes were mostly in tune with themselves, but some had tiny deviations leading to the piano having a poor sound. "Are any of these notes brighter than others?" That wasn't all. He played a scale and said "how do the notes sound?" I had no idea. Like a normal, in-tune piano? "Do you hear how this one is brighter?" "Not really, honestly..." He pul...
undefined
Aug 4, 2024 • 5min

LW - SRE's review of Democracy by Martin Sustrik

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SRE's review of Democracy, published by Martin Sustrik on August 4, 2024 on LessWrong. Day One We've been handed this old legacy system called "Democracy". It's an emergency. The old maintainers are saying it has been misbehaving lately but they have no idea how to fix it. We've had a meeting with them to find out as much as possible about the system, but it turns out that all the original team members left the company long time ago. The current team doesn't have much understanding of the system beyond some basic operational knowledge. We've conducted a cursory code review, focusing not so much on business logic but rather on the stuff that could possibly help us to tame it: Monitoring, reliability characteristics, feedback loops, automation already in place. Our first impression: Oh, God, is this thing complex! Second impression: The system is vaguely modular. Each module is strongly coupled with every other module though. It's an organically grown legacy system at its worst. That being said, we've found a clue as to why the system may have worked fine for so long. There's a redundancy system called "Separation of Powers". It reminds me of the Tandem computers back from the 70s. Day Two We were wrong. "Separation of Powers" is not a system for redundancy. Each part of the system ("branch") has different business logic. However, each also acts as a watchdog process for the other branches. When it detects misbehavior it tries to apply corrective measures using its own business logic. Gasp! Things are not looking good. We're still searching for monitoring. Day Three Hooray! We've found the monitoring! It turns out that "Election" is conducted once every four years. Each component reports its health (1 bit) to the central location. The data flow is so low that we have overlooked it until now. We are considering shortening the reporting period, but the subsystem is so deeply coupled with other subsystems that doing so could easily lead to a cascading failure. In other news, there seems to be some redundancy after all. We've found a full-blown backup control system ("Shadow Cabinet") that is inactive at the moment, but might be able to take over in case of a major failure. We're investigating further. Day Four Today, we've found yet another monitoring system called "FreePress." As the name suggests it was open-sourced some time ago, but the corporate version have evolved quite a bit since then, so the documentation isn't very helpful. The bad news is that it's badly intertwined with the production system. The metrics look more or less okay as long as everything is working smoothly. However, it's unclear what will happen if things go south. It may distort the metrics or even fail entirely, leaving us with no data whatsoever at the moment of crisis. By the way, the "Election" process may not be a monitoring system after all. I suspect it might actually be a feedback loop that triggers corrective measures in case of problems. Day Five The most important metric seems to be this big graph labeled "GDP". As far as we understand, it's supposed to indicate the overall health of the system. However, drilling into the code suggests that it's actually a throughput metric. If throughput goes down there's certainly a problem, but it's not clear why increasing throughput should be considered the primary health factor... More news on the "Election" subsystem: We've found a floppy disk with the design doc, and it turns out that it's not a feedback loop after all. It's a distributed consensus algorithm (think Paxos)! The historical context is that they've used to run several control systems in parallel (for redundancy reasons maybe?) which resulted in numerous race conditions and outages. "Election" was put in place to ensure that only one control system acts as a master at any given time...
undefined
Aug 3, 2024 • 5min

LW - Some comments on intelligence by Viliam

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some comments on intelligence, published by Viliam on August 3, 2024 on LessWrong. After reading another article on IQ, there are a few things that I wish would become common knowledge to increase the quality of the debate. Posting them here: 1) There is a difference between an abstract definition of intelligence such that it could also apply to aliens or AIs (something like "an agent able to optimize for outcomes in various environments") and the specific way the intelligence is implemented in human brains. Because of the implementation details, things can be true about human intelligence even if they are not necessarily true about intelligence in general. For example, we might empirically find that humans better at X are usually also better at Y, even if we could imagine a hypothetical AI (or even take an already existing one) whose skills at X and Y are unrelated. The fact that X and Y are unrelated in principle doesn't disprove the hypothesis that they are related in human brains. 2) Saying "the important thing is not intelligence (or rationality), but domain knowledge or experience or something else" is... ...on one hand, true; and the fans of intelligence (or rationality) should probably be reminded of it quite often. Yes, your Mensa membership card or LessWrong account doesn't mean that you no longer have to study things because you can solve relativity in five minutes of armchair reasoning... ...on the other hand, it's not like these things are completely unrelated. Yes, you acquire knowledge by studying, but your intelligence probably has a huge impact on how fast you can do that, or even whether you can do that at all. So we need to distinguish between short term and long term. In short term, yes, domain knowledge and experience matter a lot, and intelligence is probably not going to save you if the inferential distances are large. But in long term, intelligence may be necessary for acquiring the domain knowledge and experience. In other words, there is a huge difference between "can use intelligence instead of X, Y, Z" and "can use intelligence to acquire X, Y, Z". The argument about intelligence being less important that X, Y, Z is irrelevant as an objection to the latter. 3) An article that led me to writing this all proposed that we do not need separate education for gifted children; instead we should simply say that some children are further ahead in certain topics (this part is not going to trigger anyone's political instincts) and therefore we should have separate classes for... those who already know something, and those who don't know it yet. This would nicely avoid the controversy around intelligence and heredity etc., while still allowing the more intelligent kids (assuming that there is such a thing) to study at their own speed. A win/win solution for both those who believe in intelligence and those who don't? Unfortunately, I think this is not going to work. I approve of the idea of disentangling "intelligence" from "previously gained experience". But the entire point of IQ is that previously gained experience does not screen off intelligence. Your starting point is one thing; the speed at which you progress is another thing. Yes, it makes sense in the classroom to separate the children who already know X ("advanced") from the children who don't know X yet ("beginners"). No need for the advanced to listen again to the things they already know. But if you keep teaching both groups at the speed optimal for their average members, both the gifted beginners and the gifted advanced will be bored, each one in their own group. A system that allows everyone to achieve their full potential would be the one where the gifted beginner is allowed to catch up on the average advanced, and where the gifted advanced is allowed to leave the average advanced behin...
undefined
Aug 2, 2024 • 12min

LW - A Simple Toy Coherence Theorem by johnswentworth

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on LessWrong. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, and that value...
undefined
Aug 2, 2024 • 3min

LW - AI Rights for Human Safety by Simon Goldstein

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Rights for Human Safety, published by Simon Goldstein on August 2, 2024 on LessWrong. Just wanted to share a new paper on AI rights, co-authored with Peter Salib, that members of this community might be interested in. Here's the abstract: AI companies are racing to create artificial general intelligence, or "AGI." If they succeed, the result will be human-level AI systems that can independently pursue high-level goals by formulating and executing long-term plans in the real world. Leading AI researchers agree that some of these systems will likely be "misaligned"-pursuing goals that humans do not desire. This goal mismatch will put misaligned AIs and humans into strategic competition with one another. As with present-day strategic competition between nations with incompatible goals, the result could be violent and catastrophic conflict. Existing legal institutions are unprepared for the AGI world. New foundations for AGI governance are needed, and the time to begin laying them is now, before the critical moment arrives. This Article begins to lay those new legal foundations. It is the first to think systematically about the dynamics of strategic competition between humans and misaligned AGI. The Article begins by showing, using formal game-theoretic models, that, by default, humans and AIs will be trapped in a prisoner's dilemma. Both parties' dominant strategy will be to permanently disempower or destroy the other, even though the costs of such conflict would be high. The Article then argues that a surprising legal intervention could transform the game theoretic equilibrium and avoid conflict: AI rights. Not just any AI rights would promote human safety. Granting AIs the right not to be needlessly harmed-as humans have granted to certain non-human animals-would, for example, have little effect. Instead, to promote human safety, AIs should be given those basic private law rights-to make contracts, hold property, and bring tort claims-that law already extends to non-human corporations. Granting AIs these economic rights would enable long-run, small-scale, mutually-beneficial transactions between humans and AIs. This would, we show, facilitate a peaceful strategic equilibrium between humans and AIs for the same reasons economic interdependence tends to promote peace in international relations. Namely, the gains from trade far exceed those from war. Throughout, we argue that human safety, rather than AI welfare, provides the right framework for developing AI rights. This Article explores both the promise and the limits of AI rights as a legal tool for promoting human safety in an AGI world. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app