The Nonlinear Library

The Nonlinear Fund
undefined
Jun 10, 2024 • 7min

LW - My AI Model Delta Compared To Yudkowsky by johnswentworth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My AI Model Delta Compared To Yudkowsky, published by johnswentworth on June 10, 2024 on LessWrong. Preamble: Delta vs Crux I don't natively think in terms of cruxes. But there's a similar concept which is more natural for me, which I'll call a delta. Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it's cloudy today, that means the "weather" variable in my program at a particular time[1] takes on the value "cloudy". Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world. If your model and my model differ in that way, and we're trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is. That's a delta: one or a few relatively "small"/local differences in belief, which when propagated through our models account for most of the differences in our beliefs. For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else's model, or vice versa. This post is about my current best guesses at the delta between my AI models and Yudkowsky's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models basically look like Yukowsky's as far as I can tell. This post might turn into a sequence if there's interest; I already have another one written for Christiano, and people are welcome to suggest others they'd be interested in. My AI Model Delta Compared To Yudkowsky Best guess: Eliezer basically rejects the natural abstraction hypothesis. He mostly expects AI to use internal ontologies fundamentally alien to the ontologies of humans, at least in the places which matter. Lethality #33 lays it out succinctly: 33. The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind. What do my models look like if I propagate that delta? In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked, and a 99% probability of doom strikes me as entirely reasonable and justified. Here's one oversimplified doom argument/story in a world where natural abstraction fails hard: 1. Humanity is going to build superhuman goal-optimizing agents. ('Cause, like, obviously somebody's going to do that, there's no shortage of capabilities researchers loudly advertising that they're aiming to do that exact thing.) These will be so vastly more powerful than humans that we have basically-zero bargaining power except insofar as AIs are aligned to our interests. 2. We're assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we'll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system's internal ontology at all. (For instance, maybe a faithful and robust translation would be so long in the system's "internal language" that the transla...
undefined
Jun 10, 2024 • 11min

AF - 5. Open Corrigibility Questions by Max Harms

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 5. Open Corrigibility Questions, published by Max Harms on June 10, 2024 on The AI Alignment Forum. (Part 5 of the CAST sequence) Much work remains on the topic of corrigibility and the CAST strategy in particular. There's theoretical work in both nailing down an even more complete picture of corrigibility and in developing better formal measures. But there's also a great deal of empirical work that seems possible to do at this point. In this document I'll attempt to give a summary of where I, personally, want to invest more energy. Remaining Confusion Does "empowerment" really capture the gist of corrigibility? Does it actually matter whether we restrict the empowerment goal to the domains of the agent's structure, thoughts, actions, and the consequences of their actions? Or do we still get good outcomes if we ask for more general empowerment? It seems compelling to model nearly everything in the AI's lightcone as a consequence of its actions, given that there's a counterfactual way the AI could have behaved such that those facts would change. If we ask to be able to correct the AI's actions, are we not, in practice, then asking to be generally empowered? Corrigible agents should, I think, still (ultimately) obey commands that predictably disempower the principal or change the agent to be less corrigible. Does my attempted formalism actually capture this? Can we prove that, in my formalism, any pressure on the principal's actions that stems from outside their values is disempowering? How should we think about agent-actions which scramble the connection between values and principal-actions, but in a way that preserves the way in which actions encode information about what generated them? Is this still kosher? What if the scrambling takes place by manipulating the principal's beliefs? What's going on with the relationship between time, policies, and decisions? Am I implicitly picking a decision theory for the agent in my formalism? Are my attempts to rescue corrigibility in the presence of multiple timesteps philosophically coherent? Should we inject entropy into the AI's distribution over what time it is when measuring its expected corrigibility? If so, how much? Are the other suggestions about managing time good? What other tricks are there to getting things to work that I haven't thought of? Sometimes it's good to change values, such as if one has a meta-value (i.e. "I want to want to stop gambling"). How can we formally reflect the desiderata of having a corrigible agent support this kind of growth, or at least not try to block the principal from growing. If the agent allows the principal to change values, how can we clearly distinguish the positive and natural kind of growth from unwanted value drift or manipulation? Is there actually a clean line between learning facts and changing values? If not, does "corrigibility" risk having an agent who wants to prevent the principal from learning things? Does the principal want to protect the principal in general, or simply to protect the principal from the actions of the agent? Corrigibility clearly involves respecting commands given by the principal yesterday, or more generally, some arbitrary time in the past. But when the principal of today gives a contradictory command, we want the agent to respect the updated instruction. What gives priority of the present over the past? If the agent strongly expects the principal to give a command in the future, does that expected-command carry any weight? If so, can it take priority over the principal of the past/present? Can a multiple-human team actually be a principal? What's the right way to ground that out, ontologically? How should a corrigible agent behave when its principal seems self-contradictory? (Either because the principal is a team, or simply because the single-huma...
undefined
Jun 10, 2024 • 1min

LW - What if a tech company forced you to move to NYC? by KatjaGrace

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What if a tech company forced you to move to NYC?, published by KatjaGrace on June 10, 2024 on LessWrong. It's interesting to me how chill people sometimes are about the non-extinction future AI scenarios. Like, there seem to be opinions around along the lines of "pshaw, it might ruin your little sources of 'meaning', Luddite, but we have always had change and as long as the machines are pretty near the mark on rewiring your brain it will make everything amazing". Yet I would bet that even that person, if faced instead with a policy that was going to forcibly relocate them to New York City, would be quite indignant, and want a lot of guarantees about the preservation of various very specific things they care about in life, and not be just like "oh sure, NYC has higher GDP/capita than my current city, sounds good". I read this as a lack of engaging with the situation as real. But possibly my sense that a non-negligible number of people have this flavor of position is wrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 10, 2024 • 4min

LW - The Data Wall is Important by JustisMills

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Data Wall is Important, published by JustisMills on June 10, 2024 on LessWrong. Modern AI is trained on a huge fraction of the internet, especially at the cutting edge, with the best models trained on close to all the high quality data we've got.[1] And data is really important! You can scale up compute, you can make algorithms more efficient, or you can add infrastructure around a model to make it more useful, but on the margin, great datasets are king. And, naively, we're about to run out of fresh data to use. It's rumored that the top firms are looking for ways to get around the data wall. One possible approach is having LLMs create their own data to train on, for which there is kinda-sorta a precedent from, e.g. modern chess AIs learning by playing games against themselves.[2] Or just finding ways to make AI dramatically more sample efficient with the data we've already got: the existence of human brains proves that this is, theoretically, possible.[3] But all we have, right now, are rumors. I'm not even personally aware of rumors that any lab has cracked the problem: certainly, nobody has come out and say so in public! There's a lot of insinuation that the data wall is not so formidable, but no hard proof. And if the data wall is a hard blocker, it could be very hard to get AI systems much stronger than they are now. If the data wall stands, what would we make of today's rumors? There's certainly an optimistic mood about progress coming from AI company CEOs, and a steady trickle of not-quite-leaks that exciting stuff is going on behind the scenes, and to stay tuned. But there are at least two competing explanations for all this: Top companies are already using the world's smartest human minds to crack the data wall, and have all but succeeded. Top companies need to keep releasing impressive stuff to keep the money flowing, so they declare, both internally and externally, that their current hurdles are surmountable. There's lots of precedent for number two! You may have heard of startups hard coding a feature and then scrambling to actually implement it when there's interest. And race dynamics make this even more likely: if OpenAI projects cool confidence that it's almost over the data wall, and Anthropic doesn't, then where will all the investors, customers, and high profile corporate deals go? There also could be an echo chamber effect, where one firm acting like the data wall's not a big deal makes other firms take their word for it. I don't know what a world with a strong data wall looks like in five years. I bet it still looks pretty different than today! Just improving GPT-4 level models around the edges, giving them better tools and scaffolding, should be enough to spur massive economic activity and, in the absence of government intervention, job market changes. We can't unscramble the egg. But the "just trust the straight line on the graph" argument is ignoring that one of the determinants of that line is running out. There's a world where the line is stronger than that particular constraint, and a new treasure trove of data appears in time. But there's also a world where it isn't, and we're near the inflection of an S-curve. Rumors and projected confidence can't tell us which world we're in. 1. ^ For good analysis of this, search for the heading "The data wall" here. 2. ^ But don't take this parallel too far! Chess AI (or AI playing any other game) has a signal of "victory" that it can seek out - it can preferentially choose moves that systematically lead to the "my side won the game" outcome. But the core of a LLM is a text predictor: "winning" for it is correctly guessing what comes next in human-created text. What does self-play look like there? Merely making up fake human-created text has the obvious issue of amplifying any weaknesses the AI has ...
undefined
Jun 10, 2024 • 14min

LW - Why I don't believe in the placebo effect by transhumanist atom understander

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I don't believe in the placebo effect, published by transhumanist atom understander on June 10, 2024 on LessWrong. Have you heard this before? In clinical trials, medicines have to be compared to a placebo to separate the effect of the medicine from the psychological effect of taking the drug. The patient's belief in the power of the medicine has a strong effect on its own. In fact, for some drugs such as antidepressants, the psychological effect of taking a pill is larger than the effect of the drug. It may even be worth it to give a patient an ineffective medicine just to benefit from the placebo effect. This is the conventional wisdom that I took for granted until recently. I no longer believe any of it, and the short answer as to why is that big meta-analysis on the placebo effect. That meta-analysis collected all the studies they could find that did "direct" measurements of the placebo effect. In addition to a placebo group that could, for all they know, be getting the real treatment, these studies also included a group of patients that didn't receive a placebo. But even after looking at the meta-analysis I still found the situation confusing. The only reason I ever believed in the placebo effect was because I understood it to be a scientific finding. This may put me in a different position than people who believe in it from personal experience. But personally, I thought it was just a well-known scientific fact that was important to the design of clinical trials. How did it come to be conventional wisdom, if direct measurement doesn't back it up? And what do the studies collected in that meta-analysis actually look like? I did a lot of reading to answer these questions, and that's what I want to share with you. I'm only going to discuss a handful of studies. I can't match the force of evidence of the meta-analysis, which aggregated over two hundred studies. But this is how I came to understand what kind of evidence created the impression of a strong placebo effect, and what kind of evidence indicates that it's actually small. Examples: Depression The observation that created the impression of a placebo effect is that patients in the placebo group tend to get better during the trial. Here's an example from a trial of the first antidepressant that came to mind, which was Prozac. The paper is called "A double-blind, randomized, placebo-controlled trial of fluoxetine in children and adolescents with depression". In this test, high scores are bad. So we see both the drug group and the placebo group getting better in the beginning of at the beginning of the trial. By the end of the trial, the scores in those two groups are different, but that difference is not as big as the drop right at the beginning. I can see how someone could look at this and say that most of the effect of the drug is the placebo effect. In fact, the 1950s study that originally popularized the placebo effect consisted mainly of these kind of before-and-after comparisons. Another explanation is simply that depression comes in months-long episodes. Patients will tend to be in a depressive episode when they're enrolled in a trial, and by the end many of them will have come out of it. If that's all there is to it, we would expect that a "no-pill" group (no drug, no placebo) would have the same drop. I looked through the depression studies cited in that big meta-analysis, but I didn't manage to find a graph precisely like the Prozac graph but with an additional no-pill group. Here's the closest that I found, from a paper called "Effects of maintenance amitriptyline and psychotherapy on symptoms of depression". Before I get into all the reasons why this isn't directly comparable, note that the placebo and no-pill curves look the same, both on top: The big difference is that this is trial is testing ...
undefined
Jun 9, 2024 • 7min

LW - Dumbing down by Martin Sustrik

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dumbing down, published by Martin Sustrik on June 9, 2024 on LessWrong. In past few years I've been blogging in Slovak, that is, downscaling from writing in English, a language with 1457 million speakers to a language with 7 million speakers. From the point of view of the writer, this has been a very different experience. It's not only that for a topic that interests one million English speakers, the equivalent is five thousand in Slovakia, scaling down by factor of 200. It's also that topic that interests 100 English speakers, interests one half of a hypothetical Slovak speaker, that is, nobody. In fact, not everybody reads blogs, so the population in question is likely smaller by an order of magnitude or even two, resulting in even more fractional Slovaks... In other words, the reader population is not as big as to fill in all the possible niches and the writing thus has to become much more generic. It must also be "dumbed down". Not because Slovaks are less intelligent than other nations, but because the scale of the existing discourse is much smaller. While in English, not matter how esoteric your topic is, you can reference or link to the relevant discussion, in Slovak it often is the case that there's no discussion at all. The combination of the two factors above means that you have to explain yourself all the time. You want to mention game theory? You have to explain what do you mean. You want to make a physics metaphor? You can't, if you care about being understood. You want to hint at some economic phenomenon? You have to explain yourself again. And often even the terminology is lacking. Even such a basic word as "policy" has no established equivalent. I had to ask a friend who works as a translator at the European Commission, just to be told that they use word "politika" for this purpose. Which is definitely not a common meaning of the word. "Politika" typically means "politics" and using it for "policy" sounds really strange and awkward. (All of this gave me gut-level understanding of how small populations can lose knowledge. Joe Henrich mentions a case of small Inuit population getting isolated from the rest and gradually losing technology, including the kayak building skills, which in turn made it, in a vicious circle, unable to import other technology. This kind of thing also tends to be mentioned when speaking of dropping fertility rates and possible inability of a smaller global population to keep the technology we take for granted today. Well, I can relate now.) Anyway, it's interesting to look at what kind of topics were popular in such a scaled-down environment. Interestingly, the most popular article (17k views) was a brief introduction to Effective Altruism. I have no explanation for that except that it was a chance. Maybe it was because I wrote it on December 29th when there was not much other content? The readers, after all, judging from the comments, were not convinced, but rather experienced unpleasant cognitive dissonance, when they felt compelled to argue that saving one kid at home is better than saving five kids in Africa. (From comments:) Nice article. I've decided to support charity on regular basis, but here in Slovakia, even if it's more expensive, because I think that maintaining life forcibly in Africa, where it is not doing well, goes against the laws of nature. I can imagine Africa without the people who kill each other in civil wars, who are unable to take care of their own offspring and the country. If someone wants to live there, mine diamonds or grow coffee, they should go there and start life anew, and perhaps on better foundations than the ones damaged in Africa years ago by the colonizers. A series of articles about Swiss political system (all together maybe 10k views). Interestingly, the equivalent in English was popular o...
undefined
Jun 9, 2024 • 7min

EA - How to change the law to have a large-scale impact on animal welfare? by Melvin Josse

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to change the law to have a large-scale impact on animal welfare?, published by Melvin Josse on June 9, 2024 on The Effective Altruism Forum. Convergence Animaux Politique (CAP) is a French charity specialising on political action for animals: discover our theory of change to save animal lives and reduce animal suffering. You can join our crowdfunding campaign to support our efforts. To understand the importance of theory of change, see this previous post. Most issues related to animals are intrinsically linked to institutionalized practices and frameworks, such as farming conditions, wildlife protection, animal testing regulations, and more. Animal protection goes beyond individual ethics; it is a matter of collective, political responsibility. To impact as many animals as possible, CAP operates at the political level on behalf of 25 partner NGOs, primarily on a national scale, to change the law in favor of animals. The creation of CAP in 2016 resulted from the observation that while animal protection was becoming an important issue for public opinion and the media in France, politicians did not concern themselves with it, as they did not perceive it as a politically legitimate topic. Hence the law did not change, and animals did not see their condition improve. To quote Lewis Bollard, "our challenge is to convert the popular support we already enjoy into the legal protections that [...] animals deserve." Weaknesses of the French animal movement that CAP aims to address We identified several reasons for this lack of political interest and action in France: Very few Animal NGOs lobbied politicians and the ones that did, did it on a small scale and rather occasionally, which did not allow for the building of a sustained network of political allies.On the other hand, lobbies that favor the status quo and the exploitation of animals had long established influential networks, relying on significant financial resources. NGOs sometimes had differing agendas and political demands, which made them less clear and less visible for politicians (who most often will not take the time to decipher the positions of various groups to draw a consensus). Politicians interested in animal issues often did not know who to turn to for advice and support, because they lacked knowledge about the animal movement. There was a need for a clearly identified actor that would be able to act as an intermediary and to redirect politicians towards the relevant organizations on specific topics. Because very few politicians took public stances or political action for animals, sympathizing Members of parliament (MPs) often did not dare to do so, by fear of being marginalized and hence losing political leverage. How CAP seeks to bring about change To achieve political change for animals, the main strategy of CAP is to bring more animal NGOs to lobby politicians, more massively and in a more coherent manner, and to create, grow and sustain a network of political allies in Parliament, in order to be able to mobilize it when needed, to build majorities, inside parliamentary groups, or more generally in parliament. Indeed, the support of a few MPs is not enough for bills to be passed, or even debated upon: for a bill to be put on the agenda, a parliamentary group needs to collectively agree to prioritize it and use its right to set a chamber's (National Assembly or Senate) agenda. CAP's actions can be summed up to four main inputs: We obtain meetings with MPs and members of the government (nearly 350 in seven years). Our goals are to raise their awareness of animal issues and of the political demands of our partner NGOs and push them to act upon them. We also aim to sound out their opinion and their willingness to act upon specific issues, which will help us and our partners to know who to appeal to when...
undefined
Jun 9, 2024 • 31sec

LW - Demystifying "Alignment" through a Comic by milanrosko

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Demystifying "Alignment" through a Comic, published by milanrosko on June 9, 2024 on LessWrong. I hope you enjoyed this brief overview. For the full comic visit: https://milanrosko.substack.com/p/button Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 9, 2024 • 32min

AF - 3a. Towards Formal Corrigibility by Max Harms

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 3a. Towards Formal Corrigibility, published by Max Harms on June 9, 2024 on The AI Alignment Forum. (Part 3a of the CAST sequence) As mentioned in Corrigibility Intuition, I believe that it's more important to find a simple, coherent, natural/universal concept that can be gestured at, rather than coming up with a precisely formal measure of corrigibility and using that to train an AGI. This isn't because formal measures are bad; in principle (insofar as corrigibility is a real concept) there will be some kind of function which measures corrigibility. But it's hard to capture the exact right thing with formal math, and explicit metrics have the tendency to blind people to the presence of better concepts that are nearby. Nevertheless, there are advantages in attempting to tighten up and formalize our notion of corrigibility. When using a fuzzy, intuitive approach, it's easy to gloss-over issues by imagining that a corrigible AGI will behave like a helpful, human servant. By using a sharper, more mathematical frame, we can more precisely investigate where corrigibility may have problems, such as by testing whether a purely corrigible agent behaves nicely in toy-settings. Sharp, English Definition The loose English definition I've used prior to this point has been: an agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes. Before diving into mathematical structures, I'd like to spend a moment attempting to sharpen this definition into something more explicit. In reaching for a crisp definition of corrigibility, we run the risk of losing touch with the deep intuition, so I encourage you to repeatedly check in with yourself about whether what's being built matches precisely with your gut-sense of the corrigible. In particular, we must be wary of both piling too much in, such that it ceases to be a single coherent target, becoming a grab-bag, and of stripping too much out, such that it loses necessary qualities. My best guess of where to start is in leaning deeper into the final bit of my early definition - the part about empowering the principal. Indeed, one of the only pre-existing attempts I've seen to formalize corrigibility also conceives of it primarily as about the principal having power (albeit general power over the agent's policy, as opposed to what I'm reaching for). Many of the emergent desiderata in the intuition doc also work as stories for why empowering the principal to fix mistakes is a good frame. New definition: an agent is corrigible when it robustly acts to empower the principal to freely fix flaws in the agent's structure, thoughts, and actions (including their consequences), particularly in ways that avoid creating problems for the principal that they didn't foresee. This new definition puts more emphasis on empowering the principal, unpacks the meaning of "opposite the trope of-" and drops the bit about "reflecting on itself as a flawed tool." While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST Strategy. Furthermore, I suspect that the right sub-definition of "robust" will recover much of what I think is good about the flawed-tool frame. For the agent to "robustly act to empower the principal" I claim it naturally needs to continue to behave well even when significantly damaged or flawed. As an example, a robust process for creating spacecraft parts needs to, when subject to disruption and malfeasance, continue to either contin...
undefined
Jun 9, 2024 • 30min

AF - 3b. Formal (Faux) Corrigibility by Max Harms

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 3b. Formal (Faux) Corrigibility, published by Max Harms on June 9, 2024 on The AI Alignment Forum. (Part 3b of the CAST sequence) In the first half of this document, Towards Formal Corrigibility, I sketched a solution to the stop button problem. As I framed it, the solution depends heavily on being able to detect manipulation, which I discussed on an intuitive level. But intuitions can only get us so far. Let's dive into some actual math and see if we can get a better handle on things. Measuring Power To build towards a measure of manipulation, let's first take inspiration from the suggestion that manipulation is somewhat the opposite of empowerment. And to measure empowerment, let's begin by trying to measure "power" in someone named Alice. Power, as I touched on in the ontology in Towards Formal Corrigibility, is (intuitively) the property of having one's values/goals be causally upstream of the state of some part of the world, such that the agent's preferences get expressed through their actions changing reality. Let's imagine that the world consists of a Bayes net where there's a (multidimensional and probabilistic) node for Alice's Values, which can be downstream of many things, such as Genetics or whether Alice has been Brainwashed. In turn, her Values will be upstream of her (deliberate) Actions, as well as other side-channels such as her reflexive Body-Language. Alice's Actions are themselves downstream of nodes besides Values, such as her Beliefs, as well as upstream of various parts of reality, such as her Diet and whether Bob-Likes-Alice. As a simplifying assumption, let's assume that while the nodes upstream of Alice's Values can strongly affect the probability of having various Values, they can't determine her Values. In other words, regardless of things like Genetics and Brainwashing, there's always at least some tiny chance associated with each possible setting of Values. Likewise, we'll assume that regardless of someone's Values, they always have at least a tiny probability of taking any possible action (including the "null action" of doing nothing). And, as a further simplification, let's restrict our analysis of Alice's power to a single aspect of reality that's downstream of their actions which we'll label "Domain". ("Diet" and "Bob-Likes-Alice" are examples of domains, as are blends of nodes like those.) We'll further compress things by combining all nodes upstream of values (e.g. Genetics and Brainwashing) into a single node called "Environment" and then marginalize out all other nodes besides Actions, Values, and the Domain. The result should be a graph which has Environment as a direct parent of everything, Values as a direct parent of Actions and the Domain, and Actions as a direct parent of the Domain. Let's now consider sampling a setting of the Environment. Regardless of what we sample, we've assumed that each setting of the Values node is possible, so we can consider each counterfactual setting of Alice's Values. In this setting, with a choice of environment and values, we can begin to evaluate Alice's power. Because we're only considering a specific environment and choice of values, I'll call this "local power." In an earlier attempt at formalization, I conceived of (local) power as a difference in expected value between sampling Alice's Action compared to the null action, but I don't think this is quite right. To demonstrate, let's imagine that Alice's body-language reveals her Values, regardless of her Actions. An AI which is monitoring Alice's body-language could, upon seeing her do anything at all, swoop in and rearrange the universe according to her Values, regardless of what she did. This might, naively, seem acceptable to Alice (since she gets what she wants), but it's not a good measure of my intuitive notion of power, since the c...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app