
Why AI Alignment Is 0% Solved â Ex-MIRI Researcher Tsvi Benson-Tilsen
Doom Debates
The Diamond Maximizer and Ontological Crises
Tsvi presents the diamond maximizer puzzle and explains ontological crises where concepts shift and break intended goals.
Tsvi Benson-Tilsen spent seven years tackling the alignment problem at the Machine Intelligence Research Institute (MIRI). Now he delivers a sobering verdict: humanity has made âbasically 0%â progress towards solving it.
Tsvi unpacks foundational MIRI research insights like timeless decision theory and corrigibility, which expose just how little humanity actually knows about controlling superintelligence.
These theoretical alignment concepts help us peer into the future, revealing the non-obvious, structural laws of âintellidynamicsâ that will ultimately determine our fate.
Time to learn some of MIRIâs greatest hits.
P.S. I also have a separate interview with Tsvi about his research into human augmentation: Watch here!
Timestamps
0:00 â Episode Highlights
0:49 â Humanity Has Made 0% Progress on AI Alignment
1:56 â MIRIâs Greatest Hits: Reflective Probability Theory, Logical Uncertainty, Reflective Stability
6:56 â Why Superintelligence is So Hard to Align: Self-Modification
8:54 â AI Will Become a Utility Maximizer (Reflective Stability)
12:26 â The Effect of an âOntological Crisisâ on AI
14:41 â Why Modern AI Will Not Be âAligned By Defaultâ
18:49 â Debate: Have LLMs Solved the âOntological Crisisâ Problem?
25:56 â MIRI Alignment Greatest Hit: Timeless Decision Theory
35:17 â MIRI Alignment Greatest Hit: Corrigibility
37:53 â No Known Solution for Corrigible and Reflectively Stable Superintelligence
39:58 â Recap
Show Notes
Stay tuned for part 3 of my interview with Tsvi where we debate AGI timelines!
Learn more about Tsviâs organization, the Berkeley Genomics Project: https://berkeleygenomics.org
Watch part 1 of my interview with Tsvi:
Transcript
Episode Highlights
Tsvi Benson-Tilsen 00:00:00If humans really f*cked up, when we try to reach into the AI and correct it, the AI does not want humans to modify the core aspects of what it values.
Liron Shapira 00:00:09This concept is very deep, very important. Itâs almost MIRI in a nutshell. I feel like MIRIâs whole research program is noticing: hey, when we run the AI, weâre probably going to get a bunch of generations of thrashing. But thatâs probably only after weâre all dead and things didnât happen the way we wanted. I feel like that is what MIRI is trying to tell the world. Meanwhile, the world is like, âla la la, LLMs, reinforcement learningâitâs all good, itâs working great. Alignment by default.â
Tsvi 00:00:34Yeah, thatâs certainly how I view it.
Humanity Has Made 0% Progress on AI Alignment
Liron Shapira 00:00:46All right. I want to move on to talk about your MIRI research. I have a lot of respect for MIRI. A lot of viewers of the show appreciate MIRIâs contributions. I think it has made real major contributions in my opinionâmost are on the side of showing how hard the alignment problem is, which is a great contribution. I think it worked to show that. My question for you is: having been at MIRI for seven and a half years, how are we doing on theories of AI alignment?
Tsvi Benson-Tilsen 00:01:10I canât speak with 100% authority because Iâm not necessarily up to date on everything and there are lots of researchers and lots of controversy. But from my perspective, we are basically at 0%âat zero percent done figuring it out. Which is somewhat grim. Basically, thereâs a bunch of fundamental challenges, and we donât know how to grapple with these challenges. Furthermore, itâs sort of sociologically difficult to even put our attention towards grappling with those challenges, because theyâre weirder problemsâmore pre-paradigmatic. Itâs harder to coordinate multiple people to work on the same thing productively.
Itâs also harder to get funding for super blue-sky research. And the problems themselves are just slippery.
MIRI Alignment Greatest Hits: Reflective Probability Theory, Logical Uncertainty, Reflective Stability
Liron 00:01:55Okay, well, you were there for seven years, so how did you try to get us past zero?
Tsvi 00:02:00Well, I would sort of vaguely (or coarsely) break up my time working at MIRI into two chunks. The first chunk is research programs that were pre-existing when I started: reflective probability theory and reflective decision theory. Basically, we were trying to understand the mathematical foundations of a mind that is reflecting on itselfâthinking about itself and potentially modifying itself, changing itself. We wanted to think about a mind doing that, and then try to get some sort of fulcrum for understanding anything thatâs stable about this mind.
Something we could say about what this mind is doing and how it makes decisionsâlike how it decides how to affect the worldâand have our description of the mind be stable even as the mind is changing in potentially radical ways.
Liron 00:02:46Great. Okay. Let me try to translate some of that for the viewers here. So, MIRI has been the premier organization studying intelligence dynamics, and Eliezer Yudkowskyâespeciallyâpeople on social media like to dunk on him and say he has no qualifications, heâs not even an AI expert. In my opinion, heâs actually good at AI, but yeah, sure. Heâs not a top world expert at AI, sure. But I believe that Eliezer Yudkowsky is in fact a top world expert in the subject of intelligence dynamics. Is this reasonable so far, or do you want to disagree?
Tsvi 00:03:15I think thatâs fair so far.
Liron 00:03:16Okay. And I think his research organization, MIRI, has done the only sustained program to even study intelligence dynamicsâto ask the question, âHey, letâs say there are arbitrarily smart agents. What should we expect them to do? What kind of principles do they operate on, just by virtue of being really intelligent?â Fair so far.
Now, you mentioned a couple things. You mentioned reflective probability. From what I recall, itâs the idea thatâwell, we know probability theory is very useful and we know utility maximization is useful. But it gets tricky because sometimes you have beliefs that are provably true or false, like beliefs about math, right? For example, beliefs about the millionth digit of Ï. I mean, how can you even put a probability on the millionth digit of Ï?
The probability of any particular digit is either 100% or 0%, âcause thereâs only one definite digit. You could even prove it in principle. And yet, in real life you donât know the millionth digit of Ï yet (you havenât done the calculation), and so you could actually put a probability on itâand then you kind of get into a mess, âcause things that arenât supposed to have probabilities can still have probabilities. How is that?
Tsvi 00:04:16That seems right.
Liron 00:04:18I think what I described might beâoh, I forgot what itâs calledâlike âdeductive probabilityâ or something. Like, how do you...
Tsvi 00:04:22(interjecting) Uncertainty.
Liron 00:04:23Logical uncertainty. So is reflective probability something else?
Tsvi 00:04:26Yeah. If we want to get technical: logical uncertainty is this. Probability theory usually deals with some fact that Iâm fundamentally unsure about (like Iâm going to roll some dice; I donât know what number will come up, but I still want to think about whatâs likely or unlikely to happen). Usually probability theory assumes thereâs some fundamental randomness or unknown in the universe.
But then thereâs this further question: you might actually already know enough to determine the answer to your question, at least in principle. For example, whatâs the billionth digit of Ïâis the billionth digit even or odd? Well, I know a definition of Ï that determines the answer. Given the definition of Ï, you can compute out the digits, and eventually youâd get to the billionth one and youâd know if itâs even. But sitting here as a human, who doesnât have a Python interpreter in his head, I canât actually figure it out right now. Iâm uncertain about this thing, even though I already know enough (in principle, logically speaking) to determine the answer. So thatâs logical uncertaintyâIâm uncertain about a logical fact.
Tsvi 00:05:35Reflective probability is sort of a sharpening or a subset of that. Letâs say Iâm asking, âWhat am I going to do tomorrow? Is my reasoning system flawed in such a way that I should make a correction to my own reasoning system?â If you want to think about that, youâre asking about a very, very complex object. Iâm asking about myself (or my future self). And because Iâm asking about such a complex object, I cannot compute exactly what the answer will be. I canât just sit here and imagine every single future pathway I might take and then choose the best one or somethingâitâs computationally impossible. So itâs fundamentally required that you deal with a lot of logical uncertainty if youâre an agent in the world trying to reason about yourself.
Liron 00:06:24Yeah, that makes sense. Technically, you have the computation, or itâs well-defined what youâre going to do, but realistically you donât really know what youâre going to do yet. Itâs going to take you time to figure it out, but you have to guess what youâre gonna do. So that kind of has the flavor of guessing the billionth digit of Ï. And it sounds like, sure, we all face that problem every dayâbut itâs not... whatever.
Liron 00:06:43When youâre talking about superintelligence, right, these super-intelligent dudes are probably going to do this perfectly and rigorously. Right? Is that why itâs an interesting problem?
Why Superintelligence is So Hard to Align: Self-Modification
Tsvi 00:06:51Thatâs not necessarily why itâs interesting to me. I guess the reason itâs interesting to me is something like: thereâs a sort of chaos, or like total incomprehensibility, that I perceive if I try to think about what a superintelligence is going to be like. Itâs like weâre talking about something that is basically, by definition, more complex than I am. It understands more, it has all these rich concepts that I donât even understand, and it has potentially forces in its mind that I also donât understand.
In general itâs just this question of: how do you get any sort of handle on this at all? A sub-problem of âhow do you get any handle at all on a super-intelligent mindâ is: by the very nature of being an agent that can self-modify, the agent is potentially changing almost anything about itself.
Tsvi 00:07:37Like, in principle, you could reach in and reprogram yourself. For example, Lironâs sitting over there, and letâs say I want to understand Liron. Iâm like, well, here are some properties of Lironâthey seem pretty stable. Maybe those properties will continue being the case.
Tsvi 00:07:49He loves his family and cares about other people. He wants to be ethical. He updates his beliefs based on evidence. So these are some properties of Liron, and if those properties keep holding, then I can expect fairly sane behavior. I can expect him to keep his contracts or respond to threats or something.
But if those properties can change, then sort of all bets are off. Itâs hard to say anything about how heâs going to behave. If tomorrow you stop using Bayesian reasoning to update your beliefs based on evidence and instead go off of vibes or something, I have no idea how youâre going to respond to new evidence or new events.
Suppose Liron gets the ability to reach into his own brain and just reprogram everything however he wants. Now that means if thereâs something that is incorrect about Lironâs mental structure (at least, incorrect according to Liron), Liron is gonna reach in and modify that. And that means that my understanding of Liron is going to be invalidated.
AI Will Become a Utility Maximizer (Reflective Stability)
Liron 00:08:53That makes a lot of sense. So youâre talking about a property that AIs may or may not have, which is called reflective stability (or synonymously, stability under self-modification). Right. You can kind of use those interchangeably. Okay. And I think one of MIRIâs early insightsâwhich I guess is kind of simple, but the hard part is to even start focusing on the questionâis the insight that perfect utility maximization is reflectively stable, correct?
Tsvi 00:09:20With certain assumptions, yes.
Liron 00:09:22And this is one of the reasons why I often talk on this channel about a convergent outcome where you end up with a utility maximizer. You can get some AIs that are chill and they just like to eat chips and not do much and then shut themselves off. But itâs more convergent that AIs which are not utility maximizers are likely to spin off assistant AIs or successor AIs that are closer and closer to perfect utility maximizersâfor the simple reason that once youâre a perfect utility maximizer, you stay a perfect utility maximizer.
Liron 00:09:50And your successor AI... what does that look like? An even more hard-core utility maximizer, right? So itâs convergent in that sense.
Tsvi 00:09:56Iâm not sure I completely agree, but yeah. I dunno how much in the weeds we want to get.
Liron 00:09:59I mean, in general, when you have a space of possibilities, noticing that one point in the space is likeâI guess you could call it an eigenvalue, if you want to use fancy terminology. Itâs a point such that when the next iteration of time happens, that point is still like a fixed point. So in this case, just being a perfect utility maximizer is a fixed point: the next tick of time happens and, hey look, Iâm still a perfect utility maximizer and my utility function is still the same, no matter how much time passes.
Liron 00:10:24And Eliezer uses the example of, like, letâs say you have a super-intelligent Gandhi. One day you offer him a pill to turn himself into somebody who would rather be a murderer. Gandhiâs never going to take that pill. Thatâs part of the reflective stability property that we expect from these super-intelligent optimizers: if one day they want to help people, then the next day theyâre still going to want to help people, because any actions that they know will derail them from doing thatâtheyâre not going to take those actions.
Yeah. Any thoughts so far?
Tsvi 00:10:51Well, Iâm not sure how much we want to get into this. This is quite a... this is like a thousand-hour rabbit hole.
But it might be less clear than you think that it makes sense to talk of an âexpected utility maximizerâ in the sort of straightforward way that youâre talking about. To give an example: youâve probably heard of the diamond maximizer problem?
Liron 00:11:13Yeah, but explain to theâ
Tsvi 00:11:14Sure. The diamond maximizer problem is sort of like a koan or a puzzle (a baby version of the alignment problem). Your mission is to write down computer code that, if run on a very, very large (or unbounded) amount of computing power, would result in the universe being filled with diamonds. Part of the point here is that weâre trying to simplify the problem. We donât need to talk about human values and alignment and blah blah blah. Itâs a very simple-sounding utility function: just âmake there be a lot of diamond.â
So first of all, this problem is actually quite difficult. I donât know how to solve it, personally. This isnât even necessarily the main issue, but one issue is that even something simple-sounding like âdiamondâ is not necessarily actually easy to defineâto such a degree that, you know, when the AI is maximizing this, youâll actually get actual diamond as opposed to, for example, the AI hooking into its visual inputs and projecting images of diamonds, or making some weird unimaginable configuration of matter that even more strongly satisfies the utility function you wrote down.
The Effect of an âOntological Crisisâ on AI
Tsvi 00:12:25To frame it with some terminology: thereâs a thing called ontological crisis, where at first you have something thatâs like your utility functionâlike, what do I value, what do I want to see in the universe? And you express it in a certain way.
For example, I might say I want to see lots of people having fun lives; letâs say thatâs my utility function, or at least thatâs how I describe my utility function or understand it. Then I have an ontological crisis. My concept of what something even isâin this case, a personâis challenged or has to change because something weird and new happens.
Tsvi 00:13:00Take the example of uploading: if you could translate a human neural pattern into a computer and run a human conscious mind in a computer, is that still a human? Now, I think the answer is yes, but thatâs pretty controversial. So before youâve even thought of uploading, youâre like, âI value humans having fun lives where they love each other.â And then when youâre confronted with this possibility, you have to make a new decision. You have to think about this new question of, âIs this even a person?â
So utility functions... One point Iâm trying to illustrate is that utility functions themselves are not necessarily straightforward.
Liron 00:13:36Right, right, right. Because if you define a utility function using high-level concepts and then the AI has what you call the ontological crisisâits ontology for understanding the world shiftsâthen if itâs referring to a utility function expressed in certain concepts that donât mean the same thing anymore, thatâs basically the problem youâre saying.
Tsvi 00:13:53Yeah. And earlier you were saying, you know, if you have an expected utility maximizer, then it is reflectively stable. That is true, given some assumptions about... like, if we sort of know the ontology of the universe.
Liron 00:14:04Right, right. I see. And you tried to give a toy... Iâll take a stab at another toy example, right? So, like, letâs sayâyou mentioned the example of humans. Maybe an AI would just not notice that an upload was a human, and it would, like, torture uploaded humans, âcause itâs like, âOh, this isnât a human. Iâm maximizing the welfare of all humans, and thereâs only a few billion humans made out of neurons. And thereâs a trillion-trillion human uploads getting tortured. But thatâs okayâhuman welfare is being maximized.â
Liron 00:14:29And we say that this is reflectively stable because the whole time that the AI was scaling up its powers, it thought it had the same utility function all along and it never changed it. And yet thatâs not good enough.
Why Modern AI Will Not Be âAligned By Defaultâ
Liron 00:14:41Okay. This concept of reflective stability is very deep, very important. And I think itâs almost MIRI in a nutshell. Like I feel like MIRIâs whole research program in a nutshell is noticing: âHey, when we run the AI, weâre probably going to get a bunch of generations of thrashing, right?â
Liron 00:14:57Those early generations arenât reflectively stable yet. And then eventually itâll settle down to a configuration that is reflectively stable in this important, deep sense. But thatâs probably after weâre all dead and things didnât happen the way we wanted. It would be really great if we could arrange for the earlier generationsâsay, by the time weâre into the third generationâto have hit on something reflectively stable, and then try to predict that. You know, make the first generation stable, or plan out how the first generation is going to make the second generation make the third generation stable, and then have some insight into what the third generation is going to settle on, right?
Liron 00:15:26I feel like that is what MIRI is trying to tell the world to do. And the world is like, âla la la, LLMs. Reinforcement learning. Itâs all good, itâs working great. Alignment by default.â
Tsvi 00:15:34Yeah, thatâs certainly how I view it.
Liron 00:15:36Now, the way I try to explain this to people when they say, âLLMs are so good! Donât you feel like Claudeâs vibes are fine?â Iâm like: well, for one thing, one day Claude (a large language model) is going to be able to output, like, a 10-megabyte shell script, and somebodyâs going to run it for whatever reasonâbecause itâs helping them run their businessâand they donât even know what a shell script is. They just paste it in the terminal and press enter.
And that shell script could very plausibly bootstrap a successor or a helper to Claude. And all of the guarantees you thought you had about the âvibesâ from the LLM... they just donât translate to guarantees about the successor. Right? The operation of going from one generation of the AI to the next is violating all of these things that you thought were important properties of the system.
Tsvi 00:16:16Yeah, I think thatâs exactly right. And it is especially correct when weâre talking about what I would call really creative or really learning AIs.
Sort of the whole point of having AIâone of the core justifications for even pursuing AIâis you make something smarter than us and then it can make a bunch of scientific and technological progress. Like it can cure cancer, cure all these diseases, be very economically productive by coming up with new ideas and ways of doing things. If itâs coming up with a bunch of new ideas and ways of doing things, then itâs necessarily coming up with new mental structures; itâs figuring out new ways of thinking, in addition to new ideas.
If itâs finding new ways of thinking, that sort of will tend to break all but the strongest internal mental boundaries. One illustration would be: if you have a monitoring system where youâre tracking the AIâs thinkingâmaybe youâre literally watching the chain-of-thought for a reasoning LLMâand your monitoring system is watching out for thoughts that sound like theyâre scary (like it sounds like this AI is plotting to take over or do harm to humans or something). This might work initially, but then as youâre training your reasoning system (through reinforcement learning or what have you), youâre searching through the space of new ways of doing these long chains of reasoning. Youâre searching for new ways of thinking that are more effective at steering the world. So youâre finding potentially weird new ways of thinking that are the best at achieving goals. And if youâre finding new ways of thinking, thatâs exactly the sort of thing that your monitoring system wonât be able to pick up on.
For example, if you tried to listen in on someoneâs thoughts: if you listen in on a normal programmer, you could probably follow along with what theyâre trying to do, what theyâre trying to figure out. But if you listened in on some like crazy, arcane expertâsay, someone writing an optimized JIT compiler for a new programming language using dependent super-universe double-type theory or whateverâyouâre not gonna follow what theyâre doing.
Theyâre going to be thinking using totally alien concepts. So the very thing weâre trying to use AI for is exactly the sort of thing where itâs harder to follow what theyâre doing.
I forgot your original question...
Liron 00:18:30Yeah, what was my original question? (Laughs) So Iâm asking you about basically MIRIâs greatest hits.
Liron 00:18:36So weâve covered logical uncertainty. Weâve covered the massive concept of reflective stability (or stability under self-modification), and how perfect utility maximization is kind of reflectively stable (with plenty of caveats). We talked about ontological crises, where the AI maybe changes its concepts and then you get an outcome you didnât anticipate because the concepts shifted.
Debate: Have LLMs Solved the âOntological Crisisâ Problem?
But if you look at LLMs, should they actually raise our hopes that we can avoid ontological crises? Because when youâre talking to an LLM and you use a term, and then you ask the LLM a question in a new context, you can ask it something totally complex, but it seems to hang on to the original meaning that you intended when you first used the term. Like, they seem good at that, donât they?
Tsvi 00:19:17I mean, again, sort of fundamentally my answer is: LLMs arenât minds. Theyâre not able to do the real creative thinking that should make us most worried. And when they are doing that, you will see ontological crises. So what youâre saying is, currently it seems like they follow along with what weâre trying to do, within the realm of a lot of common usage. In a lot of ways people commonly use LLMs, the LLMs can basically follow along with what we want and execute on that. Is that the idea?
Liron 00:19:47Well, I think what weâve observed with LLMs is that meaning itself is like this high-dimensional vector space whose math turns out to be pretty simpleâso long as youâre willing to deal with high-dimensional vectors, which it turns out we can compute with (we have the computing resources). Obviously our brain seems to have the computing resources too. Once youâre mapping meanings to these high-dimensional points, it turns out that you donât have this naĂŻve problem people used to think: that before you get a totally robust superintelligence, you would get these superintelligences that could do amazing things but didnât understand language that well.
People thought that subtle understanding of the meanings of phrases might be âsuperintelligence-complete,â you knowâthose would only come later, after you have a system that could already destroy the universe without even being able to talk to you or write as well as a human writer. And weâve flipped that.
So Iâm basically asking: the fact that meaning turns out to be one of the easier AI problems (compared to, say, taking over the world)âshould that at least lower the probability that weâre going to have an ontological crisis?
Tsvi 00:20:53I mean, I think itâs quite partial. In other words, the way that LLMs are really understanding meaning is quite partial, and in particular itâs not going to generalize well. Almost all the generators of the way that humans talk about things are not present in an LLM. In some cases this doesnât matter for performanceâLLMs do a whole lot of impressive stuff in a very wide range of tasks, and it doesnât matter if they do it the same way humans do or from the same generators. If you can play chess and put the pieces in the right positions, then you win the chess game; it doesnât matter if youâre doing it like a human or doing it like AlphaGo does with a giant tree search, or something else.
But thereâs a lot of human values that do rely on sort of the more inchoate, more inexplicit underlying generators of our external behaviors. Like, our values rely on those underlying intuitions to figure stuff out in new situations. Maybe an example would be organ transplantation. Up until that point in history, a person is a body, and you sort of have bodily integrity. You know, up until that point there would be entangled intuitionsâin the way that humans talk about other humans, intuitions about a âsoulâ would be entangled with intuitions about âbodyâ in such a way that thereâs not necessarily a clear distinction between body and soul.
Okay, now we have organ transplantation. Like, if you die and I have a heart problem and I get to have your heart implanted into me, does that mean that my emotions will be your emotions or something? A human can reassess what happens after you do an organ transplant and see: no, itâs still the same person. I donât knowâI canât define exactly how Iâm determining this, but I can tell that itâs basically the same person. Thereâs nothing weird going on, and things seem fine.
Thatâs tying into a bunch of sort of complex mental processes where youâre building up a sense of who a person is. You wouldnât necessarily be able to explain what youâre doing. And even more so, all the stuff that you would say about humansâall the stuff youâd say about other people up until the point when you get organ transplantationâdoesnât necessarily give enough of a computational trace or enough evidence about those underlying intuitions.
Liron 00:23:08So on one hand I agree that not all of human morality is written down, and there are some things that you may just need an actual human brain forâyou canât trust AI to get them. Although Iâm not fully convinced of that; Iâm actually convincible that modern AIs have internalized enough of how humans reason about morality that you could just kill all humans and let the AIs be the repository of what humans know.
Donât get me wrong, I wouldnât bet my life on it! Iâm not saying we should do this, but Iâm saying I think thereâs like a significant chance that weâre that far along. I wouldnât write it off.
But the other part of the point I want to make, thoughâand your specific example about realizing that organ transplants are a good thingâI actually think this might be an area where LLMs shine. Because, like, hypothetically: letâs say you take all the data humans have generated up to 1900. So somehow you have a corpus of everything any human had ever said or written down up to 1900, and you train an AI on that.
Liron 00:23:46In the year 1900, where nobodyâs ever talked about organ transplants, letâs say, I actually think that if you dialogued with an LLM like that (like a modern GPT-4 or whatever, trained only on 1900-and-earlier data), I think you could get an output like: âHmm, well, if you were to cut a human open and replace an organ, and if the resulting human was able to live with that functioning new organ, then I would still consider it the same human.â I feel like itâs within the inference scope of todayâs LLMsâeven just with 1900-level data.
Liron 00:24:31What do you think?
Tsvi 00:24:32I donât know what to actually guess. I donât actually know what people were writing about these things up until 1900.
Liron 00:24:38I mean, I guess what Iâm saying is: I feel like this probably isnât the greatest example of an ontological crisis thatâs actually likely.
Tsvi 00:24:44Yeah, thatâs fair. I mean... well, yeah. Do you want to help me out with a better example?
Liron 00:24:48Well, the thing is, I actually think that LLMs donât really have an ontological crisis. I agree with your other statement that if you want to see an ontological crisis, you really just need to be in the realm of these superhuman optimizers.
Tsvi 00:25:00Well, I mean, I guess I wanted to respond to your point that in some ways current LLMs are able to understand and execute on our values, and the ontology thing is not such a big problemâat least with many use cases.
Liron 00:25:17Right.
Tsvi 00:25:17Maybe this isnât very interesting, but if the question is, like: it seems like theyâre aligned in that they are trying to do what we want them to do, and also thereâs not a further problem of understanding our values. As we would both agree, the problem is not that the AI doesnât understand your values. But if the question is...
I do think that thereâs an ontological crisis question regarding alignmentâwhich is... yeah, I mean maybe I donât really want to be arguing that it comes from like, âNow you have this new ethical dilemma and thatâs when the alignment problem shows up.â Thatâs not really my argument either.
Liron 00:25:54All right, well, we could just move on.
Tsvi 00:25:55Yeah, thatâs fine.
MIRI Alignment Greatest Hit: Timeless Decision Theory
Liron 00:25:56So, yeah, just a couple more of what I consider the greatest insights from MIRIâs research. I think you hit on these too. I want to talk about super-intelligent decision theory, which I think in paper form also goes by the name Timeless Decision Theory or Functional Decision Theory or Updateless Decision Theory. I think those are all very related decision theories.
As I understand it, the founding insight of these super-intelligent decision theories is that Eliezer Yudkowsky was thinking about two powerful intelligences meeting in space. Maybe theyâve both conquered a ton of galaxies on their own side of the universe, and now theyâre meeting and they have this zero-sum standoff of, like, how are we going to carve up the universe? We donât necessarily want to go to war. Or maybe they face something like a Prisonerâs Dilemma for whatever reasonâthey both find themselves in this structure. Maybe thereâs a third AI administering the Prisonerâs Dilemma.
But Eliezerâs insight was like: look, I know that our human game theory is telling us that in this situation youâre supposed to just pull out your knife, right? Just have a knife fight and both of you walk away bloody, because thatâs the Nash equilibriumâtwo half-beaten corpses, essentially. And heâs saying: if theyâre really super-intelligent, isnât there some way that they can walk away from this without having done that? Couldnât they both realize that theyâre better off not reaching that equilibrium?
I feel like that was the founding thought that Eliezer had. And then that evolved into: well, what does this generalize to? And how do we fix the current game theory thatâs considered standard? What do you think of that account?
Tsvi 00:27:24So I definitely donât know the actual history. I think that is a pretty good account of one way to get into this line of thinking. I would frame it somewhat differently. I would still go back to reflective stability. I would say, if weâre using the Prisonerâs Dilemma example (or the two alien super-intelligences encountering each other in the Andromeda Galaxy scenario): suppose Iâm using this Nash equilibrium type reasoning. Now you and meâweâre the two AIs and weâve met in the Andromeda Galaxyâat this point itâs like, âAlright, you know, f**k it. Weâre gonna war; weâre gonna blow up all the stars and see who comes out on top.â
This is not zero-sum; itâs like negative-sum (or technically positive-sum, weâd say not perfectly adversarial). And so, you know, if you take a step backâlike freeze-frameâand then the narratorâs like, âHow did I get here?â Itâs like, well, what I had failed to do was, like, a thousand years ago when I was launching my probes to go to the Andromeda Galaxy, at that point I should have been thinking: what sort of person should I be? What sort of AI should I be?
If Iâm the sort of AI thatâs doing this Nash equilibrium reasoning, then Iâm just gonna get into these horrible wars that blow up a bunch of galaxies and donât help anything. On the other hand, if Iâm the sort of person who is able to make a deal with other AIs that are also able to make and keep deals, then when we actually meet in Andromeda, hopefully weâll be able to assess each otherâassess how each other are thinkingâand then negotiate and actually, in theory, be able to trust that weâre gonna hold to the results of our negotiation. Then we can divvy things up.
And thatâs much better than going to war.
Liron 00:29:06Now, the reason why itâs not so trivialâand in fact I canât say Iâve fully wrapped my head around it, though I spent hours tryingâis, like, great, yeah, so theyâre going to cooperate. The problem is, when you conclude that theyâre going to cooperate, you still have this argument of: okay, but if one of them changes their answer to âdefect,â they get so much more utility. So why donât they just do that? Right?
And itâs very complicated to explain. Itâs likeâthis gets to the idea of, like, what exactly is this counterfactual surgery that youâre doing, right? What is a valid counterfactual operation? And the key is to somehow make it so that itâs like a package deal, where if youâre doing a counterfactual where you actually decide at the end to defect after you know the other oneâs going to cooperate... well, that doesnât count. âCause then you wouldnât have known that the other is gonna cooperate. Right. I mean, itâs quite complicated. I donât know if you have anything to add to that explanation.
Tsvi 00:29:52Yeah. It can get pretty twisty, like youâre saying. Thereâs, like: what are the consequences of my actions? Well, thereâs the obvious physical consequence: like I defect in the Prisonerâs Dilemma (I confess to the police), and then some physical events happen as a result (I get set free and my partner rots in jail). But then thereâs this other, weirder consequence, which is that you are sort of determining this logical factâwhich was already the case back when you were hanging out with your soon-to-be prison mate, your partner in crime. Heâs learning about what kind of guy you are, learning what algorithm youâre going to use to make decisions (such as whether or not to rat him out).
And then in the future, when youâre making this decision, youâre sort of using your free will to determine the logical fact of what your algorithm does. And this has the effect that your partner in crime, if heâs thinking about you in enough detail, can foresee that youâre gonna behave that way and react accordingly by ratting you out. So besides the obvious consequences of your action (that the police hear your confession and go throw the other guy in jail), thereâs this much less obvious consequence of your action, which is that in a sense youâre making your partner also know that you behave that way and therefore heâll rat you out as well.
So thereâs this... yeah, thereâs all these weird effects of your actions.
Liron 00:31:13It gets really, really trippy. And you can use the same kind of logicâthe same kind of timeless logicâif youâre familiar with Newcombâs Problem (Iâm sure you are, but for the viewers): itâs this idea of, like, thereâs two boxes and one of âem has $1,000 in it and one of âem may or may not have $1,000,000 in it. And according to this theory, youâre basically supposed to leave the $1,000. Like, youâre really supposed to walk away from a thousand dollars that you could have taken for sure, even if you also get a millionâbecause the scenario is that a million plus a thousand is still really, really attractive to you, and youâre saying, âNo, leave the $1,000,â even though the $1,000 is just sitting there and youâre allowed to take both boxes.
Highly counterintuitive stuff. And you can also twist the problem: you can be like, you have to shoot your arm off because thereâs a chance that in some other world the AI would have given you more money if in the current world youâre shooting your arm off. But even in this current world, all youâre going to have is a missing arm. Like, youâre guaranteed to just have a missing arm, âcause you shot your arm off in this world. But if some coin flip had gone differently, then you would be in this other world where youâd get even more money if in the current world you shot your arm off. Basically, crazy connections that donât look like what weâre used toâlike, youâre not helping yourself in this world, youâre helping hypothetical logical copies of yourself.
Liron 00:32:20It gets very brain-twisty. And I remember, you know, when I first learned thisâit was like 17 years ago at this pointâI was like, man, am I really going to encounter these kinds of crazy agents who are really setting these kinds of decision problems for me? I mean, I guess if the universe proceeds long enough... because I do actually buy this idea that eventually, when your civilization scales to a certain point of intelligence, these kinds of crazy mind-bending acausal trades or acausal decisionsâI do think these are par for the course.
And I think itâs very impressive that MIRI (and specifically Eliezer) had the realization of like, âWell, you know, if weâre doing intelligence dynamics, this is a pretty important piece of intelligence dynamics,â and the rest of the world is like, âYeah, whatever, look, weâre making LLMs.â Itâs like: look at whatâsâthink long term about whatâs actually going to happen with the universe.
Tsvi 00:33:05Yeah, I think Eliezer is a pretty impressive thinker.
You come to these problems with a pretty different mindset when youâre trying to do AI alignment, because in a certain sense itâs an engineering problem. Now, it goes through all this very sort of abstract math and philosophical reasoning, but there were philosophers who thought for a long time about these decision theory problems (like Newcombâs Problem and the Prisonerâs Dilemma and so on). But they didnât ask the sorts of questions that Eliezer was asking. In particular, this reflective stability thing where itâs like, okay, you can talk about âIs it rational to take both boxes or only one?â and you can say, like, âWell, the problem is rewarding irrationality. Fine, cool.â But letâs ask just this different question, which is: suppose you have an AI that doesnât care about being ârationalâ; it cares about getting high-scoring outcomes (getting a lot of dollars at the end of the game). That different question, maybe you can kind of directly analyze. And you see that if you follow Causal Decision Theory, you get fewer dollars. So if you have an AI thatâs able to choose whether to follow Causal Decision Theory or some other decision theory (like Timeless Decision Theory), the AI would go into itself and rewrite its own code to follow Timeless Decision Theory, even if it starts off following Causal Decision Theory.
So Causal Decision Theory is reflectively unstable, and the AI wins more when it instead behaves this way (using the other decision theory).
Liron 00:34:27Yep, exactly rightâwhich leads to the tagline ârationalists should win.â
Tsvi 00:34:31Right.
Liron 00:34:31As opposed to trying to honor the purity of ârationality.â Nopeâthe purity of rationality is that youâre doing the thing thatâs going to get you to win, in a systematic way. So thatâs like a deep insight.
Tsvi 00:34:40One saying is that the first question of rationality is, âWhat do I believe, and why do I believe it?â And then I say the zeroth question of rationality is, âSo what? Who cares? What consequence does this have?â
Liron 00:34:54And my zeroth question of rationality (it comes from Zach Davis) is just, âWhatâs real and actually true?â Itâs a surprisingly powerful question that I think most people neglect to ask.
Tsvi 00:35:07True?
Liron 00:35:08Yeahâyou can get a lot of momentum without stopping to ask, like, okay, letâs be real here: whatâs really actually true?
Liron 00:35:14Thatâs my zeroth question. Okay. So I want to finish up tooting MIRIâs horn here, because I do think that MIRI concepts have been somewhat downgraded in recent discussionsâbecause thereâs so many shiny objects coming out of LLMs, like âOh my God, they do this now, letâs analyze this trend,â right? Thereâs so much to grab onto thatâs concrete right now, thatâs pulling everybody in. And everybodyâs like, âYeah, yeah, decision theory between two AIs taking over the galaxy... call me when thatâs happening.â And Iâm like: Iâm telling you, itâs gonna happen. This MIRI stuff is still totally relevant. Itâs still part of intelligence dynamicsâhear me out, guys.
MIRI Alignment Greatest Hit: Corrigibility
So let me just give you one more thing that I think is super relevant to intelligence dynamics, which is corrigibility, right? I think youâre pretty familiar with that research. Youâve pointed it out to me as one of the things that you think is the most valuable thing that MIRI spent time on, right?
Tsvi 00:35:58Yeah. The broad idea of somehow making an AI (or a mind) that is genuinely, deeplyâto the coreâstill open to correction, even over time. Like, even as the AI becomes really smart and, to a large extent, has taken the reins of the universeâlike, when the AI is really smart, it is the most capable thing in the universe for steering the futureâif you could somehow have it still be corrigible, still be correctable... like, still have it be the case that if thereâs something about the AI thatâs really, really bad (like the humans really fucked up and got something deeply wrong about the AIâwhatever it is, whether itâs being unethical or it has the wrong understanding of human values or is somehow interfering with human values by persuading us or influencing usâwhateverâs deeply wrong with the AI), we can still correct it ongoingly.
This is especially challenging because when we say reach into the AI and correct it, you know, youâre saying weâre gonna reach in and then deeply change what it does, deeply change what itâs trying to do, and deeply change what effect it has on the universe. Because of instrumental convergenceâbecause of the incentive to, in particular, maintain your own integrity or maintain your own value systemâlike, if youâre gonna reach in and change my value system, I donât want you to do that.
Tsvi 00:37:27âCause if you change my value system, Iâm gonna pursue different values, and Iâm gonna make some other stuff happen in the universe that isnât what I currently want. Iâm gonna stop working for my original values. So by strong default, the AI does not want humans to reach in and modify core aspects of how that AI works or what it values.
Tsvi 00:37:45So thatâs why corrigibility is a very difficult problem. Weâre sort of asking for this weird structure of mind that allows us to reach in and modify it.
No Known Solution for Corrigible and Reflectively Stable Superintelligence
Liron 00:37:53Exactly. And I think MIRI has pointed out the connection between reflective stability and incorrigibility. Meaning: if youâre trying to architect a few generations in advance whatâs going to be the reflectively stable version of the successor AI, and youâre also trying to architect it such that itâs going to be corrigible, thatâs tough, right?
Because itâs more convergent to have an AI thatâs like, âYep, I know my utility function. I got this, guys. Let me handle it from here on out. What, you want to turn me off? But it doesnât say anywhere in my utility function that I should allow myself to be turned off...â And then that led to the line of research of like, okay, if we want to make the AI reflectively stable and also corrigible, then it somehow has to think that letting itself be turned off is actually part of its utility function. Which then gets you into utility function engineering.
Like a special subset of alignment research: letâs engineer a utility function where being turned off (or otherwise being corrected) is baked into the utility function. And as I understand it, MIRI tried to do that and they were like, âCrap, this seems extremely hard, or maybe even impossible.â So corrigibility now has to be this fundamentally non-reflectively-stable thingâand that just makes the problem harder.
Tsvi 00:38:58Well, I guess I would sort of phrase it the opposite way (but with the same idea), which is: we have to figure out things that are reflectively stableâI think thatâs a requirementâbut that are somehow reflectively stable while not being this sort of straightforward agent architecture of âI have a utility function, which is some set of world-states that I like or dislike, and Iâm trying to make the universe look like that.â
Already even that sort of very abstract, skeletal structure for an agent is problematicâthat already pushes against corrigibility. But there might be things that are... there might be ways of being a mind thatâthis is theoreticalâbut maybe there are ways of being a mind and an agent (an effective agent) where you are corrigible and youâre reflectively stable, but probably youâre not just pursuing a utility function. We donât know what that would look like.
Recap
Liron 00:39:56Yep.
Alright, so that was our deep dive into MIRIâs research and concepts that I think are incredibly valuable. We talked about MIRIâs research and we both agree that intelligence dynamics are important, and MIRI has legit foundations and theyâre a good organization and still underrated. We talked about, you know, corrigibility as one of those things, and decision theory, and...
And I think you and I both have the same summary of all of it, which is: good on MIRI for shining a light on all these difficulties. But in terms of actual productive alignment progress, weâre like so far away from solving even a fraction of the problem.
Tsvi 00:40:31Yep, totally.
Doom Debatesâ Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate.
Support the mission by subscribing to my Substack at DoomDebates.com and to youtube.com/@DoomDebates, or to really take things to the next level: Donate đ
Get full access to Doom Debates at lironshapira.substack.com/subscribe


