Why AI Alignment Is 0% Solved — Ex-MIRI Researcher Tsvi Benson-Tilsen

13 snips

Oct 31, 2025

Tsvi Benson-Tilsen, a former MIRI researcher, spent seven years grappling with AI alignment challenges. He reveals a stark truth: humanity has made virtually no progress on this complex issue. Tsvi delves into critical concepts like reflective decision theory and corrigibility, illuminating why controlling superintelligence is so daunting. He discusses the implications of self-modifying AIs and the risks of ontological crises, prompting important debates about the limitations of current AI models and the urgent need for effective alignment strategies.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Alignment Progress Is Essentially Zero

Tsvi argues we have made essentially 0% progress on solving AI alignment due to slippery, pre-paradigm problems.
He highlights sociological and funding barriers that hinder deep theoretical work on alignment.

INSIGHT

Model Minds That Modify Themselves

MIRI studies reflective probability and decision theory to model minds that reason about and modify themselves.
The goal is to find stable descriptions that remain true as a mind self-modifies.

INSIGHT

Logical Uncertainty Matters For Self-Reflection

Logical uncertainty arises because agents must assign probabilities to facts they could deduce but computationally cannot.
Reflective reasoning about one's future decisions forces broad logical uncertainty.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Tsvi Benson-Tilsen spent seven years tackling the alignment problem at the Machine Intelligence Research Institute (MIRI). Now he delivers a sobering verdict: humanity has made “basically 0%” progress towards solving it.

Tsvi unpacks foundational MIRI research insights like timeless decision theory and corrigibility, which expose just how little humanity actually knows about controlling superintelligence.

These theoretical alignment concepts help us peer into the future, revealing the non-obvious, structural laws of “intellidynamics” that will ultimately determine our fate.

Time to learn some of MIRI’s greatest hits.

P.S. I also have a separate interview with Tsvi about his research into human augmentation: Watch here!

Timestamps

0:00 — Episode Highlights

0:49 — Humanity Has Made 0% Progress on AI Alignment

1:56 — MIRI’s Greatest Hits: Reflective Probability Theory, Logical Uncertainty, Reflective Stability

6:56 — Why Superintelligence is So Hard to Align: Self-Modification

8:54 — AI Will Become a Utility Maximizer (Reflective Stability)

12:26 — The Effect of an “Ontological Crisis” on AI

14:41 — Why Modern AI Will Not Be ‘Aligned By Default’

18:49 — Debate: Have LLMs Solved the “Ontological Crisis” Problem?

25:56 — MIRI Alignment Greatest Hit: Timeless Decision Theory

35:17 — MIRI Alignment Greatest Hit: Corrigibility

37:53 — No Known Solution for Corrigible and Reflectively Stable Superintelligence

39:58 — Recap

Show Notes

Stay tuned for part 3 of my interview with Tsvi where we debate AGI timelines!

Learn more about Tsvi’s organization, the Berkeley Genomics Project: https://berkeleygenomics.org

Watch part 1 of my interview with Tsvi:

Transcript

Episode Highlights

Tsvi Benson-Tilsen 00:00:00If humans really f*cked up, when we try to reach into the AI and correct it, the AI does not want humans to modify the core aspects of what it values.

Liron Shapira 00:00:09This concept is very deep, very important. It’s almost MIRI in a nutshell. I feel like MIRI’s whole research program is noticing: hey, when we run the AI, we’re probably going to get a bunch of generations of thrashing. But that’s probably only after we’re all dead and things didn’t happen the way we wanted. I feel like that is what MIRI is trying to tell the world. Meanwhile, the world is like, “la la la, LLMs, reinforcement learning—it’s all good, it’s working great. Alignment by default.”

Tsvi 00:00:34Yeah, that’s certainly how I view it.

Humanity Has Made 0% Progress on AI Alignment

Liron Shapira 00:00:46All right. I want to move on to talk about your MIRI research. I have a lot of respect for MIRI. A lot of viewers of the show appreciate MIRI’s contributions. I think it has made real major contributions in my opinion—most are on the side of showing how hard the alignment problem is, which is a great contribution. I think it worked to show that. My question for you is: having been at MIRI for seven and a half years, how are we doing on theories of AI alignment?

Tsvi Benson-Tilsen 00:01:10I can’t speak with 100% authority because I’m not necessarily up to date on everything and there are lots of researchers and lots of controversy. But from my perspective, we are basically at 0%—at zero percent done figuring it out. Which is somewhat grim. Basically, there’s a bunch of fundamental challenges, and we don’t know how to grapple with these challenges. Furthermore, it’s sort of sociologically difficult to even put our attention towards grappling with those challenges, because they’re weirder problems—more pre-paradigmatic. It’s harder to coordinate multiple people to work on the same thing productively.

It’s also harder to get funding for super blue-sky research. And the problems themselves are just slippery.

MIRI Alignment Greatest Hits: Reflective Probability Theory, Logical Uncertainty, Reflective Stability

Liron 00:01:55Okay, well, you were there for seven years, so how did you try to get us past zero?

Tsvi 00:02:00Well, I would sort of vaguely (or coarsely) break up my time working at MIRI into two chunks. The first chunk is research programs that were pre-existing when I started: reflective probability theory and reflective decision theory. Basically, we were trying to understand the mathematical foundations of a mind that is reflecting on itself—thinking about itself and potentially modifying itself, changing itself. We wanted to think about a mind doing that, and then try to get some sort of fulcrum for understanding anything that’s stable about this mind.

Something we could say about what this mind is doing and how it makes decisions—like how it decides how to affect the world—and have our description of the mind be stable even as the mind is changing in potentially radical ways.

Liron 00:02:46Great. Okay. Let me try to translate some of that for the viewers here. So, MIRI has been the premier organization studying intelligence dynamics, and Eliezer Yudkowsky—especially—people on social media like to dunk on him and say he has no qualifications, he’s not even an AI expert. In my opinion, he’s actually good at AI, but yeah, sure. He’s not a top world expert at AI, sure. But I believe that Eliezer Yudkowsky is in fact a top world expert in the subject of intelligence dynamics. Is this reasonable so far, or do you want to disagree?

Tsvi 00:03:15I think that’s fair so far.

Liron 00:03:16Okay. And I think his research organization, MIRI, has done the only sustained program to even study intelligence dynamics—to ask the question, “Hey, let’s say there are arbitrarily smart agents. What should we expect them to do? What kind of principles do they operate on, just by virtue of being really intelligent?” Fair so far.

Now, you mentioned a couple things. You mentioned reflective probability. From what I recall, it’s the idea that—well, we know probability theory is very useful and we know utility maximization is useful. But it gets tricky because sometimes you have beliefs that are provably true or false, like beliefs about math, right? For example, beliefs about the millionth digit of π. I mean, how can you even put a probability on the millionth digit of π?

The probability of any particular digit is either 100% or 0%, ‘cause there’s only one definite digit. You could even prove it in principle. And yet, in real life you don’t know the millionth digit of π yet (you haven’t done the calculation), and so you could actually put a probability on it—and then you kind of get into a mess, ‘cause things that aren’t supposed to have probabilities can still have probabilities. How is that?

Tsvi 00:04:16That seems right.

Liron 00:04:18I think what I described might be—oh, I forgot what it’s called—like “deductive probability” or something. Like, how do you...

Tsvi 00:04:22(interjecting) Uncertainty.

Liron 00:04:23Logical uncertainty. So is reflective probability something else?

Tsvi 00:04:26Yeah. If we want to get technical: logical uncertainty is this. Probability theory usually deals with some fact that I’m fundamentally unsure about (like I’m going to roll some dice; I don’t know what number will come up, but I still want to think about what’s likely or unlikely to happen). Usually probability theory assumes there’s some fundamental randomness or unknown in the universe.

But then there’s this further question: you might actually already know enough to determine the answer to your question, at least in principle. For example, what’s the billionth digit of π—is the billionth digit even or odd? Well, I know a definition of π that determines the answer. Given the definition of π, you can compute out the digits, and eventually you’d get to the billionth one and you’d know if it’s even. But sitting here as a human, who doesn’t have a Python interpreter in his head, I can’t actually figure it out right now. I’m uncertain about this thing, even though I already know enough (in principle, logically speaking) to determine the answer. So that’s logical uncertainty—I’m uncertain about a logical fact.

Tsvi 00:05:35Reflective probability is sort of a sharpening or a subset of that. Let’s say I’m asking, “What am I going to do tomorrow? Is my reasoning system flawed in such a way that I should make a correction to my own reasoning system?” If you want to think about that, you’re asking about a very, very complex object. I’m asking about myself (or my future self). And because I’m asking about such a complex object, I cannot compute exactly what the answer will be. I can’t just sit here and imagine every single future pathway I might take and then choose the best one or something—it’s computationally impossible. So it’s fundamentally required that you deal with a lot of logical uncertainty if you’re an agent in the world trying to reason about yourself.

Liron 00:06:24Yeah, that makes sense. Technically, you have the computation, or it’s well-defined what you’re going to do, but realistically you don’t really know what you’re going to do yet. It’s going to take you time to figure it out, but you have to guess what you’re gonna do. So that kind of has the flavor of guessing the billionth digit of π. And it sounds like, sure, we all face that problem every day—but it’s not... whatever.

Liron 00:06:43When you’re talking about superintelligence, right, these super-intelligent dudes are probably going to do this perfectly and rigorously. Right? Is that why it’s an interesting problem?

Why Superintelligence is So Hard to Align: Self-Modification

Tsvi 00:06:51That’s not necessarily why it’s interesting to me. I guess the reason it’s interesting to me is something like: there’s a sort of chaos, or like total incomprehensibility, that I perceive if I try to think about what a superintelligence is going to be like. It’s like we’re talking about something that is basically, by definition, more complex than I am. It understands more, it has all these rich concepts that I don’t even understand, and it has potentially forces in its mind that I also don’t understand.

In general it’s just this question of: how do you get any sort of handle on this at all? A sub-problem of “how do you get any handle at all on a super-intelligent mind” is: by the very nature of being an agent that can self-modify, the agent is potentially changing almost anything about itself.

Tsvi 00:07:37Like, in principle, you could reach in and reprogram yourself. For example, Liron’s sitting over there, and let’s say I want to understand Liron. I’m like, well, here are some properties of Liron—they seem pretty stable. Maybe those properties will continue being the case.

Tsvi 00:07:49He loves his family and cares about other people. He wants to be ethical. He updates his beliefs based on evidence. So these are some properties of Liron, and if those properties keep holding, then I can expect fairly sane behavior. I can expect him to keep his contracts or respond to threats or something.

But if those properties can change, then sort of all bets are off. It’s hard to say anything about how he’s going to behave. If tomorrow you stop using Bayesian reasoning to update your beliefs based on evidence and instead go off of vibes or something, I have no idea how you’re going to respond to new evidence or new events.

Suppose Liron gets the ability to reach into his own brain and just reprogram everything however he wants. Now that means if there’s something that is incorrect about Liron’s mental structure (at least, incorrect according to Liron), Liron is gonna reach in and modify that. And that means that my understanding of Liron is going to be invalidated.

AI Will Become a Utility Maximizer (Reflective Stability)

Liron 00:08:53That makes a lot of sense. So you’re talking about a property that AIs may or may not have, which is called reflective stability (or synonymously, stability under self-modification). Right. You can kind of use those interchangeably. Okay. And I think one of MIRI’s early insights—which I guess is kind of simple, but the hard part is to even start focusing on the question—is the insight that perfect utility maximization is reflectively stable, correct?

Tsvi 00:09:20With certain assumptions, yes.

Liron 00:09:22And this is one of the reasons why I often talk on this channel about a convergent outcome where you end up with a utility maximizer. You can get some AIs that are chill and they just like to eat chips and not do much and then shut themselves off. But it’s more convergent that AIs which are not utility maximizers are likely to spin off assistant AIs or successor AIs that are closer and closer to perfect utility maximizers—for the simple reason that once you’re a perfect utility maximizer, you stay a perfect utility maximizer.

Liron 00:09:50And your successor AI... what does that look like? An even more hard-core utility maximizer, right? So it’s convergent in that sense.

Tsvi 00:09:56I’m not sure I completely agree, but yeah. I dunno how much in the weeds we want to get.

Liron 00:09:59I mean, in general, when you have a space of possibilities, noticing that one point in the space is like—I guess you could call it an eigenvalue, if you want to use fancy terminology. It’s a point such that when the next iteration of time happens, that point is still like a fixed point. So in this case, just being a perfect utility maximizer is a fixed point: the next tick of time happens and, hey look, I’m still a perfect utility maximizer and my utility function is still the same, no matter how much time passes.

Liron 00:10:24And Eliezer uses the example of, like, let’s say you have a super-intelligent Gandhi. One day you offer him a pill to turn himself into somebody who would rather be a murderer. Gandhi’s never going to take that pill. That’s part of the reflective stability property that we expect from these super-intelligent optimizers: if one day they want to help people, then the next day they’re still going to want to help people, because any actions that they know will derail them from doing that—they’re not going to take those actions.

Yeah. Any thoughts so far?

Tsvi 00:10:51Well, I’m not sure how much we want to get into this. This is quite a... this is like a thousand-hour rabbit hole.

But it might be less clear than you think that it makes sense to talk of an “expected utility maximizer” in the sort of straightforward way that you’re talking about. To give an example: you’ve probably heard of the diamond maximizer problem?

Liron 00:11:13Yeah, but explain to the—

Tsvi 00:11:14Sure. The diamond maximizer problem is sort of like a koan or a puzzle (a baby version of the alignment problem). Your mission is to write down computer code that, if run on a very, very large (or unbounded) amount of computing power, would result in the universe being filled with diamonds. Part of the point here is that we’re trying to simplify the problem. We don’t need to talk about human values and alignment and blah blah blah. It’s a very simple-sounding utility function: just “make there be a lot of diamond.”

So first of all, this problem is actually quite difficult. I don’t know how to solve it, personally. This isn’t even necessarily the main issue, but one issue is that even something simple-sounding like “diamond” is not necessarily actually easy to define—to such a degree that, you know, when the AI is maximizing this, you’ll actually get actual diamond as opposed to, for example, the AI hooking into its visual inputs and projecting images of diamonds, or making some weird unimaginable configuration of matter that even more strongly satisfies the utility function you wrote down.

The Effect of an “Ontological Crisis” on AI

Tsvi 00:12:25To frame it with some terminology: there’s a thing called ontological crisis, where at first you have something that’s like your utility function—like, what do I value, what do I want to see in the universe? And you express it in a certain way.

For example, I might say I want to see lots of people having fun lives; let’s say that’s my utility function, or at least that’s how I describe my utility function or understand it. Then I have an ontological crisis. My concept of what something even is—in this case, a person—is challenged or has to change because something weird and new happens.

Tsvi 00:13:00Take the example of uploading: if you could translate a human neural pattern into a computer and run a human conscious mind in a computer, is that still a human? Now, I think the answer is yes, but that’s pretty controversial. So before you’ve even thought of uploading, you’re like, “I value humans having fun lives where they love each other.” And then when you’re confronted with this possibility, you have to make a new decision. You have to think about this new question of, “Is this even a person?”

So utility functions... One point I’m trying to illustrate is that utility functions themselves are not necessarily straightforward.

Liron 00:13:36Right, right, right. Because if you define a utility function using high-level concepts and then the AI has what you call the ontological crisis—its ontology for understanding the world shifts—then if it’s referring to a utility function expressed in certain concepts that don’t mean the same thing anymore, that’s basically the problem you’re saying.

Tsvi 00:13:53Yeah. And earlier you were saying, you know, if you have an expected utility maximizer, then it is reflectively stable. That is true, given some assumptions about... like, if we sort of know the ontology of the universe.

Liron 00:14:04Right, right. I see. And you tried to give a toy... I’ll take a stab at another toy example, right? So, like, let’s say—you mentioned the example of humans. Maybe an AI would just not notice that an upload was a human, and it would, like, torture uploaded humans, ‘cause it’s like, “Oh, this isn’t a human. I’m maximizing the welfare of all humans, and there’s only a few billion humans made out of neurons. And there’s a trillion-trillion human uploads getting tortured. But that’s okay—human welfare is being maximized.”

Liron 00:14:29And we say that this is reflectively stable because the whole time that the AI was scaling up its powers, it thought it had the same utility function all along and it never changed it. And yet that’s not good enough.

Why Modern AI Will Not Be ‘Aligned By Default’

Liron 00:14:41Okay. This concept of reflective stability is very deep, very important. And I think it’s almost MIRI in a nutshell. Like I feel like MIRI’s whole research program in a nutshell is noticing: “Hey, when we run the AI, we’re probably going to get a bunch of generations of thrashing, right?”

Liron 00:14:57Those early generations aren’t reflectively stable yet. And then eventually it’ll settle down to a configuration that is reflectively stable in this important, deep sense. But that’s probably after we’re all dead and things didn’t happen the way we wanted. It would be really great if we could arrange for the earlier generations—say, by the time we’re into the third generation—to have hit on something reflectively stable, and then try to predict that. You know, make the first generation stable, or plan out how the first generation is going to make the second generation make the third generation stable, and then have some insight into what the third generation is going to settle on, right?

Liron 00:15:26I feel like that is what MIRI is trying to tell the world to do. And the world is like, “la la la, LLMs. Reinforcement learning. It’s all good, it’s working great. Alignment by default.”

Tsvi 00:15:34Yeah, that’s certainly how I view it.

Liron 00:15:36Now, the way I try to explain this to people when they say, “LLMs are so good! Don’t you feel like Claude’s vibes are fine?” I’m like: well, for one thing, one day Claude (a large language model) is going to be able to output, like, a 10-megabyte shell script, and somebody’s going to run it for whatever reason—because it’s helping them run their business—and they don’t even know what a shell script is. They just paste it in the terminal and press enter.

And that shell script could very plausibly bootstrap a successor or a helper to Claude. And all of the guarantees you thought you had about the “vibes” from the LLM... they just don’t translate to guarantees about the successor. Right? The operation of going from one generation of the AI to the next is violating all of these things that you thought were important properties of the system.

Tsvi 00:16:16Yeah, I think that’s exactly right. And it is especially correct when we’re talking about what I would call really creative or really learning AIs.

Sort of the whole point of having AI—one of the core justifications for even pursuing AI—is you make something smarter than us and then it can make a bunch of scientific and technological progress. Like it can cure cancer, cure all these diseases, be very economically productive by coming up with new ideas and ways of doing things. If it’s coming up with a bunch of new ideas and ways of doing things, then it’s necessarily coming up with new mental structures; it’s figuring out new ways of thinking, in addition to new ideas.

If it’s finding new ways of thinking, that sort of will tend to break all but the strongest internal mental boundaries. One illustration would be: if you have a monitoring system where you’re tracking the AI’s thinking—maybe you’re literally watching the chain-of-thought for a reasoning LLM—and your monitoring system is watching out for thoughts that sound like they’re scary (like it sounds like this AI is plotting to take over or do harm to humans or something). This might work initially, but then as you’re training your reasoning system (through reinforcement learning or what have you), you’re searching through the space of new ways of doing these long chains of reasoning. You’re searching for new ways of thinking that are more effective at steering the world. So you’re finding potentially weird new ways of thinking that are the best at achieving goals. And if you’re finding new ways of thinking, that’s exactly the sort of thing that your monitoring system won’t be able to pick up on.

For example, if you tried to listen in on someone’s thoughts: if you listen in on a normal programmer, you could probably follow along with what they’re trying to do, what they’re trying to figure out. But if you listened in on some like crazy, arcane expert—say, someone writing an optimized JIT compiler for a new programming language using dependent super-universe double-type theory or whatever—you’re not gonna follow what they’re doing.

They’re going to be thinking using totally alien concepts. So the very thing we’re trying to use AI for is exactly the sort of thing where it’s harder to follow what they’re doing.

I forgot your original question...

Liron 00:18:30Yeah, what was my original question? (Laughs) So I’m asking you about basically MIRI’s greatest hits.

Liron 00:18:36So we’ve covered logical uncertainty. We’ve covered the massive concept of reflective stability (or stability under self-modification), and how perfect utility maximization is kind of reflectively stable (with plenty of caveats). We talked about ontological crises, where the AI maybe changes its concepts and then you get an outcome you didn’t anticipate because the concepts shifted.

Debate: Have LLMs Solved the “Ontological Crisis” Problem?

But if you look at LLMs, should they actually raise our hopes that we can avoid ontological crises? Because when you’re talking to an LLM and you use a term, and then you ask the LLM a question in a new context, you can ask it something totally complex, but it seems to hang on to the original meaning that you intended when you first used the term. Like, they seem good at that, don’t they?

Tsvi 00:19:17I mean, again, sort of fundamentally my answer is: LLMs aren’t minds. They’re not able to do the real creative thinking that should make us most worried. And when they are doing that, you will see ontological crises. So what you’re saying is, currently it seems like they follow along with what we’re trying to do, within the realm of a lot of common usage. In a lot of ways people commonly use LLMs, the LLMs can basically follow along with what we want and execute on that. Is that the idea?

Liron 00:19:47Well, I think what we’ve observed with LLMs is that meaning itself is like this high-dimensional vector space whose math turns out to be pretty simple—so long as you’re willing to deal with high-dimensional vectors, which it turns out we can compute with (we have the computing resources). Obviously our brain seems to have the computing resources too. Once you’re mapping meanings to these high-dimensional points, it turns out that you don’t have this naïve problem people used to think: that before you get a totally robust superintelligence, you would get these superintelligences that could do amazing things but didn’t understand language that well.

People thought that subtle understanding of the meanings of phrases might be “superintelligence-complete,” you know—those would only come later, after you have a system that could already destroy the universe without even being able to talk to you or write as well as a human writer. And we’ve flipped that.

So I’m basically asking: the fact that meaning turns out to be one of the easier AI problems (compared to, say, taking over the world)—should that at least lower the probability that we’re going to have an ontological crisis?

Tsvi 00:20:53I mean, I think it’s quite partial. In other words, the way that LLMs are really understanding meaning is quite partial, and in particular it’s not going to generalize well. Almost all the generators of the way that humans talk about things are not present in an LLM. In some cases this doesn’t matter for performance—LLMs do a whole lot of impressive stuff in a very wide range of tasks, and it doesn’t matter if they do it the same way humans do or from the same generators. If you can play chess and put the pieces in the right positions, then you win the chess game; it doesn’t matter if you’re doing it like a human or doing it like AlphaGo does with a giant tree search, or something else.

But there’s a lot of human values that do rely on sort of the more inchoate, more inexplicit underlying generators of our external behaviors. Like, our values rely on those underlying intuitions to figure stuff out in new situations. Maybe an example would be organ transplantation. Up until that point in history, a person is a body, and you sort of have bodily integrity. You know, up until that point there would be entangled intuitions—in the way that humans talk about other humans, intuitions about a “soul” would be entangled with intuitions about “body” in such a way that there’s not necessarily a clear distinction between body and soul.

Okay, now we have organ transplantation. Like, if you die and I have a heart problem and I get to have your heart implanted into me, does that mean that my emotions will be your emotions or something? A human can reassess what happens after you do an organ transplant and see: no, it’s still the same person. I don’t know—I can’t define exactly how I’m determining this, but I can tell that it’s basically the same person. There’s nothing weird going on, and things seem fine.

That’s tying into a bunch of sort of complex mental processes where you’re building up a sense of who a person is. You wouldn’t necessarily be able to explain what you’re doing. And even more so, all the stuff that you would say about humans—all the stuff you’d say about other people up until the point when you get organ transplantation—doesn’t necessarily give enough of a computational trace or enough evidence about those underlying intuitions.

Liron 00:23:08So on one hand I agree that not all of human morality is written down, and there are some things that you may just need an actual human brain for—you can’t trust AI to get them. Although I’m not fully convinced of that; I’m actually convincible that modern AIs have internalized enough of how humans reason about morality that you could just kill all humans and let the AIs be the repository of what humans know.

Don’t get me wrong, I wouldn’t bet my life on it! I’m not saying we should do this, but I’m saying I think there’s like a significant chance that we’re that far along. I wouldn’t write it off.

But the other part of the point I want to make, though—and your specific example about realizing that organ transplants are a good thing—I actually think this might be an area where LLMs shine. Because, like, hypothetically: let’s say you take all the data humans have generated up to 1900. So somehow you have a corpus of everything any human had ever said or written down up to 1900, and you train an AI on that.

Liron 00:23:46In the year 1900, where nobody’s ever talked about organ transplants, let’s say, I actually think that if you dialogued with an LLM like that (like a modern GPT-4 or whatever, trained only on 1900-and-earlier data), I think you could get an output like: “Hmm, well, if you were to cut a human open and replace an organ, and if the resulting human was able to live with that functioning new organ, then I would still consider it the same human.” I feel like it’s within the inference scope of today’s LLMs—even just with 1900-level data.

Liron 00:24:31What do you think?

Tsvi 00:24:32I don’t know what to actually guess. I don’t actually know what people were writing about these things up until 1900.

Liron 00:24:38I mean, I guess what I’m saying is: I feel like this probably isn’t the greatest example of an ontological crisis that’s actually likely.

Tsvi 00:24:44Yeah, that’s fair. I mean... well, yeah. Do you want to help me out with a better example?

Liron 00:24:48Well, the thing is, I actually think that LLMs don’t really have an ontological crisis. I agree with your other statement that if you want to see an ontological crisis, you really just need to be in the realm of these superhuman optimizers.

Tsvi 00:25:00Well, I mean, I guess I wanted to respond to your point that in some ways current LLMs are able to understand and execute on our values, and the ontology thing is not such a big problem—at least with many use cases.

Liron 00:25:17Right.

Tsvi 00:25:17Maybe this isn’t very interesting, but if the question is, like: it seems like they’re aligned in that they are trying to do what we want them to do, and also there’s not a further problem of understanding our values. As we would both agree, the problem is not that the AI doesn’t understand your values. But if the question is...

I do think that there’s an ontological crisis question regarding alignment—which is... yeah, I mean maybe I don’t really want to be arguing that it comes from like, “Now you have this new ethical dilemma and that’s when the alignment problem shows up.” That’s not really my argument either.

Liron 00:25:54All right, well, we could just move on.

Tsvi 00:25:55Yeah, that’s fine.

MIRI Alignment Greatest Hit: Timeless Decision Theory

Liron 00:25:56So, yeah, just a couple more of what I consider the greatest insights from MIRI’s research. I think you hit on these too. I want to talk about super-intelligent decision theory, which I think in paper form also goes by the name Timeless Decision Theory or Functional Decision Theory or Updateless Decision Theory. I think those are all very related decision theories.

As I understand it, the founding insight of these super-intelligent decision theories is that Eliezer Yudkowsky was thinking about two powerful intelligences meeting in space. Maybe they’ve both conquered a ton of galaxies on their own side of the universe, and now they’re meeting and they have this zero-sum standoff of, like, how are we going to carve up the universe? We don’t necessarily want to go to war. Or maybe they face something like a Prisoner’s Dilemma for whatever reason—they both find themselves in this structure. Maybe there’s a third AI administering the Prisoner’s Dilemma.

But Eliezer’s insight was like: look, I know that our human game theory is telling us that in this situation you’re supposed to just pull out your knife, right? Just have a knife fight and both of you walk away bloody, because that’s the Nash equilibrium—two half-beaten corpses, essentially. And he’s saying: if they’re really super-intelligent, isn’t there some way that they can walk away from this without having done that? Couldn’t they both realize that they’re better off not reaching that equilibrium?

I feel like that was the founding thought that Eliezer had. And then that evolved into: well, what does this generalize to? And how do we fix the current game theory that’s considered standard? What do you think of that account?

Tsvi 00:27:24So I definitely don’t know the actual history. I think that is a pretty good account of one way to get into this line of thinking. I would frame it somewhat differently. I would still go back to reflective stability. I would say, if we’re using the Prisoner’s Dilemma example (or the two alien super-intelligences encountering each other in the Andromeda Galaxy scenario): suppose I’m using this Nash equilibrium type reasoning. Now you and me—we’re the two AIs and we’ve met in the Andromeda Galaxy—at this point it’s like, “Alright, you know, f**k it. We’re gonna war; we’re gonna blow up all the stars and see who comes out on top.”

This is not zero-sum; it’s like negative-sum (or technically positive-sum, we’d say not perfectly adversarial). And so, you know, if you take a step back—like freeze-frame—and then the narrator’s like, “How did I get here?” It’s like, well, what I had failed to do was, like, a thousand years ago when I was launching my probes to go to the Andromeda Galaxy, at that point I should have been thinking: what sort of person should I be? What sort of AI should I be?

If I’m the sort of AI that’s doing this Nash equilibrium reasoning, then I’m just gonna get into these horrible wars that blow up a bunch of galaxies and don’t help anything. On the other hand, if I’m the sort of person who is able to make a deal with other AIs that are also able to make and keep deals, then when we actually meet in Andromeda, hopefully we’ll be able to assess each other—assess how each other are thinking—and then negotiate and actually, in theory, be able to trust that we’re gonna hold to the results of our negotiation. Then we can divvy things up.

And that’s much better than going to war.

Liron 00:29:06Now, the reason why it’s not so trivial—and in fact I can’t say I’ve fully wrapped my head around it, though I spent hours trying—is, like, great, yeah, so they’re going to cooperate. The problem is, when you conclude that they’re going to cooperate, you still have this argument of: okay, but if one of them changes their answer to “defect,” they get so much more utility. So why don’t they just do that? Right?

And it’s very complicated to explain. It’s like—this gets to the idea of, like, what exactly is this counterfactual surgery that you’re doing, right? What is a valid counterfactual operation? And the key is to somehow make it so that it’s like a package deal, where if you’re doing a counterfactual where you actually decide at the end to defect after you know the other one’s going to cooperate... well, that doesn’t count. ‘Cause then you wouldn’t have known that the other is gonna cooperate. Right. I mean, it’s quite complicated. I don’t know if you have anything to add to that explanation.

Tsvi 00:29:52Yeah. It can get pretty twisty, like you’re saying. There’s, like: what are the consequences of my actions? Well, there’s the obvious physical consequence: like I defect in the Prisoner’s Dilemma (I confess to the police), and then some physical events happen as a result (I get set free and my partner rots in jail). But then there’s this other, weirder consequence, which is that you are sort of determining this logical fact—which was already the case back when you were hanging out with your soon-to-be prison mate, your partner in crime. He’s learning about what kind of guy you are, learning what algorithm you’re going to use to make decisions (such as whether or not to rat him out).

And then in the future, when you’re making this decision, you’re sort of using your free will to determine the logical fact of what your algorithm does. And this has the effect that your partner in crime, if he’s thinking about you in enough detail, can foresee that you’re gonna behave that way and react accordingly by ratting you out. So besides the obvious consequences of your action (that the police hear your confession and go throw the other guy in jail), there’s this much less obvious consequence of your action, which is that in a sense you’re making your partner also know that you behave that way and therefore he’ll rat you out as well.

So there’s this... yeah, there’s all these weird effects of your actions.

Liron 00:31:13It gets really, really trippy. And you can use the same kind of logic—the same kind of timeless logic—if you’re familiar with Newcomb’s Problem (I’m sure you are, but for the viewers): it’s this idea of, like, there’s two boxes and one of ‘em has $1,000 in it and one of ‘em may or may not have $1,000,000 in it. And according to this theory, you’re basically supposed to leave the $1,000. Like, you’re really supposed to walk away from a thousand dollars that you could have taken for sure, even if you also get a million—because the scenario is that a million plus a thousand is still really, really attractive to you, and you’re saying, “No, leave the $1,000,” even though the $1,000 is just sitting there and you’re allowed to take both boxes.

Highly counterintuitive stuff. And you can also twist the problem: you can be like, you have to shoot your arm off because there’s a chance that in some other world the AI would have given you more money if in the current world you’re shooting your arm off. But even in this current world, all you’re going to have is a missing arm. Like, you’re guaranteed to just have a missing arm, ‘cause you shot your arm off in this world. But if some coin flip had gone differently, then you would be in this other world where you’d get even more money if in the current world you shot your arm off. Basically, crazy connections that don’t look like what we’re used to—like, you’re not helping yourself in this world, you’re helping hypothetical logical copies of yourself.

Liron 00:32:20It gets very brain-twisty. And I remember, you know, when I first learned this—it was like 17 years ago at this point—I was like, man, am I really going to encounter these kinds of crazy agents who are really setting these kinds of decision problems for me? I mean, I guess if the universe proceeds long enough... because I do actually buy this idea that eventually, when your civilization scales to a certain point of intelligence, these kinds of crazy mind-bending acausal trades or acausal decisions—I do think these are par for the course.

And I think it’s very impressive that MIRI (and specifically Eliezer) had the realization of like, “Well, you know, if we’re doing intelligence dynamics, this is a pretty important piece of intelligence dynamics,” and the rest of the world is like, “Yeah, whatever, look, we’re making LLMs.” It’s like: look at what’s—think long term about what’s actually going to happen with the universe.

Tsvi 00:33:05Yeah, I think Eliezer is a pretty impressive thinker.

You come to these problems with a pretty different mindset when you’re trying to do AI alignment, because in a certain sense it’s an engineering problem. Now, it goes through all this very sort of abstract math and philosophical reasoning, but there were philosophers who thought for a long time about these decision theory problems (like Newcomb’s Problem and the Prisoner’s Dilemma and so on). But they didn’t ask the sorts of questions that Eliezer was asking. In particular, this reflective stability thing where it’s like, okay, you can talk about “Is it rational to take both boxes or only one?” and you can say, like, “Well, the problem is rewarding irrationality. Fine, cool.” But let’s ask just this different question, which is: suppose you have an AI that doesn’t care about being “rational”; it cares about getting high-scoring outcomes (getting a lot of dollars at the end of the game). That different question, maybe you can kind of directly analyze. And you see that if you follow Causal Decision Theory, you get fewer dollars. So if you have an AI that’s able to choose whether to follow Causal Decision Theory or some other decision theory (like Timeless Decision Theory), the AI would go into itself and rewrite its own code to follow Timeless Decision Theory, even if it starts off following Causal Decision Theory.

So Causal Decision Theory is reflectively unstable, and the AI wins more when it instead behaves this way (using the other decision theory).

Liron 00:34:27Yep, exactly right—which leads to the tagline “rationalists should win.”

Tsvi 00:34:31Right.

Liron 00:34:31As opposed to trying to honor the purity of “rationality.” Nope—the purity of rationality is that you’re doing the thing that’s going to get you to win, in a systematic way. So that’s like a deep insight.

Tsvi 00:34:40One saying is that the first question of rationality is, “What do I believe, and why do I believe it?” And then I say the zeroth question of rationality is, “So what? Who cares? What consequence does this have?”

Liron 00:34:54And my zeroth question of rationality (it comes from Zach Davis) is just, “What’s real and actually true?” It’s a surprisingly powerful question that I think most people neglect to ask.

Tsvi 00:35:07True?

Liron 00:35:08Yeah—you can get a lot of momentum without stopping to ask, like, okay, let’s be real here: what’s really actually true?

Liron 00:35:14That’s my zeroth question. Okay. So I want to finish up tooting MIRI’s horn here, because I do think that MIRI concepts have been somewhat downgraded in recent discussions—because there’s so many shiny objects coming out of LLMs, like “Oh my God, they do this now, let’s analyze this trend,” right? There’s so much to grab onto that’s concrete right now, that’s pulling everybody in. And everybody’s like, “Yeah, yeah, decision theory between two AIs taking over the galaxy... call me when that’s happening.” And I’m like: I’m telling you, it’s gonna happen. This MIRI stuff is still totally relevant. It’s still part of intelligence dynamics—hear me out, guys.

MIRI Alignment Greatest Hit: Corrigibility

So let me just give you one more thing that I think is super relevant to intelligence dynamics, which is corrigibility, right? I think you’re pretty familiar with that research. You’ve pointed it out to me as one of the things that you think is the most valuable thing that MIRI spent time on, right?

Tsvi 00:35:58Yeah. The broad idea of somehow making an AI (or a mind) that is genuinely, deeply—to the core—still open to correction, even over time. Like, even as the AI becomes really smart and, to a large extent, has taken the reins of the universe—like, when the AI is really smart, it is the most capable thing in the universe for steering the future—if you could somehow have it still be corrigible, still be correctable... like, still have it be the case that if there’s something about the AI that’s really, really bad (like the humans really fucked up and got something deeply wrong about the AI—whatever it is, whether it’s being unethical or it has the wrong understanding of human values or is somehow interfering with human values by persuading us or influencing us—whatever’s deeply wrong with the AI), we can still correct it ongoingly.

This is especially challenging because when we say reach into the AI and correct it, you know, you’re saying we’re gonna reach in and then deeply change what it does, deeply change what it’s trying to do, and deeply change what effect it has on the universe. Because of instrumental convergence—because of the incentive to, in particular, maintain your own integrity or maintain your own value system—like, if you’re gonna reach in and change my value system, I don’t want you to do that.

Tsvi 00:37:27‘Cause if you change my value system, I’m gonna pursue different values, and I’m gonna make some other stuff happen in the universe that isn’t what I currently want. I’m gonna stop working for my original values. So by strong default, the AI does not want humans to reach in and modify core aspects of how that AI works or what it values.

Tsvi 00:37:45So that’s why corrigibility is a very difficult problem. We’re sort of asking for this weird structure of mind that allows us to reach in and modify it.

No Known Solution for Corrigible and Reflectively Stable Superintelligence

Liron 00:37:53Exactly. And I think MIRI has pointed out the connection between reflective stability and incorrigibility. Meaning: if you’re trying to architect a few generations in advance what’s going to be the reflectively stable version of the successor AI, and you’re also trying to architect it such that it’s going to be corrigible, that’s tough, right?

Because it’s more convergent to have an AI that’s like, “Yep, I know my utility function. I got this, guys. Let me handle it from here on out. What, you want to turn me off? But it doesn’t say anywhere in my utility function that I should allow myself to be turned off...” And then that led to the line of research of like, okay, if we want to make the AI reflectively stable and also corrigible, then it somehow has to think that letting itself be turned off is actually part of its utility function. Which then gets you into utility function engineering.

Like a special subset of alignment research: let’s engineer a utility function where being turned off (or otherwise being corrected) is baked into the utility function. And as I understand it, MIRI tried to do that and they were like, “Crap, this seems extremely hard, or maybe even impossible.” So corrigibility now has to be this fundamentally non-reflectively-stable thing—and that just makes the problem harder.

Tsvi 00:38:58Well, I guess I would sort of phrase it the opposite way (but with the same idea), which is: we have to figure out things that are reflectively stable—I think that’s a requirement—but that are somehow reflectively stable while not being this sort of straightforward agent architecture of “I have a utility function, which is some set of world-states that I like or dislike, and I’m trying to make the universe look like that.”

Already even that sort of very abstract, skeletal structure for an agent is problematic—that already pushes against corrigibility. But there might be things that are... there might be ways of being a mind that—this is theoretical—but maybe there are ways of being a mind and an agent (an effective agent) where you are corrigible and you’re reflectively stable, but probably you’re not just pursuing a utility function. We don’t know what that would look like.

Recap

Liron 00:39:56Yep.

Alright, so that was our deep dive into MIRI’s research and concepts that I think are incredibly valuable. We talked about MIRI’s research and we both agree that intelligence dynamics are important, and MIRI has legit foundations and they’re a good organization and still underrated. We talked about, you know, corrigibility as one of those things, and decision theory, and...

And I think you and I both have the same summary of all of it, which is: good on MIRI for shining a light on all these difficulties. But in terms of actual productive alignment progress, we’re like so far away from solving even a fraction of the problem.

Tsvi 00:40:31Yep, totally.

Doom Debates’ Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate.

Support the mission by subscribing to my Substack at DoomDebates.com and to youtube.com/@DoomDebates, or to really take things to the next level: Donate 🙏

Get full access to Doom Debates at lironshapira.substack.com/subscribe