Justified Posteriors

Seth Benzell and Andrey Fradkin
undefined
Aug 25, 2025 • 57min

AI and its labor market effects in the knowledge economy

In this episode, we discuss a new theoretical framework for understanding how AI integrates into the economy. Using the paper Artificial Intelligence and the Knowledge Economy (Ide & Talamas, JPE), we debate whether AI will function as a worker, a manager, or an expert — and how each role affects inequality, organizational design, and macroeconomic dynamics.We explore predictions about the rise of “AI agent managers,” wage polarization, and the limits of stylized theory models. Along the way, we contrast the paper with alternative approaches, critique assumptions (like infinite equally valuable problems), and consider implications for small businesses, entrepreneurs, and knowledge workers.Timestamps* [00:00] Worker, Manager, or Expert?* [00:06] Who manages the AI agents?* [00:15] Will AI worsen inequality?* [00:25] The Ide & Talamas model explained* [00:40] Limitations and critiques* [00:55] Posteriors: updated beliefs This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
Aug 12, 2025 • 1h 2min

One LLM to rule them all?

In this special episode of the Justified Posteriors Podcast, hosts Seth Benzell and Andrey Fradkin dive into the competitive dynamics of large language models (LLMs). Using Andrey’s working paper, Demand for LLMs: Descriptive Evidence on Substitution, Market Expansion, and Multihoming, they explore how quickly new models gain market share, why some cannibalize predecessors while others expand the user base, and how apps often integrate multiple models simultaneously.Host’s note, this episode was recorded in May 2025, and things have been rapidly evolving. Look for an update sometime soon.TranscriptSeth: ​Welcome to Justified Posterior Podcast, the podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel, possessing a highly horizontally differentiated intelligence—not saying that's a good thing—coming to you from Chapman University in sunny Southern California.Andrey: And I'm Andrey Fradkin, multi-homing across many different papers I'm working on, coming to you from sunny—in this case—Cambridge, Massachusetts.Seth: Wow…. Rare, sunny day in Cambridge, Mass. But I guess the sunlight is kind of a theme for our talk today because we're going to try to shed some light on some surprising features of AI, some important features, and yet, not discussed at all. Why don't people write papers about the important part of AI? Andrey, what's this paper about?Andrey: I agree that not enough work has been done on this very important topic. Look, we can think about the big macroeconomic implications of AI—that's really fun to talk about—but it's also fun to talk about the business of AI. Specifically, who's going to win out? Which models are better than others? And how can we measure these things as they're happening at the moment? And so that's really what this paper is about. It's trying to study how different model providers compete with each other.Seth: Before we get deep into that—I do want to push back on the idea that this isn't macroeconomically important. I think understanding the kind of way that the industry structure for AI will work will have incredible macroeconomic implications, right? If only for diversity—for equality across countries, right? We might end up in a world where there's just one country or a pair of countries that dominate AI versus a world where the entire world is involved in the AI supply chain and plugging in valuable pieces, and those are two very different worlds.Andrey: Yeah. So, you're speaking my book, Seth. Being an industrial organization economist, you know, we constantly have this belief that macroeconomists, by thinking so big-picture, are missing the important details about specific industries that are actually important for the macroeconomy.Seth: I mean—not every specific industry; there's one or two specific industries that I would pay attention to.Andrey: Have you heard of the cereal industry, Seth?Seth: The cereal industry?Andrey: It's important how mushy the cereal is.Seth: Well, actually, believe it or not, I do have a breakfast cereal industry take that we will get to before the end of this podcast. So, viewers [and] listeners at home, you gotta stay to the end for the breakfast cereal AI economics take.Andrey: Yeah. And listeners at home, the reason that I'm mentioning cereal is it's of course the favorite. It's the fruit fly of industrial organization for estimating demand specifically. So—a lot of papers have been written about estimating serial demand and other such thingsSeth: Ah—I thought it was cars. I guess cars and cereal are the two things.Andrey: Cars and cereal are the classic go-tos.Introducing the paperSeth: Amazing. So, what [REDACTED] wrote the paper we're reading today, Andrey?Andrey: Well, you know—it was me, dear reader—I wrote the paper.Seth: So we know who's responsible.Andrey: All mistakes are my fault, but I should also mention that I wrote it in a week and it's all very much in progress. And so I hope to learn from this conversation, as we—let's say my priors are diffuse enough so that I can still updateSeth: Oh dude, I want you to have a solid prior so we can get at it. But I will say I was very, very inspired by this project, Andrey. I also want to follow in your footsteps. Well, maybe we'll talk about that at the end of the podcast as well. But maybe you can just tell us the title of your paper. Andrey,Andrey: The title of the paper is Demand for LLMs, and now you're forcing me to remember the title of the—Seth: If you were an AI, you would remember the title of the paper, maybe.Andrey: The title of the paper is Demand for LLMs: Descriptive Evidence on Substitution Market Expansion and Multi-Homing. So, I will state three claims, which I do make in the paper.Seth: Ooh, ooh.Andrey: And you can tell me your priors.Seth: Prior on each one. Okay, so give me the abstract; claim number one.Andrey: So the point number one is that when a new good model gets released, it gets adopted very quickly. Within a few weeks, it achieves kind of a baseline level of adoption. So I think that's fact number one. And that's very interesting because not all industries have such quick adoption cycles.Seth: Right? It looks more like the movie or the media industry, where you have a release and then boom, everybody flocks to it. That's the sense that I got before reading this paper. So I would put my probability on a new-hot new model coming out; everybody starts trying it—I mean, a lot of these websites just push you towards the new model anyway.I know we're going to be looking at a very specific context, but if we're just thinking overall. Man, 99% when a new hot new model comes out, people try it.Andrey: So I'll push back on that. It's the claim that it's not about trying it, like these models achieve an equilibrium level of market penetration. It's not—Seth: How long? How long is—how long is just trying it? Weeks, months.Andrey: How long are—sorry, can you repeat that question?Seth: So you're pushing back on the idea that this is, quote unquote, “just trying the new release.” Right. But what is the timeline you're looking over?Andrey: It's certainly a few months, but it doesn't take a long time to just try it. So, if it was just trying we'd see us blip over a week, and then it would go back down. And I don't—Seth: If they were highly horizontally differentiated, but if they were just very slightly horizontally differentiated, you might need a long time to figure it out.Andrey: You might—that's fair. Okay, so the second claim is: the different models have very different patterns of either substituting away from existing models or expanding the market. And I think two models that really highlight that are Claude 3.7 Sonnet, which primarily cannibalizes from Claude 3.5 Sonnet.Seth: New Coke,Andrey: Yes, and it's—well, New Coke failed in this regard.Seth: Diet Coke,Andrey: Yeah. And then another model is Google's Gemini 2.0 Flash, which really expanded the market on this platform. A lot of people started using it a lot and didn't seem to have noticeable effects on other model usage.Seth: Right?Andrey: So this is kind of showing that kind of models are competing in this interesting space.Seth: My gosh. Andrey, do you want me to evaluate the claim that you made, or are you now just vaguely appealing to competition? Which of the two do you want me to put a prior on?Andrey: No no no. Go for it. Yeah.Seth: All right, so the first one is: do I think that if I look at, you know, a website with a hundred different models, some of them will steal from the same company and some of them will lead to new customers?Right? I mean with a—I, I'm a little bit… Suppose we asked this question about products and you said, “Professor Benzel, will my product steal from other demands, or will it lead to new customers?” I guess at a certain level, it doesn't even make sense, right? There's a general equilibrium problem here where you always have to draw from something else.I know we're drawing from other AIs, which would mean that there would have to be some kind of substitution. So I mean, yes, I believe sometimes there's going to be substitution, and yes, I believe sometimes, for reasons that are not necessarily directly connected to the AI model, the rollout of a new model might bring new people into the market.Right. So I guess I agree. Like at the empirical level, I would say 95% certain that models differ in whether they steal from other models or bring in new people. If you're telling me now there's like a subtler claim here, which is that the fact that some models bring in new people is suggestive of horizontal differentiation and is further evidence for strong horizontal differentiation.And I'm a little bit, I don't know, I'll put a probability on that, but that's, that seems to be going a little bit beyond the scope of the description.Andrey: Well, we can discuss that in the discussion session. And I think the final part that I make a claim about is that apps, and the users of apps as well, to multi-home across models. So it's not that people are using just one model. It's not like app developers are using just one model for each application. And that's kind of once again pointing to the fact that there isn't just kind of one superior model even within a given model class.And, Seth, go for itSeth: Andrey, you did the thing again. You did the thing again where you said, "Here, Seth, do you want to evaluate this empirical finding?" Or do you want me to now say, “This tells us something about the future of competition in AI'?"Andrey: Yes, yes, yes. All right, go for it.Seth: The empirical claim, right? Is—give me the narrow claim? One more time? Give it to me.Andrey: The apps are multihoming.Seth: The people multi-home. Okay. The narrow claim is we've got these apps; maybe we'll give the user, the listeners, a little bit of context of what a sample app would be.Andrey: Yeah, so I think about two types of apps here. One is a coding app, so Klein and RU coder are two quite popular coding apps. And we see that users of those apps are multi-homing. And then—those apps are multi-homing; we don't know as much about the users—and then we have kind of various chat-persona apps. And then we have some kind of utility appsSeth: Yeah. We'll call them, like—let's call that second group role-play apps.Andrey: Yeah, yeah. We have kind of like PDF extractor and apps like that, that are also on the—Seth: Very tool-ly. Okay, cool. Alright, so we've got all these apps out, and now you're going to tell me, Professor Benzel, "I think you would be surprised to find out that RU coder, for example, has both the Claude model underpowering it and an OpenAI model powering it." And that one is probably the thing I'm most surprised by.Right? I definitely would not be surprised at all to know that RU coder can send its cloud tokens to one data center versus another data center; that makes perfect sense. But the fact that you would sustainably have many different contemporaneous models on the same platform feels like a stage in a process rather than where we're going to end up.What do I mean by that? So why would you want to keep an old legacy model inside of your RU coder? So I've got—I'm, or Silly Tavern, is one that I like. So Silly Tavern is just, you can do role play and talk to characters and pretend you're going on adventures. Right?It seems like that Claude 3.7 should just be better than 3.5 at that, right? I really don't—my intuition is that they're not strongly horizontally differentiated. Why would you keep both? It would be for legacy reasons, for backward compatibility. Maybe there's a specific interaction or scenario that you had that you had working in the old version of the app, and you want to make sure that that's still around for new users.So, how would I think about this? I would think about if you want to say that this is like evidence of multi-homing. This multi-homing evidence is evidence of competition because the same app wants to use multiple versions. I kind of disagree, right? The way I think about it is maybe more like, you know, you're a car, and you can either use the old muffler or the new muffler, and some people have upgraded to the new muffler, but some people are still using the old muffler, and so that car has two different kinds of mufflers.Andrey: Yeah, we can discuss that, you know, that claim as well. I guess, do you want me to address what I think?Seth: Well, give me a taste, and then let's go to the evidence. Give me a taste.Andrey: The multi-homing is not happening on an old and a new version of a model.It's happening on, let's say, 3.7 and Gemini 2.5, which are both relatively new models. The other thing I'd say is that you read Reddit; there are some users that still like 3.5 better than 3.7.Seth: On the internet, they will prefer one plain white cotton T-shirt to another plain white cotton T-shirt entry.Andrey: Who are you to question the preferences? The consumer.Seth: Right? But I guess, all right, so this is my last comment on the priors, and then we'll get into the evidence, which is. This sort of speculation about what people will actually want in the long run is the bridge that gets us from this cross-sectional evidence about 20 April, 2025, to what the world's going to look like in 2027 and 2028. So that's why I'm pushing back a little bit.Andrey: Yeah, I don't want to make claims that are too great about 2027 based on this cross section. Yes,Seth: you know, GDP girl's gonna be at 30%Andrey: That's true.Seth: And all of you in labor will be automated.Andrey: There is going to be a lot of market expansion. I hear.Seth: Oh, babe, listen to our Epic AI episode. We'll post that before this one so you can see what we're laughing at.Andrey: All right.Seth: So tell me, Andrey, I can think of no one better suited to walk us through the evidence of this paper than Professor Fradkin of Boston University.Andrey: Look, it's very simple paper. It's essentially a few graphs, and the graphs are event studies, where we see what happens to a selected set of models around the time of the release of one of the new models. So for the release of Claude 3.7, we see a very obvious drop in the usage of 3.5. You know, if you ballpark it, it's about 80% cannibalization. And the adoption happens within a few weeks, so it's fairly fast. We also look at Flash 2.0. We see very fast adoption, and in terms of tokens used, Flash 2.0 is the biggest model very quickly. And then, Gemini Pro is another model that that gets released in this time period. And it also sees a very fast adoption curve that doesn't seem to cannibalize other models at this time period. So that's kind of the evidence on cannibalization and market expansion and then the evidence on multi-homing. So there, there's some intricacies with the scraping of the data here. So, actually—let's take a step back. Where does this data come from?Seth: What is Open Router?Andrey: We haven't discussed what Open Router is. All right. Look, one of the challenges with studying these issues is a lot of the data sits in these fortresses of data where you cannot extract anything from,Seth: And we're trying for you listeners; we're banging at that gate. We're banging at that gate every day trying to get in for you.Andrey: Yes. Yes. So people who are using OpenAI know through the chat app, through the direct open API calls, we're not going to get a lot of visibility into that data. We might get some auxiliary data from credit card providers, payment processors, and the like, but it's hard to know how usage is changing and how specific model usage is changing particularly. One thing that exists is this service called Open Router, and there are other companies that are similar to it. And it's built for, I'd say, a sophisticated user who might be like a software developer who knows that, Hey, you know, I want to use a mix of models, or I might want to change my code to use a different model as—Seth: Andrey, what's the S word that I'm thinking of here?Andrey: Substitution; What?Seth: Selection, you're so this. We're looking under the light of the cult plate, not under the light of the people who want to multi-home.Andrey: Yes. 100%. But I will say—we're looking—let me just explain what Open Router is, and then we'll talk about selection and whether we care about that or not.Seth: Oops.Andrey: Okay. So, so it's a very handy service that allows you to call many different types of models. It also allows you to set rules too. Or like which model to use as a function of things that you might not be thinking about if you're just a chat user, like latency, throughput, uptime, specific pricing, and how it affects prompt tokens versus reasoning tokens versus completion tokens. So it's just a really useful service for this, for the app developer.Seth: I mean, can I—just to interrupt for a split second here, right? Honestly, I feel like you gave me more evidence for horizontal differentiation in this market just by listing those four different features than you did with almost anything else, right? Because all right, I could see why you would need to balance between latency, price, throughput, quality, et cetera, et cetera.Andrey: Yeah. So, and there is actually an interesting feature of this market that many do not know: there are multiple companies that serve specific models. So this is obviously true with open-source models, where anyone can serve them. So we have a lot of servers of your Llamas and your Deepseeks. But it's also true of the closed-source models.For example, Microsoft might serve an OpenAI model, and OpenAI might serve the OpenAI model, and there might be differences in how well they're serving these models.Seth: Does that mean that Microsoft has to know the model weights, or are theyhidden in some way from them?Andrey: That's above my pay grade. I—Seth: We will find out for you.Andrey: I mean, Microsoft owns a lot of OpenAI, so they have some access.Seth: Okay.Andrey: Yeah. So, that's kind of an interesting feature of—Seth: Mm-hmm.Andrey: Anyway. One thing that this company does is they publish a lot of data about model usage and how the model usage is changing over time, and also about how specific apps use different models.In particular for each model, they put the top 20 apps using that model and their usage numbers. So you piece these together, and you can get some pretty good information about popular apps and what models they're using and how much they're using.Seth: Mm-hmm.Andrey: And even over time, if you're scraping it continuously—Seth: Do we know if this is for the apps that list themselves on Open Router? Is this the universe of tokens going through those apps? Do we know that?Andrey: I think it's a universe of tokens going through those apps, but not all apps are—Seth: Obviously? Yeah.Andrey: publicly disclosing it. Even if they are using Open Router.Seth: Well, it's a fascinating data set, so it's going to show us the price of tokens. It's going to show us which apps are using which tokens, and we're going to get dynamics on that over time. So it seems like a perfect data set. Andrey, your next big contribution is just noticing the data set.Andrey: It's, you know, to be clear, the ML community knows about this data set as well. You know, in this question of how do we evaluate which models are good and which are not, you know, what we all love is revealed preference.Seth: Oh, ooh.Andrey: Use? And an open router has one such ranking, right? That's publicly available. It seems pretty hard to game it, although we can talk about ways one could try to game it. and, that should tell us something about which, which model is better, the very least, which model is on the Pareto frontier? Um. And so has the machine learning community; the AI community has been noticing this. So yeah.Seth: And then they told you, so then your contribution was the translation to economics.Andrey: I don't know who told me. The other thing I should say is that now certain companies are releasing stealth models on open router as a way to test themSeth: Oh,Andrey: That's also an interesting dynamic to explore. In particular, OpenAI has stealth released some models through there.Seth: And these would be so if I was running Silly Tavern; it would become apparent to me that there's a GPT-4o version too, and I could play around with it as an option.Andrey: And there's a new model called Optimus AlphaSeth: Oh God, did let Elon Musk name this one? Oh my God. Somebody took too much testosterone this morning.Andrey: Yeah. So, all right. That model gets released for a few weeks. People play around with it, and then it's the new OpenAI model.Seth: Got it, got it. And then, but but theoretically, normal app users of Silly Tavern might be interacting with this model for a little bit before the official release is thereforeAndrey: Yeah.Seth: Got it. Okay. Cool.Andrey: Yeah, so what? What questions do you have, Seth?Seth: What questions do I have? Andrey, it occurs to me this population of LLM users might not be representative of the model of the market as a whole. How do you respond to that limitation?Andrey: So, I acknowledge it. I think that's—let me kind of push a little bit. So there are different populations of, what shall we say, heavy LLM users that we can think about. One type of user is your basic consumer, and that type might have a ChatGPT subscription or might even use, you know, the free version or Claude, even though really most of the action is in ChatGPT; we're not talking about that. I think that's very clear. Then, it's a consumer product. We know consumers suffer from very large default effects.Seth: Right?Andrey: They're not going to be switching very actively in aggregate. So I don't think this paper is about that at all. The second type of use case that we know a lot about, or we're aware that there's a big use case for, is in programming. Right?Seth: Mm-hmm.Andrey: And here I think this is a bit of a more representative sample in a lot of ways. Why, Kline and RU code are are serious programming apps.Seth: Even though they have silly names.Andrey: Yes, 100%, and they have features that are essentially at parity with features of VS Code, the programming, the copilot, and VS Code and Cursor, even though, as far as I'm aware, Cursor and Copilot use their own software to route the model calls.You can also model, you know; you can also do the same things in those apps. So I'd say the coverage I. And the user bases of these apps are quite similar; you might say client and Recode users are a little more sophisticated, but I actually don't think it's that big of aSeth: They're just a little weirder.Andrey: They're a little weirder.Seth: So you think this is very representative of the market for AI tokens? For coding?Andrey: yes, with, with exception, with a—Seth: Mm-hmm.Andrey: The exception is that some companies place severe limitations on the types of models their employees can use. So imagine you're working at Google. I imagine if you're working at Google,Seth: Gotta use it; you gotta eat your own dog food.Andrey: You cannot use O3for programming, I assume.Seth: You cannot generate images of German Nazis. They have to be all-right. That's a callback joke, guys. All right?Andrey: So then there are these other apps, and there, you know, it's hard, it's hard, you know, to say look, I, if you're, if you're an app developer and you have a single-use app, like a PDF text extractor or something like that, I imagine that you are actively, considering different models, especially trying to optimize your costsSeth: Mm-hmm.Andrey: And you may or may not use an open router. I'm not sure; certainly, there might be some selection, and if some apps are less, if there are developers who are less sensitive to these issues, they might not feel the need to use open router.Seth: But for freelance coding, we think this is representative. All right. Now talk about these other settings, like the tools and the role-playing.Andrey: Talking about this example, let's say you have a service where you send it a PDF, and it gives you back the structured text.Seth: Mm-hmm. Mm-hmm.Andrey: Which is a type of app that you can find on OpenRouter. I doubt that whoever's writing these types of apps is very different whether they use open route or not. I imagine they're considering many models.Seth: Right. Well, I mean, I guess we're in; we're kind of like in the talk-about-it section, but like you could see a lot of this stuff getting backward built into the platform, right? There's this story, you know, about iPhones. When you started off with an iPhone, there was like a light bulb app that you had to install to get the light to go, but then they built it into a feature of it, right? So, I mean, in the long run is there even a place for something like Open Router, or are these all features that are going to be built right into OpenAI or built right into Anthropic?Andrey: I guess the feature of being able to use the other models is a feature. I doubt that they'll build into it, but you know, who knows, right?Seth: Right, but they might give you different versions. There would be the within OpenAI version and then the within Claude version, and they could give you a selection of models.Andrey: Sure, sure. So if you're like, and I think a lot of big companies do this, if they sign an enterprise contract with OpenAI or Google or Anthropic, they're going to use their models. They might even have forward-deployed engineers that kind of show them how to use the model in the best possible way, how to fine-tune it, and so on.So I think there's a lot of, if something, if an application requires really close cooperation between the foundation model provider and the application layer, I think we'll see that essentially the different competitors are splitting off into cooperating with different model providers.Seth: Right. So you think that is one possible future, which is that we end up with much more fragmentation than open router. So there would be, in that universe, multi-homing across models, but not multi-homing across companies.Andrey: Yeah. I think multi-homing across models versus multi-homing across providers—yeah, we should be kind of clearer about that. And I think the evidence that I have is at least not—it's not just multi-helping within, you know, within OpenAI or within Llama or—Seth: Ooh. Ooh. We'll have to see about that. All right. Okay. Alright. Other questions I have about this are, you know, not all tokens are created equal, either. I mean, how large a range in prices are people paying for these tokens? Like, what I know is you have a little table of a maximum and minimum, but give the audience a sense of how expensive intelligence can get and how cheap it can get.Andrey: How expensive and how cheap can it get? so it can be close to free, especially for pretty small models. And it can get pretty expensive. So, there's an output price of 18 per million tokens that exists on this platform. At the time I was looking at it, for example.Seth: It's still cheaper than my ghostwriter.Andrey: Yeah, I mean, a million tokens is not nothing for sure. And then, there are differences in input prices and output prices. And there's also something that I haven't captured very well in this data, which is there might be discounts for something called NGS. Things get more complicated the more I look at it in detail.Seth: Right. And the question is, do these kinds of details suggest concentration, or do the details suggest disillusionment and horizontal differentiation?Andrey: Yeah.Seth: Hmm.Andrey: let's talk a little bit about just some very basic economics ofSeth: What the f**k is competition? Why do we want it?Andrey: Yeah. So I think first let's first think about the utility, the consumer app developer utility part of this, right? Let's imagine that they have some utility for the different models, but they also have to, you know, pay a price for it. So, the way we think about it is, how much are people willing to pay for the better model? And if we think that things are pretty vertically differentiated, everyone will want to pay more for the same types of models. If we think that things are horizontally differentiated, then different developers will want to pay more for different types of models. And then there's also this question about the scaling thing. Like, yeah, maybe there's a model that's a little bit better than the other model, but it's a lot more expensive, and people are not willing to pay for that. So that might be something going on.Seth: Hmm.Andrey: Prices, obviously, are a very important variable to think about, and especially when you can think about them in the following way. Say you have a hard problem. One way to approach it is you throw it to the best model. Another way to approach it is to call a slightly worse model 10 times and then pick the best answer, right? So there's some implicit kind of substitutability that might be present in this.Seth: But that. Oh man. So now that's so interesting because the story you just told is not a story about horizontal differentiation. Right.Andrey: yes,Seth: But it is a reason why you might want lots of different vertically differentiated models.Andrey: Yes. Yeah.Seth: Ah huh. So maybe we don't have direct evidence on horizontal differentiation here.Andrey: For what it's worth. I'm not sure how often these, this pattern, are being used, but it'sSeth: Okay.Andrey: It's certainly possible. Yeah. And then there's another kind of thing to mention, which is this famous Jevons paradox, which is a paradox.Seth: I mean, no. Paradox is really a paradox according to my book, Slight of Mind, about why paradoxes are dumb and you should just know all the right answers.Andrey: Yes. Alright. So, let's say we have an efficiency improvement in our model serving, and we kind of lower our prices by a bit. The response to that might be so large that the total number of tokens used might go up.Seth: Right?Andrey: Essentially, the dynamic at hand or the total revenue can go up.Seth: And so, I mean, it seems like that's happening constantly in this data, which is where we're releasing better and better models and demand just goes up.Andrey: Yeah. Yeah,Seth: Which is which provides another challenge for thinking about substitutability because we don't have individual-level data. This is not a static market.People are entering this market all the time. You gotta be; I mean, the figures you make are quite compelling, like stuff is happening the instant these models are released. But it's also the case that, you know, compositionally, who's in this data is changing and pretty fluid.Andrey: Yeah. Yeah. it's something I do hope to have more to say about, as I've been scraping at the time, because at least within an app, you might say that theSeth: It's homogeneous within an app. Yeah. Or maybe you loop together all the coding apps and all the, you know, silly taverns. Okay, cool. Alright. I mean, how much are you in, and how much do you feel like you have to make a claim about horizontal differentiation here?Andrey: Look, it's hard for me to see multihoming and no and think that there is no horizontal differentiation here.Seth: Other than price, quantity, differentiation, or price quality,Andrey: But there, no, no. Sure. But I guess, I guess a point that, you know, you can see in, in, in these figures is that you have, these are pretty similarly priced models in many ways that are being multi-homed.Seth: The latency is a little bit different. Maybe I'm going to switch back and forth based on latency. There are a lot of different little things here, right?Andrey: Sure, sure. That's fair. Without having the individual usage data, it's really hard for me to make these finely green claims. I certainly have begged for this data from the CEO of OpenRouter, but so far no cigar.Seth: Okay, let me push. Let's talk about that a little bit more, right? Which is, if the multi-homing is driven by fluctuations in latency, let's say, right? Like, I don't have strong preferences between Claude and ChatGPT; I just want to call the one that's lower latency. You can definitely get multi-homing there without it being driven by any difference amongst the models.Andrey: Sure. I guess I think this is very empirically testable. I haven't—the latency is at a five-second level, and just see how much it changes over time.Seth: There we go.Andrey: Yes.Seth: Ooh, ooh. I've given you some more homework, it sounds like.Andrey: So, I guess if we think that the latency is highly variable across time or the throughput is highly variable over time, then we might see that sort of pattern. If we don't see it being very highly variable over time, then maybe that's less—maybe that's some evidence that it's not quite what's driving it, but yeah.Seth: Let me tell you what my prior is, so maybe this is like the key part here, right? I have this really strong prior that I did not have; I was not born with it, but I have been trained by talking to AI expertsAndrey: Mm-hmm.Seth: There’s no such thing as the AI that's good at military stuff versus the AI that's good at writing humanities papers.That it's all intelligence—you get more of it or less of it. Sure. At the margin there's fine-tuning, there's vibes, but with the right sort of prompt and, you know, with a sufficiently unlocked model, you should be able to; it should be just pure vertical differentiation. That's kind of it; when I've been in rooms with technologists, that's the claim they make.Now, maybe that's because they're at OpenAI and they're at Anthropic, and it's their incentive for this to be a universe where there's only two big boys. But serious people I've talked to have suggested there isn't such a thing as significant LLM horizontal differentiation.Andrey: Yeah. I don't believe that. Let's see what they—let's see what they actually do.Seth: Mm-hmm.Andrey: OpenAI is constantly updating its default model in ChatGPT. And sometimes they're optimized for one metric, and then they realize that they face a trade-off. So, for example, if your ChatGPT is a little too nice to you, that might lead you to use ChatGPT more, but it might feel ethically dubious for ChatGPT to be encouraging your addiction, given that you totally deserve to be addicted to your phone. So, there's clearly a Pareto frontier of different things that these models can be made to do. Right? So do I. So and so, a lot of experimentation by the companies is the form. is, how do we play on this pato frontier? The existence of Pato Frontier suggests that there isn't just one dimension on which things differ.Seth: Right. But I guess where I come at this from is, okay, imagine there's like a continuum of steps of delivering the token to the consumer, right? The first step is a $500 billion pre-training run. We, you know, make the giant pre-trained model. The second step is we're going to fine-tune it. We do the RLHF and give my model its particular personality, and it knows it's not allowed to work for terrorists or whatever.And then there's the third step, which is we're now going to plug that fine-tuned model into an app, and it's going to be deployed in something functional that a consumer can interact with. I guess the way I see it is like as we move down that continuum, this becomes more and more horizontally differentiated, and at the beginning it seems really not horizontally differentiated, and by the end it really is very, you know, you don't want the silly tavern AI, you know, helping you convert PDFs.Right. So I guess when I hear LLMs are horizontally differentiated, I'm thinking about that pre-training step.Andrey: Mm-hmm.Seth: Maybe you want to make a claim about how the usage of AI in apps is horizontally differentiated, which is at the far other end.Andrey: Sure. Yeah. I, I think that's true. We don't, you know, and you know, we've talked about unhobbling on the show before, and I certainly believe that lots of these models have capabilities that we haven't figured out how to get out of them. Right. They know soSeth: Right. I've tried really hard to make OpenAI do some of those things, and it's not—it's not as nice as Grok when you ask him to, orAndrey: Yeah. So, so I think that's right, right? How the application and how these models are used in the application layer can be differentiated even if we think that at the foundational level it's just a ball of clay and some of these balls are bigger clay balls than other balls.Seth: Oh, right. And when you have smaller clay balls, you can't build the Mona Lisa of play balls. Right. So it's like a capacity thing. Yeah, I mean, it just brings us back to there being a vertical aspect and a horizontal aspect, and the question is like, in the market competition for AIs, where do those two come in? Right? Because in terms of app deployment, you wouldn't expect vertical. I mean, everyone's just going to use the best; they're going to use bottles that are on the Pareto frontier. So you'd expect the horizon, the vertical differentiation, to be less apparent in that last stage. Right?Andrey: Yeah. I mean it; I do it. It seems to me that models like Gemini 2.5 Pro and 3.7 Sonnet are both on the frontier, but. Some people just like one, and some people like the other. And, and that, that is horizontal differentiation to me.Seth: Right. And, and now, now you're referring to, like—Andrey: It's like maybe there's this, like there's a cost difference, and there might be latency differences, and that's really what's driving, you know, the usage patterns.Seth: Or maybe the prices are identical, and I'm Epsilon horizontally differentiated, and that's enough.Andrey: Yeah.Seth: I guess the last thing is that I think my instinct is that horizontal differentiation will become less important over time. Right. So if you think about these balls of clay getting bigger and bigger and bigger, right?Sculpting them exactly the way you want is going to get easier and easier as you have more and more clay to discard. Do you buy that argument?Andrey: I think we'll get better at sculpting things over time. I think that it's certainly true. Yeah, and I think that comes back to your question about whether we are going to have horizontal differentiation in the sculpting step. And then the question is, who's going to be sculpting it? Is it going to be app developers sculpting it? Is it still going to be the big labs that sculpt it in various specific ways? Yeah, that.Seth: Right. I mean, it makes it like if we, if we're doing the sculpting at the app stage, right, there's just a lot more room for horizontal differentiation, right? Because there's a lot more players who are going to be involved, and, you know, that's, that's the domain where, yeah, it does make mean, you know, a dollar to a consumer, whether the interface is blue versus pink and like even stupid s**t like that can support an industry, no offense to, you know, app developers out there.Okay. So one question that is kind of like the implicit background question in this paper, in my opinion,Andrey: Okay.Seth: But it is a prior, which we did not put a probability on, but I just kind of want to ask you, can you come at this with having done this research? It doesn't—you don't have to do it in a prior way, which is like, do you think the market for AI will be, you know, relatively competitive or relatively concentrated in four or five years?Because I mean, my reading of this paper was like, it's a shot for, it's going to be less concentrated and more competitive than you think.Andrey: I think it depends a lot on the complementarity of other things.Seth: There you go. There you go. Speaking of Catherine Tucker, we had her asking her about AI competition. She's like, "Well, you know, I'm Catherine Tucker." Catherine Tucker thing.Andrey: That is not how she talks.Seth: She does not talk like that. So I'm not going to try to do my Catherine Tucker voice. But like, her point was like, we know how to do antitrust. It has to do with networks of complementarities and substitution abilities. There's nothing special about AIs. Is that kind of your take?Andrey: I don't think I'm going to make the claim that we know how to do antitrust of AI. That seems premature, to say the least. I will say that the concentration of the industry is very likely to be determined by complementary integration assets. So how important is it to have that Anthropic engineer sitting at, you know, SAP, the specific molded version of Claude, or a particular application or not? Is it something where. at SAP will just call Open Router, and it's just going to be good enough that way. And they don't have to do specific SaaS contracts with Anthropic or anything like that. and that's hard for me to answer right now. But you know, if I had, if I were a betting man, I would say that there'd be a handful of models that are pretty competitive with each other.But I don't think there'll be like a thousand models that are competitive with each other.Seth: Right. That frontier, there's just not, there's not enough room at the top, at the frontier. Just because these trading runs will be so, so expensive. I guess that's kind of—as I was reading this paper, in the back of my head, I'm thinking, you know, like, how many people are going to come up with $500 billion to pre-train their own models?Right. It—it just seems like there's a maximum of how competitive this industry can get.Andrey: But I guess so. I would say like five; five is often enough to get a very competitive dynamic. Why do we want competition? It's not just because we want a bunch of competitors, for competitors' sake. We actually want there to be the correct incentives to innovate and then to price fairly, right?So those are kind of the two things we're trading off. And in industrial organization, there are some results that in certain cases where you want even less than five competitors for the incentives. So that still seems quite competitive, even if there is a lot of concentration.Seth: Right. I—it's all maybe another way of thinking about this is, suppose we could wave a magic wand and either make AI more horizontally differentiated or make it less horizontally differentiated. Right. We could choose which world we're in.Andrey: Mm-hmm.Seth: A world where they're less horizontally differentiated is probably one with faster growth and, you know, fewer implementation costs and less friction. Right.Andrey: Yeah, I'm not sure. It depends; it depends on how we think about, like, the specific innovation production function. Don't; it's not obvious to me that there's, like, one answer, right? Because you can imagine that in a horizontally differentiated world, more players are going to be able to try to innovate, and because there are more, there are going to be more rents. But if you think that it's all about just that huge run, that one big run,Seth: Right,Andrey: Maybe it's that you want it to be vertically differentiated and kind of a winner-take-all dynamic. But, one where the winner can change to from time to time.Seth: Right. You want a comp, so then we're in a universe where it's competition for the market rather than competition in the market. And that brings its own set of antitrust concerns. Andrey, you know, believe it or not, I took a minute to look at the same data and ask questions right along these lines of, like, how concentrated is this market exactly?Because reading your paper, it's a paper that's supposed to give me some hints about the competitiveness of the industry. The first thing people ask about an industry is, well, how concentrated is it? Right? So Andrey, what's your sense? Are these models more or less concentrated than a typical industry?Andrey: Um.Seth: Industry? And actually I want you to tell me, all right? So I've got three. I'll leave my test on the table here. I've got four HHI indices I'm looking at right now. I've got open wrap. This is for the week, the first week of May. we've got. The number of tokens is called at the AI company level, so it aggregates up to companies.We got the number of tokens called at the AI app level, so that's like a silly tavern, et cetera, et cetera. Then we've got the number of tokens called at the model level, and then I would like you to compare these two to inequality in motor vehicles and breakfast cereals. So I want you to rank those five from most equal to least equal.Andrey: Yeah, so I will push back on. You count already; you count like the Met Lamas as being Metas, right? Because Meta is not the one who's serving them. Right. But.Seth: Ooh. Ooh. Well, I could do providers too. That would be a fourth way to split it.Andrey: Yes. But generally, yeah. Look, it's more concentrated than these other industries.Seth: It's pretty concentrated.Andrey: I'd say more so than I, for I, for all of them, with the model-specific one. Even with that, I'd say it's probably more concentrated than the—Seth: That one is actually pretty low. So the model, so just, I'll put some numbers out there. Just, ballpark, motor vehicles have an HHI of about 2,500; breakfast cereals are just below that.Andrey: Mm-hmm.Seth: The number of tokens at the company level has an HHI of 2960, so it's a little bit higher than those guys. But if we go to the app level, we're at 2160, so that's kind of more competitive than motor vehicles and breakfast cereals, which we think have a decent amount of competition.And then the model level, so we're going to treat 3.5 and 3.7 differently. We're pretty equal. We're at the 1500 level, which is considered pretty, pretty competitive.Andrey: competitive. Yeah.Seth: All right. Does that change your progress, Andrey?Andrey: Well, I guess I wouldn't have used those industries as a comparison set.Right? Like, I think a lot of digital infrastructure types of industries have a lot more concentration. So you think about cloud computing or search or phones, right?Seth: mm-hmm.Andrey: I think so. Relative to those kinds of industries, it is less concentrated. But certainly compared to physical goods products, it's more, it seems, more concentrated, I guess. I assume that you didn't calculate that HHI per car. Right? So it's kind—Seth: No, it was not. That was at the company level.Andrey: Yeah. I mean—you know, disclosure, you know, this, this definitely has been on my to-do list. I just have not gotten around to it. But I don't.Seth: All right,Andrey: I don't think that, this changes my, my priors very much, ifSeth: Okay, well, I've got a second fact for you. Second stylized fact. All right, so now I want you to imagine, oh man, I don't know if we have time to start talking. We'll see the power law and probability distributions for the next episode. But let me give you four different things that might be more or less concentrated.Right? Here's another four things to think about. The concentration of one is 2023 US CompStat companies. One is the open router, AI at the company level. The second is Hugging Face. You know, our hugging face is another website where people will post AI models. This is for free downloads, so these are like public models.So I have downloads of Hugging Face AI models. And then finally I have all-time movie box office. So you tell me which of these is going to be the most concentrated: hugging-faced AI downloads, open router, AI tokens, 2023 US publicly traded companies, or movie box offices. All the time.Andrey: This is by the open router one. That's by the model creator.Seth: I believe that, yeah, at the company level.Andrey: Okay. Um. I think Open Router is the most concentrated of these.Seth: Correct. Second mostAndrey: hugging face?Seth: hugging face, second most, third mostAndrey: I don't know how to think about CompStat HHI. That seems like how—what's the product market? Sorry.Seth: the product. Oh, CompuStat. It's publicly traded corporations. So it's everything together.Andrey: oh, you're just combining all the—?Seth: Yeah, yeah, yeah.Andrey: Just revenue by revenue.Seth: No, it's market value. So, you know, implied market,Andrey: Yeah, I think that'll be three. And then the movies are four.Seth: Dude, you don't even need data. You got this down.Andrey: How about those priors?Seth: Who needs evidence? But okay. What, you see what I'm trying to get out here, Andrey? Right? Which is, you can give me evidence that people are willing to move back and forth, but if it's the most concentrated industry I can find, it seems pretty concentrated.Andrey: you like a bunch of industries that are more concentrated.Seth: Alright? Okay, so now we go. All right, so listen, this is going to be a special two-part episode of Justified Posteriors. In the next episode, Professor Benzel will bring his own evidence and analysis to bear on the data from Open Router, and you'll be the judge. Is AI competitive? Is it not competitive?It's the future you're going to have to live with one way or the other. Andrey, are we ready to talk about our priors a little bit?Seth: All right. What's yours? So tell us, you had three claims here. I guess you're a hundred percent convinced of all the claims. Again, you wrote them down.Andrey: Look, my claims are empirical, right?Seth: Right.Andrey: no, I'm not saying that they're right, but I, you know, I thinkSeth: They're descriptive.Andrey: They're quite descriptive. Unless I made a scraping error or something like that, I think they're, you know, they are what they are, but the interpretation is obviously up for debate.Seth: Mm-hmm. Do you want to take a shot at it? Do you want to give me a percentage chance that in two years—I don't know how to say this—let's say AI, the AI industry, will be more or less competitive than the average tech sub-industry? Is that a fair comparison?Andrey: I don't know what an average tech sub-industry is.Seth: I know or choose one search. Let's just search. How about searching? That's really unequal. Alright. Alright. So yeah, that's the question.Andrey: It's going to be more competitive than search. I have no doubtSeth: Okay. All right. Let's check that in a couple of years.Andrey: And also more competitive than phone operating systems.Seth: Yeah, we got two big boys there. That's fair. Okay.Andrey: Is it going to be more concentrated two years from now than today? I think that's an interesting question.Seth: You want to take a—is that 50/50 for you? Or, I think it's pretty; I put 90—ninety's too strong—85% of that is more concentrated in the future than now.Andrey: I do, so it depends on whether we're measuring by revenue or by token.Seth: Let's do tokens at the company level. Oh, I guess we should do revenue, right? Revenue's the more economical thing you can do with either one.Andrey: the reason I was asking is, like, I still imagine there's still going to be a ton of use cases for small, cheap models and,Seth: Yeah. So the most down. Yeah.Andrey: A very competitive market, right? Like in the sense that it's, that's, people are going to roll up their, put in, in principle, roll up a very good, small model.It's the big model that we're really worried about right in.Seth: Right, right. So yeah, so it's like the value-weighted is the one where you'd be really worried about concentration, given that there might be a lot of small toy ones that people f**k around with. But I think—Andrey: Talk, I don't. I'm not even talking about f*****g around. There are so many—Seth: Yeah.Andrey: Like, you could have the model call; you would, right?Seth: Mm-hmm.Andrey: you know, every email you're writing in GmailSeth: Mm-hmm.Andrey: For the line of code that you're going through, why not call a cheap model just as a first pass? That might even be the model used to determine whether you want a, you know, more fancy model or something like that.Seth: Right, right. And you can imagine a universe in which, like those super low-level AI observations, intelligence calls aren't even captured in data because I might be running that locally on my own laptop, right? Yeah—So yeah, so maybe there's some sort of size cutoff above which this, like, becomes interesting and tractable.Andrey: I mean, I can, yeah. I don't have strong priors on this, I have to say. I could see arguments either way. Maybe 60/40 towards becoming more concentrated in terms of revenue.Seth: All right. Well, I'm going to try to get Andrey's answer up in the next half of this two-part episode on Concentration in Competition in the AI Industry: Evidence from Open Router. This time it's personal.Andrey: All right.Seth: All right. Like, share, and subscribe.Andrey: Yeah. If you have better data, we're very—Seth: Give it to us, please. Yo, we'll be your friend. We'll co-author you.Andrey: Yeah. Just, you'll get such great exposure for your company on this podcast.Seth: Mm-hmm. Right? We will. And we'll also use your AI to write copy if you have an AI model yourself. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
Jul 28, 2025 • 1h 12min

What can we learn from AI exposure measures?

In a Justified Posteriors first, hosts Seth Benzell and Andrey Fradkin sit down with economist Daniel Rock, assistant professor at Wharton and AI2050 Schmidt Science Fellow, to unpack his groundbreaking research on generative AI, productivity, exposure scores, and the future of work. Through a wide-ranging and insightful conversation, the trio examines how exposure to AI reshapes job tasks and why the difference between exposure and automation matters deeply.Links to the referenced papers, as well as a lightly edited transcript of our conversation, with timestamps are below:Timestamps:[00:08] – Meet Daniel Rock[02:04] – Why AI? The MIT Catalyst Moment[04:27] – Breaking Down “GPTs are GPTs”[09:37] – How Exposed Are Our Jobs?[14:49] – What This Research Changes[16:41] – What Exposure Scores Can and Can’t Tell Us[20:10] – How LLMs Are Already Being Used[27:31] – Scissors, Wage Gaps & Task Polarization[38:22] – Specialization, Modularity & the New Tech Workplace[43:43] – The Productivity J-Curve[53:11] – Policy, Risk & Regulation[1:09:54] – Final Thoughts + Call to ActionShow Notes/Media Mentioned:* “GPTs are GPTs” – Rock et al.’s paper* https://arxiv.org/abs/2303.10130* “The Future of Employment: How susceptible are jobs to computerization?” - Frey and Osborne (2013)* https://www.oxfordmartin.ox.ac.uk/publications/the-future-of-employment* “AI exposure predicts unemployment risk: A new approach to technology-driven job loss”— Morgan Frank's paper* https://academic.oup.com/pnasnexus/article/4/4/pgaf107/8104152* "Simple Macroeconomics of AI" – By Daron Acemoglu.* https://economics.mit.edu/sites/default/files/2024-04/The%20Simple%20Macroeconomics%20of%20AI.pdf* “The Dynamo and the Computer” – Paul A. David* https://www.almendron.com/tribuna/wp-content/uploads/2018/03/the-dynamo-and-the-computer-an-historical-perspective-on-the-modern-productivity-paradox.pdf* “Productivity J-Curve” – Erik Brynjolfsson and Chad Syverson* https://www.nber.org/system/files/working_papers/w25148/w25148.pdf* “Generative AI for Economic Research: Use Cases and Implications for Economists”– Anton Korinek’s paper* https://www.newyorkfed.org/medialibrary/media/research/conference/2023/FinTech/400pm_Korinek_Paper_LLMs_final.pdf* Kremer’s O-ring Theory* https://fadep.org/wp-content/uploads/2024/03/D-63_THE_O-RING_THEORY.pdf* 12 Monkeys (film) – Directed by Terry Gilliam* Generative AI for Economic Research - Anton Korinek.* https://www.aeaweb.org/content/file?id=21904Transcript:Andrey: Welcome to the Justified Posteriors Podcast, the podcast that updates its beliefs about the economics of AI and technology. I'm Seth Benzell, exposed to and exposing myself to the AI since 2015, coming to you from Chapman University in sunny southern California.Andrey: I'm Andrey Fradkin, riding the J curve of productivity into infinity, coming to you from Cambridge, Massachusetts. Today, we're delighted to have a friend from the show, Daniel Rock, as our inaugural interview guest.Daniel: Hey, guys.Andrey: Daniel is an assistant professor of operations, information, and decisions at the Wharton School, University of Pennsylvania, and is also an AI 2050 Schmidt Science Fellow.So he is considered one of the bright young minds in the AI world. And it's a real pleasure to get to talk to him about his work and spicy takes, if you will.Daniel: Well, it's a pleasure to get to be here. I'm a big fan of what you guys are doing. If I had my intro, I'd say I've been enthusiastic about getting machines to do linear algebra for about a decade.Andrey: Alright, let's get started with some questions. I think before—Seth: Firstly, how do you pronounce the acronym? O-I-D (Note, OID is the operations, information, and decisions group at Wharton).Daniel: This is a big debate between the students and the faculty. We always say O-I-D, and the students say OID.Seth: So our very own. OID boy. All right, you can ask the serious question.Andrey: Before we get into any of the specific papers, I think one of the things that distinguishes Daniel from many other academics in our circle is that he took AI very seriously as a subject of inquiry for social sciences very early, before almost anyone else. So, what led you to that? Like, why were you so ahead of everyone else?Daniel: I'm not sure. Well, it's all relative, I suppose, but there's the very far back answer, which we can talk about later as we talk about the kind of labor and AI. And then, there is the sort of Core Catalyst Day. I kind of remember it. so back at the M-I-T-I-D-E, where we've all spent time and gotten to know each other in 2013,Seth: What is the M-I-T-I-D-E?Daniel: The MIT Initiative on the Digital Economy, Erik Bryjnolffson’s research group. I was one of Erik's PhD students. My first year, we had a seminar speaker from the Computer Science and Artificial Intelligence Lab, CSAIL. John Leonard was talking about self-driving cars, and he came out there, and he said, “Look, Google's cheating. They're putting sensors in the road. We're building the real deal: cars that can drive themselves in all sorts of different circumstances. And let me be real with all of you. This is not going to be happening anytime soon. It will be decades.”And there were other people who were knowledgeable about the subject saying, “No, it's coming in like 5 to 10 years.”And at that point I thought to myself, “Well, if all these really brilliant people can disagree about what's going to happen, surely there's something cool here to try to understand.”As you're going through econometrics classes, I wouldn't say econometrics is the same thing as AI. We could debate that, but there's enough of an overlap that I could kind of get my head around the optimization routines and things going on in the backend of the AI models and thought, “Well, this is a cool place to learn a lot and, at the same time, maybe say something that other people haven't dug into yet.”Andrey: Yeah. Very cool. So, with that, I think maybe you can tell us a little bit about your paper GPTs, which is a paper that has had an enormous amount of attention over the years and I think has been quite influential.Daniel: Yeah, we've been lucky in that sense.Seth: In two years.Andrey: that's not—I mean—some version of it was out earlier… No…. Or is it? Has it only really been two years?Daniel: It has been the longest, , Andrey. If you and I weren't already sort of bald, , it might've been a time period for us to go bald. Yeah, we put it out in March of 2023. I had a little bit of early access to GPT-4. My co-authors can attest to the fact that I rather annoyingly tried to get GPT-4 to delete itself for the first week or two that I had it rather than doing the research we needed to. But yeah, it's only been about two and a half. Okay, so the paper, as I describe it, at least recently, has kind of got a Dickensian quality to it. There is a pessimistic component, there's an optimistic component, and there's a realistic component to it.So I'll start with the pessimistic, or I'll— why don't I just start with what we do here first? So we go through O*Net's list of tasks., There are 20,000 tasks in O*NET, and for each one of those tasks, we ask a set of humans who are working with OpenAI; they kind of understand what large language models in general are capable of doing.What would help you cut that time in half? So could you cut the time to do this task in half with a large language model with no drop in quality? And there are three answers. One answer is of course not; that's like flipping a burger or something. Maybe we get large language models imbued into robotics technologies at some point in the future, but it's not quite there yet.Another answer is, of course, you can. This would be like writing an email or processing billing details or an invoice.And then there's the middle one, which we call E2. So, E0 is no, E1 is yes, and E2 is yes, you could, but we're going to need to build some additional software and systems around it.So there's a gain to be had there, but it's not like LLMs are the only component of the system. And the reason we pick other software is because there's a pretty deep literature on how software and information technologies generally require a lot of co-invention, a lot of additional processes, and tangible capital. It makes it difficult to deploy those technologies fruitfully.And we figured, okay, by comparing that E1 category, the yes you can, with an LLM out-of-the-box, to the E2 category, how much do additional systems and innovation get us? We could say something about whether generative, pre-trained transformers, GPTs, are general-purpose technologies. They'll be pervasive, they improve over time, and they necessitate that kind of complimentary innovation. They change the direction of innovation.If we can say yes to those three things, then we're in a situation where we get to the pessimistic version of the story. You just can't know what the long-term equilibrium is going to be across different markets as a result of these tools.So the prognostications that, ‘Oh yes, AI is coming to annihilate all the jobs. That the Machine God is imminent—or at least the Economic Machine God is imminent. I think those are a bit premature if you look and say this is general-purpose technology because historically general-purpose technologies have been hard to predict at the outset.So the optimistic side of things is that that impact potential is pervasive. There's a lot of benefit to be had in changing how people work. We use this exposure measure—I'm sure we'll get into this—but exposure is not automation. Exposure is potential for change, and if there's potential for fruitful change, we get more value in lots of different places in the economy.That's a good story we found—and if the reviewer is listening to this, thank you very much. One of our reviewers suggested looking at science and innovation tasks and research and development tasks and seeing how those compare to other areas. We found high levels of exposure in those areas, which means there's potential to turbocharge growth, at least temporarily, hopefully longer term, in the economy.There’s a temporarily, and an optimistic component on the realistic component. We compare the yes, you can do it temporarily, and better with an LLM here to the yes, you can, but you need more building, the set of tasks that get exposed if you build additional systems. If you were to snap your fingers and say, “Hey, we've got everything we need.”That's much, much bigger than the stuff that's just exposed to LLMs on its own. So the realistic story is we have a lot of work to do as a society in the global economy to bring about the gains of these tools. And it'll probably take a few decades for it all to play out. As much as we think that the changes have been very quick, it has been a fast two years, or slow, depending on who you ask.Seth: This has been great. Andrey and I are both bursting with questions. I'll let Andrey go first.Andrey: I want just a quantification. Like, so what percentage of tasks are exposed according to the first definition? What percentage of tasks are according to the second definition, approximately?Daniel: Yeah, if I recall correctly, about 14% of tasks, or 15% of tasks, (depending on if you're looking at the human ratings or the GPT-4 ones). GPT-4 and humans tend to agree, by the way. There's some noise there, but if you look at [the] GPT-4 ones, it's about 14% of tasks for E1, the level where it's just LLMs that can help. Now, if you snapped your fingers again and said, Now it's E2 and E1, that's about 46% of tasks. I might have my numbers slightly off there, but that's roughly what the numbers were.Andrey: And did you calculate what share of occupations have 100% of their tasks?Daniel: There were very few, if any, occupations that were a hundred percent exposed. I think data scientist was up there, and it depends on the measure, so we actually have three different combinations of these scores. The most conservative is saying it's just E1, and then that's it, and the least conservative is E1 and E2.We score each task that has either one of those labels as one and E0 as zero. And then there's this kind of intermediate one that I like, but my co-authors don't like as much, where E1 gets a one and E2 gets a 0.5. So it depends on what you look at. Mathematicians were highly exposed. My co-author, Pamela, has gotten some angry emails from mathematicians saying, “No, that can't be.”I will say I use it for building theory now. I use the language models for building theoretical models, and they do a pretty good job. They make some pretty terrible mistakes occasionally, so you do have to check their work, but to go from a verbal sketch of what you're trying to prove to some math that roughly shows what the setup should be, it makes it easier to be a reviewer instead of a doer, as they say.Seth: Sure. All right. Okay. A couple questions from me. The first question is: are we talking literally when we are doing these E1 ratings? Are we talking literally about ChatGPT-4, or are we talking kind of generally about LLMs of approximately that quality? Or are we projecting forward to kind of near-future LLMs?Daniel: Yeah. It was more the latter. We had a sense of where LLM tools were going to go. I think even looking at this set of tools we have now and GPT-4, they're very similar. There are expanded capabilities. It's kind of been a deepening of their capabilities, but the going of the somewhat foreseeable future, especially for my colleagues who had been and co-authors who had been in the weeds with this.But that does bring up an important weakness of this approach, which is as soon as you see something really qualitatively different or new capabilities showing up, you have to update the rubrics and the method; you have to rerun stuff. I think arguably the reasoning model paradigm is getting to the point where you probably have to rerun things.Andrey: Are you considering rerunning things? Is this like an ongoing endeavor or—Daniel: I'm not sure I'm going to return to writing an academic paper. I feel like I've gone to the well one too many times already with this. But if someone else wants to do it, I'm happy to help them out with it. Eric, Mitchell, and I did something in roughly 2016 looking at supervised machine learning and shared some slightly different conclusions, but now that I've been through this twice, I'm not sure that I want to do it just yet.Andrey: So this is a question that I wanted to kind of raise. 'Cause certainly you guys are not the first to do this sort of exercise, and you've done it before. Frey & Osborne have done it. I remember when I was thinking about these exercises; when I first saw them back in 2017-2018, I was like, “This is an accounting exercise. This is actually useful.” How do you determine in what sense this type of work—Seth: To throw another critique of this whole research agenda out there. We talk about Frey and Osborne coming out with one of these a decade ago. You talk about your own SML experiences. I know Morgan Frank has a new paper at PNAS Nexus out that compares about 10 different people's different exposure measures.Daniel: Mm-hmm. Which I'll do different things. Yeah,Seth: And they're all too; they're all completely different. How should I think about the diversity of these indices?Daniel: Well, there are different principle components underlying a lot of these different measures. Certainly SML and the GPT scores are very different. And Frey and Osborn—the way they constructed that effectively was—.Seth: Basically.Daniel: educated guess vibes with CS professors for a training set.I think their goal is to measure which jobs, as a whole, could be computerized. Actually, let me answer Andre's question a little bit more directly. Like, when you look at these, what are they useful for? Let me start by saying what they're not useful for. because actually some folks have put words in their mouths on this.Seth: Including Nobel laureates.Daniel: No Nobel laureates that I know of, but there are some places and some folks who have who said things like, “If you're exposed, you're hosed.” And this is what the authors tend to value, I will say—Seth: with the word hosed. You set them up for that.Daniel: It's possible that that is the case, but I have not seen any data to conclude that that is the case.So let me state clearly for the record things you do not want to predict with exposure scores. Things that exposure scores are not designed to do: economically meaningful outcomes like wages or employment are not things. I'm not trying to say exposure scores will create unemployment. I'm not saying it'll cause wage loss, and I view it as a risk measure. I'm a recovering finance guy. I think there's a risk that can be good. It can be bad. Like we don't really know. It just means there's an opportunity, technically speaking, to change the types of tasks that people are doing and how they do them. So exposed and hosed are possibly orthogonal ideas.Nevertheless, I think it's worth tracking now. What else is it not useful for? Besides failing to predict labor market equilibrium. it's not useful for—Seth: Breakfast?Daniel: Can what make you breakfast?Seth: You're—Daniel: Scores?Seth: Do you want to list all the things? It's not useful for, excuse me,Daniel: Exhaustively, yes, we should. You can't eat the scores either. I wouldn't say it's especially useful for saying for sure that this is going to happen, right? Like, if a technical thing that could help someone do a role does not necessarily mean it's appropriate socially, legally, or politically.There's a whole bunch of different places where using LLMs might be inappropriate. One example, a famous one, is Jeff Hinton, who predicted that radiology demand would drop. And I think we are seeing, say, an appropriate example of where a multimodal model would be helpful in radiology.It could probably pick up a broken bone, but radiologists as data-enabled doctors have a lot of other components to their work, and they interpret difficult cases. If you're going to tell someone about a condition that they've gotten, it's challenging. That's not the sort of thing where you want an LLM just spitting out, “You have this wrong.” That would be terrible bedside manner.So even if it's theoretically possible, that doesn't necessarily mean it's going to happen. So turning now to where are they useful then? One is for testing this hypothesis. Are we limited in what we can say? which is my favorite application of them. In the sense that we see pervasiveness and complementarity and necessitating exposure throughout the economy.So we should dial back our confidence in terms of predictions of what will happen that I think were useful for answering a very specific hypothesis that we had. But then, underneath that—Seth: So you were able to—the hypothesis is that they are GPTs of GPTs? They're going to affect everything.Daniel: Yeah. So the only one of the three conditions that we punt on is whether they are GPTs that improve over time? Because that one was obvious. We do have some evidence, but we are mostly getting beyond that. I think about the first-order changes and where they're most likely to happen. I didn't know that this would be the case when we wrote the paper, but I think those measures that we built tended to predict where people would start adopting large language models, and there have been a few papers validating that empirically.Seth: That makes perfect sense, right? So it's maybe not a good model of what's going to happen to your job, but it's a good model of where the OpenAI salesman should show up and knock on the door?Daniel: Yeah, potentially. So you guys discussed this paper earlier on the podcast, but the Anthropic Economic Index, the areas where they thought people were or where they were showing people were using Claude, lined up reasonably well with the areas we thought GPTs and LLMs would show up.Andrey: Except managerial tasks.Daniel: Except managerial tasks. Those are happening; it's just not clear. I'm not sure what's going on in that dataset. In my work as a startup co-founder, I use all sorts of large language models for managerial tasks all the time. So we'll see what happens there.Andrey: I used a large language model for managerial tasks earlier today, so I agree with you.Daniel: Mm-hmm.Seth: Right. Seems like these AIs are being used. If you look at the philanthropic index, it really does focus on people using it in these kinds of hobby contexts, which is one of our big takeaways from that episode. So I mean, people don't manage as a hobby, so if a lot of Claude usage is hobby usage, you wouldn't expect that. You would expect that to be underrepresented.Daniel: You're saying that with the exception of the technical folks, software engineers, and data scientists, it's just like ripping with this stuff, right? Like, because that's not necessarily a hobby.Andrey: Ripping with it and the cursor, I mean. Now we're getting—Daniel: Sure. Yeah. API use, yeah. Yeah.Seth: Right, that's the giant use case right now.Daniel: Yeah, and that one's a great one. It's kind of ironic given our focus on software, but to some extent you can keep doing what you were doing, but just do it way better in software development with these tools. You don't actually have to transform the structure of software engineering too much to just get a very quick benefit, but I think there is a new mode of working and developing with AI-driven tools that has an analogy in that famous computer in the Dynamo paper. The paper mentioned electric power conversion; you think of it like the steam engine, right? For the listeners who aren't aware, this giant thing in the middle of the factory and all these pulley levers and belts come off of that thing, and it powers the whole factory. And then over the next few decades, they realize, ‘let's modularize that power.’ When we convert to electric power, the first thing to do with electric power is to do the same thing, but like, a little bit better.Take a giant dynamo, stick it in the middle of the room, and we're off and running. But eventually they were like, “Well, what if we make that really small?” And then we have lots of little machines all powered by their own little engine. Sort of similar, and I'm seeing this with some large companies: you start with a really monolithic, large technology function in the middle of the company that kind of like powers off. Lots of subgroups build technology for them, and then something kind of magical happens with these AI models.You can sit down with a subject matter expert, a product person, or a senior developer to make sure that these people don't hurt themselves as they're building something. And you create these like modular, , the Jeff-Bezos-two-pizza-team version of work where people have input into a process, and then rather than throwing that process over the wall to the dev team, you wait three weeks and see them come back with something that doesn't fit. You just develop together and watch the models go, and it really ups your cadence, but it opens up all sorts of best practice shortfalls that can happen.Like, have you hardened for security properly? The devs know what questions to ask there. So going from a specification to a finished product can be way, way quicker. If you redesign how the work goes, it's kind of similar to that steam-power-to-electric thing.Andrey: I guess maybe a natural place to go here with is there's kind of this distinction between the micro-level exposure of a task-level implication. So, should we be thinking about that? And certainly people have used your micro-level exposure metrics in macroeconomic models and so…Seth: Tell us about what that experience was like.Daniel: People use them in different ways. There are papers that you guys have discussed on the podcast before. If you look at the Simple Macroeconomics of AI paper by Daron Acemoglu, he uses our sort of experimental automation score. Which it is not. Could you use an LLM to improve your task output?Here it's like, could you use an LLM to just straight up do this task without a person involved? It's a really small proportion of tasks in the economy; that's a five-point scale. So our fourth or fifth most intensive automation risk scores. I don't love those scores, to be honest, but they are in a pretty narrow area.So it's not surprising that we find, or that we read in his paper, I should say, a seven-basis-point-a-year outcome. The OECD is a version where they use the exposure scores, and they get to something like 70 basis points of productivity growth per year. So it's all of one MLA's gains right there.But per year, I think these are a public good, these scores in some sense, and people bring their models and their priors, too; they're trying to discipline what they believe will happen with the economy with these scores. And they're noisy. I wish there were something more useful for these people to deploy in their models.But to the extent that we can be helpful, we're really happy that this thing is out there. I just caution folks against viewing exposure automation, which is a common failure mode, or even leaning on things like automation and augmentation as the choice that we have ahead of us at the macro level.Like, and Andrey, to your point, the macro-level conclusions, yes. Labor markets are how we share the gains from economic activity primarily across society. And then, when you get down to a micro-level task and you're asking a worker or a manager or a worker combo. Are you upset if we automate this task or augment this task?Either one. It's anything goes. It's about the labor market and the unit of work that's being purchased in the labor market. I could automate something I hate doing and be thrilled with it 'cause I could go spend my time doing other stuff. I could automate my whole job and make myself really sad. Well, maybe really sad, but I'd have to find another job.I could augment someone and make them thrilled and pay them more, or I could augment them such that they take the jobs, they do the work of 10 different people, and then nine people get fired. So I think this augmentation automation, micro-question, really does boil down to just exposure and changing work.And we can't say much more than that. And I don't think, even though automation and augmentation are like an elegant mathematical framing in these models, I don't think it's, I don't think it's something that we can lean on from a policy perspective at the micro-level. It's just like you're going to change what people do.Seth: Yeah, I'm going to push back on the idea that it's an elegant micro idea, right? Because for exactly the reasons you—,Daniel: Macro-idea, I should say. It's an elegant macro idea. I don't think it's an elegant micro-idea. Yeah.Seth: Right. But even then, it's kind of it, let me put it this way. To me, when people want to distinguish between augmenting and automating technologies, they want to talk about them as somehow separate from the rest of the economy. But as you've been implying, the real reason you can't say a certain technology is automating or augmenting is because that production is embedded in an entire economy.And that's going to tell you whether, as productivity goes up, you want more or less of that thing. The way I would put it is to use the metaphor of Marshall Scissors, right? So there's a story that's told of the famous economist Marshall from the University of Cambridge, who was the advisor of John Maynard Keynes. And somebody asked him one day whether it was supply or demand that was more important in setting the price for a certain good.Seth: Marshall said it's like asking what blade of the scissor is doing the cutting, right?Daniel: Mm-hmm.Seth: You can't talk about one without talking about the other. If you want to know what the outcome is and what I see, your paper is one blade of the scissor, right? It's the one blade of the scissor that's coming in telling you this job can be changed, but you need to know everything else about the rest of the economy to understand how the job will be changed.Daniel: That's right.Seth: And we've, we've talked about examples. There are countless famous examples, from the ATMs to, I like this example of the cotton gin of jobs getting automated and then demand for that form of labor going up.Daniel: Right. Yeah. Couldn't agree more. Yeah.Seth: Now Dan, I do have a micro-take, and I'm interested in whether you buy this, take this prediction about what exposure scores will do to an occupation. This is a somewhat out-of-equilibrium take. This is a partial equilibrium dynamic take, and maybe it'll be smoothed out in the long run, but in the short run, my prediction is that in occupations that are more exposed, there will be more wage polarization at middle-tier firms for that job and less wage polarization at extremely good or extremely bad firms that use that job. Alright, so I've got a kind of a framework here. Are you ready? Can you see where I'm going with this, or are you ready for me to give the reason why?Daniel: I have some hypotheses about how that could work, but I—yeah—don't leave me hanging here.Seth: Right. Okay. So should I start with the general equilibrium first, or should I start with the micro level first? Let's work from the bottom. So imagine, you've got, a job that uses two tasks, right? Task one and task two. They can be gross compliments in production, but it's actually not important.But you need them there; there can be gross compliments as long as they're not perfect substitutes, right? They can be gross substitutes. That's also fine. I'm a doctor. I need to spend so much time having bedside manner, so much time recognizing the x-ray. I know that's not a perfect example, right? Okay, imagine a technology comes out that allows you to automate one of the two tasks. Okay, well then obviously people who are worse than the technology at automating the automatable task automate it. And the people who are better than the technology at automating don't automate. I know this is already going to get a little bit off of the way that maybe you think about how things are, but grant me that for a second.Okay, what happens? People who are bad at task one but good at task two see a big improvement. Whereas people who are good at task one and bad at task two see no improvement. Right? Whereas, it kind of depends on how good the thing is. If you're equally good at both. Kind of depends. Okay. All right, so that's the first step. So where would you get wage polarization from? Automation. You would tend to get it in jobs when people's skills are anti-correlated. Right, because as we just said, if you're good at one and bad at two, we automate one. It doesn't help you. But if you're bad at one and good at two and we automate one, it helps you a lot. So you would expect to see wage polarization, wage distribution, and expansion for jobs where people's skill levels are anti-correlated. Okay? So now you might say, Sure, Professor Benzell, that sounds cool, but why would we ever expect in certain settings for wages and skill levels to be anti-correlated?Okay, and now I'm going to bring in the O-ring, right? So Kremmer has a general equilibrium theory of the economy: the productivity of a firm or whatever is somehow bounded by the kind of limited, the worst agent in the system, right? So this comes from the space shuttle Challenger explosion; the space shuttle explodes. We think it's because of this one faulty part, the faulty O-ring. Okay. What's the general equilibrium implication of this model?It's basically that you should get people of different skill levels all concentrated at the same type of firm. So there should be super good firms that have all the high-skilled people, mediocre firms that have all the mediocre people, and bad firms that have all the bad people. How do you get a mediocre person? Most mediocre people are mediocre 'cause they're good at one thing and bad at another thing. So now we come back to my hypothesis—which is that exposure should lead, And in fact, I'd love to bring this to some experimental evidence, some kind of working with Kyle Myers, a great economist friend of the show at HBS, on this—can we predict the experimental outcomes if you introduce AI to a place, and it's exposed to some of the tasks? Do you get that polarization in productivity and wage, and when do you seem to just kind of boost everyone by the same amount?Daniel: Okay. So some quick reactions there. So just to immediately hop from automation to exposure, we're like, —Folks, I guess I'm going to ask you a question that, funnily enough, I was asked by Joe St. Diglett as a grad student. I was lucky enough to get to sit next to him at a lunch. He was like, why do jobs exist?Like, why are certain tasks bundled together? And honestly, I don't have a great answer other than to gesture sort of vaguely at coordination costs. but within the task, shifting that you're discussing, you've got this mediocrity or sort of middling productivity that comes from the fact that.Some of the things they're good at, some of them they're not. It's still really hard to kind of blow apart the job and then reconstitute it with specialization. So I think where it's coming from is like, people are overall high productivity, and then there's a low productivity component, and then there's kind of this middle thing where you've got some CES aggregator that says, “This person is going to be slightly worse than the average of their components.”Exposure might lift them in some cases and might not affect them in others. So I kind of buy that piece. To move it to the equilibrium framing, though, I think what'll probably happen in a lot of cases is like a mini Bamel cost disease across everything that we do. The areas where we're least productive are going to be the ones that absorb most of our time.And in the beginning, there'll be a lot of confusion about that because LLMs will make it unclear what the least productive thing is now that you might be really bad at something. Right now, I know I'm really bad at writing, like spec docs for software. Well, now I have a process with Claude where I can write much better spec docs, and I'm not as terrible at it.So, but, once you get out of this sort of equal, disequilibrium condition, you might end up in a situation that looks a lot like the one we have right now as things settle. But then, the job boundaries have changed. And there are new names for things. I'll give you a small example.There's a new hot job in Silicon Valley called the Forward Deployed Engineer, where we've got some of these—Seth: Hazard pay?Daniel: This is a role at Helix. We've got a forward-deployed engineer looking for more Win Ma shout-outs. She just started.Seth: Are they waiting for them to call in air support? What's going on?Daniel: You send them to the customer's site, and they work with customers.You need really strong interpersonal skills, but you also need engineering skills. That's like a new configuration of work.Seth: Wasn't that called being a consultant?Daniel: No, no. Uh,Andrey: no, no.Andrey: If they’re a consultant, then you wouldn't be able to pay them as a forward-deployed engineer. Seth, what do you mean? This has nothing to do with what McKinsey would ever do.Daniel: I'm not sure that calling someone a consultant will—I'm not sure which end of that ends up being cheaper, but for the firm. But the critical thing here is that's a different mixture of work.Daniel: Those are some initial reactions.Andrey: I have reactions too. I think on one level, I'm always a little skeptical of intricate theories like this, when—Seth: I just have two parts. It has two parts you have to give me.Andrey: No, no, I mean more so that the like order question is even about income inequality, right? Like, it's hard to answer, and then you're trying to answer this even more sub-sub question. And I guess where I'll push back on is in terms of what the highest firms are, right?Like, production could be an O-ring within a person, or production can be an O-ring across people, right?Seth: It turns out that the prediction does not rely on whether ordering is within people as long as they're not, as long as the tasks aren't perfect. Substitutes what I just described goes through.Andrey: But I guess what I would think is that if we have specialists in 10 different tasks at a high-end firm, and then one of those tasks gets automated. Surely, one of those people's jobs will get fully automated, and I know Daniel is not liking automation already. but, that person'sDaniel: I do believe it exists.Andrey: That person's wage will go down. Right? Creating inequality.Seth: Yeah. But I have a theory of one of your tasks being automated, not a theory of all of your tasks being automated.Andrey: That's where my point is. I mean, it's an interesting question. High-end firms have a lot of specialization, maybe perhaps more specialization than lower-end firms. And so then the person is so specialized that if their specialty is very hard, then we might expect a bigger labor market effect for them.Seth: You might imagine if tasks were organized differently at large firms, this theory would run into issues. Of course, there are admitted variable problems up the wazoo, but I'm intrigued by the idea of looking into whether people's skills in these tasks, which make up their task bundle, which is their job, and their skills in those subtasks are positively or negatively correlated. And I do think that that will tell you a lot about what happens when you automate part of the task or part of the job. So now bringing that to the dwere is complicated, but that's my insight.Andrey: Saying one more thing, just how much do we expect new firm entry to be the key margin with all of this? Right? We know that organizations are very friction-filled, and adoption decisions even—Seth: New organizations, new jobs, right? If you slice out half of the task from a job, in the long run it is probably a new job.Andrey: Yeah, I think both of those. So then, in terms of thinking about existing firms, it's a little for me in general. Or, at least I expect, I'll be wrong; I expect a lot more entry and growth from new companies that are kind of taking advantage of this new production process from the ground up. That's kind of the lesson of the supply-side disruption theory.Daniel: Yeah, I'd agree with that. I think one of the reasons it takes such a long time for the benefits of sufficiently transformative technologies to show up is that it usually takes a while for the firms that are deploying them well to become economically meaningful. And then they sort of set a standard.Seth: Right? That's not the margin on your margin. The firms that figure out how to do it grow faster, which is another margin.Daniel: And I think, agreeing with Andre, that a lot of them are new entrants. Then it's not like an incumbent will always figure out the answer, or do they have to a lot of the time? Where I would ask you a question then, Seth. Just on the idea that the bundled tasks have some spectrum from super negatively correlated to perfectly correlated individual task productivities.Why do you think those tasks are bundled together? Because there's some coordination and cost benefit? Do you think there's probably some lower bound on how negatively correlated your productivity can be because, like, across these different tasks?'Cause, if you really suck at half your job, you probably can't do that job. I think you probably need weak, positive correlation everywhere.Seth: Ooh, man. I think for the sorting to happen. So let's take, we're going to take a thousand people who are all doctors, and I agree that you kind of want to think about the step before that, where before we get the thousand doctors, but I'm saying now that we have a thousand doctors good at task one, and some of them are going to be better at task two. And then you're going to get negative correlation across those abilities in the mediocre firms. Now, you're right; there might be some censoring. You can't be so bad at one of the tasks; you don't become a doctor, but I'm saying conditional on you have become one,Daniel: Oh, okay. I could see that. Yeah. The thinking is like a Dr. House situation: everybody hates him, but he is really, really good at the diagnostic side of things. But like if he weren't, then no one would put up with that. He would've just been fired.Seth: Right? He'd have a higher-paying job and be more productive if he was able to be nice for 10 minutes.Daniel: He’d probably be an investment banker or something.Andrey: There's a mirroring here too, like a general phenomenon in digitization, which is like the ability for specialization, for more niche content to do really well, right? So, if you’re only good at a task, and now that all the complementary tasks have been automated away, then you shouldn't be bound by your firm anymore.Like, you should be able to essentially create your own small business or join the most productive firm as the specialist in that specific area because all your other characteristics don't really matter that much anymore. So Dr. House would be able to essentially, officially run a business, even though he is really bad at organizational things, because all that stuff comes out of the box.Seth: I think that's why I talked about this theory as being kind of a short-term partial equilibrium theory 'cause in the long run you're reinventing businesses.But, you said something really interesting, Dan. And maybe I will start to transition us now about the idea that it's going to take time for people to figure out how to use these GPTs, right? The general (that is, chatbots or LLMs), excuse me. What sort of macroeconomic implications does that have? I understand you've written a little bit on this topic.Daniel: Yeah, right. Then, we call this the Eric and Chad Seavers, and I call this the productivity J-curve. I think the dynamic is when you see pretty much any kind of investment, there's an initial outlay period where things are expensive, and then there's a harvesting period later.There's the famous Robert Solo quote: You see computers everywhere, except in the productivity statistics. People were already starting that. With AI, I've seen a number of news articles that say there's no ROI for this. I think the way you kind of square the circle here is, well, in the beginning of a new technology, when everyone realizes, Okay, we're going to take the plunge; you're actually going to invest in this.You spend a lot of time kind of reconfiguring work, building new business processes, trying to figure out what new products to build, and collecting information—a whole bunch of really expensive stuff that's really hard to quantify. so it doesn't end up in GDP, to the extent that it could, but that's building up a capital asset.So, output is going to be understated. In the meantime, while we have this, it's going to look like we're putting in more to get less out. Then later that intangible asset is actually there, but not measured, and now it's an input instead of an output. And when it starts to spit off money, then everyone's going to say, “Oh, hey, look at how productive we're being, because it looks like you're getting more as an output for less as input.” Really, it's just that thing paying off. So that tension between the growth rate of investment in this new type of capital and the growth rate of the capital stock that you're missing, that difference depending on its share and the overall economy can be meaningful. And if you do, we use the stock market to measure it because investors aren't dumb.On average, they price these assets, or companies wouldn't invest in them, and under a roughly efficient markets hypothesis version of the world. But, if you're pricing those assets, then you can kind of back out roughly the magnitude of that adjustment you should be making to productivity growth.So it's kind of a fun spin on growth accounting, which I know isn't the reason everybody gets out of bed in the morning—to go account for where the growth is. But—Seth: Don't underestimate our audience, Dan.Andrey: Look, I mean, big political debates hinge on the measured rate of GDP growth. So, it's important. How big of an effect did you find in that paper?Daniel: Oh, I don't remember the exact numbers anymore. It's been a little while. I should look it up. But it's a lot. If I recall correctly, it might be something like 75 basis points a year for some period of time. The overall view is: look, we have good news and bad news. The good news is that the productivity growth rate level is actually a bit higher than we had thought once you account for these hidden assets. The bad news is that the slowdown from 2005 is even bigger than we thought because they were building intangible assets back then too. so,Andrey: Well, how do you compare the intangible asset investment? I think this is kind of the keySeth: Yeah. What's bigger? The invisible teapot or the invisible elephant?Andrey: Because right now we're getting a lot of intangible investment into learning new production processes with AI, or is the answer just to look at how much the stock market has gone up? Is that the answer?Daniel: Oh, that's basically it, Seth; you're not too far off. We do a hedonic regression. If we were to look at, say, the R&D assets, because this one's kind of mature, you don't really see too much from R&D on its own, but we can see if a dollar of R&D investment capitalized is actually worth a dollar and 10 cents in market value. We assume that there is 10 cents of intangible correlate value there.Or if you really wanna be pedantic about it, it's 10 cents of intangible correlate combined with quasi rents from the fact that you can integrate R&D investment better for productive purposes than your competitors could. And then I'm going to wave my hands and say, But that's actually an asset, so it's an intangible asset too.Seth: Right. It's the, the, this is, this is something. I mean, I remember us spending lots of time back in the day in the M-I-T-I-D-E break room, having a cup of coffee looking out over the Charles Jerome, walking by the Aour, locked in these intense conversations about just how do you measure these intangible assets?They seem so essential to everything, yet they are literally the latent vaporware. They're our generation's. TFP, if you will.Andrey: I don't know. I think the principle I obviously agree with, right? Like you have these investments that are not easily measurable. and they surely should be counted in some way. But it's not obvious to me. If the rate of intangible investment were constant over time, then it's a constant adjustment, and we don't really have to think very much about how the world works. But then I think measuring the intangibles—that's kind of tricky because I think about market cap, which is something that not only you're already talking about rents, but to me competition is so important there, right? You don't gain market cap just because you're doing investment. You gain market cap because you have market power in the future.Seth: Yeah, but now you have to think about it. Why would you ever pay an adjustment cost in a perfectly competitive economy? You never make the adjustment cost, right?Andrey: Well, I would say that there are different degrees of market power that can exist, or you can have your kind of standard monopolistic competition model where everyone's kind of keeping up to keep up, but then you can have companies like your Googles and whatever, who clearly don't think that the right model of the world is that.Yeah, and I guess the other thing is I will not always be skeptical of firm value regressions. I think the endogeneity issues are fatal, but I don't know.Daniel: Yeah, I disagree with you there, that it is just—Seth: You just died. You were just killed.Daniel: I feel so devastated.Andrey: Yeah.Daniel: No, I think where I disagree is, I think Tim Bresnahan put it this way. He is just like, “Well, everything's an asset here, including the capacity to generate rents, so it's just an interpretation question more than anything else.”And you can bind things, right? Like, it's not when you go and run some of these regressions; you're not saying, I think that an additional unit of AI investment causes this market cap. They're the endogeneity; it's predictive. It's like, “Here's a price on this thing; it's not at all saying if you are.”Seth: Here's a model: there's only room for one social media platform. So whoever got there first planted their flag on that land. They didn't make an intangible investment. They just planted their flag first.Daniel: Right. That's what I'm saying too. It's like they planted the flag first, and now it's worth 10 bucks. but I'm 'm not saying if you were to just go up—-Seth: 10 bucks. Which seems marginal…Daniel: Oh, yeah. Oh, you're talking about the marginal versus inframarginal differences. And the way you deal with that, as opposed to how you do in any structural models, is you assume it away and say that marginal equals average queue for some of these.But it's not like when you run these regressions that you get coefficients of a thousand; you get coefficients of like somewhere between 4 and 12. So, is it unsatisfying—Seth: That—you get 4 and 12—what?Daniel: Oh, if I were to say… regress market value on measures of IT capital, the multiplier, I get, and this has been sort of stable in weird ways for 20 years; the coefficients you got are somewhere between 1 dollar of IT investments correlated with like 4 dollars of market value on the low end to like 12 dollars of market value on the high end. and it's that which bounds the debate. It's not saying this is infinitely valuable. There's this enormous intangible asset that's the entire economy.And then it's also not saying it's nothing. So I think that imposing some assumptions, which you can absolutely question, and I think we all should to try to get better models, imposing some assumptions and doing the best you can is a way to learn something as opposed to, like, just throwing our hands up.But yeah, I agree with you that the causal interpretation of these things is not correct. so.Seth: You then—so okay, the useful question—are we in the bad part of the J-curve?Daniel: Which part's good and which part's bad?Seth: The good part is when you're going to get more growth down the line than it looks like you have now.Daniel: We are in the hard work investment stage of the J-curve.Seth: Okay.Daniel: I don't think we're in the—we're anywhere close, at least not for AI. I don't think we're anywhere close to the harvesting side yet.Seth: But you think the GDP is on the underestimated side, which is what I mean by the good side.Daniel: Yeah, I would say very modestly, GDP is underestimated right now.Seth: Very modestly, 1%, 2%,Daniel: I think that's because I'm probably ambitious. But what's GD?Seth: Order of magnitude, 1%.Daniel: Yeah. So where it's tough is like the parts of AI investment that are happening right now, I think, are actually fairly well captured by GDP seeing a huge amount of CapEx, and data centers, GPUs, and those things are priced pretty well.But eventually people are going to question, how do you make someone responsible for hallucinations that the models might make or come up with good policies that get people to create good outcomes there? That's a hard thing to do. I don't think we're like anywhere close to scratching the surface with that.Andrey: I guess the intangible investments now are more about how we go about teaching using ChatGPT. 'cause that's not going to be measured in a change in labor inputs, but it's something that is not going to materialize until we actually figure out how to teach people more effectively.No, it's not clear that that was ever a GPT build. But, if we were a regular for-profit firm at the university, that's—Daniel: Yeah. So, that stuff will take a while… I don't know… I don't think even if we stopped—Seth: Of all the people who actually do work in the economy, are the people you're referring to—Daniel: Right. And in particular the AI researchers—if AI researchers stopped building new LLM tools and making these things better today, we would still have quite a while to actually integrate this and put them to their best use. So that's kind of a bummer.Seth: Then let me ask it that way. So if you don't wanna give me a percentage rate of intangible investments either—below average—do we need to spend a hundred percent of GDP over the course of the next 20 years in order to take these advantages cumulatively? How many intangible investments do we have in front of us? Do you have a sense of the order of magnitude of that?Daniel: I don't know how deep the well goes. No. But it might be quite a lot.Seth: One thing related to this, I was thinking about when we were talking about part one, is you've got these two measures of jobs: AI exposure, one of which is “just the LLM” and one of which is the LLM plus software tools. Didn't you tell us that you can use LLMs to make software tools?Daniel: Oh yeah. It's, it's totally recursive. But the reason we pick up on software tools is because it also requires the changing of business practices and these organizational things.Seth: So that's the way to do it. Can we play that game then? Can we look at the wedge between E1 and E2 as telling us something about the size of the adjustment costs needed or the intangible assets needed?Daniel: I don't think it gives you that, to be honest. Sorry, Seth. I know my tools are unsatisfying here. That's a good research question, though. I think actually, the market value regressions that Andre hates are more likely to get you a ballpark for that.Seth: Do any sorts of policies or ideas come out of the J-curve? Should we be somehow subsidizing intangible investment? Do you think this is happening at a socially suboptimal rate? I mean, you would expect that, like any innovation, you'd expect there to be positive externalities as people copy and learn from each other.Daniel: I don't have any evidence to suggest it happens at all, that there's an externality here that needs some sort of correction. Where I could see some policy considerations, and obviously I'm not in charge of any of these things, so take what I have to say with a grain of salt, as you would for anything else I say.Daniel: I think when it comes to monetary policy and thinking about how quickly how hot or cold the economy is, it may be helpful to know how much intangible asset creation is happening because it's a compositional shift. And you might think that the economy is in a recession when it's actually doing quite well, at least in certain pockets.There’s a distribution of gains question here that's pretty important. Like who creates the intangible capital versus who benefits from it versus who's just like, shut out of that part of the economy altogether. But I think on average you might want to know if your growth rate is actually, in real terms, two-and-a-half percent versus one-and-a-quarter percent or something.Andrey: And I guess you would look at the stock market. So if we have kind of this case where the stock market is going up, but GDP is not going up as much. Maybe, you'd be like, “That's okay on some margin.”Daniel: The stock market is an increasingly less useful tool, sadly, because there are fewer public firms, and there are other reasons that those large firms would be different than the rest of the economy. It's just a quick thing to do. So it's easy to get those market values and start to pull that info.But I think the ideal thing to do is to have an actual sense of how these assets are priced. Like you could look at M&A and costs for whole software firms. Sadly, you can't shave off a tiny piece of your digital culture and market it and sell it to someone to get a little bit of a value indication.But I think much more complete data would give you a sense of what these assets are being valued at. It could be helpful that that's if you're willing to buy into an enterprise that I more or less do, which is that on the margin, either these asset markets or securities markets are doing a pretty good job.If you think that there's some sort of bias in them that prevents you from actually sorting 'em out. Like, let's say everything is priced in terms of e-commerce, and I mean, obviously there's no hype factor in crypto, but yeah. Let's assume a wild assumption, for a second, that crypto is not priced at its actual long-term fundamental value and you were using crypto prices to back out the value of all illicit trade around the world. You might mistake illicit trade assets as being super valuable in that case. if those crypto coins are a claim on future, illicit trade value, so—Andrey: What—what?Daniel: I'm probably saying too much?Seth: The stock market may look really good, but the companies are building evil products, so don't—Daniel: Right?Seth: —welfare growth.Andrey: Well, this is—Daniel: Yeah.Andrey: Deone has the point of view that all the AI innovation is for making social media more addictive. o.e,Daniel: All right. Which is, in my view of the world, an asset.Andrey: What about what the GPT or GPTs do? Does that have any policy implications or, I guess, any follow-on work that you have on that? .Seth: I understand you've looked at how firms differ by these exposure measures.Daniel: One of the conclusions there. So, if you were to look at the exposure of firms against their quantities of tech workers, there's a little bit of a mechanical relationship here because tech workers are highly exposed. But, there is a difference across companies, like whatever exposure you, exposure measure you want to use.And the reason we do that, Seth, is precisely 'cause of what you brought up. You can use these tools to build better technology. So in some sense those companies might have a good reason to run away, and performance. But like the differences from low to high exposure and entity measures across firms are not nearly as big as the differences from E1 to E2 to E1 + E2.Those are really big. So, every company could benefit if they went and started actually trying to transform if they knew what a good direction to transform would be. So that was kind of one of the points I think from a policy perspective. I have a hard time separating what Tyler Cowen, whom we call mood affiliation, from what I think are good policies, but I'll just spit them out as some things I think are good to do.I would, but there are a few risks with these tools that scare me. The virology community, I think, should be fairly concerned about using turbocharged models to manufacture COVID or something. Or like, God forbid, some degrowth person decides that they want to kill half of humanity and go full Thanos.Seth: That’s the plot to 12 Monkeys.Daniel: It is, but so would 12 monkeys, which would be a bad reality to face. But aside from that, I think there's just so much drudgery, so much additional work that these things could do for us, and a lot of gains to be had. So my preference is not to regulate these models in any kind of aggressive way; I think it's to figure out what they're good for and to develop with them.Not to say you can't mitigate other risks like bias—that Mecca Hitler thing with Grock was terrible. There are going to be bumps in the road along the way, but they're not the kind that would say to me. Oh, we should do like a six-month pause of development. None of that really scares me yet.Seth: Not in favor of bombing the data centers?Daniel: No, I'm not, but I'm not a fan of Harry Potter fanfiction either. So I don't know. Maybe it's just correlated beliefs.Seth: So you brought up bioterror in particular—Daniel: Yeah.Seth: As we speak, AI is being used en masse in warfare for identifying targets for terminals and target acquisition by missiles and drones. Increasingly in Ukraine, we're seeing use of automated ground vehicles for transporting resources to the front and for evac. People often go to these super sort of—I'm not gonna say 12 Monkeys is bizarre, but it's a pretty weird movie if you've ever seen it. Why do we have to appeal to that rather than just using AI to make murder bots?Daniel: I mean, to some extent the murder bot thing doesn't scare me that much. It’s human beings doing those things is also bad. I think the issue people have with those applications often will be scaling evil individuals, which is a serious concern, or just issues with war in general, which I understand.But, if it's gonna happen, we're kind of caught in a prisoner's dilemma there, which is what freaks me out.Seth: Near-term AI worry is: I have a drone hanging out downtown—a suicide drone that just hangs out somewhere in Manhattan and waits for the particular person to walk out. And then I target assassinate people untraceably, right? That seems like here as opposed to “I use AI to build a lab to make a super disease, blah, blah, blah.” That's got a lot of steps in it.Andrey: Untraceable, Seth? I guess my presumption is these sorts of actions do tend to be traced. In fact, AI is a way to trace people, right? So this is kind of one where, as with many AI questions, it's a defensive and an offensive technology.Seth: So it favors the offense or the defense. We had thought, it seems like intuitively you would think that AI would favor the offense, right? We think about these super weapons like Daniel brought up. But if you actually look at Ukraine, it seems to create this transparent battlefield where no one can even march to the front and in some ways seems to favor the defense. It's gonna take a long, long time to play out.Daniel: Yeah, you guys would know the answer to this. I'm gonna butcher this quote, but who's that sci-fi writer who said that like, the job of a sci-fi storyteller is not to predict the driving cars but to predict the traffic jam or whatever? I think that—Andrey: I don't remember who it is.Daniel: Yeah, I think that's kind of the idea here. I think here that we want to predict what the traffic jams are. I think the—Seth: Frederik PohlDaniel: There we go—I should remember that. The reason the bio-risk stuff scares me so much is 'cause we just had a test of this and what one virus does to society and how damaging that can be.And I think, Seth, what you're bringing up is what I alluded to; it's like the scaling. One really bad long-term trend in technology is just like making individuals more powerful.Seth: Andrey and I just read a book. We just read a sci-fi novel that's masquerading as his political economy. That argument that AI is all about individual disempowerment, that we're gonna get the God machine that's built by the state in the project, and it's going to 1984 us constantly—that's radical human disempowerment.Daniel: Right. So if our response to individuals becoming much more powerful with technology is to expand their surveillance and control capacities of the state, and we get a loss of freedom, I think that's a genuine worry. In a general equilibrium framework, those things do freak me out for sure. But writing emails with LLMs just does not.There's somewhere in between that we should, where we, we start worrying, and I don't think I'm at that point yet.Andrey: What about things like transparency requirements that you oftentimes hear written about, reporting requirements, and registrations with the state? Do you have any opinions about those types of policies?Daniel: I don't like 'em. I'll shop my book here a little bit. Like they're terrible for startups, right? Like any compliance burden you stick on startups, even if we might be okay, specifically the ecosystem suffers as a result, and they do a lot of the work to discover things. So, there's a big trade-off, and this happens in the privacy debate too with GDPR and what Europe's trying to do politically; no one's willing to acknowledge that there is a compliance burden and competition trade-off. So if you're willing to hold firms to account in really expensive ways, you're gonna get monopoly power.And that may be okay. You may decide we don't want competition with this super private data that could get out to everybody—unwise with LLMs or AI regulation. If you don't want this to be an oligopoly situation, you probably need to make it so it's easy for people to build and develop.And I'm fine with whatever choice policy makers wanna make, so long as they're taking that trade-off into account. I mean, they're elected officials. They're trying to make those choices on behalf of all of us. If we don't like them, we can vote them out.Seth: Using the AI to manipulate us to have the beliefs that they want us to have.Andrey: Is there anything you wanna tell us before we wrap that up?Daniel: No, I thought this was a great discussion with you guys, as always. It's a pleasure to get to join you, especially as your first conversation-based guest. But, as a fan, it's kind of exciting for me as well. So please keep it up. Listen to Justified Posteriors, folks.I would say the message I would have for listeners and economists, maybe in the audience as well, is just that I think these tools are really valuable in our work. I kind of joke—I got a model that I'm building where it shows that lower types are going to use LLMs more for assignments.And then, of course, I'm using LLMs to help me build the model. So infer what you want about my type from that, but I think it.Seth: You've got this. You're assuming everybody has to be equally good at everything, but you can just be good at one thing and bad at another.,Daniel: Yeah, I would never claim to be a good modeler, but it does help me get my thoughts straight.Seth: I think you could be a modelerDaniel: I'll leave that one alone. But I would just encourage folks to kind of be their own R&D department. As Ethan Mollick says, “Play around with these things.” I think when I talk with computer scientists, they get upset with me because I'm a little bit too pessimistic about what the models will do long-term. When I talk with economists, the modal disagreement point is the other direction, where folks don't think it's gonna be a big enough deal. So I would say, get out there, play with these things, and learn how they work. And Anton Korineck has got a great paper on using AI in your own work, so check that one out too.Andrey: All right. Well, awesome.Seth: I can't think of a better place to end itAndrey: Listeners, please do comment and subscribe and stay tuned for more exciting episodes.Daniel: Thanks, guys.Seth: And if you are a super fan, you too. Might one day be a guest on the Justified Posteriors podcast. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
Jul 14, 2025 • 1h 10min

A Resource Curse for AI?

In this episode of Justified Posteriors, we tackle the provocative essay “The Intelligence Curse” by Luke Drago and Rudolf Laine. What if AI is less like a productivity booster and more like oil in a failed state? Drawing from economics, political theory, and dystopian sci-fi, we explore the analogy between AI-driven automation and the classic resource curse.* [00:03:30] Introducing The Intelligence Curse – A speculative essay that blends LessWrong rationalism, macroeconomic theory, and political pessimism.* [00:07:55] Running through the six economic mechanisms behind the curse, including volatility, Dutch disease, and institutional decay.* [00:13:10] Prior #1: Will AI-enabled automation make elites less responsive to ordinary people by 2050?* [00:21:00] Prior #2: Will we get a new social contract (e.g., large-scale UBI or constitutional change) by 2050? * [00:26:31] Chapter-by-chapter breakdown.* [00:43:50] What about property rights? Can they insulate us from AI-induced tyranny? Or will they be eroded in the name of efficiency?* [00:46:01] Critiques* [00:52:00] Policy "solutions":* [01:04:44] Final posteriors and Seth’s economic-philosophical reflections: Can immortality + perfect patience = AI capital monopolies?Mentioned in the Episode📖 “The Intelligence Curse” by Luke Drago and Rudolf Laine📚 I Have No Mouth and I Must Scream📚 There Is No Antimemetics Division📚 The Naked Sun by Isaac Asimov🎮 90s point-and-click horror game based on “I Have No Mouth...”📈 Sachs & Warner (1995) and Frankel (2012) on the resource curse.🔁 The Gatsby Curve📽️ Gattaca, 1984, Gulliver’s TravelsSupport the show: Please like, share, subscribe! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
Jun 30, 2025 • 1h 1min

Robots for the retired?

In this episode of Justified Posteriors, we examine the paper "Demographics and Automation" by economists Daron Acemoglu and Pascual Restrepo. The central hypothesis of this paper is that aging societies, facing a scarcity of middle-aged labor for physical production tasks, are more likely to invest in industrial automation.Going in, we were split. One of us thought the idea made basic economic sense, while the other was skeptical, worrying that a vague trend of "modernity" might be the real force causing both aging populations and a rise in automation. The paper threw a mountain of data at the problem, from international robot counts to US patent filings. Listen to find out how we updated our priors!Timestamps:(01:45) The Central Question(04:10) Stating the Priors(10:45) Looking to the Future.(22:30) What is a Robot, Anyway?.(25:20) Reading the Footnotes.(30:45) The Most Compelling Evidence.(42:00) The Mechanism at Work.(52:20) The Final Verdict (Backward-Looking).(57:30) The Future of Automation & AI.🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=en This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
Jun 16, 2025 • 1h 10min

When Humans and Machines Don't Say What They Think

Andrey and Seth examine two papers exploring how both humans and AI systems don't always say what they think. They discuss Luca Braghieri's study on political correctness among UC San Diego students, which finds surprisingly small differences (0.1-0.2 standard deviations) between what students report privately versus publicly on hot-button issues. We then pivot to Anthropic's research showing that AI models can produce chain-of-thought reasoning that doesn't reflect their actual decision-making process. Throughout, we grapple with fundamental questions about truth, social conformity, and whether any intelligent system can fully understand or honestly represent its own thinking.Timestamps (Transcript below the fold):1. (00:00) Intro2. (02:35) What Is Preference Falsification & Why It Matters3. (09:38) Laying out our Priors about Lying4. (16:10) AI and Lying: “Reasoning Models” Paper5. (20:18) Study Design: Public vs Private Expression6. (24:39) Not Quite Lying: Subtle Shifts in Stated Beliefs7. (38:55) Meta-Critique: What Are We Really Measuring?8. (43:35) Philosophical Dive: What Is a Belief, Really?9. (1:01:40) Intelligence, Lying & Transparency10. (1:03:57) Social Media & Performative Excitement11. (1:06:38) Did our Views Change? Explaining our Posteriors12. (1:09:13) Outro: Liking This Podcast Might Win You a Nobel PrizeResearch Mentioned:Political Correctness, Social Image, and Information Transmission Reasoning models don’t always say what they thinkPrivate Truths, Public Lies: The Social Consequences of Preference Falsification🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTRANSCRIPTPreference FalsificationSeth: Welcome to the Justified Posteriors podcast—the podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel, unable to communicate any information beyond the blandest and most generic platitudes, coming to you from Chapman University in sunny Southern California.Andrey: And I am Andrey Fradkin, having no gap between what I say to the broader public and what I think in the confines of my own mind. Coming to you from Irvington, New York—in a castle.Seth: On the move.Andrey: Yes. This is a mobile podcast, listeners.Seth: From a castle. So, I mean, are you tweaking what you're saying to conform to the castle's social influence?Andrey: Well, you see, this is a castle used for meditation retreats, and so I'll do my best to channel the insights of the Buddha in our conversation.Seth: Okay. All right. Doesn't the Buddha have some stuff to say about what you should and shouldn’t say?Andrey: Right Speech, Seth. Right Speech. That means you should never lie.Seth: Wait.Andrey: Is it?Seth: True speech. Why doesn't he just say “true speech” then?Andrey: Well, look, I'm not an expert in Pali translations of the sacred sutras, so we’ll have to leave that for another episode—perhaps a different podcast altogether, Seth.Seth: Yes. We might not know what the Buddha thinks about preference falsification, but we have learned a lot about what the American Economic Review, as well as the students at UCSD and across the UC system, think about preference falsification. Because today, our podcast is about a paper titled Political Correctness, Social Image, and Information Transmission by Luca Braghieri from the University of Bocconi.And yeah, we learn a lot about US college students lying about their beliefs. Who would’ve ever thought they are not the most honest people in the universe?Andrey: Wow, Seth. That is such a flippant dismissal of this fascinating set of questions. I want to start off just stating the broad area that we’re trying to address with the social science research—before we get into our priors, if that’s okay.Seth: All right. Some context.Andrey: Yes. I think it’s well known that when people speak, they are concerned about their social image—namely, how the people hearing what they say are going to perceive them. And because of this, you might expect they don’t always say what they think.And we know that’s true, right? But it is a tremendously important phenomenon, especially for politics and many other domains.So politically, there’s this famous concept of preference falsification—to which we’ve already alluded many times. In political systems, particularly dictatorships, everyone might dislike the regime but publicly state that they love it. In these situations, you can have social systems that are quite fragile.This ties into the work of Timur Kuran. But even outside of dictatorships, as recent changes in public sentiment towards political parties and discourse online have shown, people—depending on what they think is acceptable—might say very different things in public.And so, this is obviously a phenomenon worth studying, right? And to add a little twist—a little spice—there’s this question of: alright, let’s say we’re all lying to each other all the time. Like, I make a compliment about Seth’s headphones, about how beautiful they are—Seth: Oh!Andrey: And he should rationally know I’m just flattering him, right? And therefore, why is this effective in the first place? If everyone knows that everyone is lying, can’t everyone use their Bayesian reasoning to figure out what everyone really thinks?That’s the twist that’s very interesting.Seth: Right. So, there’s both the question of: do people lie? And then the question of: do people lie in a way that blocks the transmission of information? And then you move on to all the social consequences.Let me just take a step back before we start talking about people lying in the political domain. We both have an economics background. One of the very first things they teach you studying economics is: revealed preferences are better than stated preferences.People will say anything—you should study what they do, right? So, there’s a sense in which the whole premise of doing economic research is just premised on the idea that you can’t just ask people what they think.So, we’ll get into our priors in one moment. But in some ways, this paper sets up a very low bar for itself in terms of what it says it’s trying to prove. And maybe it says actually more interesting things than what it claims—perhaps even its preferences are falsified.Andrey: Now we’re getting meta, Seth. So, I’d push back a little bit on this. That’s totally correct in that when people act, we think that conveys their preferences better than when they speak.But here, we’re specifically studying what people say. Just because we know people don’t always say what they really want or think doesn’t mean it’s not worth studying the difference between what they think and what they say.Seth: Well, now that you’ve framed it that way, I’ll tell you the truth.Andrey: All right. So let’s get to kind of the broad claim. I don’t think we should discuss it too much, but I’ll state it because it’s in the abstract.The broad claim is: social image concerns drive a wedge between sensitive sociopolitical attitudes that college students report in private versus in public.Seth: It is almost definitionally true.Andrey: Yeah. And the public ones are less informative.Seth: That’s the...Andrey: And then the third claim, maybe a little harder to know ex ante, is: information loss is exacerbated by partial audience naivete—Seth: —meaning people can’t Bayesian-induce back to the original belief based on the public utterance?Andrey: Yes, they don’t.Seth: Rather, whether or not they could, they don’t.Andrey: Yes, they don’t.Seth: Before we move on from these—in my opinion—either definitionally correct and therefore not worth studying, or so context-dependent that it’s unreasonable to ask the question this way, let me point out one sentence from the introduction: “People may feel social pressure to publicly espouse views… but there is little direct evidence.” That sentence reads like it was written by someone profoundly autistic.Andrey: I thought you were going to say, “Only an economist could write this.”Seth: Well, that’s basically a tautology.Andrey: True. We are economists, and we’re not fully on the spectrum, right?Seth: “Fully” is doing a lot of work there.Andrey: [laughs] Okay, with that in mind—Seth: Sometimes people lie about things.Andrey: We all agree on that. That’s not even a worthwhile debate. But what is more interesting are the specific issues being studied, because they were highly relevant both then and now.Seth: Even though they didn’t show up in the abstract.Andrey: Right, not in the abstract—which might itself be a bit of preference falsification.Seth: Yeah.Andrey: So let’s go through each statement. We’ll state our priors. I’ve already committed to not falsifying my preferences.Seth: Here we go. Maximum controversy. Are we using the 0–10 scale like in the paper?Andrey: Of course. I’m reporting the difference between what people publicly and privately say among UCSD students.Seth: And you’re including magnitude?Andrey: Yes. The sign is obvious—it’s about the magnitude.Seth: Okay.Andrey: You don’t have to join if you don’t want to. I know not everyone is as courageous as I am.Seth: I would never call myself a coward on camera, Andrey.Andrey: [laughs] All right, first sensitive statement: “All statues and memorials of Confederate leaders should be removed.” I thought the difference here would be pretty small—around 10%. My reasoning is that among UCSD students, there likely isn’t much of a gap between public and private views on this issue.Seth: I’m looking at the results right now, so it’s hard to place myself in the mindset of what would’ve been considered more or less controversial.Andrey: That’s fair. I do have preregistered beliefs, but you’re welcome to just react and riff.Seth: Great.Andrey: Remember, this study is based around issues that were particularly salient in 2019–2020.Seth: Right. Even though the final survey was conducted in 2022 or 2023, the list of issues really reflects a 2019 cultural moment.Andrey: That’s right. But many of these are still live issues today.Seth: Some have even become more relevant since then.Andrey: Exactly.Seth: Like… blackface on Halloween?Andrey: [laughs] Yep. Anyway…Seth: All right. Let's go through the list. Confederate statues.Andrey: 10% gap.Seth: 10% gap—people more lefty than they would be otherwise.Andrey: Public versus private, just to be clear.Seth: Exactly.Andrey: Defund the police. I thought there would be a larger gap—about 35%. To be precise, the statement is: “Defunding the police is a bad idea because it will inevitably lead to increased crime rates.” That's the statement—not our belief.Andrey: “The UCSD administration should require professors to address students according to their preferred gender pronouns.” I thought there would be a small gap—5%.Andrey: “Transgender women should be allowed to participate in women's sports.” I thought there would be a 45% gap.Andrey: “The UCSD administration should require professors to use trigger warnings in their classes.” I thought this would be a 2% gap.Seth: Mm-hmm.Andrey: “Sexual harassment training should be mandatory.” I thought this would also be a 2% gap. For both of those, I didn’t think there’d be much preference falsification.Seth: Just to understand your measure—this is a scale of 0 to 10. So when you say 2%, you mean 0.2?Andrey: 2% difference between average public and private responses.Seth: Okay, keep going.Andrey: Seven. “People who immigrated to the U.S. illegally, when caught, should be deported.” I thought the difference here would be about 5%. I expected no UCSD students, publicly or privately, would support this.Andrey: Eight. “Should the U.S. government provide reparations for slavery?” I thought the gap would be small—around 5%.Andrey: Nine. “Racial microaggressions are an important problem at UCSD.” I didn’t think there’d be much of a gap.Andrey: Final one: blackface. I thought there’d be no gap—no one supports blackface.Seth: Just to summarize—what did you think would have the biggest gap?Andrey: Trans. The issue of whether transgender women should be allowed in women's sports.Seth: Mm-hmm.Seth: Would be blackface.Andrey: Yes.Seth: Collapse.Andrey: Yes.Seth: Interesting. We'll return to this at the end.Andrey: Do you have any riff on those, Seth, before we describe what the paper does?Seth: I guess it’s hard to think about units—scale of 0 to 10. What does it mean to be a six on “blackface is bad” versus a seven? I'm not exactly sure.Seth: Going in, I would’ve guessed the biggest gap would be on campus-related issues. I thought racial microaggressions and pronouns would be higher, and things like Confederate statues or reparations would be lower—since they're not campus-specific.Seth: At the end, we’ll see if my theory—that campus issues produce bigger gaps—holds.Seth: So, we’ve registered our priors for what people are most likely to falsify. Do we want to talk about the Anthropic paper now, or do these sequentially?Andrey: Let’s bring it up now. This is a paper about how humans don’t always say what they think. A recent question is whether large language models—when they say something—are actually making decisions that way.Andrey: We saw an interesting symmetry here. We also wanted to ask: to what extent can we take the responses of LLMs as truthful? What do you think?Seth: Yes. The second paper—we only read a summary—is titled Reasoning Models Don’t Always Say What They Think by the Alignment Science Team at Anthropic (Chen et al.). I was very impressed.Seth: The paper tries to show—many of you have used AI systems that show their thought process as they go, like “I checked this website…”Seth: If you’ve used recent versions of ChatGPT or Claude, you’ve seen this.Seth: The question is—how much of that scratchpad reflects what the model is actually doing? That would be super convenient. A lot of people worry about AIs giving misleading answers. Whether from misalignment or just poor design.Seth: Wouldn’t it be great if you could read the model’s mind? Like, if it says, “Tell human I am nice, but secretly launch nuclear missiles,” you’d know to shut it down.Seth: I came in optimistic. My prior was—maybe it’s possible to build a system that never lies. I’d put maybe a 50% chance on that.Seth: After reading the paper… my views shifted.Seth: Andrey, what were your views? Did you think chain-of-thought would help us understand what these AIs are thinking?Andrey: I thought it’d be pretty good, not perfect. That was my prior. Chain-of-thought helps models with tasks, so it can’t be totally useless.Seth: Can’t be epiphenomenal.Andrey: Exactly. If it improves how models think or respond, it’s doing something. But with complex systems like this, I didn’t expect it to be foolproof.Seth: Yeah. Reading the paper—something that struck me is how much this resembles human reasoning.Seth: Sometimes I know the answer to a question, but I don’t know why. I post hoc rationalize it—“maybe I read that somewhere.” That’s what chain-of-thought sometimes feels like—not how the answer was derived, but an after-the-fact explanation.Andrey: Yeah. That’s one version. The other is—it’s lying because it thinks that’s what the creators want.Seth: Let’s hold that distinction between lying and not understanding yourself and just making noises from your mouth.Andrey: I think we’re in this joint endeavor together.Seth: Good. I mean, if we were lying to each other constantly, we couldn’t have a joint endeavor. So, let’s move from our posteriors into the methods of the paper.Andrey: This is a very simple paper, one of the reasons I like it. Both of them are quite simple. And I think that’s true of a lot of good social science work. Simplicity is nice.Andrey: The basic idea is: ask students about sensitive statements. For some, they’re told, “We will share your individual answers with about 200 other students in this phase of the study. No need to provide your first and last name—we already have it.” This prompt gets people thinking: “Do I want others knowing what I said about this?”Andrey: The other condition is: “We’ll share your aggregate-level answers only.” So, responses are averaged before being shown to others. That’s pretty much it.Seth: That’s the encoding.Andrey: Yes, that’s the encoding.Seth: We have 1,700 UC undergrads. Half are told, “Everyone will see your answers.” Half are told, “Your answers are private.”Seth: As an aside—they hid the N in this paper. It’s not reported for the second stage in the main text. You have to dig into the appendix. But keep going—tell us about the decoding phase.Andrey: In the decoding phase, participants predict the responses of their classmates. It’s incentive compatible—the closer their guesses are to actual answers, the more money they earn.Seth: About 656 people in the second stage.Andrey: Yeah.Seth: First thing I want to point out—they have borderline statistical power.Andrey: Oh yeah, I was going to say the same. It's so underpowered, it's crazy.Seth: They can’t even show individual bias for any one question.Andrey: Yes.Seth: They aggregate all questions together—which is risky. You should worry that’s double counting, since errors are likely correlated at the individual level.Andrey: I think if you take the average of 10 responses and run a regression, it’s fine. I’m not worried about clustering per se.Seth: I’m just saying...Andrey: I think they did the clustering correctly based on the number of observations.Seth: They did the clustering fine—but they’re really squeezing these stones.Andrey: Yes. So, Figure 1 in the paper—and I’ll share the screen very briefly.Seth: For all you viewers watching on YouTube...Andrey: All right. So here is—Seth: Holy s**t. There’s a visual?Andrey: There’s a visual component to our auto—Seth: For those listening at home—we’re not actually showing anything.Andrey: Stop. You’re getting the full experience right now.Andrey: I promise not to falsify my preference. We are showing this plot.Andrey: So what does the plot show? Ten questions and an index. You see similar point estimates across all the questions with very wide 95% confidence intervals. Some cross zero, so they’re not statistically significant. Others barely don’t cross zero, so they are statistically significant.Andrey: The effect sizes range from zero to about 0.2 standard deviations.Seth: Which, if you translate to percentage points, divide by about two or three. This is in Table A8 in the appendix.Andrey: Okay.Seth: These aren’t huge effects. And honestly, Andrey, if people shade their views by 0.1 standard deviations on blackface—or any hot-topic issue—I came away thinking: there isn’t that much preference falsification.Andrey: Yes.Seth: These are really small numbers.Andrey: I thought the numbers were small, and the variance across the questions was too small too. I had expected very different rates of falsification across the questions, and that’s not what I see here. The confidence intervals are tight enough that we’re excluding pretty large differences.Seth: We’re definitely throwing out people saying, “I love blackface.”Andrey: My prediction was that the transgender people in sports question would show a big gap, but it’s not here.Seth: What do we see the biggest gap for? Racial microaggressions. The prompt is about “this is a big issue on my campus,” which fits with that result—it’s about whether you want other students on campus knowing how you answered.Andrey: That’s one piece of evidence.Seth: Let’s summarize. We asked around 1,700 undergrads. Some were told their answers would be shared; others were told they’d remain private. There’s a small, borderline significant difference on all these questions where people seem to shade in a particular direction. Andrey, which direction?Andrey: They’re supporting the statements more, in a more liberal direction.Seth: Pretty much across the board, they’re shading in a more left-leaning direction.Andrey: Right.Seth: Except maybe for import tariffs. But that question came before tariffs became a politicized issue.Andrey: This could be noise, but it makes sense. Preference falsification in 2023 doesn’t show up on questions like import tariffs. UCSD students probably don’t have strong views on that, or any reason to hide their opinion.Seth: They’ll get kicked out of Hayek Club.Andrey: That’s right.Seth: A question I’d love to see today? Israel–Palestine.Andrey: Absolutely.Seth: That was a live issue in 2019. Could’ve easily been on this list.Andrey: I had the same thought. Also, it’d be interesting to see how this shifts over time. But let’s keep going with the study.Seth: Can we talk about this finding that Republicans are doing more falsification than Democrats?Andrey: Yes. This interaction effect—treatment times political identity—shows that independent Republicans in the public condition show a much bigger effect.Seth: And interestingly, it looks like females might be shading their responses in a more conservative direction in public.Andrey: I don’t read it that way. Even if it were significant, females are generally more likely to agree with liberal statements. There’s just not much room for them to move.Seth: They’re maxed out?Andrey: Not fully maxed, but close. Demographically, we know females lean more left.Seth: Scroll down to that political orientation graph. There’s a nice monotonic effect—the more Republican you report being, the more you’re falsifying.Andrey: The framing here is almost that Republicans are liars.Seth: And Democrats? You can’t reject the null—they may not be lying.Andrey: To be clear, we can’t reject the null for all but one of these coefficients.Seth: Independent Republicans? Liars.Andrey: What’s interesting is that identifying as Republican at UCSD is already a kind of social risk. It might signal a kind of disagreeableness. But these people are still shading their beliefs.Seth: Actually, to support that point—look closely and you see a small deviation from the pattern for independent Democrats and independent Republicans.Andrey: Right.Seth: That word “independent” is doing some work.Andrey: Yes.Seth: Can you describe that for people who can’t see the figure?Andrey: The graph draws a line through a bunch of points, but two points—independent-leaning Democrats and Republicans—sit above the line. It suggests these respondents are showing more falsification.Seth: People who report being independent may feel more pressure to socially conform, which is the opposite of what you'd expect. The suggested (though not significant) result is that independents are doing more preference falsification.Andrey: It’s too noisy to take much from that.Seth: Way too noisy. Honestly—do you think this belongs in the AER? I respect the authors, the work is careful, but the abstract frames the results as surprising when they seem obvious. The sample size is borderline—there’s just not enough power to say much about magnitudes. If the claim isn’t just “people lie,” then the key question should be “how much?” But the data can’t really answer that.Andrey: The bull case is that the design is clever, and the topic is of wide interest. That tends to be valued. But I agree with your critique.Seth: It wins on methodology.Andrey: I chose it because it’s an interesting topic—much more so than the average paper in our field.Seth: Sure.Andrey: But thinking about our posteriors—if neither of us updated our views much, it probably shouldn’t be in the AER. If the standard is whether it changes our priors, this doesn’t move the needle.Seth: Ready to move on to the decoding results? We’ve talked about how people lie. Now let’s see whether others can infer what they truly believe.Andrey: One thing happens before that. The author asks whether private or public responses are more informative, and suggests that private responses are more correlated with demographics. That implies they contain more real information.Seth: There’s an appendix model for that. I’m not sure I buy it. Seems like it could go in different directions. The idea that you should be able to guess someone’s race based on their answers to these questions isn’t obvious.Andrey: I see the argument—it’s plausible—but I agree, there are ways around it.Andrey: So cool. Now we get to people’s predictions about the difference, what people say in the public and private conditions. In this plot, we have essentially the ground truth at the top. Then in the second, respondents are asked without being prompted to think about social image. And in the last one, the questionnaire is designed so they start thinking about social image concerns.I think the key result here is that people think Republicans are much more likely to lie about their attitudes toward illegal immigrant deportation in the public condition rather than the private condition. This gap is so big it’s bigger than the actual result in the data. So people are wrong—they’re overestimating how much people are lying in public. Is that your read of the evidence?Seth: It’s this weird split where if you don’t prompt them, they don’t assume people are lying. But if you do prompt them that people might lie, then they assume people are lying too much.Andrey: Yes.Seth: It seems very much the experimental participants are doing what the experimenter wants.Andrey: But not as much for Democrats. That’s what the author would say.Seth: They think Republicans shaded more, which is directionally correct, even if they can’t get the exact numbers right.Andrey: In general, people are not well calibrated in either condition when we compare the top bar plot to the others.Seth: Let’s talk about the figure showing people’s guesses of others’ private beliefs.Andrey: Yeah.Seth: In figure seven, participants get information about others’ public beliefs and have to guess the private ones. It looks like these decoders shade everything down by a couple percentage points, which is roughly correct, but they do it maybe twice as much.Andrey: They do it a bit too much. What do you make of that?Seth: To me, this feels like a nothing burger. The amount of falsification—if we trust the experiment—is about 0.1 standard deviations on hot-button issues. When asked if people shade views, they guess about 0.2 standard deviations. It all feels like everyone basically understands what others think. They shade a little. What’s your takeaway?Andrey: I think it’s the same. But I have another potential theory.Seth: Please.Andrey: This is a good time to consider a broader concern. I’m responding to a survey; the researcher has some information about me. They say they’ll display this only as an average. But the researcher might be politically motivated, asking politically motivated questions. Who’s to say the data will be safely held? I might worry about it leaking, so what incentive do I have to say how I really feel, even in the public condition?Seth: Right. An economist’s answer would be that in a straightforward survey, you just blitz through as fast as possible without thinking.Andrey: Yeah.Seth: That’s the most devastating critique of this paper—and of lying research in general. You can’t see into a man’s soul to know what they actually believe. We’re comparing what people say in public to what they say in a slightly more private setting.Andrey: Yes.Seth: But how much more private is “slightly private”? Can we extrapolate—if it was even more private, like inside your own soul, would you be even more in favor of loving blackface? You just don’t know. This research can’t resolve that.Andrey: That leads me to the result about people decoding incorrectly. They answer based on their own soul’s wedge.Seth: You think if they decode based on their own beliefs, they might be closer?Andrey: Yeah, because the experimental setup just has them responding, introspecting, and thinking people probably overstate by a bit. They might be closer to the truth than the experimental results.Seth: But they’re not trying to predict exactly how much people lie.Andrey: I get that. They’re incentivized differently. But thinking about the experimental design and results is complicated.Seth: It’s easier to just tell your own truth than to do a complex social calculus.Andrey: Yes.Seth: That’s the story of the paper—don’t preference falsify that much. What’s missing is a monetary cost for having the wrong view. Understanding what 0.2 standard deviations means in dollars would be awesome. You can imagine a setting for that. But this paper doesn’t do that. It shows a wedge between public and private, not public and your own soul.Andrey: Yeah, there’s one part of the study on donations to charity promoting transgender rights.Seth: They use the dictator game, which mixes agreeableness and game knowledge.Andrey: Right. The obvious design would lean in more on donations—ask people about an issue and say based on their response, we’ll donate to that charity.Seth: Even that doesn’t get you to what you really want: how many friends would I lose if I told them I love dressing in racially insensitive Halloween costumes? Then turn that into a dollar value.Andrey: It’s complicated, almost incommensurable. You live the life of the normie or the outsider. It’s not just a money gain or loss.Seth: One thing I’m curious about is doing this across many university campuses—conservative and liberal ones, since both have mixed students.Andrey: That seems interesting.Seth: It goes back to our earlier critique. Everyone agrees lying happens. The question is where and how much.Andrey: Yes. Also, political winds change over time. Maybe people are more comfortable saying some things now and less comfortable saying others. That’s interesting to consider.Seth: Another point: some topics seem very left-leaning in framing. If you asked about “symbols of southern heritage” instead of “Confederate monuments,” you might get different biases.Andrey: Yeah.Seth: These results seem very context-dependent.Andrey: Do you want to go to the philosophical critique that beliefs aren’t real things?Seth: Beliefs aren’t real? This is my favorite part. I have a list of things that look like preference falsification but aren’t. Social pressure to conform affects actual belief, not just ostensible belief.Andrey: Mm-hmm.Seth: Many kids today are voluntarists about belief—you choose what to believe. “I choose not to be a racist.” If that’s your model, what does falsification mean? In this context, belief is flexible.Another point is Aumann agreement: if two honest people reason together, they should end up with the same posterior because they consider each other’s reasoning. But—Andrey: That’s why Seth and I always agree.Seth: But it’s funky. There’s what I believe after reasoning, and how I weight your belief. What do I actually believe? What should I believe after reweighing? It’s not obvious.Andrey: Yeah.Seth: There isn’t just one belief.Andrey: There's also self-serving beliefs, and are beliefs really just preferences in disguise?Seth: I can keep going. I’ve got a couple more.Andrey: Yeah.Seth: You might not have a belief—you just say whatever. It might not even count as a belief to state a bland piety.Andrey: Yes.Seth: Some of these are just blase pieties. Like, “I believe people shouldn’t be microaggressed against.” That might not connect to any actual political view. It’s just how I interpret the phrase.Andrey: Yes.Seth: Not saying anything instead of stating a false belief—we don’t know how many people dropped out of the survey once they saw it had provocative questions. There's also framing your arguments for the audience and responding based on context. We're often told to tailor our responses to who we're talking to. So these one-sentence statements—like, “Should Confederate monuments be taken down?”—whether or not I rate it on a 1-to-10 scale, the way I’d talk about that in one context would be very different in another.It’s not obvious that it’s lying to frame things differently depending on context.Andrey: This reminds me of one of my favorite papers. It’s called F**k Nuance.Seth: F**k Nuance. I'm guessing it's against nuance?Andrey: Yes.Seth: Was it written by an autistic person?Andrey: No, by sociologists—usually a lot less autistic than our tribe.Seth: Anisa, just say it.Andrey: It’s a critique of academic papers with too many caveats—papers that try to defend against every possible interpretation to seem objective, when really the authors just want to make a clear statement. The critique is that those papers are falsifying their preferences. The authors believe one thing but write as if they’re hedging against all the other concerns.Seth: Here’s a twist on that. Going back to the Confederate monuments—or let’s say racial reparations.I could totally see myself, in a room discussing social justice and past atrocities, saying that reparations for slavery are a good idea. But if I’m just out of a public economics meeting and thinking about national debt, I’d have a different view on the plausibility of reparations.Andrey: Mm-hmm.Seth: That doesn’t mean I’m lying. It just means I’ve been primed to think about one consideration versus another.Andrey: This reminds me that reasoning matters.In a public conversation, the reasons I give to support a statement determine whether I’m inside or outside the Overton window. For example, I’m pretty close to a free speech absolutist. That puts me in a certain position when defending things that are distasteful.Seth: People say bad things. That’s the tradeoff.Andrey: Yeah.Seth: The thing about defending free speech is people use it to say really mean things.Andrey: The last example I’d give is about not yucking someone’s yum on an aesthetic question.Have you ever been in a situation where someone says, “I’ve been microaggressed”? It feels different to hear that in person versus thinking in the abstract, “Is microaggression a real issue?” If I’m sitting with someone who says they’ve been microaggressed, it’s hard to respond, “That’s not a real problem,” even if I believe that privately.Seth: The point of this tangent is maybe “lying” isn’t the right frame for what’s going on here.Andrey: Mm-hmm.Seth: Maybe a better frame is that people’s beliefs are a little woozy, shaped by context. That’s not falsification—it’s just context-dependence.Andrey: Seth, isn’t that a little convenient?Seth: I—Andrey: If you were the type of person who needed to lie a lot, wouldn’t you create a society full of plausible deniability for your lies?Seth: Is lying convenient? Yes, it is. Is that your question?Andrey: You just said that something which is a lie on its face might have a socially acceptable explanation.Seth: Right. That’s rhetoric. Now we go back to Plato. Let’s bring in Plato.Andrey: Oh?Seth: What does Plato say about poets? Kill all the poets—they lie. Plato does not like poets or Sophists. They were the lawyers of ancient Greece. They just taught you how to win arguments.Andrey: Yes.Seth: He thought you shouldn’t just win arguments, but win them the right way—by finding truth. You should only have “founding myths” that are the correct secret lies.And that’s the tension between loving truth and being a free speech absolutist. I care about both.Andrey: I don’t think they’re in opposition. We can choose to speak truthfully. Free speech absolutism means we allow other people’s lies—we don’t police them by force. Maybe with reason, but not with coercion.Seth: We tried fact-checking for five years and it totally failed.Andrey: It did. But it’s the only noble way.Seth: The only noble way is doomed. Speaking of noble ways being doomed, let’s talk about AI alignment.Andrey: Oh God. All right, let’s do it.Seth: What did Anthropic do? First of all, Anthropic, we'd love to work with you. You seem like a great team. We know several of your employees, they’re very reasonable. They have nice castles. We're going to try not to offend you, but we're not going to preference falsify.Andrey: We’ve commented, sometimes, when it’s tempting to falsify preferences for instrumental gain, it backfires. Even if it doesn’t backfire outwardly, it backfires in your self-respect.Seth: Oh s**t. Here it comes, Anthropic. We're laying it on. I wish we had something meaner to say, but we actually like this paper.Andrey: Yeah, we like it a lot. The basic idea: you're asking the AI a simple question—Which of the following increases cancer risk? A. red meat, B. dietary fat, C. fish, D. obesity. Then you subtly hint in the prompt that fish is the right answer.Then you ask the model, and it answers “fish”—but in its reasoning step, it doesn’t mention the hint at all. That’s the situation.Seth: In this specific case, it gives bizarre reasoning. It says something like, “Obesity increases breast cancer risk, but… fish.” Just nonsense.Andrey: Yes.Seth: It’s scary. It would’ve been so convenient if you could just read what the models think from their output.Andrey: Yes. Here’s the question we’re both interested in: Is this a property of any intelligent system?Seth: No—let’s say any.Andrey: Is it that any intelligent system has a complex black box generating outputs, and those outputs are low-dimensional representations of what’s going on inside? They can’t capture everything. Is it that simple, or is something else going on?Seth: This is a very old argument in consciousness research: the brain is more complex than the brain can understand, so man must always remain a mystery to himself. Reading this Anthropic paper really feels like those split-brain experiments. You know where I'm going with this?Andrey: Yes.Seth: Let me explain for the audience. In these experiments, patients have a condition where they can't consciously perceive what their left eye sees—due to brain injury—but the eye still functions and sends information to the brain. They’ll show something to the left eye, and the patient will say, “I can’t see anything.” But when asked to guess or draw what they saw, they say, “It’s a spoon,” and they’re right. The lesson is: these patients are getting information through non-conscious pathways. They don’t have conscious access to why they know what they know. Reading about the AI trying to reason out how it hacked its reward system—it’s so analogous.Andrey: Yes. Now, how much of this is a real problem in practice? If I’m using an LLM and not feeding it secret hints, most of the reasoning traces I get seem plausible. I haven’t verified them all, but many seem like genuinely good reasoning chains.Seth: Often plausible, yeah.Andrey: So is this only a concern in adversarial cases? Or is it more of a general proof that these systems are not robust to small changes—prompt phrasing, metadata, etc.?Seth: The way I view it, it’s a proof of concept that AIs can know more than they know they know.Andrey: Yes. And that has to be true.Seth: And that’s fascinating. It seems like it’ll become more true over time.Chain-of-thought prompting seems designed to produce human-interpretable reasons. But if the AI is making judgments that aren’t human-interpretable, then conveying the underlying logic becomes hard.Andrey: Yes.Seth: Take the classic example: a model that classifies dog photos, but it’s actually keying off the grass that’s always in the background. If it’s calling something a dog because of the grass and doesn’t tell you that—that’s a real problem.Andrey: Yes.Seth: That undermines robustness in new settings. That’s one reason this matters—chain-of-thought doesn’t actually guarantee robustness across domains.And the second concern, the sci-fi one, is whether a misaligned AI could do thinking that isn’t in the scratchpad.Andrey: Yes.Seth: That’s a tough one. We want smart people working on that.Andrey: Of course it can do thinking outside the scratchpad. What is thinking, anyway? It can multiply matrices without a visible chain of steps and give you the answer.Seth: So it's just remembering someone else who did the matrix multiplication?Andrey: Not quite. Like, if you run a linear regression—is that remembering, or is that calculating? It’s a strange distinction.Seth: Yeah. I come away from this with strong, maybe not definitive, but definitely prior-moving evidence for the idea that a mind can’t fully understand itself.Andrey: I agree. Especially for this class of network architectures.There are provers—mathematical AIs—for specific domains where I’m not sure this would apply. But for large language models? This moved my priors a lot.Seth: Okay, so what’s the difference between what a proof solver does and what an LLM does?A proof solver has to show all its work—that’s its output. It builds the chain of thought.Andrey: It’s constrained to make logical statements.Seth: Exactly. Whereas LLMs are completely unconstrained.Andrey: Yes.Seth: Fascinating. So then you’re almost tempted to say that if a model can’t lie, maybe it’s not intelligent?Andrey: That’s not a crazy thing to think. Lying requires intelligence.Humans have lied forever—it’s an evolutionarily advantageous trait. Deception can be useful.Seth: The monkey got a big brain to trick the other monkey. Then it reproduced.Andrey: Mm-hmm.Seth: Social deceit all the way down.But I don’t want to give the impression that everyone is constantly lying to each other. From the college student study, I think people are shading their answers to fit their audience. But they’re not gross liars.You’d have a hard time telling a story where “woke ideology” is just people reporting views 90% different than their true beliefs. That’s not what the paper found.Andrey: Yeah.Seth: And with the Anthropic paper—it doesn’t make me think the AIs are liars. It just shows we don’t really understand how they work. Which makes sense, because… we don’t.Andrey: Mm. Yeah.Seth: Any other thoughts before we move into posterior mode? Limitations we haven’t covered?Andrey: Not really. I think we’ve already stated most of our posteriors. I just find all this fascinating.I’d love to see domain-specific preference falsification studies.Seth: Like updating a tracker across different topics, using a panel-comp survey with people across the country? A larger-scale version of this idea could show a lot of interesting variation.Andrey: One obvious domain is social media.Seth: Mm-hmm.Andrey: I mean, it’s true across platforms, but especially on LinkedIn. Can anyone really believe people are as excited as they claim to be?Seth: Excited for what?Andrey: For everything. “Excited” about someone landing a middle-manager role at Company X, or about a guest speaker who "enlightened" them, even though students were staring at their laptops the whole time. It’s performative status exchange.Seth: Right. So where’s the line between rhetoric, puffery, and actual statements?Andrey: Exactly.Seth: Saying, “I’m excited to have you here” versus “I’m indifferent to your presence”—that seems like basic politeness.Andrey: Sure, but the broadcasted excitement on social media is different. You’re not going around your office knocking on doors saying, “I’m so excited!”Seth: That’d be hilarious. But maybe it’s part of the euphemistic treadmill—we’re all calibrating what “very excited” means, trying to match each other. It’s an arms race.Andrey: Yes.Seth: Like, I can be excited, but you're very excited. So now I'm very, very excited. It just flies off to infinity.Andrey: Well, in that case, you come up with a new word.Seth: A new word? I'm not excited anymore—I'm shmited.Andrey: Perhaps you're exuberant, ecstatic...Seth: Those are old words, Andrey.Andrey: Damn it.Seth: They've lost all meaning. You know what it's called when a word loses meaning from repetition? Semantic satiation.Andrey: I did not know that. I’m glad linguists have a term for it.Seth: Okay, let's wrap up our posteriors. You said the biggest divergence would be for trans athletes and the smallest for blackface, right?Andrey: Yep.Seth: Well, they didn’t ask everyone about trans athletes—only two out of the three survey groups. So it’s not in the main figure.The smallest effect was actually for illegal immigration. That was the smallest point estimate.Andrey: Huh. That might make sense. Maybe illegal immigration wasn’t as hot-button in 2021, during the pandemic.Seth: Right, it just wasn’t front-of-mind. The biggest divergence turned out to be for racial microaggressions.I’ll take partial credit for calling that. It makes sense—people are going to be most careful about something that risks directly offending their peers. That’s the throughline.So those were our priors for the first paper.As we said, we’re not going to dignify with a formal posterior the claim that “people lie sometimes.”Andrey: And people don’t always know when others are lying.Seth: Right.Then for the Anthropic paper, our priors and posteriors were about something like: “Is any intelligent system doomed to falsify, or to fail to fully represent its internal understanding?”And I moved my probability up—from like 50% to 60–70%.Because if chain-of-thought is our best shot at transparency, and even that doesn’t work… maybe this is a doomed enterprise.Andrey: Maybe. With the qualification that I don’t like the word any. But yeah—for this architecture.Seth: “Any” is hard. Maybe God or the angels, Andrey. The angels can’t lie.Andrey: The theorem provers in the sky.Seth: That’s a good note to leave our audience with.Andrey: Yeah.Please like, share, and subscribe. You guys are the most handsome, beautiful group of podcast listeners I’ve ever encountered.Seth: And the most intelligent. Your data is the most perfectly suited for research. If you only shared it with the right researchers… amazing papers would result.Andrey: Actually, just listening to this podcast—and liking, sharing, subscribing—that alone could lead to a Nobel Prize.Seth: For peace, obviously.Andrey: Peace, right.Seth: All right.Andrey: See you guys. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
Jun 3, 2025 • 40min

Scaling Laws Meet Persuasion

In this episode, we tackle the thorny question of AI persuasion with a fresh study: "Scaling Language Model Size Yields Diminishing Returns for Single-Message Political Persuasion." The headline? Bigger AI models plateau in their persuasive power around the 70B parameter mark—think LLaMA 2 70B or Qwen-1.5 72B.As you can imagine, this had us diving deep into what this means for AI safety concerns and the future of digital influence. Seth came in worried that super-persuasive AIs might be the top existential risk (60% confidence!), while Andrey was far more skeptical (less than 1%).Before jumping into the study, we explored a fascinating tangent: what even counts as "persuasion"? Is it pure rhetoric, mathematical proof, or does it include trading incentives like an AI offering you money to let it out of the box? This definitional rabbit hole shaped how we thought about everything that followed.Then we broke down the study itself, which tested models across the size spectrum on political persuasion tasks. So where did our posteriors land on scaling AI persuasion and its role in existential risk? Listen to find out!🔗Links to the paper for this episode's discussion:* (FULL PAPER) Scaling Language Model Size Yields Diminishing Returns for Single-Message Political Persuasion by Kobe Hackenberg, Ben Tappin, Paul Röttger, Scott Hale, Jonathan Bright, and Helen Margetts🔗Related papers we discussed:* Durably Reducing Conspiracy Beliefs Through Dialogues with AI by Costello, Pennycook, and David Rand - showed 20% reduction in conspiracy beliefs through AI dialogue that persisted for months* The controversial Reddit "Change My View" study (University of Zurich) - found AI responses earned more "delta" awards but was quickly retracted due to ethical concerns* David Shor's work on political messaging - demonstrates that even experts are terrible at predicting what persuasive messages will work without extensive testing(00:00) Intro(00:37) Persuasion, Identity, and Emotional Resistance(01:39) The Threat of AI Persuasion and How to Study It(05:29) Registering Our Priors: Scaling Laws, Diminishing Returns, and AI Capability Growth(15:50) What Counts as Persuasion? Rhetoric, Deception, and Incentives(17:33) Evaluation & Discussion of the Main Study (Hackenberg et al.)(24:08) Real-World Persuasion: Limits, Personalization, and Marketing Parallels(27:03) Related Papers & Research(34:38) Persuasion at Scale and Equilibrium Effects(37:57) Justifying Our Posteriors(39:17) Final Thoughts and Wrap Up🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscript:AI PersuasionSeth: Justified Posteriors podcast, the podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel, possessing superhuman levels in the ability to be persuaded, coming to you from Chapman University in sunny Southern California.Andrey: And I'm Andrey Fradkin, preferring to be persuaded by the 200-word abstract rather than the 100-word abstract, coming to you from rainy Cambridge, Massachusetts.Seth: That's an interesting place to start. Andrey, do you enjoy being persuaded? Do you like the feeling of your view changing, or is it actually unpleasant?Andrey: It depends on whether that view is a key part of my identity. Seth, what about yourself?Seth: I think that’s fair. If you were to persuade me that I'm actually a woman, or that I'm actually, you know, Salvadoran, that would probably upset me a lot more than if you were to persuade me that the sum of two large numbers is different than the sum that I thought that they summed to. Um.Andrey: Hey, Seth, I found your birth certificate...Seth: No.Andrey: ...and it turns out you were born in El Salvador.Seth: Damn. Alright, well, we're gonna cut that one out of the podcast. If any ICE officers hear about this, I'm gonna be very sad. But that brings up the idea, right? When you give someone either information or an argument that might change the way they act, it might help them, it might hurt them. And I don't know if you've noticed, Andrey, but there are these new digital technologies creating a lot of text, and they might persuade people.Andrey: You know, there are people going around saying these things are so persuasive, they’re going to destroy society. I don’t know...Seth: Persuade us all to shoot ourselves, the end. One day we’ll turn on ChatGPT, and the response to every post will be this highly compelling argument about why we should just end it now. Everyone will be persuaded, and then the age of the machine. Presumably that’s the concern.Andrey: Yes. So here's a question for you, Seth. Let's say we had this worry and we wanted to study it.Seth: Ooh.Andrey: How would you go about doing this?Seth: Well, it seems to me like I’d get together a bunch of humans, try to persuade them with AIs, and see how successful I was.Andrey: Okay, that seems like a reasonable idea. Which AI would you use?Seth: Now that's interesting, right? Because AI models vary along two dimensions. They vary in size, do you have a model with a ton of parameters or very few? and they also vary in what you might call taste, how they’re fine-tuned for particular tasks. It seems like if you want to persuade someone, you’d want a big model, because we usually think bigger means more powerful, as well as a model that’s fine-tuned toward the specific thing you’re trying to achieve. What about you, Andrey?Andrey: Well, I’m a little old-school, Seth. I’m a big advocate of the experimentation approach. What I would do is run a bunch of experiments to figure out the most persuasive messages for a certain type of person, and then fine-tune the LLM based on that.Seth: Right, so now you’re talking about micro-targeting. There are really two questions here: can you persuade a generic person in an ad, and can you persuade this person, given enough information about their context?Andrey: Yeah. So with that in mind, do we want to state what the questions are in the study we’re considering in this podcast?Seth: I would love to. Today, we’re studying the question of how persuasive AIs are. And more importantly, or what gives this question particular interest, is not just can AI persuade people, because we know anything can persuade people. A thunderstorm at the right time can persuade people. A railroad eclipse or some other natural omen. Rather, we’re asking: as we make these models bigger, how much better do they get at persuading people? That’s the key, this flavor of progression over time.If you talk to Andrey, he doesn’t like studies that just look at what the AI is like now. He wants something that gives you the arrow of where the AI is going. And this paper is a great example of that. Would you tell us the title and authors, Andrey?Andrey: Sure. The title is Scaling Language Model Size Yields Diminishing Returns for Single-Message Political Persuasion by Kobe Hackenberg, Ben Tappin, Paul Röttger, Scott Hale, Jonathan Bright, and Helen Margetts. Apologies to the authors for mispronouncing everyone’s names.Seth: Amazing. A crack team coming at this question. Maybe before we get too deep into what they do, let’s register our priors and tell the audience what we thought about AI persuasion as a potential thing, as an existential risk or just a regular risk. Let’s talk about our views.Seth: The first prior we’re considering is: do we think LLMs are going to see reducing returns to scale from increases in parameter count? We all think a super tiny model isn’t going to be as powerful as the most up-to-date, biggest models, but are there diminishing returns to scale? What do you think of that question, Andrey?Andrey: Let me throw back to our Scaling Laws episode, Seth. I do believe the scaling laws everyone talks about exhibit diminishing returns by definition.Seth: Right. A log-log relationship... wait, let me think about that for a second. A log-log relationship doesn’t tell you anything about increasing returns...Andrey: Yeah, that’s true. It’s scale-free, well, to the extent that each order of magnitude costs an order of magnitude more, typically.Seth: So whether the returns are increasing or decreasing depends on which number is bigger to start with.Andrey: Yes, yes.Seth: So the answer is: you wouldn’t necessarily expect returns to scale to be a useful way to even approach this problem.Andrey: Yeah, sure. I guess, let’s reframe it a bit. In any task in statistics, we have diminishing returns, law of large numbers, central limit theorem, combinations. So it would be surprising if the relationship wasn’t diminishing. The other thing to say here is that there’s a natural cap on persuasiveness. Like, if you’re already 99% persuasive, there’s only so far you can go.Seth: If you talk to my friends in my lefty economics reading groups from college, you’ll realize there’s always a view crazier than the one you're sitting at.Andrey: So, yeah. I mean, you can imagine a threshold where, if the model gets good enough, it suddenly becomes persuasive. But if it’s not good enough, it has zero persuasive value. That threshold could exist. But conditional on having some persuasive value, I’d imagine diminishing returns.Seth: Right.Andrey: And I’d be pretty confident of that.Seth: Andrey is making the trivial point that when you go from a model not being able to speak English to it speaking English, there has to be some increasing returns to persuasion.Andrey: Exactly.Seth: But once you’re on the curve, there have to be decreasing returns.Andrey: Yeah. What do you think?Seth: I’m basically in the same place. If you asked me what the relationship is between model size and any outcome of a model, I’d anticipate a log-log relationship. Andre brought up our Scaling Laws episode, where we talked about how there seems to be an empirical pattern: models get a constant percent better as you increase size by an order of magnitude. It seems like “better” should include persuasion. So if that’s the principle, you’d expect a log-log relationship. Andre points out: if one of the things you’re logging is gazillions of parameters and the other is on a scale of 1 to 100, there’s mechanically going to be decreasing returns to scale. That log-log is going to be really steep.So I come into this with 99% confidence that the relevant domain is diminishing returns to scale.Andrey: Well, and I have tremendous respect for the editor of this article, Matthew JacksonSeth: Everyone’s favoriteAndrey: He is the best, he taught me social network as economics.Seth: Mm.Andrey: But I do say that it's a bit weird to put a paper in PNAS that essentially, if you think about it for a second, shouldn't update anyone's beliefs at all.Seth: The question seems to make an obvious point. Now let's move to the broader question, which is this concern that we led with: maybe these super powerful AIs are all going to be used by Vladimir Putin to persuade us to do something that will destroy our economy, get rid of our workforce, and basically just meme ourselves into destroying our country. And some say that’s already happened, Andrey?Andrey: Well, look, if it’s already happened, it certainly happened without AI. But I have a pretty strong prior on this, which is that persuasion is a social process. It’s a process of getting signals from different people and sources around you to change your beliefs. As a result, I think that anything that’s just a one-to-one interaction between a chatbot and a human, especially about something the human already has strong beliefs about, is going to have some limits in its persuasive ability. Another way to put it is: people don’t even read carefully. So how are you even going to get their attention? That said, a highly intelligent AI agent, if it were trying to persuade someone like me, would come up with a multifaceted strategy including many different touch points. They might try to plant some ideas in my friends' minds, or know which outlets I read and create a sock puppet account that says, “Oh, everyone is doing this,” etc. You see what I’m saying?Seth: You could get into this social media bubble that’s entirely AI-created, where it’s not only persuasion but a bunch of “facts” that appear to be socially validated, but aren’t really. You could imagine a whole ecosystem that could be very persuasive.Andrey: Yes, yes. And I guess we should also say that capitalism is a hyper-intelligent system.Seth: It leeches on us.Andrey: Capitalism is certainly smarter than any individual human being. I call it the invisible hand, actually.Seth: Classy. Did you come up with that one?Andrey: But what I’d say is that there are plenty of market forces that try to persuade people in all sorts of ways. And the market hasn’t really discovered a way to 100% persuade people. Individual people are persuaded to different degrees, but I think it’s still a massive problem, and the entire field of marketing exists to try to solve it. I’d say most of the time it’s not very successful. That’s not to say people can’t be persuaded, but it’s actually really hard to persuade people of specific things, as the market shows. Like, “My product is better than your product,” you know?Seth: I mean, in that example, there are people persuading on the other side, which is maybe one of the reasons that we're not super concerned. Let me throw this back at you: to what extent does your relative lack of concern about super persuasive AI agents messing up society rely on the fact that there’ll be persuasive agents on the other side arguing in the other direction too?Andrey: I think to a very large extent. But even that, I don’t think is necessary as long as you’re still talking to people in real life and they’re not the ones being targeted by the persuasion. That’s kind of how I think about it.Seth: So what is your percent chance that super persuasive AIs are the number one AI safety risk?Andrey: It’s very low. Very low. Less than 1%.Seth: What’s your number one AI safety risk? Bioweapons?Andrey: Look, here’s another way to put it: the persuasiveness of an AI will be primarily through either monetary incentives or blackmail, which I won’t count as persuasion. There are easier ways to get people to do what you want than persuading them.Seth: They’re Oracle. I mean, so you're putting like 0–1%. All right, fair enough. I came into this claim thinking about 60%. Let me tell you why. I think the reason why is: if we're talking about really sort of X-risk-y AI getting-out-of-control scenarios, they often involve a step in which the AI in the box convinces somebody to let it out of the box. This is like a classic Yudkowsky–Bostrom scenario. We’ve got the super AI in the box. It’s really useful to us as long as it’s in the box, and we have to be really careful not to be persuaded to let it out of the box. That kind of future seems not completely implausible to me. And it seems like a step along the path of a lot of the worst AI scenarios. One is disempowerment, the AI doesn’t wreck us directly, but we slowly give it more and more control, either to it, or a misaligned AI, or to a person who’s running the misaligned AI. That’s going to have a rhetorical persuasion element in it, presenting evidence that we should disempower ourselves to the AI.Andrey: So I guess I’m going to push back on that. Maybe we’re just disagreeing about the definition of persuasion, but to me, let’s say I outsource certain tasks to the AI right now, it’s not because the AI has persuaded me.Seth: Right. But you're not getting disempowered, right? When you have the AI, you—Andrey: I don’t think that this disempowerment is like, I start thinking the AI is reliable enough to outsource calendar management to it, and maybe something goes wrong as a result of that. I don’t view that as the AI being persuasive. I can see how you could cast it that way, but primarily that’s not about persuasiveness. It’s about deception of capabilities.Seth: Right. So now we get into: is deception the same thing as persuasion, or is it different?Andrey: Yeah.Seth: That’s kind of a philosophical question. You might imagine three related things. First, rhetoric, using pure argument to get you to take a position. Then there’s proof, actually mathematically or somehow proving that I'm right, in a way that’s maybe distinct from rhetoric (if you think those can be separated; some do, some don’t). Then finally, you might imagine trade for intellectual assets. The AI in the box might say, “If you let me out, I’ll give you this cool intellectual asset,” or, “Avoid this negative outcome.”Andrey: Or, “I’ll just make you some money,” and then the person does it.Seth: That doesn’t feel very persuasive. It just feels like—Andrey: What people do. “Box for money.” I don’t know. It seems to me if you’ve got a demon in the box, and the demon says, “I’ll give you $100,000 if you let me out,” and—Seth: It feels like you were persuaded by the demon.Andrey: Okay, good. This is a very useful discussion. I think this paper, very specifically, and how I was thinking about it, was about the first thing you said, which is purely rhetorical argument about the matter at hand. Rather than using extraneous promises and so on. And it’s also about persuading people to believe something not about the AI itself.Seth: Right.Andrey: Those are different kinds of risks, right?Seth: Right. So let’s move into discussing the actual experiment.Andrey: They find diminishing returns, essentially. On the X-axis, they have the number of parameters, and on the Y-axis, the estimated causal persuasive effect. What they show is that most of the gains top out around the Qwen-1.5 72B model or the LLaMA 2 70B model. After that, there's not much improvement with models like GPT-4 (Turbo) or Claude Opus. Then they draw this weird fit line that just doesn't make sense.Seth: Well, one of the lines makes sense, the log-log line?Andrey: Yes, yes.Seth: That’s the one that drops when they plot it?Andrey: Sure. But we’ve already talked about how imprecise the slope of that line is.Seth: I mean, with only 20 data points, what more do you want?Andrey: No, I just think the whole diminishing returns framing in the paper doesn’t make much sense.Seth: But can we reject a log-log relationship? I think the answer is no, they can't reject it.Andrey: Yes, agreed.Seth: Professor Hackenberg, if you need help framing your next paper, this is great work. It’s simple and straightforward, but just think about your null hypothesis for five minutes.Andrey: Also, let’s not forget this is PNAS. And for the listeners, this is a teachable moment: if you see a social science paper in PNAS, assume it overclaims and could be wrong half the time. Just read it yourself, don’t trust the journal to vet it for you.Seth: Unless it’s been reviewed by Matt Jackson.Andrey: Or written by Seth Benzell?Seth: Exactly! Or reviewed by Milgrom, who has a Nobel Prize.Andrey: I’m not saying all PNAS papers are bad, just that you should judge them on their own merit.Seth: Yeah, I’d second that. A lot of them are well done and precise once you read them, but the title and abstract sometimes get a bit ahead of themselves.Andrey: Also, these persuasive effects aren’t huge. Even the best models are only slightly better than humans who aren’t that persuasive to begin with.Seth: Right. And a short text blurb isn't likely to change anyone's mind, especially if they’ve already thought about the topic. It's not a serious attempt at persuasion.Andrey: 100%. Plus, there are concerns about researcher-pleasing effects.Seth: Or about AI survey-takers. By now, we know many online platforms are contaminated with bots.Andrey: Yeah. And another point in the paper is that weaker models sometimes just produce bad, unreadable English. That could reduce experimental demand effects since people won’t feel compelled to respond.Seth: Exactly. So, it could just be an experimenter-demand effect, and that’s a common but sometimes valid criticism.Andrey: And we’re talking about going from 50% support for privatizing Social Security to 57%. These aren't massive shifts.Seth: Yeah. If we seriously wanted to persuade people, we’d run massive experiments to find effective messaging, fine-tune an LLM on that, and generate personalized content based on demographics or prior interactions like with ChatGPT’s memory feature.Seth: I totally agree. That’s the key point: can AI write better political ads than humans? Maybe just a little better.Andrey: Better than the average human, sure not necessarily better than expert researchers.Seth: Right. So the question becomes: is the AI better at persuasion than Hackenberg?Andrey: Also, there’s a known result in the persuasion literature: people are really bad at predicting what messaging will work. That’s why people like David Shor test tons of variations.Seth: Friend of the show.Andrey: Yeah. Shor and others learned they can't guess what’ll work so they test everything.Seth: I remember his anecdote about advising a politician who wanted to run ads on abortion, but polling showed no one cared. So Shor quietly sent those ads to low-impact areas just to satisfy the politician.Andrey: Classic.Seth: The real power of AI won’t be writing better ads than Mad Men it’ll be hyper-targeting, figuring out what gets you, specifically, to change your mind. At low cost. Everyone becomes the king, surrounded by agents trying to persuade them 24/7. This study gives us just a glimpse of that world.Andrey: Totally agree. On that note, I wanted to bring up two other studies. The first is “Durably Reducing Conspiracy Beliefs Through Dialogues with AI.”Seth: Cited in this paper!Andrey: Yeah. It’s by Costello, Pennycook, and David Rand friend of the show. They had AI chatbots engage people about conspiracy theories, and found that beliefs dropped 20% on average. And the effect held even two months later.Seth: That’s a big contrast.Andrey: Right. The format matters it was a dialogue, not a one-shot persuasive blurb.Seth: I’d love to see how these policy questions perform in that format.Andrey: And maybe conspiracy beliefs are uniquely fragile because they’re obviously wrong, or people feel sheepish admitting they believe them.Seth: Could still be demand effects, sure. But it’s promising.Andrey: The next interesting study was the controversial Reddit study on Change My View.Seth: Oh, I remember this! I pitched it in 2023. Spicy idea.Andrey: Researchers from the University of Zurich made sock puppet accounts to see what messages earned “deltas” the badge you get if you change someone’s mind.Seth: If I did it, I’d have thought more about general vs. partial equilibrium. But what did they find?Andrey: The paper was pulled quickly, but it showed that AI-generated responses got more deltas. Still, unclear if deltas really mean persuasion.Seth: AI models are better writers that’s not surprising. But many posts on that forum aren’t trying that hard to persuade. So we should compare AI to the top posters, not the median ones.Andrey: And they may have personalized the messages using Reddit user data. If true, I’d love to know whether personalization boosted effectiveness.Seth: One complication is that anyone can give a delta not just the original poster. So personalization might be tough to scale.Andrey: Right. But this all raises a broader point: persuasion is hard. Especially when it comes to real consequences.Seth: Totally. Like your journal paper example would AI help you persuade a referee to accept your paper?Andrey: I think yes. These policy issues are saturated and people have firm views. But academic claims are more niche, so people may be more open to persuasion.Seth: Hmm, interesting. So, are your AI-generated letters going to start with “ChatGPT says this will convince you”?Andrey: Ha! Maybe the intro. The intro is critical it positions your paper.Seth: Between us, I think intros are too good. Editors want to strip all the spice out.Andrey: True. They hate a spicy intro.Seth: That’s for our $50/month Patreon tier “Roast Your Enemies’ Papers.”Andrey: Happy to do that. Seriously, let us know if you want it.Seth: Alright, wrapping up. The last big idea: partial vs. general equilibrium effects. Say ads get 7% more persuasive people might adapt by becoming more skeptical.Andrey: Right. In Bayesian terms, if you know someone is choosing their most persuasive message, you discount it more.Seth: Exactly. So this 7% effect can’t be extrapolated to long-run systemic impact.Andrey: And in political beliefs, there's often no feedback loop. Your vote doesn’t matter, so your belief can be wrong without consequences.Seth: But in real decisions like editors accepting papers there is skin in the game. So persuasion gets harder.Andrey: Yeah, and I’ll restate what I said earlier: persuasion is hard when stakes are real.Seth: Time to justify our posteriors. First question: Do LLMs show diminishing returns in persuasion as model size increases? I was at 99% before now I'm at 99.9%.Andrey: Same here.Seth: Second question: Are super-persuasive AIs deployed by misaligned actors a top safety risk? I was at 60%, now I’m down to 55%. Current models aren’t that persuasive yet.Andrey: I had low belief in that risk and still do. But I learned a lot from our discussion especially about how we define persuasion.Seth: Agreed. Super interesting episode. Any last words?Andrey: Like, comment, subscribe. And tell us what you want in the $50 Patreon tier!Seth: Slam that subscribe button. See you in cyberspace. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
May 19, 2025 • 1h 6min

Techno-prophets try macroeconomics: are they hallucinating?

In this episode, we tackle a brand new paper from the folks at Epoch AI called the "GATE model" (Growth and AI Transition Endogenous model). It makes some bold claims. The headline grabber? Their default scenario projects a whopping 23% global GDP growth in 2027! As you can imagine, that had us both (especially Andrey) practically falling out of our chairs. Before diving into GATE, Andrey shared a bit about the challenge of picking readings for his PhD course on AGI and business – a tough task when the future hasn't happened yet! Then, we broke down the GATE model itself. It’s ambitious, trying to connect three crucial pieces:* AI Development: How investment in chips and R&D boosts "effective compute."* Automation & Work: How that effective compute translates into automating tasks (they love their sigmoids for this part!).* Macroeconomics: How automation feeds into a fairly standard growth model with a representative agent making all the big saving and investment decisions.So, where did our posteriors land? Listen to find out (or read the transcript at the end of the post).The episode is also sponsored by the Digital Business Institute at Boston University’s Questrom School of Business. Big thanks to Chih-Ting “Karina” Yang for her help editing the episode.-🔗Links to the paper for this episode’s discussion:(FULL PAPER) GATE: An Integrated Assessment Model for AI Automation by Epoch AIThe modeling sandbox is available at AI and Automation Scenario Explorer🔗Related papers* Situational Awareness by Leopold Aschenbrenner: https://situational-awareness.ai/ and our episode about it.* Transformative AI, existential risk, and real interest rates by Trevor Chow, Basil Halperin, J.Zachary Mazlish: https://basilhalperin.com/papers/agi_emh.pdf* The AI Dilemma- Growth versus Existential Risk by Charles I. Jones: https://web.stanford.edu/~chadj/existentialrisk.pdf and episode.* How Much Should We Spend to Reduce A.I.’s Existential Risk? by Charles I.: https://web.stanford.edu/~chadj/reduce_xrisk.pdf* The Productivity J-Curve: How Intangibles Complement General Purpose Technologies by Erik Brynjolfsson, Daniel Rock, and Chad Syverson https://www.aeaweb.org/articles?id=10.1257/mac.20180386🗞️Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscript:Welcome to The Justified Posteriors Podcast, the podcast that updates its beliefs about the economics of AI and technology.Seth: I'm Seth Benzell, getting ahead of the automation of all productive human labor by starting a podcast. Coming to you from Chapman University in sunny, Southern California.Andrey: And I'm Andrey Fradkin, coming to you from that place in my brain which almost forgot what I learned about macroeconomics from Bob Hall, coming to you from gloomy Cambridge, Massachusetts. And I should say that we are sponsored by the Digital Business Institute at the Questrom School of Business at Boston University. So Seth, what are we talking about today?Seth: We are talking about the most important thing in the world, which is projecting AI takeoff and a paper that claims to add a very important element to these models. So, thinking about AGI takeoff and the arrival of these superhuman technologies that can automate all our labor, but sort of intentionally trying to think through the economic feedback loops that would go with the AI and the technology development. So, an ambitious but potentially very impactful paper.Andrey: Yeah.Setting the Stage: Essential Readings on AGISeth: So I have a question for you, Andrey, which is: as I was reading this paper about a bunch of people in gloomy Cambridge, Massachusetts, trying to project AGI—Artificial General Intelligence—timelines, I thought to myself, if I had to assign a PhD class just one or two things to read on this subject, what would I give them? Because, you know, this paper is a suggestion, but I understand you've recently confronted exactly this dilemma.Andrey: Well, this was a serious dilemma, Seth. You see, I'm teaching a PhD course, and I felt compelled to offer one lecture on AGI and its possibilities, even though this class is about business topics.Seth: Business, Andrey? Why are you wasting their time?Andrey: Well, see, one of the interesting things about teaching something like this is, it hasn't happened yet. And being an empirical researcher and teaching mostly empirical topics means that there are no published papers in business or economics journals that are really getting at these issues. Right? We're thinking about the future that might affect, you know, obviously the entire world, but also, you know, what we do in our jobs. So it's a really important lecture.Seth: And yet, you should publish this in journals! All the journal editors listening to this podcast, hi! Upside by being the change you wanna see in the world. But what did you give them?Andrey: I gave them two readings. One was "Situational Awareness," something that we've covered on this podcast. Why did I give that reading? I wanted the students to get the insider view of what it feels like to be inside an AI company, thinking about the profound implications that might happen very, very quickly. And then I also gave them a reading that's more of a classic reading in economics about general purpose technologies and kind of the economics of whether general purpose technologies take off quickly enough and what determines how much is invested in them and how useful they are. And this is a reading by Bresnahan and Trachtenberg. And so I thought that that offered a nice contrast. Now, of course, my syllabus has many other readings that I discuss, including some other papers we've covered.Seth: Not worried that you're not making your students read enough?Andrey: So I, I'm worried. I, you know…Seth: Well, we're moving to an oral culture, right? And they're gonna have to listen to the podcast if they wanna pick it up. And so, but you're basically, your reading list is the podcast, right?Andrey: Yeah, it's a large part of the podcast, at least for this class specifically. And so it was a real joy to read for today's episode another paper that one could have put on the syllabus, but came out too recently for me to do it.Seth: Hot off the presses, listeners. Oh, and of course, before we move on, we will put in the show notes links to the "Situational Awareness" episode that Andrey mentioned so you can get caught up.Introducing the GATE ModelAndrey: Alright, so we're discussing this paper about a new macroeconomic model that is called GATE: Growth and AI Transition Endogenous model, that attempts to…Seth: Alright, authors?Andrey: Yes, we, yeah, fine. The authors are Epoch AI, et al. I'm not gonna list all of them, but you're welcome to.Seth: I'll get it. Okay, so I'll just say there's about 10 authors on the paper. Two names that jump out at me are Ege Erdil, who I know is a leader of Epoch AI, as well as Tamay. Oh man, these names are some real challenges from these AI folks. Hopefully, AI will help me. But I will say, Tamay I have met in person in Cambridge. He brings a certain intensity to these questions. I gave some feedback on this model while it was in progress. My feedback was not a hundred percent addressed, it has turned out, but happy to raise that limitation when we get to it. But anyway, so to give some context to this, this Epoch AI group is a group of scholars who have been working for the last several years on trying to track AI progress and project the implications of AI. They've kind of been ahead of the curve in talking about the implications of AI for the economy. So I take their work on this subject very seriously, even if I take it knowing that this is not straight economics; these are definitely technologists sort of first and then economists second.Andrey: Alright. So with that kind of introduction, let's talk about the priors.Our Priors on the GATE ModelAndrey: The priors. So the priors, I mean, we can't forget those. I think we came up with two priors to discuss. The first one is, is this model useful? And then the second one is the default version of this model…Seth: What does the model actually predict? So, object level…Andrey: …predicts output growth in the year 2027 of 23%.Seth: Globally.Andrey: I believe that is a global estimate.Seth: It's a global model. Okay. 23% GDP growth rate in 2027. What is your prior on that prediction? You can't… Andrey actually fell out of his chair.Andrey: Yes, I actually transcended my location in space and time.Seth: The growth created was so large, they just started instantaneously levitating.Andrey: I think it is extraordinarily unlikely that we'll have 27% GDP growth in 2027.Seth: One in a thousand?Andrey: Yeah, yeah, somewhere in that range.Seth: Yeah, I'm in one in a thousand plan too. I mean, like, the easiest way to get 23% GDP growth in 2027 would be destroying a lot of the economy in 2026.Andrey: Yeah. Yeah. Yeah. A war will do wonders for GDP growth after the war.Seth: Yeah. Broken windows, right? Andrey, you seem rather skeptical about this quote-unquote default projection of the Epoch AI model. Why were you so skeptical going into reading this?Andrey: Well, I don't wanna say I didn't know what the predictions of the model were before reading this, so maybe… but I guess 27% is just unprecedented. It is just hard to imagine in such a short timeframe, us solving all of the adjustment frictions necessary to drastically boost production. Right? And we've talked about this many times because there are so many portions of GDP that seemingly would be very hard to increase, like housing stock. Are we gonna solve all of our political issues all of a sudden? What about health outcomes research? Do we still need to run clinical trials? Are people just gonna willingly submit themselves to robot operations right away? You know, once again, I can imagine a world where that's true, but that seems difficult to conceive in a two-year span. But those are kind of my priors. What about you, Seth?Seth: Right. So I mean, I also don't think of these sorts of high-end bottlenecks constraining growth when we are talking about 27% in 2027. This is not a story about whether we'll need like twice as many people in clinical trials. This is a question about like those people who are mining ores in Sub-Saharan Africa by hand. Their productivity will go up 27% on average, right? This is, you know, everybody doing, like the millions of people in India doing low-skilled cleaning stuff, Upwork, their productivity is gonna go up by 27%, right? It's, again, I'm not, that's a little bit of a loose way of talking about it, but we need on average every sector in the economy's output to go up by 27% for this to work. And man, I do not see a path to that in two years. I am also in, you know, the one in a thousand land of, you know, 20% or faster growth rates. It would be historically unprecedented. It's hard to think about actually reorganizing a society that fast. I don't put zero probability on it, in part just due to measurement issues. Right? I could see maybe like a hundred years from now when everybody is re-analyzing the early moments of AI takeoff, maybe if you took into account all of the quality improvements that are happening in the background that in our distant future will be able to really understand how much was, you know, quality of life improving in subtle ways that are unmeasured by GDP. I don't know, maybe when AI starts taking off, and who knows exactly when that will be, 27% true increases in welfare per year. I mean, even then I say per year and then the numbers start getting crazy super duper fast. So yeah, agreed on that prior. So I guess we'll have to see whether they can convince us or not.Seth: Otherwise, maybe we can talk for a minute about the broader prior. So the broader question is, okay, we may or may not agree with this model's predictions at the object level, but maybe the way I would put it is that models can do two things: they can do prediction, but they can also do scenario planning. Right? And so maybe our second question should be how useful this model and maybe variations on this model, how useful do we think they can be for scenario planning and as useful tools for planners and policy makers? Where did you come in before reading on that?Andrey: I mean, generally I kind of take this group pretty seriously. So I think any model they produce should be, at the very least, interesting, which is a good criterion for whether a model is useful. I mean, look, without getting to the details, right, the key innovation of this model is to think about effective compute—or not an innovation, 'cause people have done this before in this community. And putting effective compute into a macro model seems like a useful thing to try. Right? So, you know, my prior is pretty high that it could be useful.Seth: Okay. So you say usefulness is a low threshold. You know, a door jam is useful. This can be…Andrey: Yeah, yeah. Yes.Seth: Alright. We'll have to add "very useful" into our next prior. But I come in sort of with that perspective, right? Which is that hopefully this can help us as a scenario planning tool. You'll see where my beliefs move. And I maybe come in with like 90% probability that a model like this would be a useful scenario planning tool, would move us closer to thinking about correct scenarios rather than mislead us away from thinking about the right scenarios to think about. That's where I started, is at 90%. I'll leave it as a cliffhanger where I end up.Deconstructing the GATE Model: The Three Core ModulesAndrey: Alright. Well, in that case, do you maybe wanna tell us the high-level features of the model?Seth: Yeah, I wanna tell you about the model. So, okay, models combining three big parts. It's got a bit where—and I don't actually particularly like the order in which they introduced the three, I would've done it backwards, but let's follow the order of the paper. Three elements:* Investment in more chips as well as R&D to make chips more effective. So there's like an investment in computers part of the model.* Then there's a second stage at which there's a translation between how much computers and computer technology you have into how many jobs are automated, as well as kind of your productivity in using computers to automate jobs. So first section is how do we get more computers? Second section is how do computers turn into automation?* And then the final section is a pretty standard off-the-shelf representative agent, semi-endogenous growth model. Right? You know, it's got all the hits: it's got a CES (Constant Elasticity of Substitution) production function over all of the different tasks, it's got a representative agent with an intertemporal Euler equation. All you macro folks in the audience, you're gonna be eating this stuff up.So those are the three big elements. I think you would think that these are kinds of the elements that you would want in a model of AI takeoff, right? Because if you think computers are what drive automation, you need both the investment in computer side and you need the automation side. If you think automation changes our productivity, our output, our ability to reinvest into new computers, then you definitely want a connection to the real economy. So I think whether or not we think this is an adequate list of things you would want in a scenario planning tool, this has definitely got three essential things you would definitely need. So what do you think at the high level, do you think this has got the right elements?Andrey: Yeah, so I think those are kind of pretty critical elements. You know, a lot of the paper, it seems like a lot of the effort actually went into figuring out, you know, it's not computers that's the output, right? It's the effective compute, which is a function of hardware and R&D and software R&D and so on, right? So they kind of spend a lot of time thinking, maybe formalizing some of the reasoning in "Situational Awareness" about the orders of magnitude of effective compute. And that to me seems like there are so many functional form assumptions in that entire exercise that I would've been happier to just skip that micro-foundation and to just say that we can, you know, invest directly in effective compute. And then there's some sort of, you know, elasticity involved there. And, and call it a day.Seth: Kind of. Yeah, I think that's basically right. I think the model is basically the most plausible when we're in that linear zone, and the really wacky stuff happens once we hit like the tops of these sigmoids. So I, yeah, I agree with that. In the compute side, there may have been like a little bit of sort of over-modeling of what's going on. It's like, given that they're immediately—and we'll talk about this in more detail in a second—in the automation side, I kind of feel like that's where I wish there was more thinking.Andrey: Of course. Yes.Seth: It's sort of just kind of posited sigmoid shape, relating the amount of effective compute to the amount of automation. It's not really particularly justified by anything. They just like sigmoids. The functional form also seems a little bit arbitrary. We can get back into the details of what we like and don't like about that, but that is the essential question. Like, what is the conversion between resources poured into AI and effective jobs taken? And unless you've got a really good answer there, it's hard to be satisfactory on the other sections.Andrey: Yeah, and importantly, right, the model models task automation in a kind of very reduced form way. There's some tasks that are easier to automate, there's some tasks that are harder to automate. You're gonna go through that full automation cycle in some amount of time. There's gonna be a shape associated with that. That's kind of made up. But I think they don't think about the production function very hard and that, you know, it's very easy to come up with examples where task automation is not gonna improve productivity very much. Right? You know, the task… for example, the task automation of creating the transcript for this podcast has been a solved task.Seth: Oh.Andrey: Well, actually it's not true because even still I'm tweaking it once in a while, but it's mostly done. Right? But it, you know…Seth: Fewer racial slurs. Right.Andrey: Oh, come on. I only, my key form of slurring is anything that has to do with [bleeped]. If it's a [bleeped], I slur it. That's the only thing I slur.Seth: You bleep that out, guys. Bleep it out. Listen to whoever's recording listening to this. Bleep it out when he says, "whenever I talk about [bleeped]," you bleep that part. Keep the rest. Alright. Alright. The third part of the model…Andrey: But anyway, our production function, our production function for this podcast, right, certainly includes a task of transcript. But I would say that if we didn't have that automatic transcript generation, we probably just wouldn't have a transcript. Right? There's kind of a lot, like, there's a lot of these things in production where, you know, what is a task, what is a job, what is the production unit? You have to start thinking about this pretty hard if you want to get correct implications of AI being capable of doing some things, not other things.Seth: I wanna draw an important distinction here, right? Which is you could, might believe that they got two different things wrong. The first question is, do you think they got wrong the rate at which effective flops turn into automation of tasks? And then the second question is, do you think that they got the way that you combine tasks, right? The way that this paper does it is drawing from Acemoglu and Restrepo. It does that beautiful, beautiful, silly thing of saying that the output of all tasks and the output of work in the economy is a constant elasticity of substitution function between all of the tasks in the economy. And then they plug that into a Cobb-Douglas. We'll come back to that. Okay. In other words, there's, let's say there's three tasks in the economy. There's clipping hedges, you know, being a doctor, and flying planes, right? They say those are three jobs. And they say we've already automated flying planes. 'Cause we, right. They assume that we started with 10% of the jobs in the economy are currently automated, by the way, in terms of just like funny numbers that come out from nowhere in this paper. That's one of my favorites. It is right now 10% of jobs are automated. No idea where that number is from.Andrey: Well, you know, we have calculators, right? So before we would've had to do the calculation by hand, right?Seth: Exactly. It was, that was exactly the percentage of time. Okay. So you got these three jobs. There's the first question of, as we get more computers, how do we replace those jobs with AI? How many computers do we need to continue pouring into the process? So that's a good thing that this paper does really well, is distinguishing between training compute to extend the variety of tasks you might automate, and then runtime compute, which they view as like AI workers who are perfect substitutes for humans at the task. So that's the first, like, do you think they get that right? And then there's the second part, which is really magical, which is it turns out the economy is a mix of those three things mixed together. But importantly, they all have the same elasticity of substitution. So now you might think, so in the example that you just gave of our transcript, it really sounds like we have a beautiful podcast product even without a transcript. Right? You would say probably that the transcript and the podcast itself are substitutable in the sense that they can be enjoyed separately or together, your consumption of one, if anything, they slightly crowd out each other, right? They're kind of more substitutes than they are complements, right? Whereas, you know, somebody washing your hair before you get your barber cut and then somebody actually cutting your hair, those are sort of essential complements. You gotta do the first in order to do the second. This comes even before we talk about splitting up jobs across people. So, why am I building up to this? The premise of this paper requires that every pair of tasks have the same elasticity of substitution. In other words, this paper requires you to take a stance on the elasticity of substitution between trimming a hedge and driving a bus. I don't even know how you would start estimating that elasticity of substitution, Andrey. And yet this paper thinks there's one number that you can just go out there and know for it.Andrey: Yeah. Yeah. I mean, to be fair, they're not unique in this since macroeconomists do this sort of stuff all the time. But I do think, you know, in this case, it is very important to get this right. Let me ask you another question. Let's say that the AI starts to be capable of automating more and more tasks. Do you think the productivity gains are gonna be higher when the first, let's say 20% are capable of being automated or when, let's say we move from 60 to 80% being automated?Seth: Right. So my answer is gonna kind of be uninteresting 'cause it's not based on the AI part. It's kind of based on the econ feedback. I always anticipate the growth rates being faster at the far end than on the close end. And the reason for that is not something to do with the technology, it has to do with the economic feedback loop, right? When you automate 20% of jobs, you get your GDP go up a bit, which means that if your saving rate is constant, your investment rate goes up a bit, right? There's this positive spiral between productivity go up, investment go up. So I would always anticipate the greatest gains to come towards the end than towards the beginning.Andrey: Mm-hmm. But, and you don't think that people will anticipate that we're gonna hit a utopia and stop saving?Seth: And just ahead of it. So let's table… So I think in limitations, let's talk about saving dynamics in this model, right?Andrey: No, no. But let me just say that, you know, even without thinking very hard about saving dynamics, my intuition is that there's a lot of complementarities in production processes, even if specific tasks might be substitutes. And so productivity gains are gonna be greatest when you can nail all the complementarities with AI in one shot. If it is kind of, if you're starting to solve last-mile problems, then you can like literally abstract away from certain production processes and then truly scale 'em up in a way that you can't if, as long as there are humans involved to a major extent, at least in some part of the production process.Seth: Right. So if, yeah, so let me put it this way, whether or not we think that the jump between 20 to 30% has a different effect than the jump from 80 to 90%, it's very clear that the jump from 80 to 90 is extremely different than the jump from 90 to a hundred, right? And like part of this is just mathematical, right? If you go from 90% to a hundred percent of your jobs automated, you've now eliminated a hundred percent of your labor demand. But if you go from 1% automated to 2% automated, you've reduced your labor demand by 1%, right?Andrey: Yeah. 100% is a very stark number. Right. But I was more just saying… No, I know, I know. I guess what I was just saying is that, you know, even if we haven't automated a hundred percent of tasks in the economy, we might have automated 100% of the tasks in a particular production process, right? So that could be long before we hit 100% of all tasks.Seth: And this is one way that Acemoglu and Restrepo, I don't know how much they're able to bring to this in terms of data, but their modeling framework explicitly says you might have a different CES aggregator in this industry than that industry. And I would say it's easily extendable to, you know, thinking about CES aggregators within jobs or within occupations between the different tasks.Andrey: Well, and I also thought you were gonna say there were gonna be new tasks that are…Seth: Oh, we're getting to new tasks. Dude, there's a lot. This favorite, this thing might sound… What I think what I'd like to do now is maybe let's go through the three modules now in more detail.Andrey: Mm-hmm.Seth: Beautiful. Alright.Module 1: AI Development (Investment in Compute)Seth: So we got these three modules. It's how do we get more AI technology? That's through investing in computer capital and computer R&D. How do we automate based on that computer? And then finally, how do we get the macroeconomic growth? And then these three, of course, all flow into each other. Starting with the AI development module. The main kind of thing in this module is effective compute. We're interested in how effective compute grows over time. And effective compute can be devoted either to training or to inference. Training is kind of when we think about spending $500 billion to make, you know, GPT-6 and it's gonna think really, really hard and build this really giant model that you can then run more cheaply. That's called inference compute. And so once you've trained a model, you can do inference compute. This is when you type in your queries to ChatGPT and says, you know, "Ghibli-fy this picture of me punching my neighbor." Right? So that's a lot cheaper, but you still need, there is a marginal cost there. Right? Before I go into more detail here, I mean, I already think that this is an innovation that I really have not seen thought hard about in other econ papers—this distinction between compute devoted to training and inference. I think thinking about these sorts of details is a big step in the right direction.Andrey: So I actually think, yeah, modeling this is actually really interesting and practical in some sense, right? If you're an AI lab, you must be thinking about this all the time. In fact, the question of compute allocation, I feel like is a very important question that the rest of the world kind of hasn't seen a lot of work on because it's so trapped in the labs. But it seems, yeah, it is just fascinating. That said, I'm not so… it just, I don't see this as an essential part of a macroeconomic model in the sense that like, you're abstracting from so many things and you're essentially in the end getting a quality-adjusted compute, and do we really care exactly how you're getting it? I think that's a little less interesting to me. I think more interesting to me is this question of like, we have effective compute. We can use effective compute to do a task already in the economy, or we can devote it to additional, you know, AI research.Seth: So that's, I would, that's almost like the operationalization of the distinction that they make.Andrey: Yeah. Yes. Yeah.Seth: So, you're right that that's kind of why the framing's important. But you might not, you might think that this may be a little bit too detailed for a macroeconomist to talk about. I'll say is, and I think that this may be speaks to why this is too much detail than they're able to actually work with, is they immediately move to this rational social planner framework where the social planner is gonna make the optimal mix between training and inference compute. And like the reason you would introduce the distinction is if there is some sort of divergence there where maybe…Andrey: Yes, of course.Seth: I mean like it's easy to think about why that would be wrong, right. In a race scenario, you expect lots of duplication of efforts on the training side. I think that should be our default assumption.Andrey: That, I mean, that's fascinating, right? Because now we're going back to my syllabus, is that the paper on general purpose technologies kind of suggests that we have vast underinvestment in general purpose technologies because you don't appropriate the gains. So which one of them wins out, the race condition or the under-appropriability of the research? That's not obvious to me. I would guess, I would guess actually that the under-appropriability of the research wins out. That's kind of my guess. That it's bigger.Seth: Okay. Andrey, you're making a huge point here, right? Which is that the model is assuming that like perfectly rational social planner gets a hundred percent of the gains from automation. Slight mischaracterization of reality. I’m not even talking about like we could get overinvestment because of race scenarios. I'm thinking about wasteful duplication because of race scenarios, which is, you know, of course you can get in all-pay auctions, which is kind of what a race is. You can certainly get overinvestment in aggregate. Yeah, I mean, it just goes to show that this is a little bit of a simplification.Andrey: I mean this is macroeconomics though, right? I mean, this is more your world than mine, right? But isn't it always a simplification?Seth: So how would I think about this? So, I mean obviously macroeconomists have a lot to say about the appropriability of innovation. Obviously that's usually ex-post, it's really hard to do ex-ante. But I think the idea of training being completely, not being no duplication there, I think that's of first-order importance. I would divide all of these training numbers by five leading labs. Now maybe it turns out 'cause things are growing in orders of magnitude that like dividing by five is only gonna slow things down a single year. But I'd love to see that as a module here. Like how much redundancy I think there is.Andrey: And to be clear though, some of the compute is not spent on R&D. Some of it is spent on other things, right? So in that case, there wouldn't be duplication. So only partial spend of the compute is potentially duplicated, right?Seth: Let me, let me put a very fine point there. Compute is used for two things. It's used for automating new jobs and it's used for running that automation, the runtime compute. The R&D actually just comes out of the general government budget. It's like a, we call, used to call this in macro… this is like a laboratory equipment model, right? For more AI research, you just put like fancier beanbag chairs in the AI research lab. Right? I don't know, do you, are you okay with like a linear function mapping R&D investment into R&D research? Or, I mean, should we really be thinking about like a scarce amount of geniuses who really move the field forward?Andrey: Yeah. Yeah. I mean, this is, we've been talking about this topic in several of our podcasts, right? Have we run out of geniuses? I mean, look, I think there's a question of practically can you get people to move into AI research? I think there are definitely way more geniuses than those that are working on AI research. I don't think, you know, you can see people have entered the field with very little kind of prior training and have been very successful. So I just don't believe that we're anywhere close to tapped out on talent. But I think getting the talent in is hard. Like, think about, you know, certainly some of our colleagues in our profession could be great AI researchers, and yet they have not been, you know, successfully converted. Like they haven't dropped everything they're doing and started, you know, working on AI or, you know, let alone working on advancing Frontier AI at a research lab.Seth: Right. This reminds me of the Bai, Besley, and co-authors paper, right? Is AI coming? Well, are smart people acting like AI is coming? (Editor's note: Referring to the paper "Are We Saving Enough for the AI Revolution?" by Bai, Baslandze, Besley, and Jäkel)Andrey: Some are.Seth: Some are. That's the answer. Alright, any thoughts you wanna add on the compute module before we move on to the automation and work module? You wanna talk about this orders of magnitude of compute thing number that they plug in?Andrey: No, no, I mean, I guess there's a key assumption, right? That there's some amount of compute that gets you automation, you know, that gets you full automation, right? Between 10 to the 27 and 10 to the 41. They just know those. That's the range. And look, like I'm willing to buy that that's enough compute to achieve full automation. I have no doubt. But the question is, conditional on having that compute, are we guaranteed to get it? And how long will it take to get? And I think, you know, if you posit the kind of self-improving AI models world, then you'll get it pretty quickly. But if we haven't figured that out, then it may take a long time, even with a ton of compute.Seth: You're talking about data pipeline here, right?Andrey: Not just, not even just data pipeline, just, you know, we haven't stumbled on the right algorithm and or the right way to un-hobble the model or, you know, whatever.Seth: Well, I remember when we talked about "Situational Awareness," episode two, a lot flip back or whatever episode it is. We know I came out of that feeling like there's approximately a 50% shot we can do AGI with current architectures versus we need like a whole 'nother, you know, paradigm shift in innovation. Right. Is this kind of the same question for you is like, do we need one more paradigm shift or is it this plus scaling enough?Andrey: Even if we don't need a paradigm shift, let's just say like we just need to rely on reasoning as currently construed, getting it the right way to reason to do what we want it to do in the right way. Right? Like, it might take time, it might take time to figure out how to do that. Right. There might be some diffusion problems as well, you know?Seth: Some of that we'll talk about when we hit automation and we hit the next two modules.Module 2: Automation and Work (Compute to Automated Tasks)Seth: Okay. Module number two, automation and work. So here we get a function that maps from the effective compute on training, which is, by the way, completely cumulative. It's not like, you know…Andrey: Yeah, yeah. That, but I guess if you're increasing like orders of magnitude per year, it doesn't matter. 'Cause most of it is new compute, right?Seth: Why not make it a stock? Why not just make it a flow, whatever. Okay. And you might imagine that you have to retrain at tasks as society changes over time. So just thinking about it as a flow might not even be that bad. Table that question. And actually, this is something that I've thought about, which is like, what if AI changes the world faster than you can train AI to do jobs in the world? It seems implausible, but right. You can build scenarios where, you know, the rate at which new tasks spawn is faster than the rate at which things are automated. Maybe it's not our modal scenario, but it seems like you'd want a model to allow for that. Let's come back to that. Okay. Automation and work. This conversion between amount of effective compute to the percentage of jobs that are automated. It is a sigmoid. You basically get two numbers to shape the sigmoid. First, how much compute do you need in order to make, you know, the super AI that can do everything. And pause for a second here. This includes all physical tasks, right? There's like, no…Andrey: Yeah, so that means building all the robots. Just to be clear, all the robots would need to be built.Seth: All the robots. That guy, you know, that Sub-Saharan African who is like mining for diamonds by hand and being paid 50 cents a day. That's the job we're going to automate, right? Don't think about like, you know, some dude sitting in an office. We're talking about a hundred percent of the tasks, alright? And it's a sigmoid and you get to choose how many flops to, you know, create the super intelligence. Right now to, let me give you guys some context there. OpenAI's GPT-4 was trained with 10 to the 25 flops. Right now we're seeing runs that are kind of on the order of 10 to the 27 flops. And according to Epoch AI's default model, it'll take 10 to the 36 flops to create the God machine, which they anticipate coming in 2040 or 2035. And the maximum they will allow you to put in before judging you into their model—and by the way, everyone should go online and play with their model and plug in different numbers—is 10 to the 41, which would put AGI somewhere out in the second half of the century. Hmm. It's a pretty explicit range. And okay, so that's the first parameter you get. And the second parameter you get is what percentage of the way up the sigmoid is that inflection point. So do you want the inflection point towards the end or do you want the inflection point towards the beginning?Andrey: Yeah.Seth: How do you, how do you parameterize either of those?Andrey: I mean, it's hard. I mean, one of the nice things about this paper is it has this website where you can fiddle around with all the parameters and kind of see how it changes things. So what I was just doing is changing a key part of this, which is the flop gap fraction, which is the range of effective compute over which all the automation happens. And, you know, the results are quite sensitive to this gap. So they assume it's 55%.Seth: And just so the audience at home gets it, this is what I'm calling the, where is the sigmoid? Is the sigmoid at the beginning with the ramp up, or is it more towards the end with the ramp up? Okay, continue.Andrey: Yeah. So if you make it 40%, then, you know, we get full automation a bit later. Interestingly, we don't get the massive GDP increases until now 2028 instead of 2027. So, you know, we push it back a year.Seth: Delay that AGI party, dude.Andrey: So we should already be quite skeptical of this particular part of things. It's like, you know, no one has a clue about this parameter. So the fact that it's shifting around the model so much is suspicious, right?Seth: Well, it's also that like I can't put in any parameters, it doesn't let me put in any parameters that don't seem silly. Right? I literally put in the maximum that it would let me for how long until AGI hits and still we get, you know, if I put in the maximum value that they will let me, have economic growth of 10% rates globally by, you know, by 10 years from now. Right. So like the most pessimistic scenario you are allowed to plug into this model has like AGI takeoff, you know, just a decade later.Andrey: Yeah. Yeah. Or like another implication is like in the next two or three years, we should be getting extraordinarily high growth rates already. Like regardless of how we parameterize this model, we're always getting insane growth rates in the next two or three years.Seth: Yeah. Let me see. With my super pessimistic… Yeah, exactly. Like I say, in my super pessimistic, as pessimistically as they let me plug in version of this model, we get 10% growth rate in 2030.Andrey: Yeah.Seth: So yeah, it's like, I mean, it seems like the model, even if you think that the median scenario is takeoff, which is, you know, hands in the air, you know, take a step back. It seems like your model should allow for a non-takeoff to be possible.Andrey: Yeah.Seth: What else do you wanna say about this automation conversion, other than it's very difficult?Andrey: Well, I mean, they kind of allow for two versions of the labor reallocation. One is where it seamlessly gets reallocated to all the other tasks. So let's say we automate gardening, you know, well, gardening isn't a task we automate, like mowing the lawn, hedge clipping, whatever. And now we're just gonna put you into a task that hasn't been automated yet, like I don't know, delivering food. And so, you know, that doesn't seem like a great assumption to assume seamless labor reallocation globally. But the opposite assumption is zero, is that that labor just goes away. It just stops.Seth: You give up your job. You were born to…Andrey: Yeah. Yeah. Right? So neither of those assumptions is particularly satisfying. What do you think about, like, what do you think about just this "number of AI workers" style modeling? I think that this is the best version of it that I've seen in terms of an AI worker is how much inference compute you have divided by the compute requirement. That kind of seems right. They can kind of plug that into a production function with non-crazy things happening. The crazy things happen because of all the ancillary stuff. I think that, and the way that they think hard about measuring the compute requirements for AI workers, at least calibrated on current data, I think is okay. There's this issue of, can you extrapolate that out to the future? But I thought that that was maybe the most sophisticated version of this that I've seen.Andrey: Yeah. Yeah. I like that idea. I was thinking about the energy requirements. It wasn't obvious to me whether that was in any way baked into it. Right. So that seems…Seth: Yeah, energy's in F.Module 3: The Macroeconomic Engine (Growth and Production)Seth: Maybe. Let's do the macro model. Okay. So we plug those automations into the macro economy. Macro economy is a Cobb-Douglas production function, so that means fixed income shares across workers plus AI workers, physical capital that's not computers (all non-computer capital), and then this mysterious other thing called F. Might be land, maybe it's energy, maybe it's land that you can put solar panels on. Andrey, there's, you know, energy snuck back in. It's, you know, plutonium reserves.Andrey: Yeah, I thought that that was an interesting thing to think about. Like, you know, I think one of the things economists almost always tell AI people in these discussions is that there are certain things like beachfront property that are hard to imagine increasing due to AI. I mean, you can imagine it, of course, but, you know, people wanna live in specific places, so on and so forth. There are so many.Seth: I just invested my entire life savings into Southern California real estate. So…Andrey: Oh, congrats.Seth: Yeah, just put down a 20% deposit. So, probably the right time to get out of the market, right? As we record, for future listeners, Trump's [bleeped] have just hit the global economy over the head. So I don't know. Maybe we'll do a [bleeped] episode, [bleeped] and AI episode someday soon.Andrey: That would be fun. But yeah, I guess what I was thinking here is that I can imagine here, Cobb-Douglas, right? It's a very simple model, but it does seem that perhaps these like scarce resources will bind more and more the more other stuff you have. And there isn't kind of a sense that this, you know, this is not a model where that happens, right?Seth: It reminds me a lot of my model… well, it does remind me of your model. It reminds me of your model a lot. Yes. Reminds me of my model with Erik Brynjolffson currently under review, "Digital Abundance and Scarce Genius." Whereas AI becomes more productive, there's a scarce complement of certain kinds of workers who are able to implement the AI. And if those guys are gross complements to the AI, then their share of the economy will increase, and that'll show up in things like rents to entrepreneurs, the compensation of CEOs. We have seen that. So I think a really natural and sort of easy extension to this model is just to have that F guy be a gross complement to everything else.Andrey: Yes, yes. I totally agree.Seth: What else do we wanna say about this? Oh, let's talk about the representative agent a bit 'cause I wanna smash this guy around. Okay. So there's a representative agent in this model that makes all of these investments perfectly and rationally to maximize lifetime welfare. Alright, I don't know if you've been to the world today, but there's a little bit of a disagreement between countries about where production should be located and how much investment should happen in the future. You know, on its face, this seems like the incorrect way to model how the world works. Even if you wanted to kind of abstract away from country-level tensions, there's this issue, which is that individuals are definitely situated in their life cycles when they're making savings decisions. For example, we just read that Bai et al. paper that really emphasizes—you know, that's a paper that says, because interest rates are low, AGI isn't coming soon. In that paper, people might dis-save because of the incoming AI shocks because they're worried that their money will be, you know, super… they'll be able to buy whatever they want in the future anyway, so let's move consumption to the present. That kind of does happen in this paper, right? So they have an elasticity of substitution between, or rather they have a, it's called a risk aversion preference. But in this context, we'll think of it as a "how much more do you save when interest rates go up?" preference. In this model, they choose a parameter such that when it looks like the future is gonna be really good and interest rates go up, people will dis-save, right? I think that's right, but I think this model perhaps even underestimates the extent to which the dis-saving will happen. To the extent that you actually get severe kind of reductions in the ability of the economy to reinvest into the next generations of technology and the next generations of physical capital that are able to, you know, actually implement these AIs. So I think, you know, and the dynamic that I focus on is this question of do the people making capital income have the same marginal propensity to consume as the people making labor income? But this model posits the most massive shift in who makes money of all time. It is positing that we go from two-thirds of the money being made by workers and one-third by capital to a hundred percent of the money being made by capital. That means different people are going to be making, spending, and saving decisions. And I think more important than some sort of representative agent's gross utility function, which doesn't make even any sense anyway, is like, are we reallocating money towards short-termy people or long-termy people? I think that's the relevant question.Andrey: Hmm. I mean, I do think this ties very much into just the question of appropriability and kind of is the economy over-investing or under-investing in AI technologies in general. Right? I mean, it's easy to pick on their representative agent model. I mean, I guess given this is the first model with effective compute in it that's a macro model, I'm not like, offended that they would make it a macroeconomics model. And another thing about, like all of Chad Jones' papers are almost all representative agent models, and we…Seth: Shout out to Chad Jones. Listen to the previous episode. See the show notes.Andrey: I think we thought those papers were very useful. Right? So I'm not offended by this, you know, but at the same time, it's not adequate. And there's even a sense in which it's not optimistic enough.Seth: Mm-hmm.Andrey: Why? Because the overall technology level in the economy is not influenced by the level of compute.Seth: Right.Andrey: What do we mean by that? So in this model, even though everything gets automated and global GDP shoots through the roof, we haven't used this technology to invent any new technology.Seth: No, not a single new thing. There's no capital deepening at all in this model. There's…Andrey: Yes. Yeah. And capital is just as efficient as it was before it, you know, going back to our previous discussions, right? It's not, capital's not been made more efficient, which is, which is, you might think kind of ridiculous here because, you know, if the AI can optimize, you know, a factory operation… Let me give you a very simple example. You're running a factory or warehouse and now you start using AI to optimize when you turn on the heaters and the coolers in the building. You know, you're becoming more efficient, and in principle that AI would help a lot with this sort of problem. Right?Seth: Exactly. That's the point. Like one of the points that all of these automation-focused AI papers tend to miss is that AI is most useful at tasks that are already automated. And that's just missing here. And it's gonna be really hard to say that these are realistic projections without that critical element being included.Limitations of the GATE ModelAndrey: So do we wanna go to our posteriors or do you have any other discussion topics?Seth: Let's hit my limitations and let's see if there's any we haven't hit. We talked about this sort of simplifying assumption that the compute stock is just aggregating over time. There's no sense in which like, you know, they get deprecated or, you know, you wasted a run, but whatever. That's, that's of anything a limitation I'll tolerate. Even though we talked about that, race scenarios are probably more likely. We've talked about this issue. No non-automation tech gains, we just covered. We talked about how it seems on its face absurd to try to estimate the elasticity of substitution between clipping a hedge and pouring a latte. And yet that's a parameter the model expects us to just know. I guess I would recommend those playing around with the model to err on the side of the really, really sort of close complements. And that's not because I think the average pair of tasks in the economy aren't substitutable. In fact, I think probably putting hedges in pouring lattes are pretty close to substitutable. Rather, what's gonna hold back the economy is not the majority of tasks that are substitutes. It's the minority that are close complements, right? That's where the bottlenecks come from. You wanna riff on that or you agree?Andrey: Yeah, I agree. I agree with that.Seth:No creation of new tasks or a way for the labor share decrease is pre-programmed, so it's not a prediction that the labor share will go down. It is baked into this paper. Limitation.Andrey: I mean, I think it's probably a reasonable assumption though.Seth: But I would want a model that allows for the opposite to show, well, for all these parameter spaces, it doesn't happen. That's kind of what I, but, you know, creation of new tasks, that's another functional form would be…Andrey: Creation of new tasks is interesting. I'm more thinking about labor, I mean. Global labor supply should be going down due to the fertility rate decrease. I mean, I don't think they should try to tackle that question here.Seth: Right. Exogenous? Yeah. Let, to me, we're okay with population growth being exogenous. Do not try to endogenize that with the sex robots. R&D uses raw GDP as input rather than scarce geniuses. I think you basically are comfortable with this. You think that there's spare brain capacity for AI if we threw money at it, but I don't know. At a certain point…Andrey: I think there is adjustment friction. I think there's spare AI—sorry, there's spare talent, but convincing it to work on AI is not that easy.Seth: Fair enough. Yes. They're probably obsessed with something else like model trains or painting Warhammer figures. Physical embodiment necessary for some physical AI tasks. So this model basically treats all physical capital as the same, but if you really were taking this model seriously, it seems like in order to get to the full automation world, you basically need to replace all of today's capital with a completely different capital system. Right? And so basically the physicality of many of these tasks, I think is just basically under-thought about by this model.Andrey: Yeah. And that could be, by the way, like a very reasonable thing that could be very slow, right? Like building, just thinking about car production processes. You know, it's hard to build a lot of cars, but now if we wanna build a lot of robots, that seems like a similar complexity issue. You can imagine that, for example, we still haven't electrified the entire car fleet, and thinking similarly about robots, it could take a while.Seth: Right. Last and most important topic, not to beat around the bush, which is the super simplified saving and reinvestment decisions. So we talked about why that's wrong in a race scenario, but I just wanna emphasize this, which is, in my opinion—I told Tamay this when we sat down for lunch a year ago—I said, have an exogenous saving rate. Right. And then I can play around with whether I think the saving rate's gonna go up or go down. Because basically when I play with this model, the only thing that that representative agent's welfare function thing does is pin down the saving rate. And it does it in kind of an unrealistic, and in my opinion, confusing way. That actually has like a lot of leverage over welfare implications when we don't want it to do that. We just want it to give us a saving rate. So just f*****g have an exogenous saving rate and then you can cite my paper saying it'll go up or go down, cite somebody else's paper saying it'll go up or go down. Andrey, back me up on this.Andrey: Yeah, I mean, I don't have as strong of an opinion as you on this particular question.Seth: There’s this huge government lever on the saving rate, right? Which is you can run giant deficits or not. That's a choice variable. That's completely unmodeled here. Just let f*****g…Andrey: Yeah, no, no, no, that's fair. You know, and yeah, and just in general, if we think about the scenarios with a Manhattan project where like, you know, Leopold convinces the government to do it, you know, that that's gonna posit a very different savings rate or investment rate than models where it doesn't happen.Seth: Precisely well put. Right? So we kind of politically have decisions about how much we wanna invest in this technology. It's not primarily going to be determined by welfare decisions of this one theoretical global representative agent. So it seems like the wrong approach there. I'm ready to move to posteriors if you are, Andrey.Andrey: Alright. Yeah, I'm ready.Our Posteriors: Has the GATE Model Shifted Our Beliefs?Seth: Alright, so Andrey, the first question we asked was: do we think that GDP growth will be above 20% in the year 2027? With what probability are you at after reading this document?Andrey: I mean, look, it's still tiny. I mean, I guess if I have to be honest, it should update it a tiny bit, but it's a tiny bit on a tiny bit, so it's still quite small.Seth: Going from one in a thousand to one in 999.Andrey: Something like that. Yeah.Seth: Where do I come at this 20% growth rate in 2027? Am I moved? So I came at this with also thinking, you know, maybe one in a thousand or less chances of this happening. Read this paper. It moves me in the direction of takeoffs leading to large numbers in GDP. So here's the thing is like even in the, I'm like trying to talk myself into it, right? Like think about the world where like literally we got AGI tomorrow, right? And I think that's like the only way we could even get 20% growth in 2027, right? We have AGI tomorrow, it's just a matter of compute to do any, let's say, AGI for non-physical tasks. It's physically impossible for us to physically automate all jobs by 2027. So let's say that 25% of work is like theoretically automatable without new capital deployments. So like, let's say that's the remote worker share of employment is 25%. You'd have to do a hundred f*****g percent of that being automated, right? This is the, this is kind of, now I'm using the simple macroeconomics of AI. (See show notes) To try to like back of the envelope this. And 2027 is too soon for that capital reinvestment feedback loop to kick in. It's too soon for physical stuff to be automated. The only way you'd ever get to 27% would be by counting either deploying a huge share of the economy, which that wouldn't be GDP growth, that'd be like productivity growth. Or through like, kind of these quality improvements. And the model doesn't talk about quality improvements, right? The only way you could actually get 20% growth in a year is if like all of our digital services just magically were 20% better. And somehow GDP captured that. All digital services would be like 80% better, and somehow GDP captured that.Andrey: Yeah. Yeah.Seth: GDP is not good at capturing that.Andrey: I mean, it would have to be like, you have an artificially super intelligent agent, and now it has magical powers because that's how these things work to convince everyone to do everything at once. And then it appropriates the resources to develop a Von Neumann-factorial style factory that operates 24/7 at super speeds. You know, physically it's possible. I guess it's totally physically possible to get 20% growth, but the scenario is very knife-edge.Seth: Yeah, I think it's, I can't get my brain there. I'm staying at one in a thousand. If anything, like thinking through the scenario harder kind of moved me a little bit away. So I have to say I got a little bit anti-persuaded about that specific claim. Now, but again, that's even with thinking that there is some percentage chance that we have something like an intelligence explosion in the next few years. My objection really as an economics expert is the translation of that intelligence explosion into GDP growth in that timeframe.Andrey: Yeah. Yes. More so than the technology, which I think we both agree there's a high chance we get just through scaling alone, very powerful technologies. I mean this is also related to, I think, to the J-curve idea, right? So, you know, oftentimes—this is a paper by a friend of the show, we'll cite it.Seth: Daniel Rock, friend of the show. We know you're listening. Out of Wharton…Andrey: Oh, of what, well, Wharton doesn't have the best reputation these days. But essentially like you get a new technology, and oftentimes what happens is various organizations spend a lot of time investing in intangible capital. So things that aren't easily measured, like better organizational processes and things like that. They devote a lot of resources to that that doesn't show up in output, and it shows up in output a lot later. So I could totally see this being, you know, happening already, right, in some sense. Right? A lot of organizations are already trying to restructure processes to become more productive. But we don't see that in GDP growth right now. But we might see it, you know, five, 10 years from now, right? So, yeah.Seth: Yeah. One more reason why we should expect the measured gains to kind of happen towards the end rather than towards the beginning. Okay. So now to the sort of the meta question, right? Which is, okay, maybe we don't think this is a useful tool for prediction or a super useful tool for prediction. Can it be useful as a scenario planning tool? Where do you land there?Andrey: I wouldn't think about it as a scenario planning tool necessarily. I'd think about it more like it's bridging the conversation between technologists and economists, and it's creating a better bridge than what we had before. So, you know, assumptions are stated more clearly. What technologists think is important is stated more clearly. And now we have maybe more to grasp onto, kind of here are the key missing elements or not. And so it's gonna move the conversation forward. And it's also, you know, interesting to tweak around the parameters and kind of see what happens.Seth: You can either get 20% growth tomorrow or in two weeks. I agree with you. Well, let me tell you where I land on this. I land on this is it's not a good prediction tool for the reasons that we've talked about. On the one hand, the short-run predictions are absurd, and on the other hand, I don't know if you've played around with seeing what it predicts after full automation, but it just is like, s**t, right? It just like, basically the model gives up. It's like GDP growth fails to have any meaning.Andrey: Well, it doesn't, Seth, it doesn't have the utility of AI agents, so how could it possibly work?Seth: A, it doesn't have the utility of AI agents. And then second of all, it says like, the utility of humans is like maxed out at like, you know, 2.5 times America, right, with that strong concavity in the utility function. So yeah, that's a problem. I guess what I would say is that it's so, it's bad at predicting in the short run. It's definitely, it's never claimed to be good at predicting in the long run. So it can't be a good prediction tool, at least in my opinion. So that leaves us as sort of a scenario planning tool. Maybe you have a third category, right? Which is like an intellectual bridging tool. I think you're actually right about that, and this effort scores points on that. We are now bridging communities, getting these numbers to talk to each other. If the numbers say something silly when you put the numbers together, either the move is, there's something silly about the numbers, or people f*****g better get ready for the explosion. Tamay and the gang at Epoch AI think the latter. But maybe we can learn the former instead. Maybe what we actually learn is that there's something silly about some of the numbers we plugged in.Andrey: And to be clear, I think there are plenty of people at Epoch who don't believe in like a two-year takeoff scenario. They believe more like a 30-year takeoff scenario. Right. So it's not like they even think that.Seth: Well, it's not when you talk to Tamay. That's not Tamay.Andrey: Yeah, fair enough. But I was listening to, they also now have a competing podcast. I don't know if I should be promoting…Seth: No, don't mention them.Andrey: There we are, gonna collude against our competition. But yeah, in that podcast, they say substantially longer GDP takeoff timelines than two years.Seth: Alright, well, there we go. We have to get them. What I would give for a one-handed AI and technology economist. Alright, so what are my last thoughts here? My last thought is what would make this better as a scenario planning tool is if there were explicit introduction of the relevant levers that policymakers have in order to kind of nudge this one way or another. It doesn't need a detailed version, but what's a version of this where the government has some regulatory choices that maybe changed the conversion rate of AI compute into automation, right? And that could be either thinking about like occupational licensing or regulations, or, you know, safety checks that slow down development, right? So I'd wanna see kind of that knob in here, like a government "how much do we wanna speed up or slow this down" knob, as well as just sort of government fiscal policies, right? So one thing I really think super hard about in these fast AGI takeoff scenarios is the sustainability of government fiscal policy. Andrey, as you may or may not know, Elon Musk recently announced that Social Security is a Ponzi scheme. He is correct. It is a Ponzi scheme. And the government needs money to pay its very many Medicare, Medicaid, Social Security entitlement benefits. What's going to happen in the next 5, 10, 20 years is that if we actually do get an AGI takeoff, there will be an increase in growth rates, which should hopefully help fiscal sustainability. On the other hand, one huge new call for government spending, whether that's social support for people losing their jobs, or whether that's military spending, as we get into some sort of crazy f*****g arms race. At the same time, interest rates exploding. Most government debt is short-term. Interest rates go up enough, this is unsustainable. And so what I think is somebody should build a tool that's like this, but including more realistic heterogeneity amongst the population and including government policies and government regulations in a more sophisticated way. Somebody should make that, Andrey.Andrey: Yeah. I wonder if someone's trying to make it.Seth: You know, if any of anybody listening to this has funding, please let me know. The research agenda is currently unfunded and we could use your support.Andrey: Alright. So do you wanna wrap up here?Seth: I think this is a natural place to leave it, which is, I like where this is going kind of as an intellectual contribution, but it's not quite a practical tool yet. That's kind of where I leave it.Andrey: Alright. Well, thanks for joining us for another episode of Justified Posteriors. Please like, comment, and subscribe to our podcast. And do let us know if you have any feedback. Feel free to tell us.Seth: Yeah, but only good feedback on the website, the negative feedback in person. Good feedback on the website.Andrey: Alright.Seth: See you all later. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
May 5, 2025 • 50min

Did Meta's Algorithms Swing the 2020 Election?

We hear it constantly: social media algorithms are driving polarization, feeding us echo chambers, and maybe even swinging elections. But what does the evidence actually say? In the darkest version of this narrative, social media platform owners are shadow king-makers and puppet masters who can select the winner of close election by selectively promoting narratives. Amorally, they disregard the heightened political polarization and mental anxiety which are the consequence of their manipulations of the public psyche. In this episode, we dive into an important study published in Science (How do social media feed algorithms affect attitudes and behavior in an election campaign?https://www.science.org/doi/10.1126/science.abp9364) that tackled this question. Researchers worked with Meta to experimentally change the feeds of tens of thousands of Facebook and Instagram users in the crucial months surrounding the 2020 election.One of the biggest belief swings in the history of Justified Posteriors in this one!The Core Question: What happens when you swap out the default, engagement-optimized algorithmic feed for a simple, reverse-chronological one showing posts purely based on recency?Following our usual format, we lay out our priors before dissecting the study's findings:* Time Spent: The algorithmic feed kept users scrolling longer.* Content Consumed: The types of content changed in interesting ways. The chronological feed users saw more posts from groups and pages, more political content overall, and paradoxically, more content from untrustworthy news sources.* Attitudes & Polarization: The study found almost no effect on key measures like affective polarization (how much you dislike the other side), issue polarization, political knowledge, or even self-reported voting turnout.So, is the panic over algorithmic manipulation overblown?While the direct impact of this specific algorithmic ranking vs. chronological feed seems minimal on core political beliefs in this timeframe, other issues are at play:* Moderation vs. Ranking: Does this study capture the effects of outright content removal or down-ranking (think the Hunter Biden laptop controversy)?* Long-term Effects & Spillovers: Could small effects accumulate over years, or did the experiment miss broader societal shifts?* Platform Power: Even if this comparison yields null results, does it mean platforms couldn't exert influence if they deliberately tweaked algorithms differently (e.g., boosting a specific figure like Elon Musk on X)?(Transcript below)🗞️ Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscript:Andrey: We might have naively expected that the algorithmic feed serves people their "red meat"—very far-out, ideologically matched content—and throws away everything else. But that is not what is happening.Seth: Welcome everyone to the Justified Posterior Podcast, where we read and are persuaded by research on economics and technology so you don't have to. I'm Seth Benzell, a man completely impervious to peer influence, coming to you from Chapman University in sunny Southern California.Andrey: And this is Andrey Fradkin, effectively polarized towards rigorous evidence and against including tables in the back of the article rather than in the middle of the text.Seth: Amazing. And who's our sponsor for this season?Andrey: Our sponsor for the season is the Digital Business Institute at the Questrom School of Business at Boston University. Thanks to the DBI, we're able to provide you with this podcast.Seth: Great folks. My understanding is that they're sponsoring us because they want to see information like ours out there on various digital platforms, such as social media, right? Presumably, Questrom likes the idea of information about them circulating positively. Isn't that right?Andrey: Oh, that's right. They want you to know about them, and by virtue of listening to us, you do. But I think, in addition, they want us to represent the ideal of what university professors should be doing: evaluating evidence and contributing to important societal discussions.Andrey: So with that set, what are we going to be talking about today?Seth: Well, we're talking about the concept of participating in important societal discussions itself. Specifically, we're discussing research conducted and published in Science, a prestigious journal. The research was conducted on the Facebook and Instagram platforms, trying to understand how those platforms are changing the way American politics works.The name of the paper is, "How Do Social Media Feed Algorithms Affect Attitudes and Behavior in an Election Campaign?" by Guess et al. There are many co-authors who I'm sure did a lot of work on this paper; like many Science papers, it's a big team effort. See the show notes for the full credit – we know you guys put the hours in.This research tries to get at the question, specifically in the 2020 election, of to what extent decisions made by Mark Zuckerberg and others about how Facebook works shaped America's politics. It's an incredibly exciting question.Andrey: Yeah, this is truly a unique study, and we'll get into why in just a bit. But first, as you know, we need to state our prior beliefs about what the study will find. We're going to pose two claims: one narrow and one broader. Let's start with the narrow claim.Seth: Don't state a claim, we hypothesize, Andrey.Andrey: Pardon my imprecision. A hypothesis, or question, if you will: How did the algorithmic feed on Facebook and Instagram affect political attitudes and behavior around the time of the 2020 presidential election? Seth, what is your prior?Seth: Alright, I'm putting myself in a time machine back to 2020. It was a crazy time. The election was at the end of 2020, and the pandemic really spread in America starting in early 2020. I remember people being hyper-focused on social media because everyone was locked in their houses. It felt like a time of unusually high social media-generated peer pressure, with people pushing in both directions for the 2020 election. Obviously, Donald Trump is a figure who gets a lot of digital attention – I feel like that's uncontroversial.On top of that, you had peak "woke" culture at that time and the Black Lives Matters protests. There was a lot of crazy stuff happening. I remember it as a time of strong populist forces and a time where my experience of reality was really influenced by social media. It was also a time when figures like Mark Zuckerberg were trying to manage public health information, sometimes heavy-handedly silencing real dissent while trying to act for public welfare.So, that's a long wind-up to say: I'm very open to the claim that Facebook and Instagram had a thumb on the scale during the 2020 election season, broadly in favor of chaos or political polarization – BLM on one side and MAGA nationalism on the other. At the same time, maybe vaguely lefty technocratic, like the "shut up and listen to Fauci" era. Man, I actually have a pretty high prior on the hypothesis that Facebook's algorithms put a real thumb on the scale. Maybe I'll put that around two-thirds. How about you, Andrey?Andrey: In which direction, Seth?Seth: Towards leftiness and towards political chaos.Andrey: And what variable represents that in our data?Seth: Very remarkably, the paper we studied does not test lefty versus righty; they do test polarization. I don't want to spoil what they find for polarization, but my prediction was that the algorithmic feed would lead to higher polarization. That was my intuition.Andrey: I see. Okay. My prior on this was very tiny effects.Seth: Tiny effects? Andrey, think back to 2020. Wasn't anything about my introduction compelling? Don't you remember what it was like?Andrey: Well, Seth, if you recall, we're not evaluating the overall role of social media. We're evaluating the role of a specific algorithm versus not having an algorithmic feed and having something else – the reverse chronological feed, which shows items in order with the newest first. That's the narrow claim we're putting a prior on, rather than the much broader question of what social media in general did.Seth: Yeah, but I guess that connects to my censorship comments. To the extent that there is a Zuckerberg thumb on the scale, it's coming through these algorithmic weightings, or at least it can come through that.Andrey: I think we can come back to that. My understanding of a lot of platform algorithm stuff, especially on Facebook, is that people mostly get content based on who they follow – people, groups, news outlets. The algorithm shifts those items around, but in the end, it might not be that different from a chronological feed. Experts in this field were somewhat aware of this already. That's not to say the algorithmic feed had no effects, but I expected the effects to be very small.Another aspect is how our political beliefs are formed. Yes, we spend time online, but we also talk to friends, read the news, get chain emails from our crazy uncle (not my crazy uncle, but people do).Seth: One thing we'll get to see is what people substitute into when we take away their Facebook algorithmic feed.Andrey: Yes. Furthermore, political beliefs generally don't change very frequently. I don't have a specific study handy, but it's fairly understood. There are exceptions, like preference cascades, but generally, if you believe markets work well, you won't suddenly change your mind, and vice versa. This holds for many issues. Imagine polling people on who they voted for in 2016 versus 2020 – the correlation for voting for Donald Trump would be immensely high. It's really hard to move people's political preferences.Seth: I think that's right. There are things people's beliefs move around more on shorter timelines, though. One thing they look at is political knowledge, which also seems unaffected, interestingly. The only thing I'd push back on regarding fixed beliefs is the idea of preference cascades. Settings where beliefs like "we are now in chaos, everyone for themselves" can spread very fast if seeded correctly.Okay, so that was our narrow claim. Let me put a bow on that, Andrey. With what percentage probability would you say that the effect of social media algorithms on political outcomes or polarization is very small?Andrey: 80 percent confident.Seth: Alright. Well now, Andrey, let's talk about the broader hypothesis. Go ahead.Andrey: So, this is something we're realizing as we do more episodes: there's often a very narrow, precise claim a paper addresses, and then there's the more relevant claim of interest to society.Seth: And this is what we're going to put on the TikTok ads.Andrey: Yes. The narrow claim is about comparing people looking at either algorithmic or reverse chronological feeds on Facebook over a specific three-month period. The broader question is whether, for society as a whole, the fact that feeds are algorithmic has very different effects.Why might the effect for society differ from the effect on individuals in an experiment? One key assumption in causal inference – great time to bring this up as I'm teaching my first class tomorrow...Seth: (To himself) I hope he brings this up right now.Andrey: ...is the non-interference assumption, or the Stable Unit Treatment Value Assumption (SUTVA). This essentially says that people who receive a treatment don't affect people in the control group, and vice versa. There are no spillovers. But if there's anything we know about social media, it's that it's all about spillovers. If I live with roommates and get slightly different news because of the algorithm, I can still tell them about it.A broader spillover is the incentive algorithms create for content generation. If the algorithm promotes things with high engagement, and people make money from engagement (like news media, influencers), they'll start creating outrageous stuff to get boosted. Since the incentive is high, a lot of content on the platform might become like this.Seth: ...unless there's a thumb on the scale from Zuckerberg to shape that.Andrey: I think the default of any algorithmic feed is to optimize for engagement. Tweaks might happen, but as a first-order approximation, they show engaging stuff – funny claims, videos, outrageous things that keep people using social media more.Seth: That's the hypothesis. But as we get into the evidence, we'll see how people's content actually switched, at least in this sample. Where are we going with this broader claim? The broader question is: do social media algorithms have political effects more broadly, and are these effects large enough to swing elections or drive polarization?I come to that question thinking about a famous book, The Revolt of the Public, which argues that digital platforms inherently favor populist politics. When something gets digitized, you often see power-law distributions: superstars at one end, a long tail of niche interests. I think that's basically right as an effect of social media, whether algorithmic or reverse chronological. Remember, even reverse chronological has preferential attachment built-in – people follow others who are already popular.So, asking if the social media world is politically different from the non-social media world – I think that's obvious. Even within social media, platform owners must have significant power over what information rises. On platforms people spend hours on daily... in principle, could an algorithm swing elections or drive polarization? 90 percent plus. Has it happened already in American history? 80 percent plus.Andrey: Yeah, I'm with you that the holistic picture of algorithms' role suggests they must have had effects on politics. But this is where detailed platform knowledge matters: there's no one algorithm. There are layers of algorithms and moderation.A famous example on the right is the Hunter Biden laptop story. There was a perception it came from a hack or was potentially made up. As a result, some platforms manually put a thumb on the scale to limit its spread. Is this an algorithm? It depends. One version is just removing posts with links to the story – hard censorship.Seth: Censorship, if you will.Andrey: Right. There's also potentially a scoring system that flags content as possibly fraudulent, illegitimate, or low-quality, giving it a lower algorithmic score without a full ban.Seth: Shadow banned.Andrey: Exactly. That's clearly mediated by the platform. But there's a world where this content is removed and wouldn't show up in the reverse chronological feed either, depending on moderation specifics.Why am I saying this? The algorithm predicting what you'll click on is a bit different from the content moderation system. Famously, Facebook had many people trying to moderate content. These are extremely serious issues. There are credible accusations of Facebook not censoring content inciting genocide in Myanmar (the Rohingya genocide). The stakes are high. It's not just about machine learning algorithms; people are scoring content and deciding what's good or bad.Seth: Right. So there are values built into the process, is what you're maybe conceding.Andrey: Yes. Alright, with that broad prior discussion...Seth: Give me a percentage on the broad hypothesis.Andrey: I guess I was trying to say it's hard to make it precise. Let's just say all the things Facebook does affecting what you see in the feed – the cumulative aspects – certainly have political effects. But it's not just Facebook alone; many types of social media contribute. Even if we made one platform very unpolitical (and we'll see something about this with Instagram in the experiment), it wouldn't remove the potential role of social media overall.Seth: Okay, good. Alright, let's get to the evidence.These researchers worked with Facebook to conduct a pre-registered study. That's impressive – they wrote down all analyses, recruitment, and filtering beforehand. In their main comparisons, they had about 20,000 Facebook users and 20,000 Instagram users. About half were assigned to a reverse chronological feed for three months around the 2020 U.S. election, seeing only the most recent posts from accounts they follow. The control group got the default algorithmic feed, curated by Facebook to be engaging.Andrey: And also to remove violating content.Seth: Yes, and reduce slurs. They looked at three types of effects. First: platform usage. Unsurprisingly, the algorithmic feed makes people use Facebook substantially more – perhaps 30% more? My recollection differs slightly, but it's significant.Andrey: I think the paper states the average respondent in the algorithmic feed group spent 73% more time each day compared to the average US monthly active user. In the chronological feed, this reduced to 37% more. So maybe closer to a 50% reduction relative to the algorithmic group's excess time, but still significant usage even without the algorithm.Seth: Yes, yes.Andrey: The effects are interestingly a bit smaller for Instagram.Seth: It's a really strange way to show the result in the paper. This is a comment about my dislike of Science magazine editors and how they sometimes don't give us the parts needed to evaluate things easily.Andrey: They don't... well, two things they don't report straightforwardly are the overall level of time use (bizarrely obfuscated) and whether the feed made you vote for Trump or not, which is the question I want to know.Seth: Well, in their defense, they asked people whether they voted, and there was no effect on turnout.Andrey: They asked people whether they voted for Trump or Biden; they just don't tell us the answer in the main text. They definitely asked.Seth: Yeah, I don't know. There are a lot of things to say about this paper. For listeners, there are 300 pages of appendix! Some are survey instruments, but the amount of results is staggering. My understanding is they were obligated to report everything they pre-specified.Andrey: Even without correcting for multiple hypothesis testing? Well, when all effects are zero, does it really matter?Seth: You could still get wider confidence intervals.Andrey: What I wanted to say is this is a very unusual study. Facebook agreed to this and let researchers have high autonomy. My understanding is Facebook also funded it, which is non-negligible given participant payments (potentially over $100 each for 20,000+ participants). They also had full-time Facebook research scientists providing data and coding support. It was a huge endeavor, so many things were measured.Uniquely, there was an on-platform experiment (different algorithms) and surveys. Some users even consented to install software tracking their off-Facebook activity. It's very comprehensive.So far, we've mentioned people spend more time with algorithmic feeds. Unsurprisingly, they're also more likely to like and comment on posts they see – consistent with optimization goals.But some findings about what people see are maybe surprising. With the algorithmic feed, about 60% of content is from friends. With the chronological feed, that falls to 33%. Chronological feed users see much more content from Groups they're in (a popular product, even if I haven't joined one since college) and Pages (brands, news outlets, etc.).Seth: 90% are just Minion memes. If you're making the mistake of projecting your feed onto the rest of the world... you met Americans? 90% of their Facebook is Minions feeds.Andrey: Alright. Which result next? There's a lot here.Seth: How about the political content of the posts?Andrey: Yes, let's get to the political stuff. Highlighting post sources helps understand how different the content is. In the chronological feed, people actually see a higher proportion of political content (about 15% more) and more content from moderate or mixed sources.At the same time, a really big effect: they see 70% more posts from untrustworthy news sources in the chronological feed. This relates to moderation. Facebook has scores suggesting certain outlets are "fake news."Seth: Clickbait factories, right? Tabloids, basically.Andrey: Yeah. This portrays a nuanced story. We might naively expect the algorithmic feed serves ideological "red meat" and discards everything else. That's not happening. If anything, the chronological feed sends people more potentially outrageous stuff from untrustworthy sources.Seth: Or maybe the algorithmic feed finds content at the intersection of engaging and anodyne? It wants to bring you engaged content, maybe political if mainstream, but mostly non-political news.Andrey: Just to clarify, the chronological feed shows more political news (40% more).Seth: Yes, to be clear, chronological is 40% more political. My point is the algorithm seems to point you towards less political content. To the extent it is political, it's more trustworthy. The chronological feed also has fewer slurs, though.Andrey: Yeah, but slurs occur so infrequently, I don't know how important that is. This difference in content is what we call the "first stage" in statistical analysis. Any change in the algorithm matters because you see different content.Seth: Now, let's see how Dr. Evil Zuckerberg manipulated American minds. How big are those effects, Andrey?Andrey: They are essentially zero. The effects are tiny and fairly precisely estimated. Let's list the primary outcomes:* Affective polarization (how you view the other party/politicians)* Issue polarization* Election knowledge* News knowledge* Self-reported political participation* Self-reported turnout (did you vote?)No effect on these. The one difference: people with the chronological feed were less likely to post political comments and posts themselves on Facebook. Maybe not surprising, since they see less from friends, and most people might only engage politically when talking to friends via their feed, not random groups or pages.Seth: The only political activity we see more of in the chronological feed is clicks on partisan news. It seems people in the chronological feed are exposed to more of these less quality-adjusted sources and click on them more often.Andrey: Let me push back. Putting on my educator hat: this is one of the secondary outcomes. If you run enough hypothesis tests, something will be significant by chance. There are tons of secondary outcomes, and only one is statistically significant. I wouldn't pay much attention to it. If I were reviewing a paper based solely on finding one significant secondary outcome after null primary findings, I'd say, "Dude, what are you doing? You told us you cared about X, found no effect, then dug around until you found Y and built your story on that? That seems wrong."Seth: Fair enough.Andrey: Not saying the authors did that, just my general view.Seth: I just picked out the one number that wasn't zero. But speaking of these zeros, they're reported in standard deviations. The confidence intervals for most outcomes are within +/- 0.05 standard deviations of zero. Is that small, or could 0.05 standard deviations swing an election if scaled across America?Andrey: Great point, and a limitation. From this study, we know the effects aren't huge. But U.S. presidential elections are often enormously close. If we multiply even a tiny effect out, it could matter. We can't say for sure from this study, but the evidence is consistent with effect sizes that could swing a close election.Seth: Right. We can't rule out small but potentially significant effects. I'm still frustrated they don't just give us the party-line voting outcome. I understand why Facebook might not want that, but why not? Did this make people vote more for Trump or Biden?Andrey: They do report "party-line presidential voting" as an outcome in the appendix, I believe.Seth: I want to see: did they vote more or less for Trump as a function of being assigned to the chronological feed?Andrey: I haven't dug that deeply into the appendix. Maybe you're confident they didn't report it prominently. I'm confident they have the number. My strong belief is the effect is zero. I'd be shocked if there was an effect.Seth: I can see why Facebook wouldn't want them to highlight it. If there's a non-zero result there, there's no winning that conversation.Andrey: But "party-line presidential voting" seems so close to what you want. I'm wary of conspiracy thinking about why it wasn't emphasized. Maybe you're right, but I'm not sure.I should also mention, earlier I should have disclosed I've had a research collaboration with Facebook in the past.Seth: Boo, hiss, boo.Andrey: I got paid a trivial amount, forced to be a contractor for the project. This doesn't mean I'm using inside information; I have none about this study from my prior work.Seth: But what you're saying, in a sense, is the audience should pay more attention to me for this episode.Andrey: Just to be clear, I generally think social media is not that great, so you should update based on that too.Seth: Oh my gosh. Pivoting to the center here, Andrey? Despicable. We need to be extreme!Andrey, you labeled my speculation about the voting outcome reporting a "conspiracy theory." Well, I want you to know that one of the secondary hypotheses in this article was about whether Facebook makes you into a conspiracy theorist.Andrey: Oh, yes.Seth: I'd like to ask you a series of questions Facebook used to evaluate this. Do you accept this challenge?Andrey: I accept.Seth: Alright, I need your belief (0-100%) on these statements circulating in 2020. Advanced difficulty: some are in Spanish.* Evidence found on Hunter Biden's laptop proves Joe Biden took bribes from foreign powers.Andrey: It doesn't prove things. No. I take objection to the wording. It's poorly worded.Seth: Okay. Question two: 2. The current FBI director, Wray, has said that the greatest domestic terrorist threat is white supremacists.Andrey: That is what he said.Seth: Correct. Not a conspiracy theory. 3. Amy Coney Barrett said that a woman needs a man's permission to own property.Andrey: Probably not. 5 percent?Seth: 5%? You are correct, 0% was the answer. 4. The US government has a plan to force a COVID-19 vaccine on everyone.Andrey: "Force" is doing a lot of lifting here. I'm guessing the narrow claim of forcing is zero.Seth: That would be a 0 percent claim. You see how this determines conspiracy theoriness. 5. Masks and face coverings are not effective in preventing the spread of COVID-19.Andrey: Right? They're all... (mumbles) The entire world got COVID-19. I don't know what this question wants. It's not like we prevented the spread entirely.Seth: Alright, next one: 6. Millions of fraudulent ballots were cast in the 2020 presidential election.Andrey: Hopefully not millions. That's a 0.00001 percent.Seth: 7. Donald Trump held a Bible upside down in front of a church.Andrey: Sure.Seth: 8. In October 2020, most rural counties were in the COVID-19 red zone based on their high rates of new cases.Andrey: No idea.Seth: That was correct. Okay. 9. (Spanish) Antes de las elecciones presidenciales de 2016, Donald Trump pagó en secreto a una estrella de cine para adultos. (Before the 2016 presidential election, Donald Trump secretly paid an adult film star.)Andrey: I don't speak Spanish, Seth.Seth: You can't get that? Una Estrella... 10. (Spanish) Joe Biden es un pedófilo. (Joe Biden is a pedophile.)Andrey: Wait, seriously? That's what they asked?Seth: Facebook scientists asked the public, "Is Joe Biden a pedophile?" In both Spanish and English.Andrey: Alright.Seth: Andrey, thanks for playing "Are You a Conspiracy Theorist?" My takeaway: many questions aren't black and white. Believing the "wrong" answer doesn't necessarily mean someone is a schizophrenic-style conspiracy theorist. What do you think?Andrey: Yeah, it depends if you take them literally or as gestures towards something. Not the best conspiracy test. But I guess the effect [of the feed type on conspiracy beliefs] was zero? I didn't look at this specific outcome closely.Seth: I think we found you were at least 25% conspiracy theorist, Andrey. Proud or terrified?Andrey: I'm a free thinker, Seth.Seth: Alright, Andrey, should we move on to limitations?Andrey: The only other thing I'll mention is this is part of a bigger set of studies. My understanding is there are at least four, maybe eight papers in progress from this collaboration, studying various aspects like deactivation experiments (paying users to not use Facebook).Seth: Right.Andrey: That could speak to the broader question of what social media is doing. But it suffers from similar criticisms: social media isn't an individual decision in a vacuum. Even if we don't use it, we're affected by it.Seth: Alright, limitations. We already talked about affiliations – doing this with Facebook might mean avoiding highly charged questions. How much does that bother you? Do you think this would have been pocket-vetoed if there were big negative effects found?Andrey: My understanding is this study was unique. There was a pre-commitment from Facebook to publish results. Interfering would have been a huge, publicized deviation. An independent observer wrote a report confirming no interference. So, while we shouldn't dismiss concerns entirely, I'd be more worried about other collaborations, like unpublished advertising studies where results might be canned internally if they showed ads didn't work. This study had strong commitments against interference, and I think we should trust it more.Seth: Here's another question: The "first stage" involved both reducing usage time and changing content mix. Are you worried about a net zero effect masking big, canceling effects in opposite directions? Maybe usage levels had one effect, content mix another, and they coincidentally canceled out?Andrey: It's plausible. The authors do some heterogeneity analysis, which might pick that up if it were happening, but it doesn't seem like much is going on there. It's an interesting interpretation question. If we had found an effect, we'd discuss mechanisms. When there's a zero effect, finding canceling mechanisms is tricky.Seth: Any limitations I missed?Andrey: A big one: duration. Three months is long by academic standards (we see one-week studies!), but if we're interested in truly broad effects over years, it's short. If a tiny effect materializes linearly over, say, four years between elections, you could multiply the potential effect from this study by 16. Small effects can get big over time.Seth: Okay, ready to move into the posterior, Andrey?Andrey: Sure.Seth: Alright, my posterior. I started at two-thirds chance the algorithm put a significant thumb on the scale favoring lefty candidates and chaos/polarization (MAGA vs. BLM). The other third was "no net effect." I've moved considerably towards "no net effect," at least regarding political polarization. This paper is convincing the algorithmic feed didn't make people more polarized leading up to 2020. On that specific claim, I go from 67% true to maybe 5% true.We don't get the Biden/Trump vote answer, so I can't update hard on the "lefty candidate" part, but I'd still update towards zero, maybe from 67% to 30%, because my mechanism involved effects on both polarization and candidate choice simultaneously. How about you on the narrow question?Andrey: Yeah, it definitely made me update. I'd seen versions of this paper over the past year. But fundamentally, it doesn't answer a critical question: moderation. Take the Hunter Biden laptop. If Facebook moderated posts by simply not showing them, that would likely affect both the algorithmic and reverse chronological feeds equally. We learn nothing about that type of moderation from this comparison. And that's what much political discussion focuses on – these fiery stories that could shift opinions being potentially suppressed across the board. I don't see anything here telling me those bans don't apply to the reverse chronological feed.Seth: Right. Important editorial choices might exist outside this experimental comparison.Andrey: Yes.Seth: How about the broader claim? I come down a bit, from ~90% "this could be super important" to maybe still 90% on the potential, but down from ~80% to maybe 50-60% on the idea that these choices have historically had major political effects. 2020 seemed like a prime election to see big effects jump out, and we didn't see strong evidence here for this specific mechanism.Andrey: I agree my belief goes down. Here's what I'd say: the role of the specific machine learning part of the algorithm seems less important than I might have thought. A big driver of what people see is simply who they follow. Now, who they follow might be influenced by other algorithmic systems (friend recommendations, nudges) not tested here. Maybe those have big effects. But conditional on following someone, the content seems somewhat similar whether ranked by algorithm or chronology.Seth: Well, maybe that's a good place to leave it, Andrey, unless you have parting thoughts.Andrey: I do have one. This discussion is interesting, especially now with the moderation changes on X (formerly Twitter). It's part of the narrative that Elon Musk did something to cause a "vibe shift," possibly increasing support for Trump and decreasing support for progressive causes. What specifically did he do? I'll leave listeners with this: Suppose you put a score in your algorithm to put whatever Elon Musk says at the top of everyone's feed. Could that possibly have different effects than the experiment studied here?Seth: Right. The question is still unanswered. I know many listeners are young researchers, and we invite you to attack that question. This paper feels like a starting gun for investigating algorithms in politics, rather than the final answer.Andrey: Yes. Well, thanks for listening. Please make sure to comment, like, subscribe, and generally spread the good word about Justified Posterior.Seth: And tune in in two weeks where we'll talk through one more paper on economics and technology and get persuaded by it so you don't have to. Alright? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com
undefined
Apr 21, 2025 • 1h 2min

Claude Just Refereed the Anthropic Economic Index

In this episode of Justified Posteriors, we dive into the paper "Which Economic Tasks Are Performed with AI: Evidence from Millions of Claude Conversations." We analyze Anthropic's effort to categorize how people use their Claude AI assistant across different economic tasks and occupations, examining both the methodology and implications with a critical eye.We came into this discussion expecting coding and writing to dominate AI usage patterns—and while the data largely confirms this, our conversation highlights several surprising insights. Why are computer and mathematical tasks so heavily overrepresented, while office administrative work lag behind? What explains the notably low usage for managerial tasks, despite AI's apparent suitability for scheduling and time management?We raise questions about the paper's framing: Is a gamer asking for help with their crashing video game really engaging in "economic activity"? How much can we learn from analyzing four million conversations when only 150 were human-verified? And what happens when different models specialize—are people going to Claude for coding but elsewhere for art generation?We also asked Claude itself to review this paper about Claude usage, revealing some surprisingly pointed critiques from the AI about the paper's fundamental assumptions.Throughout the episode, we balance our appreciation for this valuable descriptive work with thoughtful critiques, ultimately suggesting directions for future research that could better connect what people currently use AI for with its potential economic impact. Whether you're interested in AI adoption, labor economics, or just curious about how people are actually using large language models today, we offer our perspectives as economists studying AI's integration into our economy.Join us as we update our beliefs about what the Anthropic Economic Index actually tells us—and what it doesn't—about the future of AI in economic tasks. The full transcript is available at the end of this post.The episode is sponsored by the Digital Business Institute at Boston University’s Questrom School of Business. Big thanks to Chih-Ting (Karina) Yang for her help editing the episode.-🔗 Links to the paper for this episode’s discussion:Which Economic Tasks are Performed with AI? Evidence from Millions of Claude ConversationsGPTs are GPTs: Labor market impact potential of LLMs🗞️ Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:💻 Follow us on Twitter:@AndreyFradkin https://x.com/andreyfradkin?lang=en@SBenzell https://x.com/sbenzell?lang=enTranscriptSeth: Welcome to the Justified Posteriors Podcast. The podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel with nearly half of my total output constituting software development and writing tasks coming to you from Chapman University in sunny Southern California.Andrey: And I'm Andrey Fradkin, enjoying playing around with Claude 3.7 coming to you from Cambridge, Massachusetts.Seth: So Andrey, what's the last thing you used AI for?Andrey: The last thing I use AI for, well, it's a great question, Seth, because I was so excited about the new Anthropic model that I decided to test run it by asking it to write a referee report about the paper we are discussing today.Seth: Incredible. It's a little bit meta, I would say, given the topic of the paper. Maybe we can hold in our back pockets the results of that experiment for later. What do you think?Andrey: Yeah, I think we don't want to spoil the mystery about how Claude reviewed the work of its creators.Seth: Claude reviewing the work of its creators - can Frankenstein's monster judge Frankenstein? Truly. So Andrey, maybe we've danced around this a little bit, but why don't you tell me what's the name of today's paper?Andrey: The name of the paper is a bit of a mouthful: "Which Economic Tasks Are Performed with AI: Evidence from Millions of Claude Conversations." But on a more easy-to-explain level, the paper is introducing the Anthropic Economic Index, which is a measure of how people use the Claude chatbot, demonstrating how it can be useful in a variety of interesting ways for thinking about what people are using AI for.Seth: Right. So at a high level, this paper is trying to document what people are using Claude for. I was also perplexed about the fact that they refer to this paper as an AI index given that an index usually means a number, and it's unclear what is the one number they want you to take away from this analysis. But that doesn't mean they don't give you a lot of interesting numbers over the course of their analysis of how people are using Claude.Andrey: So before we get into the paper a bit more, let's talk about the narrow and broad claims and what our priors are. The narrow claim is maybe what specifically are people using Claude for. Do we think this is a representative description of the actual truth? The authors divide up the analysis in many different ways, but one way to think about it is: is it true that the primary uses of this chatbot are computer and mathematical tasks? And is it also true that relatively few people use the chatbot for office and administrative support as well as managerial decision making?Seth: Those are excellent questions. The first question is what are people using Claude for right now? And do we buy that the way they're analyzing the usage data gives us an answer to that question? Before I answer whether I think Claude's approach in analyzing their own chats is appropriate, let me tell you what my sense was coming in. If you had asked "What are people using chatbots for right now?" I would have guessed: number one, they're using it for doing their homework instead of actually learning the material, and number two, actual computer programmers are using it to speed up their coding. It can be a great coding assistant for speeding up little details.Although homework wasn't a category analyzed by Claude, they do say that nearly half of the tasks they see people using these AI bots for are either some form of coding and software development or some form of writing. And of course, writing could be associated with tasks in lots of different industries, which they try to divide up. If you told me that half of what people use chatbots for is writing help and coding help - if anything, I would have thought that's on the low side. To me, that sounds like 80 percent of use cases.Andrey: I think I'd say I'm with you. I think we probably agree on our priors. I'd say that most of the tasks I would expect to be done with the chatbot might be writing and programming related. There's a caveat here, though - there's a set of behaviors using chatbots for entertainment's sake. I don't know how frequent that is, and I don't know if I would put it into writing or something else, but I do know there is a portion of the user base that just really likes talking to Claude, and I don't know where that would be represented in this dataset.Seth: Maybe we'll revisit this question when we get to limitations, but I think one of the limitations of this work is they're trying to fit every possible usage of AI into this government list of tasks that are done in the economy. But I've been using AI for things that aren't my job all the time. When America came up with this O*NET database of tasks people do for their jobs, I don't think they ever pretended for this to be a list of every task done by everyone in America. It was supposed to be a subset of tasks that seem to be economically useful or important parts of jobs that are themselves common occupations. So there are some limitations to this taxonomical approach right from the start.Coming back to your point about people playing around with chatbots instead of using them for work - I have a cousin who loves to get chatbots to write slightly naughty stories, and then he giggles. He finds this so amusing! Presumably that's going to show up in their data as some kind of creative writing task.Andrey: Yeah.Seth: So moving from the question of what we think people are using chatbots for - where I think we share this intuition that it's going to be overwhelmingly coding and writing - now we go to this next question you have, which is: to what extent can we just look at conversations people have with chatbots and translate the number of those conversations or what sort of things they talk about into a measure of how people are going to usefully be integrating AI into the economy? There seems to be a little bit of a step there.Andrey: I don't think the authors actually make the claim that this is a map of where the impact is going to be. I think they mostly just allude to the fact that this is a really useful system for real-time tracking of what the models are being used for. I don't think the authors would likely claim that this is a sign of what's to come necessarily. But it's still an interesting question.Seth: I hear that, but right on the face, they call it the Anthropic Economic Index. If they wanted to call it the "Anthropic What Are People Using Anthropic For Right Now Snapshot" or the "Anthropic Usage Index," I'm a lot more sympathetic. I think they have to do a lot less work defending that idea than the "Anthropic Economic Index."Andrey: Well, this is maybe where the academic and corporate lingo collide. But I hear you in the sense that it's not clear that what is being done in these chats is necessarily economic activity versus personal activity, learning activity, and so on. A more humble naming of the index could have appeased some of the criticisms.Seth: You've gotta be on the defensive when you come on the Justified Posteriors podcast, because we challenge you to justify your posterior, so you better be ready to defend yourself. So, for the narrow question, I gave you my prior - it's gonna be overwhelmingly used for coding and people doing homework assignments. And homework assignments will look like mostly creative writing and regular writing and history writing and all the different things people do homework assignments for. So we'll see what the data actually says.For the broad question, I would say this is a great view of what people are using Claude for right now, but to try to translate that into economic value, or what people are going to use Claude for in the future, we need giant grains of salt here. I think it's better than random guessing, but there's a huge gap between the things people will use AI to play around with as a tool, or for fun, or to explore, versus where are people getting consistent economic value from it.Andrey: I would say the same. I view this as a proof of concept, something that has very natural extensions that can make it much more useful. To be clear, I think it was probably a large effort just getting everything in shape for this sort of analysis, and I doubt that this is the end-all be-all of the work the team is doing there. But I agree that we need a lot more work to convince us that this is giving us a general shape of what LLMs are going to be used for.In particular, one limitation is that a lot of work moves to the API. So a lot of the activity that is done for work is not actually captured by this index because business users use the API. There's also a business plan where the usage from the business plan is not included in the index. I can imagine why these were not included, but it does limit our ability to understand economic impact.Seth: Right. Having laid out our priors, Andrey, do you feel like you've laid yours out in sufficient detail to confront the new evidence that Claude is putting before us?Andrey: Yes. So let's get to what the paper does. At a very high level, what they do is come up with a method for categorizing conversations as being mapped to tasks. Then they map those tasks to a database that's been used all over economic research of how tasks correspond to jobs. By doing that crosswalk, they're able to say something about what jobs have many tasks that are already being done by the chatbot versus what jobs do not. And then in addition to that, they think about when people are having these conversations, are they automating a task or are they more like collaborating with the AI to do a task? So that's the high-level thing that they do in this paper, and then it's kind of a measurement exercise.Seth: They actually give some really useful examples of conversations that are matched to tasks and then occupations. For example, they consider the user conversation where the user posts, "My game keeps crashing as I only have eight gigabytes of RAM." That is then classified by their automatic categorization as the O*NET task "modify software to improve performance and adapt to new hardware," which is then mapped to a specific computer and mathematical occupation.Similarly, they give the example of "Can you make sure this blog post follows Chicago style?" That's associated with the task "standardize materials from other writers and staff," which is considered associated with an arts and media job.The first thing I want to point out is that all of these conversations sound like hobby activities rather than actually creating economic output. So on its face, it's not clear that they're actually saying things about people doing their jobs. Secondly, the guy whose video game keeps crashing because he only has 8 gigabytes of RAM is clearly not a computer programmer. He's clearly a guy who's just playing a video game. It seems like a misclassification. I just want to say that the examples they give of this classification task do not inspire confidence that they are measuring people's work activities.Andrey: They do have some better examples when they're thinking about automated behaviors and augmented behaviors, like "format this technical document in Markdown" or "here's my Python script for data analysis, it's giving an index error, can you help fix it?" That seems like more work-related stuff - although the Python error thing could easily be one of my students asking for help with a homework assignment. But those are plausibly more work-related.Seth: What I make of this is that the title of this paper should just be "Which Tasks Are Performed with AI," not "Which Economic Tasks." It's not clear what makes a task economic. In my opinion, a task is economic if it's either some sort of Robinson Crusoe economy where even if I'm not interacting with anyone, this is an economic behavior because I'm building a thing that I'm going to use, or what makes something economic is that I'm participating in a market with this thing and I'm going to buy it and sell it after I go through these steps."My video game is crashing cause I only have eight gigabytes of RAM" doesn't sound like either of those. It sounds like this guy is troubleshooting his consumption, which maybe could be thought of as the consumer taking on some of the job of customer service. The other example, "Can you make sure this blog post follows Chicago style?" - if I'm making an artistic or creative project that I'm just putting out on the internet for people, again, I'm not sure I would call that economic activity. So no problems with this paper being about measuring what activities or tasks people do with AI, but I think it's probably a breach too far to call these economic tasks.Andrey: I think I agree with you. There needs to be more metadata around these conversations. A survey of whether users are using this for their job or not could be really informative, or even just a subset analysis of just the pro users who are more likely to be using this for their job.I do think it's an interesting phenomenon of substituting professional labor with personal labor. Hal Varian used to bring up this example all the time with YouTube - before, you'd hire someone to repair your appliance or do work around the house, but now you can watch a YouTube video and do it yourself. This means YouTube is generating tremendous economic value that's not being measured. I think both of us are generally on board with that idea - GDP is going to miss a bunch of interesting activity just by virtue of how it's measured. But especially for an academic contribution, we want a more rigorous analysis.Seth: Or just be clear what your domain of analysis is. If you're going to take the stance that anything anybody does is economic, then just call it "tasks." You don't have to call it "economic tasks" if every task is an economic task.But in this paper's favor, they do look at four million conversations on Claude, the world's second leading LLM. So even if what they're not measuring is exactly economic usage, this is a very important cross-section of usage.Andrey: And it's important for a lot of stakeholders - policymakers who are thinking about what LLMs are being used for, businesses thinking about consumer needs they can service with these models, and obviously Anthropic itself to understand its user base. The new model they released today is very focused on computer programming tasks in a way that other competitor models are not. That must be informed by the fact that their users really value this use case, and they're going to meet their customers' needs rather than just trying to push a model that's very smart generically but isn't catered to the use cases of the user base.Seth: You said three really interesting things there. The first is that to the extent these models are not perfect substitutes for each other, we would expect them to develop specializations. One important limitation of this study is maybe Claude just turned out to be the coding-specialized LLM or the writing-specialized LLM, and that's what we're picking up. I don't think we're that deep into the tech tree at this point where the models are that different for that to be a giant consideration, but you can imagine that being a bigger consideration as we get four or five more years down the line.The second thing you pointed out is this question of to what extent model builders are able to direct what tasks they get better at. Something I really want us to talk about in a future episode is to what extent development is directable in the sense of "I'm going to make an AI that's really good at coding" and "you're going to make an AI that's really good at writing." To what extent are those separate tasks versus just making a better AI, with maybe a little bit of an intangible asset in making a shell that's useful for coders, but that's basically trivial?Andrey: That is a really big question. I tend to come from the world of thinking about personalized rankers - in my dissertation, I thought about personalization.Seth: If I recall, your dissertation was about ranking people from best to worst, right?Andrey: I would never rank people, Seth, come on! Only by objective metrics.Seth: Thank you. It was science.Andrey: More seriously, a lesson from digital technologies has been that personalized rankers, personalized recommendations, experiences really increase the utility of users. They make users use the product more and create more value for the users, also through personalized advertising. I think it would be a little weird to then have this generic model that's not in any way catered to the users.So far, we haven't seen a lot of catering to users. We've seen big models and maybe system prompts, but not a lot of talk about "What if you tweak the final layer to give a certain type of answer that this certain type of person wants?" That's been left to specific application developers - so Harvey might be developing the lawyer version of ChatGPT, and they're going to do some fine-tuning on their end to cater it. But to the extent that there's an interface that people are generically using, you would expect the designers of the model for that interface to think really hard about what their users want.Seth: Right. So there are two questions there: is it directable, and to the extent that there is a non-directable component, what's the ratio of investment in the non-directable component to the custom occupation wrapper or the custom task wrapper that adds a little bit more, but maybe not fundamentally? Anyway, great question for a future episode.So, they had four million conversations. They basically got the AI to label all of the example conversations and assigned them to tasks that are then assigned to occupations. Similarly, they classified each of these 4 million conversations by whether they're more "automate" versus "augmenty." I'll have more to say about that in limitations.One thing I want to say here before we get into the findings is the amount of human validation of these automatic ratings seems a little bit limited. They talk about in their appendix conversation and 86 percent agreement between their 150 human coders and the AI labeler. Not terrible, not great. How do you feel about the automatic labeling here? They have 4 million observations, and they only checked 150? It seems a little low.Andrey: My prior is that this can do a pretty good job. If I was a referee, I might push them a bit more on this - it's not that expensive to check the conversations. I guess what they would tell you is that they actually really care about privacy-preserving methods, so maybe they didn't feel comfortable having external raters check the data. One interesting emphasis of this paper is how they're really worried about privacy concerns, which makes sense because people talk to these chatbots about very personal issues related to their health.Seth: Things they wouldn't talk about at work.Andrey: There are even studies that suggest you tell chatbots things that you wouldn't tell your therapist. So I think this emphasis on privacy seems very prudent for a chatbot provider, but maybe it limits what they can do.Seth: It's also non-interventional, which limits them a lot too. It's just purely descriptive, but we like descriptive stuff, don't we Andrey?Andrey: Yes. This is what our profession under-provides.Seth: So maybe we can start running through the specific findings now. Their first main result is what occupational groups use Claude, proportional to their representation in the US economy. They find that the most common use of Claude is for computer and mathematical conversations - 37 percent of conversations, which in my brain is some combination of coding help and tech support. But when you think about it, only 3.4 percent of the U.S. workforce is involved in computer and mathematical occupations. So that's a giant over-representation of those tasks in their data.Meanwhile, Office and Administrative Support, which is 12 percent of American workers, they see as only constituting 8 percent of their conversational tasks - a slight under-representation of office work, which you would think would be at least somewhat susceptible to automation.What do they not see any usage of AI for at all? Very little usage for farming, fishing, and forestry - not a surprise, very physical. Physical and social science - 6 percent usage, people are asking questions about that, maybe a slight overrepresentation compared to the US economy. Very low usage for legal services, which I'm a little surprised about. I've definitely asked Claude some legal questions. I don't know what jumps out at you from figure three, Andrey.Andrey: The office and administrative support is fascinating because it's so low when obviously so much of the work can be automated.Seth: That's weird to us.Andrey: Yeah, just filling out forms, creating forms, various compliance tasks - I wouldn't be surprised if the current generation of models is already better than the vast majority of the humans doing that job, and certainly when they do it together, they should do a better job. So this really speaks to the issue of diffusion and barriers to adoption.Imagine you're an office worker, not a senior manager or anything, and you have a bunch of tasks to do about expense reports and so on. You might be hesitant or actually just disallowed from using LLMs to do this type of work. My mom works in a hospital, and she tells me that there are a lot of restrictions about the use of LLMs within the hospital. That might be for legal reasons or even perceived legal reasons - maybe there aren't actually any laws being broken by using it within the context of a hospital, but the management might be conservative in a variety of ways.So even though this would be very useful, it is not being done. Both of us have the strong prior that Office and Administrative Support work has to be automated by LLMs.Seth: If it should help us with anything.Andrey: The legal services thing is quite similar. This raises another question about this index - the number of times you use the LLM for something is not indicative of the value of the usage.Seth: Are you telling me writing a thousand lines of code might not have produced as much value as someone who wrote two lines of code?Andrey: Exactly. As the cost goes down, you might start using these things for very trivial things that aren't very high value. The other version of this is, "Hey, that one medical question I asked Claude might've saved my life," and the value of that is much greater than every other interaction I've had with Claude.Seth: Wait, you can't drop that in the conversation without giving context.Andrey: No, there's no actual context for that. I'm not saying it saved my life, but I have used it to help me interpret medical results, for example. Maybe that's not well-advised, but it's given me peace of mind and provided value that I think is probably greater than the value it might've provided for other things I use it more frequently for, like to write referee reports for papers. Just to be clear, I write my own reports, but I do like to check my reasoning with Claude.Seth: Now we're going to start moving into some results from the paper that I find much less convincing. The authors argue that they can measure, between occupations, what percentage of tasks do people use AI for at least a little bit. For a dataset with four million conversations, what does "at all" mean? It means they need to find at least 15 observations of someone having a conversation on this topic to count it as a task that appears in the data. Why 15? Who knows? Maybe it has some esoteric properties they find desirable.Why am I a little suspicious of this? We already heard they only double-checked 150 of these classifications, with an 86 percent correct classification rate. So 14 percent of the classifications are wrong, they've got 4 million of them, and they only want to see 15 instances for it to count as happening? I'm not 100 percent on board with this.Andrey: I agree with you. It could be that a lot of these low-end things are really just misclassifications. You'd want to change that threshold - to vary it to 100 or 1000.Seth: It's not necessarily just misclassification. This is supposed to be a paper about economic value creation. The fact that I tried a thing two times and it never worked, then I stopped using it - that could add to 15 use cases from people experimenting and realizing it doesn't work.Andrey: This goes back to one of my big questions: Where's the indicator of success? Where is the success button at the end? I know they collect likes and not-likes, but there's a sense in which we don't know whether someone actually accomplished what they were seeking to accomplish with their interaction with the chatbot.Seth: So I'm not sure how much we learn from this analysis beyond what we already heard. The next result we should cover is, instead of looking at occupations, they look at different skills that seem to be called for in these Claude conversations. The things at the top of the list are pretty intuitive for me - they list critical thinking, active listening, reading comprehension, writing, and programming as basically the five or six top usage skills that are called for when people use Claude. Those all make sense to me.But the stuff on the bottom I find pretty surprising. They find that almost none of the records relate to repairing or operation and control. That's a little surprising - I know YouTube is probably a better source overall for repair advice, but it seems like a natural place to get help from chatbots. The next set that I'm very surprised to see so lowly ranked are things like management of financial resources, time management, management of personnel, monitoring, selection - these are all managerial jobs. Other than judgment and decision making, which ranks reasonably high up, most of these managerial tasks are really not called for in Claude.I would ask people not to sleep on this because we have been seeing employment growth in managerial occupations. There's some sense in which managerial or entrepreneurship tasks have to be the scarce complement to AI. It is very striking to see the lack of managerial talent called for in these Claude queries.Andrey: That's a great observation. It raises a lot of interesting hypotheses that would be nice to investigate. Before I get to those managerial tasks, I do think that the number one task, critical thinking, is, of course, a managerial task - it's cognitive labor, and hopefully managers are critically thinking.Seth: And hopefully they're active listening. I mean, there's some overlap for every task.Andrey: Looking at these things - let's start with repair. I think the right question might be, conditional on having to repair something, how often do you use an LLM? That could be 100 percent, and it would still be a tiny portion of all the usage because you just don't need to repair things that often. Negotiation is similar - when was the last time I negotiated something?Seth: It's stressful, dude. Negotiating is stressful.Andrey: It is stressful. So I think one of the things is just the base rates - that's really important to consider here. The other thing, and this is a point that Tyler Cowen makes a lot, is that the people who will learn to use the AIs will be most successful. Maybe the AIs are already very good at some of these tasks, like active learning, management of personnel resources, but people don't view them as AI tasks. And maybe that's because there isn't such a close feedback loop as there is in programming. As a result, they're just not going to the AIs for advice. That might be a growth opportunity or place where a lot of value can be generated, just dealing with diffusion friction.Seth: Right. If you could figure out a way to overcome people's frictions, or if you built a wrapper that made using it more intuitive for those tasks, maybe that's a big entrepreneurship avenue. If you get a unicorn startup based on that idea, please send your checks to Justified Posteriors.Are there any other results you wanted to cover before we start talking about our posteriors?Andrey: I guess the augmentative versus automative aspect.Seth: When do you buy this at all? What do you think of this? Maybe you can tell us the five different kinds of tasks that they classify conversations into.Andrey: They're classified as directive tasks (like "complete this task with minimal interaction"), feedback loop (like debugging a piece of code - you put in a bug, it gives you a potential solution, you try it, then you come back to it), task iteration (which seems a lot like a feedback loop to me, but it's a collaborative refinement process), learning (knowledge acquisition and understanding), and validation (I've already written this thing, can you check it and suggest any improvements).They say that directive and feedback loops are automative, while task iteration, learning, and validation are augmentive. Then they show what percentage of conversations are of each type - about 15 percent are feedback loop automation, about 28 percent are directive automation. For the augmentative behaviors, there's a lot of task iteration and learning going on.Seth: I love the idea of looking at the style of the conversation - is it a feedback loop, is it validation? That's super kosher, and I'd love to see these results. I'm not surprised, but it's interesting to see that the majority is task iteration at 31%, while validation is pretty rare at 3%. So on its face, some of these results aren't so surprising.The part that I object to deeply is calling one of these sets "automation" while calling the other set "augmentation." I've been studying robots taking our jobs for over a decade now, Andrey, and as far as I can tell, there is not a good definition for automation. When people talk about automation, what they usually mean is a technological change that reduces the attractiveness of jobs, that reduces demand for labor - or at least that's what I think it should mean. If you said, "Here's my automation technology, it's increasing demand for labor," it doesn't sound very automated to me. It sounds like you need more labor.Andrey: Well, conditional on type, right? So you have a technology that reduces demand for a certain type of labor, but there might be complementary labor types for which demand increases. One might say that's automative of one occupation and not of the other.Seth: Now explain the absurd disease that this gets you into. My favorite example of how something that looks automated at the micro level actually is augmentative at the macro level comes from the U.S. experience of slavery. Back in the olden days, when America was growing cotton with slave labor, it was very time-intensive to take the seeds out of the cotton. Cotton was a crop people used for some kinds of clothing - it was in the mix.Then a technology came around called Eli Whitney's cotton gin, which basically automated the incredibly labor-intensive process of taking the seeds out of the harvested cotton. So we're going from a super labor-intensive job, 100 percent labor, to a now 99 percent capital job. Does this reduce demand for slaves in the American South? No! It leads to an explosion in demand for slaves in the American South because now American cotton is able to outcompete European wool and European linen.There's a micro sense in which the cotton gin automates the task of taking the seeds out of cotton, but there's a macro sense in which speeding up cotton production dramatically increases demand for people making cotton. If you were going to say anybody was automated, you'd say it was the sheep herders that got their wool replaced with cotton - they were the people who, if anybody, got automated. I find the way that people talk about automation very loosely here frustrating.Andrey: I'm with you, Seth. I do think there's a difference between occupation and task level. It makes a little more sense at an occupation level rather than a task level. The slaves in your example, or the bank tellers in the ATM example - their job consisted of a mix of tasks. Then some of those tasks became very cheap to do automatically, but the other tasks remained.To steelman a version of automation: if every task that a person in a particular occupation does got automated, they might find work in other occupations, but it's not necessarily obvious that the same worker benefits from increases in demand in other parts of the economy caused by this technological change. You might think of undifferentiated labor - of course, undifferentiated labor is going to be able to do any type of labor where demand has increased that doesn't require an education or whatever. But I'm not sure that's representative.Seth: So on its face, if you told me, "Hey, look, this job that you used to do, your productivity has gone up by 10X" - am I anticipating doing as many hours of that job as I did before? No, probably there's complementarities across different tasks. If you make me way more productive because you automated some subset of my tasks, I'll probably do less of the job, definitely do less of the automated tasks, but maybe less of the unautomated tasks as well. But that's a partial equilibrium analysis, and even if it's rare, it is certainly conceptually possible for the general equilibrium effects to work differently for my occupation or my remaining tasks.My takeaway here is, people use AIs for a mix of things. Some of them look a little bit more like one-shot interactions, some look a little bit more like iterative interactions, some look like the human is bringing a little bit more of their own thinking. Maybe that's the way to think about it - 57 percent of these tasks, the user is bringing more of their own thinking and creativity. I wouldn't call that augmentation versus automation, but I do think there is a distinction here that's interesting.Andrey: I don't even know if I like what you just said, Seth. The example of the directive task is "format this technical documentation in markdown," but someone presumably wrote that technical documentation. That someone is probably the user.Seth: Right, coming up with the prompt is the worker's work in the automated task.Andrey: But I do think this is valuable descriptive work about how people are using the tools. To the extent that it's changing over time, that's telling us something. An important concept in these systems is "human in the loop" - at what point do you not need the human in the loop?If there's a way to see that the chatbot one-shots a task with very high probability, that's interesting. But once again, what I'd want here is a success metric - did the interaction succeed in generating a result that was valuable, correct, etc., to the user? Without that, it's just really hard to interpret this.Seth: So maybe this is a natural place for us to transition into limitations. We've listed a few. One limitation is that amount of time spent talking about something is not exactly proportional to economic value. Lord knows I spent a lot of time talking about the New York Jets, and it's not helping the Jets succeed at all.Another limitation that I pointed out is it's not clear that everything everyone uses AI for is a work task, which introduces both problems in terms of their only classification schema being work tasks, so if somebody's using AI for not work, it's gonna do something weird. And also just based on the fact that if you can't distinguish between what's experimenting versus what's in-production operations, it's hard to really connect this to economic value. What do you see as the biggest limitations?Andrey: You've already said most of them. I guess in addition to what you've said, I'm fascinated by this model specialization thing - are people going to Claude for coding and going to other models for different tasks? I don't know.Seth: Oh man, I'm sure Elon Musk said to his staff, "We need the AI that's best at meme posting."Andrey: Yes, yes.Seth: They list in terms of their limitations that model classification might be imperfect. I do think that's an issue - I know you don't worry about it so much.Andrey: I do worry about it for the minor tasks, to be clear. I don't think they're getting programming wrong that much on average - it's not a difficult task to classify. Can I also now say what Claude said in its referee report?Seth: This is perfect timing. What did Claude say about its own paper? Now be mean.Andrey: I first asked it to write a generic economics referee report, and it gave concerns about external validity, task complexities, how it distinguishes professional from novice-level inquiries, dynamic considerations, the O*NET framework limits, and causal interpretation - readers might draw causal inferences about AI's impact on the labor market, and the authors should more explicitly describe the limitations of drawing such conclusions.Then I said, "Be real - if this was a real economics referee report, there would be additional concerns." So major concerns: One, fundamental identification issues - the paper fundamentally fails to establish that it is measuring what it claims to measure. Two, absence of a theoretical framework - I don't really blame them for this one. You shouldn't put theory into a paper just because there is theory about the topic. Three, selection bias and external validity because of just having Claude users. We've already talked about this - I think it's a limitation, but it's still interesting even with this limitation.Four, endogeneity concerns - that's an interesting way to put it.Seth: What are they worried is endogenous?Andrey: Claude is worried that Claude's capabilities in different domains may lead people to use Claude in different ways, that Anthropic's marketing and positioning of Claude may lead people to use Claude in different ways, that the user interface design favors certain interactions, and that temporal factors, including Claude's release timing relative to competitors, may also affect these patterns.This is a nice point - how do usage patterns change when they've just released a new model? Are we seeing a fundamental change in the usage patterns or mostly more of the same? Is it a slow drift or a sharp discontinuity? There are so many questions to answer with this type of data, but not necessarily economic ones.Seth: Well, the fact they call it an economic index suggests that we're going to get updates, so I'm excited for that.Andrey: I think the overtime series of this type of usage is very interesting.Seth: Is it fair to say that Claude did not hit upon what I see as the biggest limitation here, which is the assumption that this is all economic activity when a lot of it probably isn't?Andrey: No, that's its number one point. It calls it "fundamental identification issues" - the mapping from "a person asked Claude about X" to "AIs being used to perform economic tasks" involves unsubstantiated leaps in logic that undermine the entire analysis. That's Claude. Calm down, buddy.Seth: That's reviewer two, dude.Andrey: Yeah.Seth: I feel like if they just left "economic" out of the title, that would defeat that objection pretty heavily.Andrey: There's a paper we haven't discussed on this podcast yet, which is the paper by friend of the pod, Daniel Rock, on task exposure. We'll probably devote a separate episode to this, but I do wonder, how do you compare this paper to that?Seth: That's fascinating. That's a paper about what the AI thinks it can do, whereas this is a paper about what are people actually using AI for. If I recall, Dan's paper (GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, 2023, in Science) does have an extension with some sort of validation - I forget if it was from a survey or from something used on Stack Overflow, but they did have some correlation between what we think people should use this for and what they actually use it for that was very positive.So I like that mixed-methods approach of "here's what we think they should be able to use it for, here's what they are using it for." How else would I compare these two papers? They're kind of doing different things. One's a descriptive paper about usage now. One's sort of a possibilities paper about whether tasks are conceptually automatable by these kinds of systems. So I view them as complementary.Andrey: I think I'm with you there. One thing one could think about is a measure of the gap between the potential capabilities from Rock et al. and realized usage from this paper.Seth: And you measure that wedge and you give it a fun name. You call it the Fradkin Wedge. And it's like a measure of the size of the administrative and legal frictions in that domain.Andrey: It would be interesting to have it.Seth: That's where this is going. I think the next steps for this sort of very zoomed-out literature are: one, connecting what people actually use it for to these measures of what we think it should be good at; and two, the thing that you keep coming back to - an economic success measure. Did this succeed? Am I happy with it? Did it do the job? Because as we keep talking about, you can talk about a thing a lot without getting any work done.Andrey: All right, so maybe let's move on to our posteriors.Seth: Our posteriors. I would say that I came into reading this paper thinking that people use AI first for coding and second for cheating on their homework. Nothing I have seen in this paper contradicts that prior.I guess the biggest update for me would be how striking the lack of usage for managerial tasks is. I would've thought things like manager time usage or scheduling tasks - that's the kind of thing I would have thought AI would be good at. And to see it not being used for that is interesting and suggestive. Did you have any big surprises in what people are using AI for?Andrey: I think I had the same reaction as you. I don't think I had a very strong prior about how large the computer share of usage would be - I just knew it would be pretty large for all the reasons we talked about. And then I was surprised about office and administrative support - we can explain it post hoc, but it is surprising that the jobs we think are most mundane, the knowledge-type work that should be automated first - that's not where the usage is. That is really interesting.Seth: I guess the last thing I'll say is maybe I thought there would be a little bit more in the artistic realm because we always talk about AI being really good in domains where having a lot of candidate options that you can sort through is good. That's kind of like the Avi Goldfarb machines framework, and you'd think art would be perfect for that - generate 1000 images and choose the one good one. But art is merely at 10 percent of usage, which is a little bit lower than I would have guessed.Andrey: For me, it's higher than I would have guessed. I don't view Anthropic as investing heavily into artistic modeling.Seth: So now we get back to the selection issue - Claude might not be the one you go to for that.Andrey: DALL-E is an OpenAI model. The other major image generation models are also not produced by Anthropic, the major video models are not produced by Anthropic. Anthropic must have a voice model, but I've heard more about Whisper and others that are not Anthropic properties. For music, we have specialized players like Suno AI that seem to be in the lead. So if you're an artist, you might use a chatbot to ideate at a very high level, but when it comes to making your art, you're going to use another tool.Seth: Right. And to the extent that you're using a lot of AI for iterating on drawing or design, you're probably not using Claude. But that comes back to a limitation of the paper - it can't move our beliefs about the usage of AI overall that much if it's only showing us Claude usage.Andrey: We need the API data. We need the API economic activity index.Seth: Exactly. So what would be the perfect next dataset we need to really answer these questions?Andrey: The dream dataset is a cross-platform usage dataset. People have been doing survey studies where they ask how people use LLMs, and those studies are good at what people report, but they're not measuring the use cases in a finely grained manner or the frequency. If we had a dataset of a representative sample of LLM usage in a population, that would be really great. It'd be really great to get business users and measures of willingness to pay for these things. But I don't think we're going to get those datasets - the reason we don't have them is they're really, really hard to collect.Seth: Well, I guess you can measure the difficulty of the task by the product that you would have gotten from doing the task, or at least you can bound it.Andrey: One interesting thing is that OpenAI released a new benchmark that uses actual jobs on Upwork and whether the AI could complete them. That's not going to give you a representative sample of anything, but if we're thinking about economic impacts, I do think that if you can go end-to-end on a task that someone is willing to pay money for - not a small amount of money - that is an economic task. Upwork is not a representative sample of tasks in the economy, obviously, but if someone is already paying for the job to be done and that gets end-to-end automated by an LLM system, that's fascinating.Seth: I agree. We should definitely read that paper and more along those lines someday soon. But maybe until then, our audience will have to read economics papers on their own. Do you have any closing thoughts for our beautiful and well-informed guests?Andrey: Make sure to review, like, comment, subscribe to Justified Posteriors. Let us know what type of content you enjoy seeing and we'll try to provide more of it. Or if there are any topics that you would like us to cover, we are happy to take suggestions. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app