

Claude Just Refereed the Anthropic Economic Index
In this episode of Justified Posteriors, we dive into the paper "Which Economic Tasks Are Performed with AI: Evidence from Millions of Claude Conversations." We analyze Anthropic's effort to categorize how people use their Claude AI assistant across different economic tasks and occupations, examining both the methodology and implications with a critical eye.
We came into this discussion expecting coding and writing to dominate AI usage patterns—and while the data largely confirms this, our conversation highlights several surprising insights. Why are computer and mathematical tasks so heavily overrepresented, while office administrative work lag behind? What explains the notably low usage for managerial tasks, despite AI's apparent suitability for scheduling and time management?
We raise questions about the paper's framing: Is a gamer asking for help with their crashing video game really engaging in "economic activity"? How much can we learn from analyzing four million conversations when only 150 were human-verified? And what happens when different models specialize—are people going to Claude for coding but elsewhere for art generation?
We also asked Claude itself to review this paper about Claude usage, revealing some surprisingly pointed critiques from the AI about the paper's fundamental assumptions.
Throughout the episode, we balance our appreciation for this valuable descriptive work with thoughtful critiques, ultimately suggesting directions for future research that could better connect what people currently use AI for with its potential economic impact. Whether you're interested in AI adoption, labor economics, or just curious about how people are actually using large language models today, we offer our perspectives as economists studying AI's integration into our economy.
Join us as we update our beliefs about what the Anthropic Economic Index actually tells us—and what it doesn't—about the future of AI in economic tasks. The full transcript is available at the end of this post.
The episode is sponsored by the Digital Business Institute at Boston University’s Questrom School of Business. Big thanks to Chih-Ting (Karina) Yang for her help editing the episode.
-
🔗 Links to the paper for this episode’s discussion:
Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations
GPTs are GPTs: Labor market impact potential of LLMs
🗞️ Subscribe for upcoming episodes, post-podcast notes, and Andrey’s posts:
💻 Follow us on Twitter:
@AndreyFradkin https://x.com/andreyfradkin?lang=en
@SBenzell https://x.com/sbenzell?lang=en
Transcript
Seth: Welcome to the Justified Posteriors Podcast. The podcast that updates beliefs about the economics of AI and technology. I'm Seth Benzel with nearly half of my total output constituting software development and writing tasks coming to you from Chapman University in sunny Southern California.
Andrey: And I'm Andrey Fradkin, enjoying playing around with Claude 3.7 coming to you from Cambridge, Massachusetts.
Seth: So Andrey, what's the last thing you used AI for?
Andrey: The last thing I use AI for, well, it's a great question, Seth, because I was so excited about the new Anthropic model that I decided to test run it by asking it to write a referee report about the paper we are discussing today.
Seth: Incredible. It's a little bit meta, I would say, given the topic of the paper. Maybe we can hold in our back pockets the results of that experiment for later. What do you think?
Andrey: Yeah, I think we don't want to spoil the mystery about how Claude reviewed the work of its creators.
Seth: Claude reviewing the work of its creators - can Frankenstein's monster judge Frankenstein? Truly. So Andrey, maybe we've danced around this a little bit, but why don't you tell me what's the name of today's paper?
Andrey: The name of the paper is a bit of a mouthful: "Which Economic Tasks Are Performed with AI: Evidence from Millions of Claude Conversations." But on a more easy-to-explain level, the paper is introducing the Anthropic Economic Index, which is a measure of how people use the Claude chatbot, demonstrating how it can be useful in a variety of interesting ways for thinking about what people are using AI for.
Seth: Right. So at a high level, this paper is trying to document what people are using Claude for. I was also perplexed about the fact that they refer to this paper as an AI index given that an index usually means a number, and it's unclear what is the one number they want you to take away from this analysis. But that doesn't mean they don't give you a lot of interesting numbers over the course of their analysis of how people are using Claude.
Andrey: So before we get into the paper a bit more, let's talk about the narrow and broad claims and what our priors are. The narrow claim is maybe what specifically are people using Claude for. Do we think this is a representative description of the actual truth? The authors divide up the analysis in many different ways, but one way to think about it is: is it true that the primary uses of this chatbot are computer and mathematical tasks? And is it also true that relatively few people use the chatbot for office and administrative support as well as managerial decision making?
Seth: Those are excellent questions. The first question is what are people using Claude for right now? And do we buy that the way they're analyzing the usage data gives us an answer to that question? Before I answer whether I think Claude's approach in analyzing their own chats is appropriate, let me tell you what my sense was coming in. If you had asked "What are people using chatbots for right now?" I would have guessed: number one, they're using it for doing their homework instead of actually learning the material, and number two, actual computer programmers are using it to speed up their coding. It can be a great coding assistant for speeding up little details.
Although homework wasn't a category analyzed by Claude, they do say that nearly half of the tasks they see people using these AI bots for are either some form of coding and software development or some form of writing. And of course, writing could be associated with tasks in lots of different industries, which they try to divide up. If you told me that half of what people use chatbots for is writing help and coding help - if anything, I would have thought that's on the low side. To me, that sounds like 80 percent of use cases.
Andrey: I think I'd say I'm with you. I think we probably agree on our priors. I'd say that most of the tasks I would expect to be done with the chatbot might be writing and programming related. There's a caveat here, though - there's a set of behaviors using chatbots for entertainment's sake. I don't know how frequent that is, and I don't know if I would put it into writing or something else, but I do know there is a portion of the user base that just really likes talking to Claude, and I don't know where that would be represented in this dataset.
Seth: Maybe we'll revisit this question when we get to limitations, but I think one of the limitations of this work is they're trying to fit every possible usage of AI into this government list of tasks that are done in the economy. But I've been using AI for things that aren't my job all the time. When America came up with this O*NET database of tasks people do for their jobs, I don't think they ever pretended for this to be a list of every task done by everyone in America. It was supposed to be a subset of tasks that seem to be economically useful or important parts of jobs that are themselves common occupations. So there are some limitations to this taxonomical approach right from the start.
Coming back to your point about people playing around with chatbots instead of using them for work - I have a cousin who loves to get chatbots to write slightly naughty stories, and then he giggles. He finds this so amusing! Presumably that's going to show up in their data as some kind of creative writing task.
Andrey: Yeah.
Seth: So moving from the question of what we think people are using chatbots for - where I think we share this intuition that it's going to be overwhelmingly coding and writing - now we go to this next question you have, which is: to what extent can we just look at conversations people have with chatbots and translate the number of those conversations or what sort of things they talk about into a measure of how people are going to usefully be integrating AI into the economy? There seems to be a little bit of a step there.
Andrey: I don't think the authors actually make the claim that this is a map of where the impact is going to be. I think they mostly just allude to the fact that this is a really useful system for real-time tracking of what the models are being used for. I don't think the authors would likely claim that this is a sign of what's to come necessarily. But it's still an interesting question.
Seth: I hear that, but right on the face, they call it the Anthropic Economic Index. If they wanted to call it the "Anthropic What Are People Using Anthropic For Right Now Snapshot" or the "Anthropic Usage Index," I'm a lot more sympathetic. I think they have to do a lot less work defending that idea than the "Anthropic Economic Index."
Andrey: Well, this is maybe where the academic and corporate lingo collide. But I hear you in the sense that it's not clear that what is being done in these chats is necessarily economic activity versus personal activity, learning activity, and so on. A more humble naming of the index could have appeased some of the criticisms.
Seth: You've gotta be on the defensive when you come on the Justified Posteriors podcast, because we challenge you to justify your posterior, so you better be ready to defend yourself. So, for the narrow question, I gave you my prior - it's gonna be overwhelmingly used for coding and people doing homework assignments. And homework assignments will look like mostly creative writing and regular writing and history writing and all the different things people do homework assignments for. So we'll see what the data actually says.
For the broad question, I would say this is a great view of what people are using Claude for right now, but to try to translate that into economic value, or what people are going to use Claude for in the future, we need giant grains of salt here. I think it's better than random guessing, but there's a huge gap between the things people will use AI to play around with as a tool, or for fun, or to explore, versus where are people getting consistent economic value from it.
Andrey: I would say the same. I view this as a proof of concept, something that has very natural extensions that can make it much more useful. To be clear, I think it was probably a large effort just getting everything in shape for this sort of analysis, and I doubt that this is the end-all be-all of the work the team is doing there. But I agree that we need a lot more work to convince us that this is giving us a general shape of what LLMs are going to be used for.
In particular, one limitation is that a lot of work moves to the API. So a lot of the activity that is done for work is not actually captured by this index because business users use the API. There's also a business plan where the usage from the business plan is not included in the index. I can imagine why these were not included, but it does limit our ability to understand economic impact.
Seth: Right. Having laid out our priors, Andrey, do you feel like you've laid yours out in sufficient detail to confront the new evidence that Claude is putting before us?
Andrey: Yes. So let's get to what the paper does. At a very high level, what they do is come up with a method for categorizing conversations as being mapped to tasks. Then they map those tasks to a database that's been used all over economic research of how tasks correspond to jobs. By doing that crosswalk, they're able to say something about what jobs have many tasks that are already being done by the chatbot versus what jobs do not. And then in addition to that, they think about when people are having these conversations, are they automating a task or are they more like collaborating with the AI to do a task? So that's the high-level thing that they do in this paper, and then it's kind of a measurement exercise.
Seth: They actually give some really useful examples of conversations that are matched to tasks and then occupations. For example, they consider the user conversation where the user posts, "My game keeps crashing as I only have eight gigabytes of RAM." That is then classified by their automatic categorization as the O*NET task "modify software to improve performance and adapt to new hardware," which is then mapped to a specific computer and mathematical occupation.
Similarly, they give the example of "Can you make sure this blog post follows Chicago style?" That's associated with the task "standardize materials from other writers and staff," which is considered associated with an arts and media job.
The first thing I want to point out is that all of these conversations sound like hobby activities rather than actually creating economic output. So on its face, it's not clear that they're actually saying things about people doing their jobs. Secondly, the guy whose video game keeps crashing because he only has 8 gigabytes of RAM is clearly not a computer programmer. He's clearly a guy who's just playing a video game. It seems like a misclassification. I just want to say that the examples they give of this classification task do not inspire confidence that they are measuring people's work activities.
Andrey: They do have some better examples when they're thinking about automated behaviors and augmented behaviors, like "format this technical document in Markdown" or "here's my Python script for data analysis, it's giving an index error, can you help fix it?" That seems like more work-related stuff - although the Python error thing could easily be one of my students asking for help with a homework assignment. But those are plausibly more work-related.
Seth: What I make of this is that the title of this paper should just be "Which Tasks Are Performed with AI," not "Which Economic Tasks." It's not clear what makes a task economic. In my opinion, a task is economic if it's either some sort of Robinson Crusoe economy where even if I'm not interacting with anyone, this is an economic behavior because I'm building a thing that I'm going to use, or what makes something economic is that I'm participating in a market with this thing and I'm going to buy it and sell it after I go through these steps.
"My video game is crashing cause I only have eight gigabytes of RAM" doesn't sound like either of those. It sounds like this guy is troubleshooting his consumption, which maybe could be thought of as the consumer taking on some of the job of customer service. The other example, "Can you make sure this blog post follows Chicago style?" - if I'm making an artistic or creative project that I'm just putting out on the internet for people, again, I'm not sure I would call that economic activity. So no problems with this paper being about measuring what activities or tasks people do with AI, but I think it's probably a breach too far to call these economic tasks.
Andrey: I think I agree with you. There needs to be more metadata around these conversations. A survey of whether users are using this for their job or not could be really informative, or even just a subset analysis of just the pro users who are more likely to be using this for their job.
I do think it's an interesting phenomenon of substituting professional labor with personal labor. Hal Varian used to bring up this example all the time with YouTube - before, you'd hire someone to repair your appliance or do work around the house, but now you can watch a YouTube video and do it yourself. This means YouTube is generating tremendous economic value that's not being measured. I think both of us are generally on board with that idea - GDP is going to miss a bunch of interesting activity just by virtue of how it's measured. But especially for an academic contribution, we want a more rigorous analysis.
Seth: Or just be clear what your domain of analysis is. If you're going to take the stance that anything anybody does is economic, then just call it "tasks." You don't have to call it "economic tasks" if every task is an economic task.
But in this paper's favor, they do look at four million conversations on Claude, the world's second leading LLM. So even if what they're not measuring is exactly economic usage, this is a very important cross-section of usage.
Andrey: And it's important for a lot of stakeholders - policymakers who are thinking about what LLMs are being used for, businesses thinking about consumer needs they can service with these models, and obviously Anthropic itself to understand its user base. The new model they released today is very focused on computer programming tasks in a way that other competitor models are not. That must be informed by the fact that their users really value this use case, and they're going to meet their customers' needs rather than just trying to push a model that's very smart generically but isn't catered to the use cases of the user base.
Seth: You said three really interesting things there. The first is that to the extent these models are not perfect substitutes for each other, we would expect them to develop specializations. One important limitation of this study is maybe Claude just turned out to be the coding-specialized LLM or the writing-specialized LLM, and that's what we're picking up. I don't think we're that deep into the tech tree at this point where the models are that different for that to be a giant consideration, but you can imagine that being a bigger consideration as we get four or five more years down the line.
The second thing you pointed out is this question of to what extent model builders are able to direct what tasks they get better at. Something I really want us to talk about in a future episode is to what extent development is directable in the sense of "I'm going to make an AI that's really good at coding" and "you're going to make an AI that's really good at writing." To what extent are those separate tasks versus just making a better AI, with maybe a little bit of an intangible asset in making a shell that's useful for coders, but that's basically trivial?
Andrey: That is a really big question. I tend to come from the world of thinking about personalized rankers - in my dissertation, I thought about personalization.
Seth: If I recall, your dissertation was about ranking people from best to worst, right?
Andrey: I would never rank people, Seth, come on! Only by objective metrics.
Seth: Thank you. It was science.
Andrey: More seriously, a lesson from digital technologies has been that personalized rankers, personalized recommendations, experiences really increase the utility of users. They make users use the product more and create more value for the users, also through personalized advertising. I think it would be a little weird to then have this generic model that's not in any way catered to the users.
So far, we haven't seen a lot of catering to users. We've seen big models and maybe system prompts, but not a lot of talk about "What if you tweak the final layer to give a certain type of answer that this certain type of person wants?" That's been left to specific application developers - so Harvey might be developing the lawyer version of ChatGPT, and they're going to do some fine-tuning on their end to cater it. But to the extent that there's an interface that people are generically using, you would expect the designers of the model for that interface to think really hard about what their users want.
Seth: Right. So there are two questions there: is it directable, and to the extent that there is a non-directable component, what's the ratio of investment in the non-directable component to the custom occupation wrapper or the custom task wrapper that adds a little bit more, but maybe not fundamentally? Anyway, great question for a future episode.
So, they had four million conversations. They basically got the AI to label all of the example conversations and assigned them to tasks that are then assigned to occupations. Similarly, they classified each of these 4 million conversations by whether they're more "automate" versus "augmenty." I'll have more to say about that in limitations.
One thing I want to say here before we get into the findings is the amount of human validation of these automatic ratings seems a little bit limited. They talk about in their appendix conversation and 86 percent agreement between their 150 human coders and the AI labeler. Not terrible, not great. How do you feel about the automatic labeling here? They have 4 million observations, and they only checked 150? It seems a little low.
Andrey: My prior is that this can do a pretty good job. If I was a referee, I might push them a bit more on this - it's not that expensive to check the conversations. I guess what they would tell you is that they actually really care about privacy-preserving methods, so maybe they didn't feel comfortable having external raters check the data. One interesting emphasis of this paper is how they're really worried about privacy concerns, which makes sense because people talk to these chatbots about very personal issues related to their health.
Seth: Things they wouldn't talk about at work.
Andrey: There are even studies that suggest you tell chatbots things that you wouldn't tell your therapist. So I think this emphasis on privacy seems very prudent for a chatbot provider, but maybe it limits what they can do.
Seth: It's also non-interventional, which limits them a lot too. It's just purely descriptive, but we like descriptive stuff, don't we Andrey?
Andrey: Yes. This is what our profession under-provides.
Seth: So maybe we can start running through the specific findings now. Their first main result is what occupational groups use Claude, proportional to their representation in the US economy. They find that the most common use of Claude is for computer and mathematical conversations - 37 percent of conversations, which in my brain is some combination of coding help and tech support. But when you think about it, only 3.4 percent of the U.S. workforce is involved in computer and mathematical occupations. So that's a giant over-representation of those tasks in their data.
Meanwhile, Office and Administrative Support, which is 12 percent of American workers, they see as only constituting 8 percent of their conversational tasks - a slight under-representation of office work, which you would think would be at least somewhat susceptible to automation.
What do they not see any usage of AI for at all? Very little usage for farming, fishing, and forestry - not a surprise, very physical. Physical and social science - 6 percent usage, people are asking questions about that, maybe a slight overrepresentation compared to the US economy. Very low usage for legal services, which I'm a little surprised about. I've definitely asked Claude some legal questions. I don't know what jumps out at you from figure three, Andrey.
Andrey: The office and administrative support is fascinating because it's so low when obviously so much of the work can be automated.
Seth: That's weird to us.
Andrey: Yeah, just filling out forms, creating forms, various compliance tasks - I wouldn't be surprised if the current generation of models is already better than the vast majority of the humans doing that job, and certainly when they do it together, they should do a better job. So this really speaks to the issue of diffusion and barriers to adoption.
Imagine you're an office worker, not a senior manager or anything, and you have a bunch of tasks to do about expense reports and so on. You might be hesitant or actually just disallowed from using LLMs to do this type of work. My mom works in a hospital, and she tells me that there are a lot of restrictions about the use of LLMs within the hospital. That might be for legal reasons or even perceived legal reasons - maybe there aren't actually any laws being broken by using it within the context of a hospital, but the management might be conservative in a variety of ways.
So even though this would be very useful, it is not being done. Both of us have the strong prior that Office and Administrative Support work has to be automated by LLMs.
Seth: If it should help us with anything.
Andrey: The legal services thing is quite similar. This raises another question about this index - the number of times you use the LLM for something is not indicative of the value of the usage.
Seth: Are you telling me writing a thousand lines of code might not have produced as much value as someone who wrote two lines of code?
Andrey: Exactly. As the cost goes down, you might start using these things for very trivial things that aren't very high value. The other version of this is, "Hey, that one medical question I asked Claude might've saved my life," and the value of that is much greater than every other interaction I've had with Claude.
Seth: Wait, you can't drop that in the conversation without giving context.
Andrey: No, there's no actual context for that. I'm not saying it saved my life, but I have used it to help me interpret medical results, for example. Maybe that's not well-advised, but it's given me peace of mind and provided value that I think is probably greater than the value it might've provided for other things I use it more frequently for, like to write referee reports for papers. Just to be clear, I write my own reports, but I do like to check my reasoning with Claude.
Seth: Now we're going to start moving into some results from the paper that I find much less convincing. The authors argue that they can measure, between occupations, what percentage of tasks do people use AI for at least a little bit. For a dataset with four million conversations, what does "at all" mean? It means they need to find at least 15 observations of someone having a conversation on this topic to count it as a task that appears in the data. Why 15? Who knows? Maybe it has some esoteric properties they find desirable.
Why am I a little suspicious of this? We already heard they only double-checked 150 of these classifications, with an 86 percent correct classification rate. So 14 percent of the classifications are wrong, they've got 4 million of them, and they only want to see 15 instances for it to count as happening? I'm not 100 percent on board with this.
Andrey: I agree with you. It could be that a lot of these low-end things are really just misclassifications. You'd want to change that threshold - to vary it to 100 or 1000.
Seth: It's not necessarily just misclassification. This is supposed to be a paper about economic value creation. The fact that I tried a thing two times and it never worked, then I stopped using it - that could add to 15 use cases from people experimenting and realizing it doesn't work.
Andrey: This goes back to one of my big questions: Where's the indicator of success? Where is the success button at the end? I know they collect likes and not-likes, but there's a sense in which we don't know whether someone actually accomplished what they were seeking to accomplish with their interaction with the chatbot.
Seth: So I'm not sure how much we learn from this analysis beyond what we already heard. The next result we should cover is, instead of looking at occupations, they look at different skills that seem to be called for in these Claude conversations. The things at the top of the list are pretty intuitive for me - they list critical thinking, active listening, reading comprehension, writing, and programming as basically the five or six top usage skills that are called for when people use Claude. Those all make sense to me.
But the stuff on the bottom I find pretty surprising. They find that almost none of the records relate to repairing or operation and control. That's a little surprising - I know YouTube is probably a better source overall for repair advice, but it seems like a natural place to get help from chatbots. The next set that I'm very surprised to see so lowly ranked are things like management of financial resources, time management, management of personnel, monitoring, selection - these are all managerial jobs. Other than judgment and decision making, which ranks reasonably high up, most of these managerial tasks are really not called for in Claude.
I would ask people not to sleep on this because we have been seeing employment growth in managerial occupations. There's some sense in which managerial or entrepreneurship tasks have to be the scarce complement to AI. It is very striking to see the lack of managerial talent called for in these Claude queries.
Andrey: That's a great observation. It raises a lot of interesting hypotheses that would be nice to investigate. Before I get to those managerial tasks, I do think that the number one task, critical thinking, is, of course, a managerial task - it's cognitive labor, and hopefully managers are critically thinking.
Seth: And hopefully they're active listening. I mean, there's some overlap for every task.
Andrey: Looking at these things - let's start with repair. I think the right question might be, conditional on having to repair something, how often do you use an LLM? That could be 100 percent, and it would still be a tiny portion of all the usage because you just don't need to repair things that often. Negotiation is similar - when was the last time I negotiated something?
Seth: It's stressful, dude. Negotiating is stressful.
Andrey: It is stressful. So I think one of the things is just the base rates - that's really important to consider here. The other thing, and this is a point that Tyler Cowen makes a lot, is that the people who will learn to use the AIs will be most successful. Maybe the AIs are already very good at some of these tasks, like active learning, management of personnel resources, but people don't view them as AI tasks. And maybe that's because there isn't such a close feedback loop as there is in programming. As a result, they're just not going to the AIs for advice. That might be a growth opportunity or place where a lot of value can be generated, just dealing with diffusion friction.
Seth: Right. If you could figure out a way to overcome people's frictions, or if you built a wrapper that made using it more intuitive for those tasks, maybe that's a big entrepreneurship avenue. If you get a unicorn startup based on that idea, please send your checks to Justified Posteriors.
Are there any other results you wanted to cover before we start talking about our posteriors?
Andrey: I guess the augmentative versus automative aspect.
Seth: When do you buy this at all? What do you think of this? Maybe you can tell us the five different kinds of tasks that they classify conversations into.
Andrey: They're classified as directive tasks (like "complete this task with minimal interaction"), feedback loop (like debugging a piece of code - you put in a bug, it gives you a potential solution, you try it, then you come back to it), task iteration (which seems a lot like a feedback loop to me, but it's a collaborative refinement process), learning (knowledge acquisition and understanding), and validation (I've already written this thing, can you check it and suggest any improvements).
They say that directive and feedback loops are automative, while task iteration, learning, and validation are augmentive. Then they show what percentage of conversations are of each type - about 15 percent are feedback loop automation, about 28 percent are directive automation. For the augmentative behaviors, there's a lot of task iteration and learning going on.
Seth: I love the idea of looking at the style of the conversation - is it a feedback loop, is it validation? That's super kosher, and I'd love to see these results. I'm not surprised, but it's interesting to see that the majority is task iteration at 31%, while validation is pretty rare at 3%. So on its face, some of these results aren't so surprising.
The part that I object to deeply is calling one of these sets "automation" while calling the other set "augmentation." I've been studying robots taking our jobs for over a decade now, Andrey, and as far as I can tell, there is not a good definition for automation. When people talk about automation, what they usually mean is a technological change that reduces the attractiveness of jobs, that reduces demand for labor - or at least that's what I think it should mean. If you said, "Here's my automation technology, it's increasing demand for labor," it doesn't sound very automated to me. It sounds like you need more labor.
Andrey: Well, conditional on type, right? So you have a technology that reduces demand for a certain type of labor, but there might be complementary labor types for which demand increases. One might say that's automative of one occupation and not of the other.
Seth: Now explain the absurd disease that this gets you into. My favorite example of how something that looks automated at the micro level actually is augmentative at the macro level comes from the U.S. experience of slavery. Back in the olden days, when America was growing cotton with slave labor, it was very time-intensive to take the seeds out of the cotton. Cotton was a crop people used for some kinds of clothing - it was in the mix.
Then a technology came around called Eli Whitney's cotton gin, which basically automated the incredibly labor-intensive process of taking the seeds out of the harvested cotton. So we're going from a super labor-intensive job, 100 percent labor, to a now 99 percent capital job. Does this reduce demand for slaves in the American South? No! It leads to an explosion in demand for slaves in the American South because now American cotton is able to outcompete European wool and European linen.
There's a micro sense in which the cotton gin automates the task of taking the seeds out of cotton, but there's a macro sense in which speeding up cotton production dramatically increases demand for people making cotton. If you were going to say anybody was automated, you'd say it was the sheep herders that got their wool replaced with cotton - they were the people who, if anybody, got automated. I find the way that people talk about automation very loosely here frustrating.
Andrey: I'm with you, Seth. I do think there's a difference between occupation and task level. It makes a little more sense at an occupation level rather than a task level. The slaves in your example, or the bank tellers in the ATM example - their job consisted of a mix of tasks. Then some of those tasks became very cheap to do automatically, but the other tasks remained.
To steelman a version of automation: if every task that a person in a particular occupation does got automated, they might find work in other occupations, but it's not necessarily obvious that the same worker benefits from increases in demand in other parts of the economy caused by this technological change. You might think of undifferentiated labor - of course, undifferentiated labor is going to be able to do any type of labor where demand has increased that doesn't require an education or whatever. But I'm not sure that's representative.
Seth: So on its face, if you told me, "Hey, look, this job that you used to do, your productivity has gone up by 10X" - am I anticipating doing as many hours of that job as I did before? No, probably there's complementarities across different tasks. If you make me way more productive because you automated some subset of my tasks, I'll probably do less of the job, definitely do less of the automated tasks, but maybe less of the unautomated tasks as well. But that's a partial equilibrium analysis, and even if it's rare, it is certainly conceptually possible for the general equilibrium effects to work differently for my occupation or my remaining tasks.
My takeaway here is, people use AIs for a mix of things. Some of them look a little bit more like one-shot interactions, some look a little bit more like iterative interactions, some look like the human is bringing a little bit more of their own thinking. Maybe that's the way to think about it - 57 percent of these tasks, the user is bringing more of their own thinking and creativity. I wouldn't call that augmentation versus automation, but I do think there is a distinction here that's interesting.
Andrey: I don't even know if I like what you just said, Seth. The example of the directive task is "format this technical documentation in markdown," but someone presumably wrote that technical documentation. That someone is probably the user.
Seth: Right, coming up with the prompt is the worker's work in the automated task.
Andrey: But I do think this is valuable descriptive work about how people are using the tools. To the extent that it's changing over time, that's telling us something. An important concept in these systems is "human in the loop" - at what point do you not need the human in the loop?
If there's a way to see that the chatbot one-shots a task with very high probability, that's interesting. But once again, what I'd want here is a success metric - did the interaction succeed in generating a result that was valuable, correct, etc., to the user? Without that, it's just really hard to interpret this.
Seth: So maybe this is a natural place for us to transition into limitations. We've listed a few. One limitation is that amount of time spent talking about something is not exactly proportional to economic value. Lord knows I spent a lot of time talking about the New York Jets, and it's not helping the Jets succeed at all.
Another limitation that I pointed out is it's not clear that everything everyone uses AI for is a work task, which introduces both problems in terms of their only classification schema being work tasks, so if somebody's using AI for not work, it's gonna do something weird. And also just based on the fact that if you can't distinguish between what's experimenting versus what's in-production operations, it's hard to really connect this to economic value. What do you see as the biggest limitations?
Andrey: You've already said most of them. I guess in addition to what you've said, I'm fascinated by this model specialization thing - are people going to Claude for coding and going to other models for different tasks? I don't know.
Seth: Oh man, I'm sure Elon Musk said to his staff, "We need the AI that's best at meme posting."
Andrey: Yes, yes.
Seth: They list in terms of their limitations that model classification might be imperfect. I do think that's an issue - I know you don't worry about it so much.
Andrey: I do worry about it for the minor tasks, to be clear. I don't think they're getting programming wrong that much on average - it's not a difficult task to classify. Can I also now say what Claude said in its referee report?
Seth: This is perfect timing. What did Claude say about its own paper? Now be mean.
Andrey: I first asked it to write a generic economics referee report, and it gave concerns about external validity, task complexities, how it distinguishes professional from novice-level inquiries, dynamic considerations, the O*NET framework limits, and causal interpretation - readers might draw causal inferences about AI's impact on the labor market, and the authors should more explicitly describe the limitations of drawing such conclusions.
Then I said, "Be real - if this was a real economics referee report, there would be additional concerns." So major concerns: One, fundamental identification issues - the paper fundamentally fails to establish that it is measuring what it claims to measure. Two, absence of a theoretical framework - I don't really blame them for this one. You shouldn't put theory into a paper just because there is theory about the topic. Three, selection bias and external validity because of just having Claude users. We've already talked about this - I think it's a limitation, but it's still interesting even with this limitation.
Four, endogeneity concerns - that's an interesting way to put it.
Seth: What are they worried is endogenous?
Andrey: Claude is worried that Claude's capabilities in different domains may lead people to use Claude in different ways, that Anthropic's marketing and positioning of Claude may lead people to use Claude in different ways, that the user interface design favors certain interactions, and that temporal factors, including Claude's release timing relative to competitors, may also affect these patterns.
This is a nice point - how do usage patterns change when they've just released a new model? Are we seeing a fundamental change in the usage patterns or mostly more of the same? Is it a slow drift or a sharp discontinuity? There are so many questions to answer with this type of data, but not necessarily economic ones.
Seth: Well, the fact they call it an economic index suggests that we're going to get updates, so I'm excited for that.
Andrey: I think the overtime series of this type of usage is very interesting.
Seth: Is it fair to say that Claude did not hit upon what I see as the biggest limitation here, which is the assumption that this is all economic activity when a lot of it probably isn't?
Andrey: No, that's its number one point. It calls it "fundamental identification issues" - the mapping from "a person asked Claude about X" to "AIs being used to perform economic tasks" involves unsubstantiated leaps in logic that undermine the entire analysis. That's Claude. Calm down, buddy.
Seth: That's reviewer two, dude.
Andrey: Yeah.
Seth: I feel like if they just left "economic" out of the title, that would defeat that objection pretty heavily.
Andrey: There's a paper we haven't discussed on this podcast yet, which is the paper by friend of the pod, Daniel Rock, on task exposure. We'll probably devote a separate episode to this, but I do wonder, how do you compare this paper to that?
Seth: That's fascinating. That's a paper about what the AI thinks it can do, whereas this is a paper about what are people actually using AI for. If I recall, Dan's paper (GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, 2023, in Science) does have an extension with some sort of validation - I forget if it was from a survey or from something used on Stack Overflow, but they did have some correlation between what we think people should use this for and what they actually use it for that was very positive.
So I like that mixed-methods approach of "here's what we think they should be able to use it for, here's what they are using it for." How else would I compare these two papers? They're kind of doing different things. One's a descriptive paper about usage now. One's sort of a possibilities paper about whether tasks are conceptually automatable by these kinds of systems. So I view them as complementary.
Andrey: I think I'm with you there. One thing one could think about is a measure of the gap between the potential capabilities from Rock et al. and realized usage from this paper.
Seth: And you measure that wedge and you give it a fun name. You call it the Fradkin Wedge. And it's like a measure of the size of the administrative and legal frictions in that domain.
Andrey: It would be interesting to have it.
Seth: That's where this is going. I think the next steps for this sort of very zoomed-out literature are: one, connecting what people actually use it for to these measures of what we think it should be good at; and two, the thing that you keep coming back to - an economic success measure. Did this succeed? Am I happy with it? Did it do the job? Because as we keep talking about, you can talk about a thing a lot without getting any work done.
Andrey: All right, so maybe let's move on to our posteriors.
Seth: Our posteriors. I would say that I came into reading this paper thinking that people use AI first for coding and second for cheating on their homework. Nothing I have seen in this paper contradicts that prior.
I guess the biggest update for me would be how striking the lack of usage for managerial tasks is. I would've thought things like manager time usage or scheduling tasks - that's the kind of thing I would have thought AI would be good at. And to see it not being used for that is interesting and suggestive. Did you have any big surprises in what people are using AI for?
Andrey: I think I had the same reaction as you. I don't think I had a very strong prior about how large the computer share of usage would be - I just knew it would be pretty large for all the reasons we talked about. And then I was surprised about office and administrative support - we can explain it post hoc, but it is surprising that the jobs we think are most mundane, the knowledge-type work that should be automated first - that's not where the usage is. That is really interesting.
Seth: I guess the last thing I'll say is maybe I thought there would be a little bit more in the artistic realm because we always talk about AI being really good in domains where having a lot of candidate options that you can sort through is good. That's kind of like the Avi Goldfarb machines framework, and you'd think art would be perfect for that - generate 1000 images and choose the one good one. But art is merely at 10 percent of usage, which is a little bit lower than I would have guessed.
Andrey: For me, it's higher than I would have guessed. I don't view Anthropic as investing heavily into artistic modeling.
Seth: So now we get back to the selection issue - Claude might not be the one you go to for that.
Andrey: DALL-E is an OpenAI model. The other major image generation models are also not produced by Anthropic, the major video models are not produced by Anthropic. Anthropic must have a voice model, but I've heard more about Whisper and others that are not Anthropic properties. For music, we have specialized players like Suno AI that seem to be in the lead. So if you're an artist, you might use a chatbot to ideate at a very high level, but when it comes to making your art, you're going to use another tool.
Seth: Right. And to the extent that you're using a lot of AI for iterating on drawing or design, you're probably not using Claude. But that comes back to a limitation of the paper - it can't move our beliefs about the usage of AI overall that much if it's only showing us Claude usage.
Andrey: We need the API data. We need the API economic activity index.
Seth: Exactly. So what would be the perfect next dataset we need to really answer these questions?
Andrey: The dream dataset is a cross-platform usage dataset. People have been doing survey studies where they ask how people use LLMs, and those studies are good at what people report, but they're not measuring the use cases in a finely grained manner or the frequency. If we had a dataset of a representative sample of LLM usage in a population, that would be really great. It'd be really great to get business users and measures of willingness to pay for these things. But I don't think we're going to get those datasets - the reason we don't have them is they're really, really hard to collect.
Seth: Well, I guess you can measure the difficulty of the task by the product that you would have gotten from doing the task, or at least you can bound it.
Andrey: One interesting thing is that OpenAI released a new benchmark that uses actual jobs on Upwork and whether the AI could complete them. That's not going to give you a representative sample of anything, but if we're thinking about economic impacts, I do think that if you can go end-to-end on a task that someone is willing to pay money for - not a small amount of money - that is an economic task. Upwork is not a representative sample of tasks in the economy, obviously, but if someone is already paying for the job to be done and that gets end-to-end automated by an LLM system, that's fascinating.
Seth: I agree. We should definitely read that paper and more along those lines someday soon. But maybe until then, our audience will have to read economics papers on their own. Do you have any closing thoughts for our beautiful and well-informed guests?
Andrey: Make sure to review, like, comment, subscribe to Justified Posteriors. Let us know what type of content you enjoy seeing and we'll try to provide more of it. Or if there are any topics that you would like us to cover, we are happy to take suggestions.
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com