Evaluating GDPVal, OpenAI's Eval for Economic Value

Justified Posteriors

chevron_right

00:00

Why are pure-text tasks scoring lower than expected?

They explore taste, expert detection, and why text tasks may see lower AI win rates despite common assumptions.

Play episode from 32:35

chevron_right

Transcript

chevron_right

Transcript

Episode notes

In this episode of Justified Posteriors podcast, Seth and Andrey discuss “GDPVal” a new set of AI evaluations, really a novel approach to AI evaluation, from OpenAI. The metric is debuted in a new OpenAI paper, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.” We discuss this “bottom-up” approach to the possible economic impact of AI (which evaluates hundreds of specific tasks, multiplying them by estimated economic value in the economy of each), and contrast it with Daron Acemoglu’s “top-down” “Simple Macroeconomics of AI” paper (which does the same, but only for aggregate averages), as well as with measures of AI’s use and potential that are less directly tethered to economic value (like Anthropic's AI Economic Value Index and GPTs are GPTs). Unsurprisingly, the company pouring hundreds of billions into AI thinks that AI already can do ALOT. Perhaps trillions of dollars in knowledge work tasks annually. More surprisingly, OpenAI claims the leading Claude model is better than their own!Do we believe that analysis? Listen to find out!

Key Findings & Results Discussed

* AI Win Rate vs. Human Experts:

* The Prior: We went in with a prior that a generic AI (like GPT-5 or Claude) would win against a paid human expert in a head-to-head task only about 10% of the time.

* The Headline Result: The paper found a 47.6% win rate for Claude Opus (near human parity) and a 38.8% win rate for GPT-5 High. This was the most shocking finding for the hosts.

* Cost and Speed Improvements:

* The paper provides a prototype for measuring economic gains. It found that using GPT-5 in a collaborative “N-shot” workflow (where the user can prompt it multiple times) resulted in a 39% speed improvement and a 63% cost improvement over a human working alone.

* The “Catastrophic Error” Rate:

* A significant caveat is that in 2.7% of the tasks the AI lost, it was due to a “catastrophic error,” such as insulting a customer, recommending fraud, or suggesting physical harm. This is presumed to be much higher than the human error rate.

* The “Taste” Problem (Human Agreement):

* A crucial methodological finding was that inter-human agreement on which work product was “better” was only 70%. This suggests that “taste” and subjective preferences are major factors, making it difficult to declare an objective “winner” in many knowledge tasks.

Main Discussion Points & Takeaways

* The “Meeting Problem” (Why AI Can’t Take Over):

* Andrey argues that even if AI can automate artifact creation (e.g., writing a report, making a presentation), it cannot automate the core of many knowledge-work jobs.

* He posits that much of this work is actually social coordination, consensus-building, and decision-making—the very things that happen in meetings. AI cannot yet replace this social function.

* Manager of Agents vs. “By Hand”:

* The Prior: We believed 90-95% of knowledge workers would still be working “by hand” (not just managing AI agents) in two years.

* The Posterior: We did not significantly change this belief. We distinguish between “1-shot” delegation (true agent management) and “N-shot” iterative collaboration (which they still classify as working “by hand”). We believe most AI-assisted work will be the iterative kind for the foreseeable future.

* Prompt Engineering vs. Model Size:

* We noted that the models were not used “out-of-the-box” but benefited from significant, expert-level prompt engineering.

* However, we were surprised that the data seemed to show that prompt tuning only offered a small boost (e.g., ~5 percentage points) compared to the massive gains from simply using a newer, larger, and more capable model.

* Final Posterior Updates:

* AI Win Rate: We updated our 10% prior to 25-30%. We remain skeptical of the 47.6% figure.

PS — Should our thumbnails have anime girls in them, or Andrey with giant eyes? Let us know in the comments!

Timestamps:

* (00:45) Today’s Topic: A new OpenAI paper (”GDP Val”) that measures AI performance on real-world, economically valuable tasks.

* (01:10) Context: How does this new paper compare to Acemoglu’s “Simple Macroeconomics of AI”?

* (04:45) Prior #1: What percentage of knowledge tasks will AI win head-to-head against a human? (Seth’s prior: 10%).

* (09:45) Prior #2: In two years, what share of knowledge workers will be “managers of AI agents” vs. doing work “by hand”?

* (19:25) The Methodology: This study uses sophisticated prompt engineering, not just out-of-the-box models.

* (25:20) Headline Result: AI (Claude Opus) achieves a 47.6% win rate against human experts, nearing human parity. GPT-5 High follows at 38.8%.

* (33:45) Cost & Speed Improvements: Using GPT-5 in a collaborative workflow can lead to a 39% speed improvement and a 63% cost improvement.

* (37:45) The “Catastrophic Error” Rate: How often does the AI fail badly? (Answer: 2.7% of the time).

* (39:50) The “Taste” Problem: Why inter-human agreement on task quality (at only 70%) is a major challenge for measuring AI.

* (53:40) The Meeting Problem: Why AI can’t (yet) automate key parts of knowledge work like consensus-building and coordination.

* (58:00) Posteriors Updated: Seth and Andrey update their “AI win rate” prior from 10% to 25-30%.

Seth: Welcome to the Justified Posteriors Podcast, the podcast that updates its priors on the economics of AI and technology. I’m Seth Benzell, highly competent at many real-world tasks, just not the most economically valuable ones, coming to you from Chapman University in sunny Southern California.

Andrey: And I’m Andrey Fradkin, making sure to never use the Unicode character 2011, since it will not render properly on people’s computers. Coming to you from,, San Francisco, California.

Seth: Amazing, Andrey. Amazing to have you here in the “state of the future.” and today we’re kind of reading about those AI companies that are bringing the future here today and are gonna, I guess, automate all knowledge work. And here they are today, with some measures about how many jobs—how much economic value of jobs—they think current generation chatbots can replace. We’ll talk about to what extent we believe those economic extrapolations. But before we go into what happens in this paper from our friends at OpenAI, do you remember one of our early episodes, that macroeconomics of AI episode we did about Daron Acemoglu’s paper?

Andrey: Well, the only thing I remember, Seth, is they were quite simple, those macroeconomics., it was the...

Seth: “Simple Macroeconomics of AI.” So you remembered the title. And if I recall correctly, the main argument of that paper was you can figure out the productivity of AI in the economy by multiplying together a couple of numbers. How many jobs can be automated? Then you multiply it by, if you automate the job, how much less labor do you need? Then you multiply that by, if it’s possible to automate, is it economically viable to automate? And you multiply those three numbers together and Daron concludes that if you implement all current generation AI, you’ll raise GDP by one percentage point. If you think that’s gonna take 10 years, he concludes that’s gonna be 0.1 additional percentage point of growth a year. You can see why people are losing their minds over this AI boom, Andrey.

Andrey: Yeah. Yeah. I mean, I, you know, I think with such so much hype, you know, they should,, they should,, probably just stop investing altogether. Is kind of right what I would think from [Eriun’s?] paper. Yeah.

Seth: Well, Andrey, why don’t I tell you, which is, the way I see this paper that we just read is that OpenAI has actually taken on the challenge and said, “Okay, you can multiply three numbers together and tell me the economic value of AI. I’m gonna multiply 200 numbers together and tell you the economic value of AI.” And in particular, rather than just try to take the sort of global aggregate of like efficiency from automation, they’re gonna go task by task by task and try to measure: Can AI speed you up? Can it do the job by itself?, this is the sort of real-world economics rubber-hits-the-road that you don’t see in macroeconomics papers.

Andrey: Yeah. Yeah. I mean, it is, it is in many ways a very micro study, but I guess micro...

Seth: Macro.

Andrey: Micro, macro. That was the best, actually my favorite.

Seth: Yeah.

Andrey: I guess maybe we should start with our prior, Seth,, before we get deeper.

Seth: Well, let’s say the name of the paper and the authors maybe.

Andrey: There are so many authors, so OpenAI... I’m sorry guys. You gotta have fewer co-authors.

Seth: We will not list the authors.

Andrey: But,, the paper is called,, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.”

Seth: And we’re sure it’s written by humans.

Andrey: We’re sure that it’s not fully written by humans because they’ve disclosed that they use AI. They have an acknowledgement—they have an AI acknowledgement section.

Seth: They used AI “as per usual”? Yeah. In the “ordinary course of coding...”

Andrey: And writing.

Seth: And writing. And for “minor improvements.” Yes. They wanted to be clear. Okay.

Andrey: Not, not the major ones. Yes.

Seth: Because,, you know, base... so, all right. You gave us the name of the paper. The paper is going to... just in one sentence, what the paper is about is them going through lots of different tasks and trying to figure out if they can be automated. What are the priors? Before we go into this, what are you thinking about, Andrey?

Andrey: Well, what they’re gonna do is they’re gonna create a work product, let’s say a presentation or schematic or a document, and then they’re gonna have people rate which one is better, the one created by the AI, or the one created by a professional human being. And so the first prior that we have is: What share of time is the AI’s output gonna win? so what do you think, Seth?

Seth: Great question. Okay, so I’m thinking about the space of all knowledge work in the economy. All of the jobs done by humans that we think you could do 100% on a computer, remote, is kind of the space of tasks that I’m thinking about. What percentage of those could an AI straight up... And just to be clear, Andrey, are these like kind of specialized AIs for the specific tasks, or are these kind of generic AIs?

Andrey: These are pretty generic AIs. Let me give you an example of a task, I guess, of at least the type that they’re thinking about in this paper. Mm-hmm. Although they think about a lot of tasks. So, the task is: “This is June 2025, and you are a manufacturing engineer in an automobile assembly line. The product is a cable spooling truck for underground mining operations, and you are reviewing the final testing step. In the final testing step, a big spool of cable needs to be reeled in and reeled out two times to ensure the cable spooling works as per requirement. And the current operation requires two persons.” So now the... it goes on and on. and then the...

Seth: ...and then the last sentence is “How many Rs are in strawberry?”

Andrey:, but the idea is, is that would... an example, yeah. Essentially you have to design, suppose you’re designing a jig using 3D modeling software, and creating a presentation using Microsoft PowerPoint as part of the deliverable. Upload only PDF summarizing design using snapshots of the 3D design created. The 3D design file is not required for submission.

Seth: There we go. So a pretty complex PDF being called for. I don’t think I could do it.

Andrey: I don’t think you could do it. I don’t think either of us can do it.

Seth:, I couldn’t do it in the amount of time the AI did it. You know, in a week, maybe.

Andrey: Yeah, I guess. I guess maybe, maybe in a week. Or, and maybe with AI assistance. With AI, with AI assistance, I could teach myself just enough. Yeah.

Seth: Right. I guess that’s a whole background issue here is we’re not thinking about AI for training. This is AI for just doing the thing. Yeah. Alright. So that’s an example of a very hard task. I think most tasks in the knowledge economy are easier than that. So that’s gonna ground my prior., I would say in real-world tasks, head-to-head versus a human, I’d be in the ballpark of about 10%. This is assuming we’re using like GPT-5 or Claude off-the-shelf versus a human who is actually paid to do that job. I’d be surprised if the AI wins up head-to-head much more than 10% of the time.

Andrey: Yeah, I think I’m in the same ballpark as you coming into this. You know,, I think I’ve tried making various work products using AI, and it’s,, rarely ever kind of a zero-shot process. One-shot, yeah. Or a zero-shot. Yeah. and there are oftentimes artifacts that kind of make it pretty clear that it’s an AI-generated thing, although not always.

Seth: Right. And so then we come around to like, some of those minor artifacts. To what extent can a little bit of massaging of these generic models get you a lot of additional productivity if you can get over those little hiccups that we run into with chatbots?

Andrey: But, and to be clear, I still think even... my prior going into it is even with some pretty sophisticated prompting, that the win rate would not be much higher than 10%, just because I’ve tried doing that. Right? Like, it’s not like I go into it and I’m like, “Hey, like do it, do it.” You know? I, you know, like I write like a pretty... I try to write a set of instructions for it and so on. I’m not, I’m not like naively using the models. And so,, I’m very often not getting kind of what I, what I’d like out of it. Right. As a result. So that’s...

Seth: Even as, even as top-tier prompters. Yes. You know, you might call us a 10x... we’re 10x prompters. I don’t know if you know that., you still don’t get what you want all the time. Right. Sometimes it’s just not... it. Sometimes the idea’s not in the model. Yes. And you can’t prompt it out.

Andrey: Yes.

Seth: but I guess,. I guess that’s one thing we’ll keep an eye on as we go, is just to what extent, they are adding additional scaffolding for these models. Okay. So the second prior that we were thinking about going into this is thinking about, like, kind of like the meta idea here is that any job that you can do on a computer, this AI should be able to do, if not in the immediate future, in the near future. That’s the dream, right? The “country of geniuses on the cloud.”

And so the question I have for you, Andrey, is looking at the occupations that are mostly about creating digital artifacts, so the knowledge work occupations, and let’s set aside whether there’s gonna be growth in those occupations or shrinking in those occupations. ‘Cause what we’ve said a lot, a lot of times when you automate part of a job, you might get more jobs or you might get fewer jobs. So setting aside that part of it, within the jobs that exist, are the people in those jobs going to still be making digital artifacts, quote-unquote “by hand,” as their main job? Or are all these knowledge workers gonna basically be managers of AI agents?

Andrey: And the question is about the share of workers whose primary job is currently to make these [artifacts]?

Seth: In the share of,, the share of, yes., let’s take it that way and let me give you a two-year horizon.

Andrey: So I would say that it’s still gonna be, you know, 85%, 90% of people,, that are still gonna be making digital artifacts by hand. But I think my question, I mean that’s, that’s my prior, I guess I would say. And, but kind of the main reason for it is it’s almost orthogonal to how capable the models are.

Seth: Okay.

Andrey: because what I’ve observed in my life is a lot of people just have AI usage aversion. So, mm-hmm. They’re just not adopters. And so...

Seth: Oh, so you’re, you have an adoption latency theory, which is just that, like it won’t grow because people won’t adopt it.

Andrey: Yeah. I, I’m just, I just look around and see a lot of people not adopting tools that are very useful,, in a variety of settings. And so to me, over the course of two years, can you teach an old dog new tricks, as they say? I, I don’t know.

Seth: The thing is, is it’s really, you can save a lot of time and people are, humans are also really lazy. So, well there are some forces going in different directions here. I guess, you know, I found this question of, you know, as I was asking it, this question of “by hand,” so ironic, right? Because like almost definitionally, if you’re doing it digitally, you’re not doing it by hand, right? So like what even is “by hand”? Are we just like moving up another chain of abstraction? And we should think about this as a continuum of, like, of knowledge work. We abstract a piece and we abstract a piece and we abstract a piece, but there’s always that long tail of knowledge work that remains to be done.

I think to me, this question comes down to like, what does it feel like in your job? Does it feel like I’m bossing an agent around, or does it feel like I’m getting messy guesses that I am cleaning up and doing, you know, half of the work, sort of iteratively, collaboratively? “Oh, you know, try this, try that.” That’s the AI systems that I mostly work with now, right? We keep on hearing promises about these agentic agents that’ll really be able to do 7, 10, 20-hour projects by themselves. My sense is that that level of “I am bossing around agents, I am not doing it myself,” is gonna be pretty rare within the next two years. So in 2027... I would think that that’s gonna be maybe 5% of knowledge workers. I mean, ‘cause right, it’s gonna be like lots of coders and then a small share of everything else.

Andrey: Yeah. And I wasn’t even thinking about coders. I was even excluding them from my thought process.

Seth: Excluding coders. Okay.

Andrey: Yeah. ‘Cause because I’m really thinking about, you know, like producing documents, presentations, schematics.

Seth: Well, here’s an interesting thing ‘cause we’re gonna see later at computer tasks, at, sorry, programming tasks versus other tasks. Is the AI actually a lot better at the programming tasks versus the other tasks? Hold on for evidence on that.

Andrey: Yeah. Yeah. And then did you wanna put a...

Seth: Did I, did I get a number? So you said 85%, so 15%?

Andrey: No, I said about 90. 90%.

Seth: 90%. Yeah. So 10% of, yeah. Knowledge work will be bossing around agents. Yeah. I’m, I’m, leads me closer to five, but... Very good.

Andrey: Alright.

Seth: Alright. Are we ready to go to the paper?

Andrey: Let’s rock and roll.

Seth: All right. So headline thing, this paper is gonna try to make an evaluation that can track how AI is improving in real-world economically valuable tasks. They claim that their tasks cover nine different sectors and 44 different occupations. Curiously, I don’t know why they specify both, because they’re gonna assign each sector one occupation. So it’s not like it’s sectors times occupations, it just, there’s 44 occupations and they’re associated with sectors, is the way to think about it.

Um, together these jobs make $3 trillion,, in the United States every year. it’s about a quarter of labor income., focusing on five occupations by sector that are digital and contribute most to total wage. How they’re selected, and I’m just gonna list a few of them for you guys. in real estate, there are jobs like concierges and rental clerks. In government, there’s jobs like recreation workers and first-line supervisors of police. In manufacturing, there are jobs like different kinds of engineers and, and so on. You know, programmers, any sort of like digital, you could do this job remotely, job... financial advisors, et cetera.

For each of these jobs. And this is like honestly, you know, huge shout-out, round of applause to this team because it seems like incredibly,, high effort. They recruited tons of experts in these occupations to first figure out what are the tasks in these occupations, matching that up with O*NET, which is a government database on the tasks, on occupations, and then sort of iteratively working with them to like define very narrowly, “Here is the economic task that we think AI can do.” And,, as a contribution, I think that that is so cool. I mean, the idea of like economic measurement of productivity at the task level is, I mean, I don’t know. It’s a dream since Taylorism of the 1920s. This is all the... this is a dream a hundred years in the making that we’re making progress on. Right?

Andrey: Yeah. Yeah. And okay. So that, that’s the setup. So we got 1,300 tasks across these 44 occupations,, that we’re gonna ask,, who’s better: man or machine.

Andrey: Yeah. I mean, yeah. I just want to double down on how impressive this effort is. I mean, you have experts from companies like Goldman Sachs, you know, Apple.

Seth: Oh, this is hilarious. The Air Force. They have a list of companies in the middle of the paper. Yeah. Why is this not a footnote? Why is this not in the appendix? Half of a page is just like, “Here are all the companies that our people have worked for. Apple, Amazon, 10 other ‘A’ companies.” It’s like, all right, cool.

Andrey: You know? Well, I get the sentiment. The paper is only nine pages long, and so I know you gotta like...

Seth: Half a page, a list of companies.

Andrey: I mean, these aren’t, you know. These aren’t your,, average Joes, right? They’re, they’re, they’re actually at these very high, you know, performing companies.

Seth: Average Joe works at Apple too. In fact, the person at Apple who’s taking time off from their lives to do this is maybe like the less of the average Joe than the high performer, or I don’t know, or they recruit... who thinks the best of the best.

Andrey: My sense is that they, I’m not saying like they recruited the best person in the world or anything, but these tasks, pay really well. Like they, they’re quite well compensated, so they’re not...

Seth: Right. So the average tasks, to give some context for this, the average task on their 220 tasks that they’re gonna end up focusing the most on took an average of 400 minutes. And if you multiply that by the median wage that we get paid,, someone would get paid $361 for doing the average task. So these are like real tasks. Yeah.

Andrey: okay. So. So kind of what do they do? They, they, they get these professionals to propose tasks. Then they use other professionals to figure out whether these are kind of really, you know, correctly specified tasks. They iterate on that a bunch. Mm-hmm., then once they’ve kind of come to that convergence, they have the AI do the task, and then they have other highly paid humans do the task.

Seth: And that... Wait, I think, and, and then there’s an iterative process. Yeah. Yeah. That’s process. There’s a process of prompt... Yeah, go ahead. Yeah, yeah.

Andrey: Yeah. So the iterative process,, are you talking about the, sorry, are you talking about the prompt process already or are you talking about the...

Seth: I’m up to the prompt process, but there’s the first, there’s several iterations. So, yeah.

Andrey: So I think the, the one I had in mind first was just the task is iterated on, between various experts so that,, it’s actually well-specified and representative of what a,, a task in this job category would be like. But there’s also additional iterations on the AI that is actually, right, doing the task. So you wanna talk about that?

Seth: I, I, yeah. And this is what I want you to take a minute to talk about, right? Because I think this is a really important point, is that they are not using... it’s not a huge amount of investment, but they are not using out-of-the-box Claude. They’re not using out-of-the-box ChatGPT in the sense of they’re not just prompting it naively. They’re spending a lot of time thinking really carefully about what is the perfect prompt to elicit this set of tasks.

Andrey: And so this is actually a great prompt for you all listeners if you were, wanted your [AI?] to do similar tasks, right? So this is actually where my introductory joke came from because the prompt begins, “Special characters: never use the character Unicode 2011.” But it goes on, you know, and a lot of these are, are kind of mostly about tool usage, right? And so...

Seth: Right. you know, like talking about... one of the basic prompts that’s so important is, is like, “If the task requires you to resend a PDF, definitely send a PDF” is one of the prompt improvements.

Andrey: Yeah., there’s some stuff like, “Take your time, do these thoroughly.” There’s, there’s other things like “Display all the PNGs.”

Seth: Be sure to double-check. Yeah. Double-checking things. Yeah.

Andrey: Are some... “Be sure to look a few days and see...” there’s a... “This is important” in capital letters and “Mandatory.” But I guess, I guess what I’d say is, this sort of prompt,, iteration is pretty standard in the industry at this point., there are a variety of frameworks that kind of let you do this programmatically even., but if you think about your... [Codecs?] or your Cursors... there, there’s a lot of prompt engineering going on under the hood., or, or even your ChatGPTs or your Claude chat, you know, there’s that system prompt. They’re, they’re tweaking all the time. So there’s nothing, I’d say there’s nothing unusual ‘cause it’s well-known that to get a good performance out of these,, systems, you need to,, have a good prompt.

Seth: I think that’s exactly right. I just wanna connect this to the point you made earlier about adoption lags. Right? And I agree with you that it’s very standard to, you know, for a company or an individual to spend a good amount of time prompt searching before they find one they’re good with. But even a small friction like that makes a big difference in terms of adoption, I think.

Andrey: Yeah, totally. Unless that,, prompt is given to you out-of-the-box,, baked in, in Cursor or whatever. Yeah, yeah, yeah. You’re just... or not you, I don’t wanna say, but most people, they’re, they’re, they’re gonna try...

Seth: Dear listeners, dear listeners, we’re sure that you are the best prompters.

Andrey: Yes. I’m sure our listeners are better prompters than we are, but everyone else, you know, you know, I think, I think they might have a bad experience with one prompt and kind of overlearn about the capabilities of the system., which is kind of an argument for why we might, we might see a lot more application-driven adoption, right? Rather than, you know, using a generic LLM,, that could be capable of doing something. You might have a packaged service,, like let’s say “PDF Creator.”

Seth: Alright, Andrey. This is what I wanna talk to you about. This is what I wanna talk to you about. ‘Cause I low-key think the paper’s about this. I think that the secret theme of this paper is: What is the relative return towards this basic prompting work, this basic scaffolding work, versus another hundred billion parameters in the model? ‘Cause we do get an estimate of that, right? And so I was really surprised to see kind of that, you could get about a 10% improvement on win rate. I guess, can I just...

Andrey: Can I just pause you on that and can actually just go through the results first before we...

Seth: Alright. Okay. Yeah, yeah. Listeners, listeners, you know how excited I get. You know, I get off the chain and you need to reel me back in. So let me give you the results, and then I’m gonna wildly speculate.

Andrey: Okay. Perfect. Yeah, yeah, yeah. Well, let’s, I’ll just say what the... so we haven’t actually done the description of the pairwise task, which is essentially this highly incentivized person,, they have to choose which one is better, the AI-generated one, or,, the output created by another human expert. and you know, and just in general, kind of with these things, you might be worried that,, the graders aren’t putting in enough effort, right? Like,, maybe they don’t really care which one is better. And so they sometimes, they might not read as deeply as they’d like. And,, you know, from having talked to some of the authors of the paper, it seems like these graders spend quite a bit of time just evaluating which of the outputs is better.

Seth: Right, they said about an hour per evaluation. Yeah. Is a real... yeah, yeah, yeah.

Andrey: So it’s not like they’re just like, “Eh, I kind of feel like this one is better than that one.” I mean, I, to be clear, I still think that, you know, we could probably do better in incentivizing proper grading, but kind of, it’s not, you know, some of the more obvious flaws you might think are there, are, are, they’ve thought about them.

Seth: Right. No. These, like we said, extremely well done within the bounds of what they’re doing,, from everything we’re reading. Okay. So we’ve got, we are evaluating them, we’re going head-to-head. And to be clear, we’re only, as far as I understand, it’s only for 220 of these 1,300 tasks. Do we have the resources to actually do this evaluation? But within the 220,, we’re gonna ask, okay, what’s the win rate of GPT-4o, or 4-mini, [O3?], GPT-5? What? So my prior was that the AI will win 10% of the time. What were we seeing?

Andrey: Yeah, so we’re seeing... and perhaps the most remarkable part of this paper... which is that Claude Opus makes a showing.

Seth: Claude does better.

Andrey: Claude does the best. Claude does the best of all the LLMs with 47.6%, which is just very close to human when you really think about it. I mean, it’s almost a coin flip which one is better. right. And then GPT-5 High also does pretty well at 38.8%, but actually substantially worse than Opus,, which is quite interesting.

Seth: Right. Bold of OpenAI to go out there. Although,, maybe we wanna talk about different domains, different occupations here. There are areas where the OpenAI models shine.

Andrey: Yeah, yeah. okay...

Seth: So the headline result,, Claude almost human parity on these tasks. [Expletive] insane, at least in terms of, you know, that win rate., and then OpenAI close behind at 39% with their leading model, but it differs a little bit by sector and occupation.

Andrey: Well, I just wanted to mention one other thing.

Seth: Go ahead.

Andrey: Before we moved on to sector and occupation, I just... ‘cause like one of the things that, you know, with a theme of this show has been,, you know, scaling laws and how much better newer models are. And it’s interesting to me the set of models that was considered here. So we have GPT-4o, which, you know, is an older model, but not that old of a model. It’s a kind of a cheaper model and it actually only wins about 10% of the time. So we’re kind of pretty well calibrated if we think about that model.

Seth: And that’s actually right. We’re just taking it out of date...

Andrey: Closer to the model that many, many people, you know, had access to essentially until July. And then, O3 High, which is a model that essentially no one uses,, because it’s really, really expensive,, is at about 30%. And then GPT-5 High, which I guess may be the “thinking” version of the ChatGPT interface. I’m not exactly sure. It’s kind of unambiguous, frankly. Because maybe they have a...

Seth: Is there a special GPT model that’s being used here?

Andrey: Well, there, there’s a router and who knows what’s being routed where.

Seth: It gets routed. It gets routed to the good server.

Andrey: Yeah, yeah, yeah. So that, that’s, that’s kind of almost, you know, 35% to 40%. Right. So, so we do see improvement within newer models or the models that are more compute-intense. But I would also say that most people do not have this quality of a model as their default.

Seth: Yeah. There does seem to be a giant... so this is speaking to what is the relative value of overall progress versus prompting progress? I mean, it seems like in a year of overall progress, we’ve boosted—arguably boosted—this win rate by 30 percentage points and, like, arguably saturated it if we’re getting a, you know, almost 50% win rate. I mean, if there... it could... I’m not saying we actually saturated it. In fact, one of the arguments in the paper is that they’re gonna use win rate as their main success measure because it doesn’t get saturated as easily., but it’s damn impressive, that amount of progress in the last year.

Andrey: Yeah. so all right, so let’s go, you know, you wanted to... Yeah. Go by occupation. Yeah. Yeah. All right. Go for it.

Seth: Oh yeah. So what jumped out at me about that was basically all the models do pretty well at basic clerking jobs, and all of them are decent at programming. Claude... kind of the stuff that all of the models are good at, Claude just knocks out of the park. Right? Then there’s some interesting kind of turnarounds in the sense that the ChatGPT models seem better at sales and editing and audio-visual than, Claude. I wonder... so there’s like two different things going on here. One is you might think that ChatGPT is like a little bit more attuned for writing versus coding. That’s maybe an intuition that I have.

Andrey: I guess what I’d say is actually,, for some of these occupations we do see that, that,, the AI is actually better than the human.

Seth: This will be above 50%. Yeah.

Andrey: For example, I think statistically significantly, Opus is better than humans at being a private detective.

Seth: Now that was nuts. That was nuts. Or the rather, the knowledge tasks of being... Yeah.

Andrey: Which is kind of like, you know, an interesting thing to think about. Does that mean that private detectives,, are going to have their job removed? What are, what are we actually... or is it just that private detectives are really good at investigating and not that good at making presentations? Right. So like, like what are we, you know, that’s an interesting thing to think about.

Seth: Right. How does this translate into people’s jobs actually changing? When I think about a private eye or a police supervisor, this sounds like internet research tasks. So yeah, I mean, probably just internet research goes faster and then they spend more of their time on their other tasks would be my simple guess.

Andrey: That, that’ll be my simple,, guess as well. I, yeah, I mean, I think I’d be a little, you know, because the standard errors are so large for individual occupations, I think I’m a little wary of overreading into them. I think like standout things, like all the models are bad at being pharmacists. All the models are bad at being film and video editors and producers in direct ways.

Seth: Well, well, but, but, but... ChatGPT, the GPT models are significantly better than Claude’s. So that is an interesting difference.

Andrey: That is different than film and video editors and pharmacists, which is the one I was mentioning. Oh, okay. I mean, I’m not saying that you know, that there are differences... there are statistically no difference across models. But I’m just saying that just in general, there are certain categories of jobs where,, the models are far away from 50% and others where they might even be better than humans. Right, right.

Seth: And I guess, and then the third, and then the third kind of twist on that is kind of surprisingly, there’s not [monotonicity?]. Some, in some of these cases, most of the cases, Claude is the best, but in some of the cases, the AI models are better.

Andrey: Yes, yes. and you know, they, you know, another way to think about it that surprised me is actually they did it, the win rates by,, the category of output. So for pure text, the models suck. For PDF, the models,, at least Claude is quite a bit better. For Excel,, the, you know, Claude is very good., for PowerPoint, Claude is very good. And then for “other,”, a lot of the models are good. But to me the just the... I would’ve thought that at text they would actually be quite, quite good. But that’s actually the category in which most of the models are doing pretty badly, which is kind of...

Seth: I think it has to be endogenous to what kind of jobs are associated with pure text, right? And I imagine if it’s pure, sort of creative... I guess creative writing... both of them should do okay at, but I’m not surprised that OpenAI is a little bit better at...

Andrey: Yeah. But I guess I’m just surprised at how low they are, you know, not, not at who’s better.

Seth: Maybe it’s, it’s, I think this is might be a taste thing, right? It’s maybe like, you know, like the [winch?] either works or it doesn’t, but people still have a strong preference for a non-AI voice.

Andrey: But it’s not... but I guess what puzzles me about that is when we’ve seen a bunch of behavioral studies which are kind of like heads up, you know, “Do you even know this is an AI?” and people, people can’t detect whether...

Seth: Are those expert contexts?

Andrey: No. And this is kind of, this is kind of this interesting thing. Maybe the experts in their own domain of expertise still are able to distinguish the model, you know, the quality, and therefore the models...

Seth: There’s still hope for expertise. There’s still hope for us.

Andrey: But for, but for normies... but for normies, like... they already... normies have no idea who wrote the damn thing.

Seth: Right. And audience, just to be clear, we include ourselves in normies in 99% of world topics that are outside of our domain, right?

Andrey: Yes. I’m sure I’ve been fooled by AI output in, in many ways. I think another interesting exercise that they go through, which I kind of view as a prototype more than anything else, is like the, essentially the cost improvement from using the AI versus human., and it kind of makes some assumptions about what that...

Seth: Right. How do they interact? I kind of... yeah, this is, this is prototype, but a very intriguing... so walk us through that result. Yeah.

Andrey: So you know. So you can imagine that. Alright, so the human does it end-to-end, that takes a certain amount of time., alternatively the human can prompt an AI. The AI does it. The human needs to evaluate the output. So that’s gonna take a certain amount of time and maybe will even iterate with the output a certain amount of time before they get what they want. And so they make some reasonable assumptions here and think about like, what is the cost improvement and the speed improvement from using the different models in different collaborative modes.

Seth: Right. And they’re gonna consider one-shot, so use the model once and then fix it, or N-shot, use the model lots and lots of times and try to get it that way.

Andrey: Yeah. And just, I’ll just focus on the main figure in the paper,, where, what’s interesting is that GPT-4o, which is kind of the old default model in ChatGPT, it’s kind of not a cost improvement and it’s not a speed improvement. And that’s because the outputs are so bad....

Seth: Right. And its win rate is low.

Andrey: Right? Yeah. So it’d be one thing if like, it could just do it by itself sometimes, but it doesn’t do it by itself often, and in collaboration with humans, it can actually slow you down.

Seth: Yeah. now 4-mini,, which is different than 4o. Remember how open... how good OpenAI is at naming their models.

Uh, it’s already better. But,, compared to that, GPT-5, which is their newest model,, it achieves substantial cost improvements...

Andrey: ...blows it away...

Seth: ...uh, 1.5x, over 1.5x, and substantial speed improvements,, over 1.25x. And importantly in both of these metrics, it beats O3, which is kind of a more capable reasoning model. and that’s because cost matters in an ROI calculation and speed matters in an ROI calculation. And that’s kind of, You know, one way one can read this as kind of a... you know, OpenAI got criticized a lot for the GPT-5 model, like somehow it was underwhelming. But actually, you know, for adoption and utility, what we care about is economic value and not, you know, whether it can solve the gold medal on the IMO, right? So, and so here it’s, it’s providing a lot of that value.

Andrey: Right. And so the number that jumps out at me is with ChatGPT-5, which is the model that,, you know, their best model, they say in the, “you do it,, you have the AI do it once and then fix it” configuration, that leads to a 12% speed improvement and an 18% cost improvement. And in the “you can just, you know, prompt it as many times as you want and incorporate that in your final answer,”, a 39% speed improvement and a 63% cost improvement. So, so, I mean, damn, if you could improve,, the productivity of all knowledge workers 60%, that’d be quite a thing.

Seth: Yeah, that would be, you know, pretty, pretty great....

Andrey: Is that the “country of geniuses on the cloud”?

Seth: I don’t... 60%. I don’t think it’s geniuses. You don’t really think about geniuses making great PowerPoints. I mean, this kind of...

Andrey: Ben Jones is excellent, sir.

Seth: I, I guess, yeah, the, I don’t know if we’re ready to kind of come to some of these meta thoughts,, about what it means to kind of automate this, these sorts of tasks. But, yeah. Yeah. Before we get to that, are there any other parts of the paper that we should mention?

Andrey: In, in that particular... there’s two other results I wanted to get to.

Seth: Okay.

Andrey: the first is you might be worried that sure, these models are doing good at win rate, but maybe like when they lose, they’re saying something horrible, right? Yeah, yeah. So it might be, it might be,, better at the median, but worse on average, right? Like, we don’t think this is like super plausible, but it’s something they check for. And what they do is they ask, for the models,, whenever they do these head-to-head comparisons and the AI loses, they ask like, “Why did it lose?” Yeah. And 2.7% of the time it was due to a quote-unquote “catastrophic error.” And the examples they give are: insulting a customer, giving the wrong diagnosis, recommending fraud, suggesting actions that would cause physical harm. We do not get the details audience, but I promise you I will ask Andrey to ask his friend who was on this paper, what was the horrible thing that AI did?

Seth: Is that... just to be clear, I am not friends with anyone in this paper. Just someone I saw at a conference.

Andrey: Read his name. That’s true. so I don’t know, it’s just 2.7% catastrophic error rate. I mean, I think that’s probably a little bit higher than a human.

Seth: Yeah, no, it’s certainly a lot higher than an incentivized human in these jobs. I mean. But I guess it, yeah, I mean, it depends. Certainly doctors misdiagnose all the time. I mean,...

Andrey: Yeah, that’s kind of the odd man out, right? Yeah. That’s, you know, that happens, but you know, it’s recommending fraud.

Seth: Yeah. Recommending fraud. Yeah. That’s not a good look.

Andrey: If I was in a room with a lawyer, I think 3% of the time they would recommend fraud.

Seth: It’s,, you know, the Better Call Saul was a huge part of the training set. but I think most work outputs, you know, there’s, they’re, they’re in the end presented to some other people who also vet it. There are many, like the way organizations are structured is that there are many checks and balances on a lot of this output., but it depends.

Andrey: But maybe it suggests that we’ll need more of them as we move to an automated world. And you know, you’re, the job of the future will be,, automated AI, you know, I don’t know, sanity checker.

Seth: And they, by the way, they spent a lot of time training in it, you know, or trying to use a model to grade the model outputs, right?

Andrey: Yeah. You wanna talk about that for a second?

Seth: Yeah. They achieve some kind of pretty reasonable results, I’d say. So the automated grader agrees with the human grader about 65% of the time. versus inter-human agreement is about 70%. I guess, I guess if I had to like poke at any part of this paper, I actually might just poke at this, right? Hmm. 70% inter-human agreement. Seems low. Seems quite low. Like, if I were to say like the win rate is this very meaningful feature, then why... and kind of we really wanna do well here...

Andrey: ...and [humans?] are winning 30% of the time. You’d be, you’d be concerned.

Seth: I, I mean, you would think that humans would agree, you know, expert humans would agree on something where there’s truly a right answer. Clearly we’re not seeing that here. And one version of that is something I’ve already mentioned, which is maybe the incentives are not high-powered enough for them to really determine what is better than, you know, which of the options is better.

Andrey: You don’t think there’s some ambiguity in like in that winch example you gave at the beginning, all right, so maybe the AI gives a winch that’s a little bit stronger and the human gives a winch that’s a little bit more colorful. I mean, it seems like a lot of these settings are pretty...

Seth: No, no, sorry. That’s, so that’s where I was going. It wasn’t like, “Okay.” I was saying I think we interpret this quite differently if we think that a lot of what’s going on here is that there’s some sort of latent preference heterogeneity.

Andrey: Taste.

Seth: Yes.-huh. Yeah. That some, some, some experts like certain types of work,, other, other experts like other types of work. And you could say, well, maybe it’s just all aesthetic. Like who cares? You know, you know, who cares that this guy likes their slides red and this guy likes their slides blue. But maybe it’s actually quite relevant to the job. And I think that’s kind of an open question to me is like, is there a reason why this particular expert thinks that one output is better than another? and that they disagree with their other human experts? yeah.

Andrey: Yeah. One thing, one, one follow-up they do on that is they do ask in, the examples where one, where the AI lost, they ask, “Why does it lose?” Yeah. And greater than 50% of the time, they say it was, it was “adequate,” but just, you know, their faith was the other one was better.

Seth: Yeah, yeah. Yeah. And to be clear, so I mean, you, I’ve seen a lot of “adequate” work products in my life from humans.

Andrey: Humans in their “adequate” work products. Yes. Yes. Okay. So now can I go to the thing that I’m on about, Andrey?

Seth: Yes, you can go for it.

Andrey: So my interpretation of this figure is. Going from a version of GPT-5 that is sort of out-of-the-box on these tasks versus one that they’ve prompt-engineered, they’ve been able to increase the win and tie rate by only three percentage points. So the scaffolding, it’s meaningful, they do do some work on it., but in terms of like the benefit compared to just going from the models from a year ago to the models from today, it’s dwarfed. It’s, it’s 10x better to just go to the bigger model rather than to fine-tune. Is that, do you agree or disagree with that interpretation?

Seth: Not even close. Not even close. Seth, I... please.. This specific plot is not even the one I think is addressing what you’re talking about. Unfortunately.

Andrey: I thought that the,, I guess in my, in my eyes, I thought these were pretty similar.

Seth: No, the concise one is quite different.

Andrey: So explain to, explain to us in the audience what,, Figure 9 explains.

Seth: My thought process here was prompt tuning and scaffolding increases... so this is the win rate for GPT-5 High, right? By about five percentage points. And now, right, your Figure 14 is specifically about telling it where to find stuff.

Andrey: Oh, so it’s a su... Okay. Sorry. So it was like...

Seth: So it’s like the way, the way I interpret Figure 14, is really about,.. like, if you’re giving it a vague, like, you’re like, “Hey, like make this report for me, but I’m not gonna tell you where like the materials are.” Like that’s very, that’s very different than thinking something about like, fine-tuning. Right? It’s really like... it’s like, I’m like being a “bad boss,” I’m just gonna give you very ambiguous instructions versus like, I’m actually gonna be like, “Hey, here’s a folder with all the materials. Like, go at it, you know?”

Andrey: Oh, this is great. I read this too fast. This is even more interesting than I thought. Yeah. Okay. So, all right, so turning around. So is it fair to say from Figure 9 we get that the prompt fine-tuning is worth about,, four or five percentage points of improvement, but from Figure 14 we get that being a “bad boss” is worth, you know... and not explaining basic stuff that you would expect to be... yeah, it has...

Seth: ...about a similar effect, is kind of my understanding.

Andrey: Negative in the other direction. Okay. Yeah, yeah, yeah. So I mean. To, I’m, I’m gonna be frank with you, Andrey. The reason this stuck with me is because I thought that this was gonna matter way more. Hmm. I thought like prompting was gonna matter, like almost half as much, if not as much as model quality, but, I, there you go. It’s the “use the bigger model, man.”

Seth: Yeah. I mean, I think prompt tuning the way I’ve always thought... always, like, it’s not like I’ve been thinking about prompt tuning for that long. “My entire life. My entire life I’ve been thinking about prompt tuning.” No. the way I think about prompt tuning is it gives you kind of a constant benefit on top of just a base model. it allows you to do some percentage better, but it’s, there’s not a scaling law aspect to it in the same way.

Andrey: Percentage points better. Yeah. Yeah. Yeah. So maybe so yeah. So there you go. So, a year ago, a five percentage point improvement was a 50% improvement.

Seth: Yeah, I don’t know. I don’t know about that. I more think of it as a percent-over-performance improvement, rather than in levels, if that makes sense. So I would say like, you know, before, if we were only able to do 10% win rate, then like prompt tuning would’ve given you, you know, 12%. But now you know, because the baseline is higher, you also get a constant, you also get a better improvement from...

Andrey: More benefit. There’s more in the model to find.

Seth: Yes. Yeah, exactly. Yeah. Yeah...

Andrey: So, okay, but, so, but, but high level, was this surprisingly... you thought that this was in the ballpark of...

Seth: How... Yeah, yeah, that’s kind of what I’ve, what this particular aspect of it I was not particularly surprised by.

Andrey: Yeah. I mean, I, to me, I see people flailing with bad prompts and people doing amazing with good prompts, but, maybe just the range of, maybe they started with a pretty good one.

Seth: ‘Cause they’re... I think, yeah, I mean, I don’t think they started with a bad one. I think the nature of this task involves very specific instructions already. Right. So it’s not like they were saying “do it,” you know, they were “here, job, do, read my mind.” You know, it’s a, like the entire, this entire task is very, like, really well specified by the expert. Mm-hmm. Mm-hmm. yeah. I mean, tool use is very important. Just to be clear, it’s obviously this couldn’t be done without tool use.

Andrey: Right. Right. It needs to call CAD to make the model. It needs to call all the different APIs to interact with other things. Yeah., although they don’t call ‘em APIs anymore. What do they call the, the,, the APIs for AIs now?

Seth: [MCPs?]

Andrey: [MCPs?] Why don’t they just call ‘em API or like AA-APIs? I’d be able to remember that.

Seth: Yeah. I, I, [AI-PI?] I mean, I do think it’s, you know, it does kind of raise this question of like, what is an AI? I think for a while, people were thinking, “Oh, it’s just, you know, the LLM.” But clearly, you know, now that an LLM can use an arbitrary programming language with arbitrarily smart packages, you know, I don’t think... Right. The capabilities in the model are quite different depending on what tools it has.

Andrey: Very well put. Are there any final results you wanna,, bring up, Andrey, before we get into our posteriors?

Seth: No, I just wanted to actually make the following point.

Andrey: Do it.

Seth: Just like, one of the questions that I hear here, you know, talking to AI folks is, is just kind of like, “Well, why aren’t economists at the forefront of, you know, AI and economics?” And I think about this...

Andrey: It’s very expensive.

Seth: Yeah. And I think about this paper and I’m like, I don’t know of a single team of economists that could pull this off, just organizationally and financially. organizationally,, this is, you know, 1, 2, 3, 4, 5, 6, 7, 8, 9...

Andrey: He won’t tell you their names, but there’s a lot of them.

Seth: There’s a lot of them. There are nine main authors, and then a bunch of sub-authors.

Andrey: And a bunch of authors that are not main...

Seth: And yeah, a bunch of non-main, main authors and, but apparently also equal contribution. and, and like these are, you know, AI researchers, so we assume... let’s do their salary.

Andrey: ...getting paid a million dollars.

Seth: Or I’d say I wouldn’t be surprised if the average salary...

Andrey: Average wage on average.

Seth: ...average, you think is the average yearly salary of this research team is probably two to $3 million per year.

Andrey: Right. And then probably, you know, double it for their expenses.

Seth: Yeah. And then the expenses of recruiting all these people is, yeah, just staggering. There’s just no way....

Andrey: You think it’s a $50 million study?

Seth: I think that’s right. Ballpark?

Andrey: No, I think it, no, I don’t think it’s quite that high. And I don’t know how much time it took these people to do it., that’s, that’s...

Seth: You said AI do it, dude.

Andrey: Yeah. I mean, I, I more put it like at the, maybe somewhere in the $2 million to $5 million range, but still it’s a lot of money.

Seth: You don’t think it’s 10 million?

Andrey: I don’t think it’s 10 million. I mean, it depends, it really depends on how much time each of these guys...

Seth: ...is getting paid over.

Andrey: No. Yeah. It depends on how much... depends. Got, yeah. Yeah. I think with the, yeah, it depends on whether this was the main part of their job for a while or not. Sorry to speculate about, you know, if you’re listening and an author, sorry to speculate about your salary.

Seth: Yeah, no, we, I’m sure all... we’re very happy for you all,, impoverished and deserving of our love and support.

Andrey: Yeah. All right. well, while we’re like kind of multiplying some numbers together. Yeah. so I was trying to like kind of ball... instead of ballparking how expensive this study was. I was trying to ballpark, like, so they say that these jobs constitute $3 trillion of economic output in the US and they’re gonna claim that like some per... I mean, I don’t know. The implicit claim in this paper is that once we figure out how to implement the technology, some percentage of those, of that work will be automated.

I think that plausibly they’re on a path to automating maybe like a third of that. Right? Do you think like, maybe there’s a trillion dollars... I know you, you really hesitate to speculate on dollar values, but I mean, people are betting on OpenAI thinking it’s gonna [create?] trillions of dollars of value., right now, maybe one trillion’s worth if we think there’s, you know, about one-third of these,...

Seth: ...per year.

Andrey: Per, per year. Per year., yeah. I guess if you make a trillion a year, it’s, it’s worth a lot in terms...

Seth: Just remember about stock versus flow.

Seth: Fair enough. Yeah. All these OpenAI getting compared to the GDP of Sweden. Stock versus flow. All right. Yeah. yeah. Anyway, so that’s just something I’m thinking about, right, which is whether or not we think that that’s the most important result from the paper. To me, one of the motivations of this paper is: Can we do something fancier than [Eriun?] in terms of thinking what is the total economic value of current generation technology?

And they get a number that’s basically,... so if he says it’s 1% of the economy and we’re saying it’s one-third of [a quarter?], we’re say... I’d say it’s like [one-twelfth?] of the economy. So, you know, a slight disagreement with [Eriun?] there. Do you think it’s close? What percentage of the economy can be automated by AI, Andrey? Is it closer to 1% or closer to,, two-ninths?

Andrey: I mean, this goes to the question of value creation, right? Like, and think about what, what hours-wise, what people spend time on. But you know, I’m currently working at a company and I, you know, and I don’t wanna spend...

Seth: How dare you as an academic. Yeah.

Andrey: How dare I. I don’t wanna, I don’t wanna speak too much about my, my work for a variety of reasons, but, you know, I’ll, I will note that a lot of my time is spent in meetings.... I’ll just make a side note to the listeners that Seth just made approximately five inappropriate jokes in a row. And for our reputation—each one funnier than the last—we’re just gonna not include them. But if you’re interested, you can, reach out to us in private channels and Seth will share his, comedic insights.

Seth: Alright. So did we, did we come... So let’s give our posteriors.

Andrey: Well, no, I get, I get... No, no, we’re not finished with the meetings, I guess.

Seth: Oh, okay.

Andrey: Why was I talking about the meetings?, which... I was talking about the meetings,, because I spent a lot of my time in meetings and, as far as I can tell, AI cannot automate my participation in these meetings. Now why is that? I don’t... that’s actually like a, an interesting question. The way I think about it, it’s like organizations are decision-makers, you know, kind of similar to some other work we’ve covered in this podcast. And ultimately the ultimate output is not like the hours of work making the presentations and the documentations and so on. It’s making resource allocation decisions to produce stuff. And, and so even if like hours-wise,, the, you know, some things can be automated, doesn’t mean that the people are going to lose their jobs, let’s say. How about that? What do you think about that?

Seth: I, I think you’re, I mean, you’re totally right to point out that like a lot of what counts as “doing a job” is not perfectly lining up with the tasks measured in this study. The question then becomes to what extent can the things that are measured in this study as being high win rate for the AI be unbundled from the things that aren’t?

Right. So the issue here with meetings is for whatever reason, they have decided that the person who writes... let’s say that you’re working for like a bus company, something that’s completely not funny at all, right? And at this bus company,, you, I don’t know, you have to make some sort of like, logistical decision, right?

Andrey: Like,, when to replace the engine.

Seth: Yeah, when to replace the engine, whatever. And so like if there’s a part, like part of that is an intellectual decision that could be automated. The thing could do research, right? But maybe there’s something that can’t be separated from that. Maybe it’s the liability component. Maybe there has to be a human that is responsible for the engine working and we can punish them if he makes a wrong decision, he or she. Maybe the thing that can’t be taken out of it is there’s some sort of special context that you’re gonna be told about in the meeting that is super weird and happens like one out of a hundred times and it’s gonna dramatically either increase or decrease the rate at which engines need to be replaced.

So you can imagine like a long tail of things that you might learn at this meeting that’s going to affect your future knowledge output. Right? So, and because in knowledge work, everything, at least in principle, is connected to everything else... If you think about the, you know, Quinean web of belief, there’s like a certain sense in which no knowledge work is completely separable. So yes, you’re gonna have to go to the meeting.

Andrey: But I think there’s another role which is like consensus building. And I know, you know, I know, you know, common knowledge of the factors that resulted in the decision being made. And meetings are kind of an enforcement mechanism for that. Now you can imagine maybe,, new organizations where since it’s AIs, they don’t, we don’t need this thing happening. But, you know, a lot of organizational processes are really about this social thing, not about,, the actual decision. The CEO might have already made up his or her mind, right?

Seth: Right. Right. So meetings as coordination mechanism. I guess then it comes back to like, can we unbundle coordinating Andrey from work?

Andrey: Yes. Yeah, yeah. Yes, that’s right. That’s right.

Seth: Yeah. I mean, like in principle, if we don’t need you to do any work, we don’t need to coordinate you. Right. We need to coordinate you insofar as there is another piece of it that you’re responsible for that we can’t automate.

Andrey: Yes. Okay.

Seth: Very, very provocative to think about, Andrey. Okay, so going in, we asked what is the win rate of AI versus knowledge economy workers in top knowledge economy occupations today? Right now, if you had put up man versus machine, John Henry going at it with his hammer, does he stand a chance? Andrey, what is your posterior?

Andrey: Yeah, so 10% was clearly off. I don’t think I’m updating all the way to, you know, 39%, 46%... or whatever the Opus numbers...

Seth: 47 is the Opus, 39 for... Okay.

Andrey: Yeah. Yeah. not... because, just because,, they didn’t, we don’t have this for all the tasks that are in their super sample. We only have it for the 220. I assume there’s some selection in there...

Seth: Yeah, yeah. Fair enough.

Andrey: So I’d say maybe 30%. Yeah.

Seth: You’d update from 10% to 30%.

Andrey: Yeah. I’m 30%.

Seth: 30%. I’m gonna update from like 10% to like 25%. I’m definitely moving very strongly in that direction. I do think that these are probably selected to, right. They’ve gotta be, because they wouldn’t use ones where the AI just fell on its face immediately. Yeah.

Andrey: Alright. Prior, number two: What share of workers in occupations where that occupation today makes digital artifacts will still have making digital artifacts, quote-unquote “by hand,” for themselves as their primary job? That was almost English. It was a lot of connected words. If you think you understood it, tell me where you think we’ll be at, at two years after reading this paper.

Andrey: Yeah, so I think my initial guess was what, 90%?

Seth: Yes.

Andrey: So I think, yeah, I’m, I’ll say 85%. I think that just the... I think people are slow to adopt. People are slow to change their work processes, especially in organizations where there are habits and plausible deniability and all this sort of stuff. And,, I think even though in principle it should be a lot more, it won’t be a lot more in two years, but yeah.

Seth: So yeah. So, but still, you’re still thinking it might be 15% of knowledge workers. let me ask you a question. Do you consider that 1x collaboration? Would you consider that “by hand,” or would you consider that AI agent management?

Andrey: I’d say for like, gets you 99% of the way there and then you just need to tweak it a little bit, I’ll still consider it part of the, you know, part of what I’m talking about. If it’s like really like, like the way I work with, you know, data analysis today, I wouldn’t put it... there’s, it’s not automating anything. I mean right, right now it’s very helpful. If it’s not... yeah. Back and forth constantly. Right. That’s not what we’re talking about. Yeah.

Seth: Okay, so if you’re calling 1x... we’re gonna call that delegating to an agent and I’m just gonna fix it up at the top. Versus Nx, I’m back and forth, you know, all day as not delegating to an agent. Then I guess I would think about this as... yeah, I’m still kind of in the, like maybe 5% of people will be bossing around agents as their main job. So that would put me in closer to the still, the 95% world. I don’t think this moves me that, that hard because I think that the stuff in this that gets automated will, will get automated, but then the knowledge economy workers will then spend more of their time on this, like kind of Nx iteration stuff. So,, I think that at the end of the day, if Nx iteration with the AI counts as “by hand,” I think we’ll have a lot, a lot of that. So yeah. Maybe only, so I would put that at 95% yeah. We’re still doing “by hand.”

Andrey: And maybe this goes to the taste thing, right? Like mm-hmm. You know,, maybe we should expect stronger results if we have very high inter-human, you know, agreement scores. But the fact that humans are disagreeing on work ethic, quality so much means that maybe as an individual, I have a specific style that I wanna convey in my work. Mm-hmm. And I have certain things I wanna see and don’t wanna see. And I’m gonna, you know, it’s gonna be maybe harder for me to specify that, although I’m not sure, you know, maybe, maybe the AI will know my style as well, so, yeah.

Seth: Right. Yeah, I mean, that’s, that seems to not be so far away,, to, you know, train a digital twin who will be able to attend those meetings for you, Andrey.

Andrey: Yeah. well, all right. So...

Seth: Even if your boss doesn’t want to... even if your boss doesn’t want you to, I got news for you. If I was a digital twin company, I’d tell my digital twins to encourage users to commit fraud by using me.

Andrey: Yeah. Yeah...

Seth: And then you locate, and then you locate internationally, everybody pays you in crypto. Just locate in the Cayman Islands. People buy your deep-fake software on the... this is, I mean, I don’t wanna give away my whole business model for free. People will have to tune in for the special episode on that.

Andrey: Yeah. Yeah. On that aside,, well,, thanks for joining us for another episode. and we look forward to your feedback and please,, boost our work or let us know what you’d like to see.

Seth: Yeah, let us know what you wanna see., we are your servants, our fans. Peace out dudes. Oh, keep your posteriors justified.

Andrey: True, true.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books