Secrets, Leakage, and Benchmark Representativeness

They address dataset leakage, secret benchmarks, and how frontier models' details affect interpretation.

Play episode from 47:13

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Seth and Andrey are back to evaluating an AI evaluation, this time discussing METR’s paper “Measuring AI Ability to Complete Long Tasks.” The paper’s central claim is that the “effective horizon” of AI agents—the length of tasks they can complete autonomously—is doubling every 7 months. Extrapolate that, and AI handles month-long projects by decade’s end.

They discuss the data and the assumptions that go into this benchmark. Seth and Andrey start by walking through the tests of task length, from simple atomic actions to the 8-hour research simulations in RE-Bench. They discuss whether the paper properly measures task length median success with their logarithmic models. And, of course, they zoom out to ask whether “time” is even the right metric for AI capability, and whether METR applies the concept correctly.

Our hosts also point out other limitations and open questions the eval leaves us with. Does the paper properly acknowledge how messy long tasks get in practice? AI still struggles with things like playing Pokémon or coordinating in AI Village—tasks that are hard to decompose cleanly. Can completing one 10-hour task really be equated with reliably completing ten 1-hour subtasks? And Seth has a bone to pick about a very important study detail omitted from the introduction.

The Priors that We Update On Are:

* Is evaluating AI by time (task length) more useful/robust than evaluating by economic value (as seen in OpenAI’s GDP-eval)?

* How long until an AI can autonomously complete a “human-month” sized task (defined here as a solid second draft of an economics paper, given data and research question)?

* Seth’s Prior: 50/50 in 5 years, >90% in 10 years.

* Andrey’s Prior: 50/50 in 5 years, almost certain in 10 years.Listen to see how our perspectives change after reading!

Links & Mentions:

* The Paper: Measuring AI Ability to Complete Long Tasks by METR

* Complementary Benchmarks:

* RE-Bench (Research Engineering Benchmark) - METR’s eval for AI R&D capabilities.

* H-CAST (Human-Calibrated Autonomy Software Tasks) - The benchmark of 189 tasks used in the study.

* The “Other” Eval: GDP-eval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks by OpenAI

* AI 2027 (A forecasting scenario discussed)

* AI Village - A project where AI agents attempt to coordinate on real-world tasks.

* Steve Newman on the “100 Person-Year” Project (Creator of Writely/Google Docs).

* In the Beginning... Was the Command Line by Neal Stephenson

* Raj Chetty

Transcript[00:14] Seth Benzell: Welcome to the Justified Posteriors podcast, the podcast that updates its beliefs about the economics of AI and technology. I’m Seth Benzell, wondering just how long a task developing an AI evaluation is, at Chapman University in sunny Southern California.Andrey Fradkin: And I’m Andrey Fradkin, becoming very sad as the rate of improvement in my ability to do tasks is nowhere near the rate at which AI is improving. Coming to you from San Francisco, California.Andrey: All right, Seth. You mentioned how long it takes to do an eval. I think this is going to be a little bit of a theme of our podcast about how actually, evals are pretty hard and expensive to do. Recently there was a Twitter exchange between one of the METR members talking about their eval, which we’ll be talking about today, where he says that for each new model to evaluate it takes approximately 25 hours of staff time, but maybe even more like 60 hours in rougher cases. And that’s not even counting all the compute that’s required to do these evaluations.So, you know, evals get thrown around. I think people knowing evals know how hard they are, but I think as outsiders, we take them for granted. And we shouldn’t, because it certainly takes a lot of work. But yeah, with that in mind, what do you want to say, Seth?Seth: Well, I guess I want to say that we, I think we are the leaders in changing people’s opinions about the importance of these evals. The public responded very positively to our recent eval of Open AI’s GDP-eval, which was trying to look to bring Daron Acemoglu’s view of how can we evaluate the economic potential economic impact of AI to actual task-by-task-by-task, how successful is this AI system. People loved it. Now you demanded it, we listened. We’re coming back to you to talk to you about a new eval—well not a new eval, it’s about eight months old, but it’s the Godzilla of evals. It’s the Kaiju of evals. It’s this paper called “Measuring AI Ability to Complete Long Tasks,” a study that came out by METR. We’ve seen some updates or new evaluations of models since this first came out in March of 2025. Andrey, do you want to list the authors of this paper?[3:05] Andrey: As usual I don’t. There are a lot of authors of this paper. But, you know, I’ve interacted with some of the authors of this paper, I have a lot of respect for them. I have a lot of respect for the METR organization.Seth: Okay. But at a high level, just in a sentence, what this wants to do is evaluate different frontier AI models by the criteria of: “how long are the tasks that they complete”?” Andrey: I guess what I would say before we get to our priors is, just as context, this, from what everything I’ve seen, is the most influential evaluation of AI progress in the world right now. It is a measure that all important new models are benchmarked against. If something is above the trend, it’s news. If something is below the trend, it’s news. If something’s on the trend, it’s news. And it’s caused a lot of people to change their minds about the likely path of AI progress. So I’m very excited to discuss this.Seth: It’s been the source of many “we’re so back” memes. Yeah, I totally agree Andrey. Am I right that this was a paper that was partly inspiring the AI 2027 scenario by favorite blogger Scott Alexander?Andrey: I don’t know if it inspired it, but I think it was used as part of the evidence in that. Just to be clear though, AI 2027, it’s a scenario that was proposed that seemed a bit too soon of a vision for AGI taking over the world by many folks. We have not done an episode on it.Seth: We haven’t done an episode on it. But it’s fair to say that people look at the results of this paper and they see, you know, they see a trend that they extrapolate. But before we get into the details of the paper, are we ready to get into our priors?Andrey: Let’s do it.[05:50] Seth: Okay, so Andrey, just based on that headline description, that instead of evaluating AI systems by trying to go occupation by occupation and try to find tasks in those occupations that are economically valuable and then trying to see what percentage of those tasks the AI can do—that’s what the Open AI GDPval approach that we recently reviewed did—this approach is trying to evaluate tasks again by how long they are. So comparing those two approaches, I guess my first prior is, before we read this paper, which of those approaches do you see as like kind of intuitively more promising?Andrey: One way of thinking about this is tasks are, or things people do which could be a series of tasks, are bundles and they’re bundles embedded in some higher dimensional space. And what these two evals are doing, this one we’re discussing here versus GDPval, is they’re embedding them into different spaces. One of them is a time metric. And one of them is a dollar metric, right? And you can just by phrasing it that way, you can see what some of the issues might be with either. With the dollar metric, well, what are people getting paid for? Is it a specific deliverable or is it being on call or being the responsible party for something? So you can see how it’s kind of hard to really convert lots of things into dollar values at a systematic level. Now, you can say the same thing about how long it takes to do something. Of course, it takes different people very different times to do different tasks. And then once again chaining tasks together, how to rethink about how long it takes to do that. So I think they’re surprisingly similar. I think maybe this length of time one is more useful at the moment because it seems simpler to do frankly. It seems like, yes we can get an estimate for how long it takes to do something. It’s not going to be perfect, it’s going to be noisy, but we can get it and then we can just see whether the model does it. And that’s easier than trying to translate tasks to dollar values in my opinion.[8:42] Seth: Right. I guess I also am tempted to reject the premise of this question and say that they’re valuable for different things. But I guess I come into this thinking about, you know, we think about AI agents as opposed to AI tools as being this next frontier of automation and potentially supercharging the economy. And it really does feel like the case that working with AI models, the rate limiter is the human. It’s how often the human has to stop and give feedback and say, “Okay, here’s the next step,” or “Hey, back up a little bit and try again.” So going in, I would say I was kind of in equipoise about which of the two is the most useful kind of as a projection for where this is going. Maybe on your side of the ledger saying that economic value is kind of a socioeconomic construct, right? That could definitely change a lot even without the tool changing. Whereas time seems more innately connected to difficulty. You can think about psychometric measures of difficulty where we think about, you know, a harder exam is a longer exam. So at least going in, I think that this has a lot of potential to even potentially surpass GDP-eval in terms of its value for projection.Andrey: Yes. Yeah, yeah. Seth: Okay. The next one I was thinking to ask you Andrey was, if we buy all the premises of whatever context the paper sets up for us, the question I’d like to think about is: how long until AI can do a human month-size task on its own? In the abstract of the paper, we have that happening within five years, by 2030. That seems like a pretty big bite at the apple as they say. Do you want to take a stance on how long until an AI can do a human month-size task? I mean, I have to say in my use of AI, I haven’t gotten anywhere near that.[10:55] Andrey: What is an example of a human month-size task?Seth: What’s something that takes 160 hours of work? I would say, you know, as an academic, maybe I need kind of three months of focus on a paper to bring it from zero to, you know, solid second draft. Maybe that’s like a third of a paper is a month of work?Andrey: I mean, it can do a third of a paper in a day. I mean I’m not being facetious here. I referee a lot of papers. Is the question an end-to-end, completely no-intervention sort of thing? Because I think like, look, you take Claude-code off into a folder, the folder has the data. You tell it, “Hey, like write a paper that does this, that investigates this question with this data.” It can do that in a day. I don’t think it needs... I think it depends on how much you require for human intervention. I think with something where there’s a verifiable answer, it’s very different than something subjective like a paper. Because I think we don’t want just any paper. We want the paper that we want to write. It’s not just about quality, it’s also about taste. And so I don’t think it could do “end-to-end write a paper that I like” even if I gave it a lot of scaffolding. I don’t think it could do that yet. But could it do that in five years? Sure, I think it’s possible.Seth: And just to be a little bit more specific, can we say gets published in like a top 10 economics journal level of quality?Andrey: The quality bars will have to increase. I mean, I think it goes to a question of like if I already have the research question and I know the data is adequate. Yes. Very few projects are of course like that, right? None of my recent projects have that flavor to it I think, where it’s just I’ve already found the data set and the question is obvious and I just needed to go plug and chug. Seth: There are papers like that. Raj Chetty gets the US tax records, and just needs to run some pre-registered analyses. Andrey: That’s an interesting one Seth. So Raj Chetty is an economist -now we’re really in the weeds - who does big public economics analyses. He works with gigantic teams on data analysis and iteration. It’s not as simple as just going to town on a dump of data. So yeah, I’d say that I can think of easier papers than Raj Chetty’s papers to implement.Seth: Okay, but if I want to think about the same kind of general format of question, right? Which is: I have a data set, I have kind of the general research question I want answered about the data set... let’s say the question is only specified at that level. I’m not being any more specific than that. Plus a data set. I don’t think an AI could make a plausible, complete, top 10 econ journal out of that right now. Do I think it could be there at a plausible level of quality in 10 years? In five years? Five years might be like exactly at my cutoff. I think in 10 years for sure. In five years, 50/50.Andrey: Interesting. Okay. Okay. So that’s... yes. So we’re both very bullish, huh? Okay. Well, you know, maybe it’s slow, but 10 years is fast enough that we’re not ready. In fact, my understanding of the METR organization is that a big part of its mission is to prepare us for AI progress that’s a lot faster than society is ready to deal with. And you know, I think it’s an important mission.Seth: That’s my mission too, Andrey. Also, they need to be prepared for slow progress. I want to prepare society for everything. Why prepare them for only one thing?Andrey: Society is already prepared for slow progress. Perhaps.Seth: Okay, are we ready to move on to the evidence?[17:34] Seth: Okay, so Andrey, we read this paper, or this Eval from METR. It looks at the probability of task completion as a function of task length across a variety of frontier models, starting with GPT-2 in 2019 and continuing through Claude 3.7, which is kind of early-to-mid 2025. And I would say the Eval works in sort of four steps. First is they establish a human baseline for how long it takes humans to complete 169 software engineering tasks --- By the way, in the abstract it does not mention that this is overwhelmingly software engineering tasks. I probably would have put that in the abstract, but you know who am I? -- Secondly, once we’ve got that baseline for each AI, we see whether it can complete each task. That was the quote you just gave us from Twitter. So once you’ve got the baselines, it takes about 60 hours of work to run each AI through the paces. Then we’re going to run a logistic regression of “Does AI correctly answer the task?” on “Length of task.” And then that gives you a data point for each model of: we think it has a 50% shot of completing an arbitrary task of a certain length. And then you put all of those points for all of different models from 2019 to 2025, and you see a diagonal line pointing from models that can do one-second tasks to models that can do one-hour tasks. And if you just extend that line out a little bit, that line’s going to take all our jobs. Isn’t that right, Andrey?Andrey: Yeah, yeah. So just to be clear, I think the numbers that I have for the extrapolations... if we think that the current horizon is about a couple of hours, and the latest model rated is GPT-5.1 Codex Max which is just under three, the prediction for February of 2027 is 16 hours. And for April of 2028 is 5 days. So that’s you know, and if we go further we get to those month-long numbers eventually.Seth: Okay. So maybe let’s take a minute to talk about that headline result. So they estimate putting all these models together a doubling time of approximately seven months. So every seven months we get a frontier model which is able to work for twice as long. They give themselves an R-squared of 98% in fitting what is it, 10, 15 points? Do you have anything to say kind of about this headline result before we dive in? The one thing I wanted to point out was this is all software engineering specific. So if you think that software engineering might obey very different doubling times than other tasks in the economy, this is only going to tell you about that one particular domain.Andrey: Yeah, yeah. And I think that’s a really important caveat. I don’t think there is as much care here in making the tasks as realistic as possible as was, let’s say, in GDP-eval.[21:35] Seth: Right, different priorities. GDP-eval very focused on like “what are useful tasks.” This kind of more focused on the abstract “short versus long tasks.” Maybe one other point I’ll make here which is a high-level point, which is something that they emphasize, which is if you think that there’s just some sort of constant error in their estimates, you can shift this entire graph down. But the important thing is the doubling time, right? And if the doubling time is seven months, sure shift the whole thing down, it’ll take one more year to get to whatever crazy outcome you want.Andrey: Yeah, and for what it’s worth, to me 50% completion doesn’t seem very relevant. Presumbly you want 99% completion, right?Seth: Yeah. I’d be happy much—you know I prefer to look—they have an 80% completion option on their site that you can plot and I tend to prefer that one. For that we have a number like that’s pretty current that’s around 30 minutes versus for the 50% it’s about 2.5 hours.Seth: There we are. Okay. So we’ve talked about the headline results. Maybe now let’s go kind of point by point and how we end up there. So the first thing that they need to do is establish a human baseline for how long different tasks take. They do this by combining three different data sets. The first one they do is sort of internal. They call them Software Atomic Actions. These are like really micro tasks. The example they give is kind of hilarious. The example they give is: “Okay Andrey, how long was it going to take you to answer this question? I’m putting you on the spot. Which file is most likely to have a password in it? Credentials.txt, InstallationNotes.txt, Main.py, or LauncherWin.exe?”Andrey: Wow. Wow that is a hard question Seth. I mean I kind of view these sorts of tasks as similar to kind of like cursor auto-complete tasks where like, you know, you don’t need a reasoning model for this. You’re almost like... let’s say you have a little bug in the code, it just auto-complete correct it. You know, that sort of thing.Seth: One thing I want to highlight about this... and they really they talk a little bit about trying to do what they can to reduce the noise from overhead from reading, from human reaction time... but it seems like they’re not going to do a super good job of distinguishing whether answering that question is a one-second task or a three-second task, right? But the difference between a one-second task and a two-second task is an order of magnitude here. And I guess I’m a little bit concerned if the logistic curve is learning too much about what’s the one-second version of that versus the two-second version of that.[24:54] Andrey: Yes. Yeah yeah. I mean yes, there is an argument to be made that due to measurement error just swamping everything that maybe we should only start with one minute or or two minutes. Now of course we can draw our own visual regression on that plot over there and see that you still have a pretty steep curve even if we throw out the first few points, right?Seth: Okay. So that’s done internally with their own kind of own engineers or just whoever was around. The second data set they draw on is something called the RE Bench suite or the Research Engineering Benchmark V1, which to quote from the paper consists of “seven open-ended ML research engineering environments and data from 71 eight-hour attempts by 61 distinct human experts.” So they’ve got these 61 guys that are doing seven of these tasks. And we confirm our experts make progress in these environments given eight hours. The third benchmark is H-CAST, Human Calibrated Autonomy Software Tasks. Designed to be a little bit more realistic to what a software engineering task would be in an economic environment. And they say that their baseliners typically have a degree from a top 100 global university and are primarily recruited via professional networks of METR employees. They’re paid $50 to $100 per hour plus $25 to $150 per hour in performance bonuses. Baseliners also did the tasks and predicted how much time it would take them to do the tasks. Curiously only 61% of human baselines actually successfully completed tasks, right? So one thing kind of we should be thinking around in the background here is we kind of want to compare how long it takes a human to do a task to can the AI do the task. But in reality it’s like like we talked about, it’s higher dimensional than that. There’s not just how long does it take a human, but with what probability can a human do it in a certain length of time.Andrey: Yeah. Or which human? And does the human have the context ahead of time? Or you know, are they an expert in this type of work or not, right? There’s no one number for the human.[27:38] Seth: Exactly. And for that third data set they record 189 tasks that they evaluate across which there are 563 human baselines. So I guess a second note here is these aren’t kind of giant populations of people. I just I guess you wouldn’t expect this to be giant populations of people. You know is 61 people being judged on their research engineering skills a lot? A little? I mean on the one hand 61 seems like a small sample for all of humanity, but on the other hand getting 61 serious software engineers’ time for a thousand hours is a bigger deal.Andrey: Yeah. Yeah. I mean it’s hard. I mean this goes back to our discussions of cost, right? I mean to do these sorts of metrics well, especially for valuable tasks, is just very expensive. You know look, there’s also this question of which population do we want to sample from? In the economy, experts are oftentimes doing the work. And that expertise can be very very narrow, right? You know think about just you know economists. You know even if economists are using different methods, you’re you know one person studying you know the medical industry is going to have very different expertise than a different person studying you know the energy industry. Even if like they use the same methods. So yeah I think the question of what population you want to sample is an interesting one.Seth: Very very well put. One other detail here that is interesting but it’s kind of mixing together some pretty different evals here. The RE Bench, unlike the other ones where they just see how long it takes a person to finish it and conditional on finishing it how long did it take you, for the RE Bench they kind of give everyone eight hours and they figure out like what the average quality of people were able to do in those eight hours and that’s going to be their cutoff for an eight hour length task. So a little mix and matching going on. I’m not saying that they P-hacked this but there’s some informality going on. Is there anything else you want to say in the creation of the bench lines before we move on?Andrey: Well I think there’s one other data that they use which was the internal PR pull request experiments. I don’t know if you read this part where so they ran these models on some issues in the internal METR code base. So these are ones that would not have been in any training set certainly. And they found that their contract baseliners take 5 to 18 times longer to resolve these issues than the repository maintainers. So the people whose job it is are 5 to 18 times more efficient than contract baseliners on this on these tasks.Seth: So the idea is METR coders are very smart boys. And girls.Andrey: No, they actually don’t say that. They actually don’t say that. And I disagree with your statement here. Not that they aren’t smart, but more that they say that it’s all about context, right? Like if you’re dealing with a code base and you’re very used to it, you can diagnose the problem very easily. You can solve them very easily. If you’re not, then it takes you a while to load the context back in. I mean we’ve all had this. You know you work on a research project, you take a little break for a few months and now you come back and you know something that you know should be very simple takes you a few hours because you know you just don’t remember the code anymore, right?[31:38] Seth: I wanted to bring up one last point here Andrey before we move on, which is around the question of how many people do we need to establish the correct baseline. So we’ve already talked about context matters, like have I already loaded in the prior knowledge or am I coming in cold? Am I a super smart expert or am I a man off the street? Those are all definitely mattering. But one thing I’d like to point out is that if we think that some of these tasks have a very long tail in completion time... right? Which seems really plausible for a very hard research engineering task, that you know some people can do it in a short amount of time and some people take twice that and some people take twice that... a very long tail... as the variance of people’s abilities to complete this task goes up, you know you’re going to be less and less confident in your estimate with a small N.Andrey: Yes. Yes yes. I think that’s right. But once again it’s not clear to me where we want the minimum... whether we want the average or the min. There’s a very good argument for the min.Seth: Right. If what we care about is superhuman ability then I guess we want the min.Andrey: No, or or just like a comparable to a professional working on the code base. Not even superhuman right? Seth: Do we really want the strict min? If the question is “how long does a certain journey take”, I’m not sure we want to include the person who by chance had just looked up that number. Andrey: Like I think the min is perhaps too far... but something much closer to like what someone day in day out of the code base would do rather than you know... one is how much do you accelerate a company with an existing code base with professional software engineers. Like for me maybe that’s not the relevant benchmark. I’m not a professional software engineer. And so I don’t care if it’s better or worse than the best professional coder. I care if it saves me time. Which could be you know much more economically relevant if we think that the value of better software engineering is coming from the fact that now everyone can be a software engineer.Seth: I think that’s very fair. But as we get deeper into this I’m becoming more convinced that if you really care about economic value you should be reading the GDPval paper not this paper.Andrey: Okay. Okay.Seth: So the second step of this process is for each AI seeing whether it completes each task. Right? So we’ve got these benchmarks. We’ve got the short benchmarks, the medium length benchmarks, the long benchmarks. How many can each AI do? I guess the one note I want to bring up here is they do some basic scaffolding. They claim it’s not elaborate. They try to bring some agent tools to the early models. So early models were like not set up at all for these longer projects but they try to give it like a little scratch pad and a little “remember these are the most important command line codes.” It seems like they’re not going to do a super good job of distinguishing whether answering that question is a one second task or a three second task. But you could imagine a version of this test that would have zero scaffolding or a version that would have very elaborate subtask specific scaffolding and they’re kind of closer to the first.Andrey: Yeah and I think that’s fair to have a comparison baseline. It’s also becoming less and less representative of how people are using the models, right? I think if you’re serious about using the models you’re giving them skills and putting in the right context. Certainly you’re using a cursor or Claude code or a codex where there’s a lot of optimizations there. So you know one one argument here is like actually if you’re if you’re serious about using these models they’re actually a lot better than what’s portrayed in this benchmark.Seth: Yeah I think that’s definitely right. And again one of the running themes of this podcast is “Bitter Lesson” and how important is the frontier-ness of the model versus the customization and the specific task orientation of the model. We don’t really get... you know they just say we do light scaffolding. And I guess before we move on, the range of tasks here are all designed so they can be done through the command line. So there’s no kind of... it’s not like Chat GPT immediately fails everything because it can’t make a picture.Andrey: Seth, I thought that everything could be done through the command line. In fact Neal Stephenson famously said…Seth: In the beginning there was the command line. That’s a good book. That’s a good book.Andrey: Cryptonomicon for those who don’t know.[36:10] Seth: No, he has a book, he has an essay collection called In the Beginning Was the Command Line also.Andrey: Yes that’s true yes that too yes.Seth: And in the essay collection, this is the one thing I remember, is he compares Macs to the Batmobile. ---Seth Cuts in With Correction: Actually he compared Mac OS to a luxury European car, Windows to a station wagon, Linux to a free tank, and BeOS to the Batmobile. Apologies to Mac OS fans for comparing their OS to the Batmobile -- It was a very 1990s book. It was like OS Wars book.[37:01] Andrey: I just say that Neal Stephenson in terms of the pantheon of prophets... (Seth: he got crypto right). He got Uber right. He got virtual reality right. Wait wait wait. Okay. So right. Crypto. (Seth: He does think that there needs to be a big pile of gold somewhere. Which turns out to not be the case. Maybe he gets stable coins right.)Yeah but but I guess yeah there are many things he got right and and certainly in Snow Crash that were way way ahead of their time. It’s one of those things where you almost imagine that the sci-fi author kind of causes the subsequent innovations. And maybe with AI there’s a similar sense to that because so many people who’ve developed these technologies were inspired by reading science fiction.Seth: And the AI is reading the science fiction too Andrey.Andrey: Yeah well it’s not clear whether we want the AI to read the science fiction. It might develop some weird notions of what might happen in the future.Seth: Yeah. Read Bicentennial Man, don’t read Frankenstein. Let’s leave it at that. Okay. I could talk about Neal Stephenson for a whole episode. So let’s hold off on that. Okay. So the third step we promised the listeners is running the logistic regression. So what we have here at the bottom of my screen I’ll put up is for each of the models that they evaluate you can see this nice logistic curve that starts at 100% for a sufficiently short task and moves down to 0% for a sufficiently long task. And I don’t know Andrey, I look at these curves and a lot of them don’t seem particularly logistic. A lot of them are not monotonic even. It seems like you’re assuming the conclusion if you think that AI can do all one second tasks. I my read is that AI cannot do all one second human completable tasks. And like the idea... logistic models are one parameter models. So like we talked about, it’s learning just as much about this curve about from going from four seconds to eight seconds as from going from one hour to two hours. Which seems like the wrong way of thinking about it.Andrey: Yeah I mean I guess is it really that different than just finding than just extrapolating the point at which it has a 50% success rate? And then you know if we actually look at that point non-parametrically it’s it’s pretty it seems like like pretty close to where where we end up right? So I guess like one argument here is actually if you’re if you’re serious about using these models they’re a lot better than what’s portrayed in this benchmark.Seth: The 50-50 point. I think for a lot of these if I was trying to draw a diagonal line I guess my midpoint, my 50-50 point would be similar. I guess I don’t know how to think about like this GPT-2 example where…[40:37] Andrey: Sure. I mean but I think we already both like kind of argue that we might as well toss them. And it wouldn’t really make a difference. So let’s toss the early ones.

Seth: We’re not going to focus on the ones that can knock all these one second tasks out of the park. One thing I I guess think about is there seems to you know they talk in the caption for this figure about a jump in between the the the atomic tasks and the H-CAST tasks. And you do kind of see that in a bunch of these figures. But then I also see a jump at the eight hour tasks right? Because we know that there’s a lump of eight hour tasks that they get from the RE benchmarks. You know this is not to like punch down on a paper that is like a really good paper is definitely inspirational um and definitely influential correctly. But I think when you dig into these curves I am not convinced that the logistic model is definitely the right model. And then I guess then I lose maybe a little bit more faith than you do that were correctly finding the 50-50 point in these.Andrey: Yeah. I mean I guess the other... I just don’t... yeah. I think there are other criticisms that are much deeper than than this one is maybe what I’d say. No no no. We already mentioned them. These are programming tasks. They’re very selective. (Seth: Yes. Yes. Yeah. There are other deeper criticisms. We’ll get to those.Seth: You gotta put... dude how do they not put that in the abstract? I don’t know. That’s that’s something I ask. I mean the only… I’ll tell you why you don’t put it in the abstract and not to cast aspersions... it’s the hubris of someone who thinks that software engineering is the is the final task.

Andrey: Tell me tell me about these messiness scores. Did you read about those?Seth: Right. They have 16 of them. Um I I I’ll why don’t you tell us about the messiness scores Andrey.[42:50] Andrey: Yeah so so there’s an idea that like look if you have a very well defined task... like implement some algorithm... you know verify that the results are working... you know that’s way easier for an AI to do than “Hey you know I don’t know how to solve this problem you know try a bunch of things and solve it for me.” That’s very messy. Like you don’t you know you don’t um really know what the right solution there is no maybe objective solution to that um and so um you might think of a dimension here that’s messiness in addition to some sort of difficulty uh level. And and so they have a bunch of ratings uh of the messiness of uh these different tasks and yes there’s yes and and one thing I’ll say is that most of these tasks are not very messy. Now what else will I tell you is like you know working at my job most of the tasks I do are super messy.Seth: They wouldn’t give... they don’t give you the easy jobs Andrey.Andrey: No no no no. I mean and maybe you know look once again like maybe the intern is getting these very non-messy jobs but I am not. So so I do think it’s an important dimension. Not to say that the AI can’t do the messy jobs. They’re not even in the data set that’s being evaluated here.Seth: Right. I think that’s a very fair point right. Which is this is a set of tasks that is really designed to be as amenable as possible to sticking the agent on it and coming back later right. That’s that’s intriguing right it’s like and it’s inspirational and it’s uh vertiginous is maybe the word I want to use. Uh but it maybe doesn’t extrapolate directly to um normal people’s interaction with these tools right. One other way I might want to frame this and we talked about this in the beginning is that problems are sort of tasks are multi-dimensional. They have lengths but they also have messiness. They also have difficulty. They have you know verbal difficulty and math difficulty and difficulty on lots of different dimensions. You could imagine a world in which there’s lots and lots of evals. More than 169. Maybe let’s say a thousand of these benchmarks. And we could actually estimate something that’s kind of multi-dimensional right. So success probability as a function of the length of the task, the verbal difficulty of the task, the you know the math difficulty of the task. And then throw in model year as just another parameter. Or as another interaction term.Andrey: What an economist. Just add more fixed effects.Seth: Dude machine learning! Let there be interactions too right. Let let it have whatever shape you like. Um that’s the dream. Maybe it’s an unrealistic dream given how expensive you know even putting together 160 benchmarks are. Uh but it seems like if you wanted to estimate the role of year in how good model is in doing thing you would want a model where year is a parameter in the model.Andrey: Yeah yeah. I mean for what it’s worth you know there aren’t that many models... yeah I guess there are more there are a lot of... let me take that back. There aren’t that many frontier models. There are a lot of models that are around. But I think this benchmark is really focused on the frontier models and and you know over the course of this year we’ve maybe had the 10 total frontier models. So it’s not you’re if you want to if you want to run that regression you know you’re gonna have too many parameters.Seth: Well here how how about this? Right? Which is you don’t only focus on frontier models. You just try to do this prediction not as a function of like model frontier you know is this the frontier model and year. Is you do it as a function of model size. And maybe there’s instead of one frontier model every year there’s one frontier model at each size every year. And you can get a little bit richer data.Andrey: Sure. I will tell you that we actually don’t know the size of the frontier models.[47:04] Seth: Yeah they don’t they don’t tell us. They don’t say it’s got a gazillion parameters. It’s secret. You know I... all right keep your secrets meme. All right.Andrey: Well look uh to any of our listeners at the various illustrious labs uh a little tip might be appreciated if so we know what what sizes we’re working with.Seth: Okay. So that’s a fair point. I think another point I would make here is that when we’re talking about secrecy is that the evals also have to be secret right. You know as if I’m putting on my reviewer 2 hat I kind you know I want to see the evals. I know I understand that you can’t put them on the internet because then the AI companies will cheat at the evals. But uh it’s a non-optimal thing that they have to do.Andrey: Yeah and there and there is a sense that some of these tasks that they do have them do are a bit leaky. Who you want you want… I have some intel…Seth: You want to name names?Andrey: No I mean look I haven’t dug into them myself but having talked to some having gotten some intel. Let’s just say that they’re not they’re not some there not things that are that different from what you might have trained on a lot of time.Seth: Okay. All right. So are we ready to sort of start talking about uh discussion limitations? I feel like we’ve run through the paper now. Is there anything else you want to say in terms of the technical sort of evidence side before we move into kind of more free-wheeling discussion?Andrey: Let me just uh kind of now uh say you know this is a really I think this is a really important topic and episode for us because I I truly do think that this eval is driving so much of the conversation and uh most of the people have not read at all what the eval um is. And I will and I will especially thank so I’m in this uh Twitter group chat called uh the the “Demon Economics Research Unit” with a lot of uh very uh based uh participants who pointed me to various resources various very interesting writings on this eval that I that I benefited greatly from um when when thinking about limitations here. Um so let me give you a limitation that one can think about. Have you ever tried to watch uh Sonnet play Pokemon Seth?[49:37] Seth: I oh I remember really early on I remember like Chat GPT yes I do remember Chat GPT plays Pokemon right. But it was like it was no I know I remember Twitch plays Pokemon and it was terrible right. I I do not I have not seen Claude play Pokemon. How what is that like?Andrey: Uh it’s pretty slow going. Um and it’s not not very successful. Uh it’s a game with no fail state you can just keep on grinding it. Yeah just to be clear like a child can play this game quite successfully. Um and uh this is something an AI just has a very hard time doing. A number I have here is that not the current Gemini but but I think it was Gemini 2.5 Pro took 888 hours to minimally beat Pokemon uh that’s elite four that’s not capturing all Pokemon yeah with a dozen intense human handholds like tile labeling.Seth: Wow.Andrey: So so it’s very easy to think like hey uh these numbers in you know you to naively look at this graph and you’re like yeah now it’s four hours now it’s you know so on. But but let’s think about something like Pokemon which which humans can do quite well where where even when the AI can do it the amount of tokens involved is is immense. It’s just staggering.Seth: When it will become when Andrey when will it become economically viable to export our Pokemon play?Andrey: Yeah. Yeah I’m just say that look like obviously obviously these tokens are going to become cheaper over time and more efficient and whatever but but but you know like we have to take things with a grain of salt. Here’s another piece of evidence that was brought up in the in the group in the group chat that I that I think was quite uh convincing to me. Have you ever heard of uh AI Village?Seth: No.Andrey: Uh so AI Village is uh is an experimental project where uh AIs uh personified with like the different models like you know Sonnet and Gemini and and GPT are coordinating on different tasks. Uh like what? Like Stardew Valley kind of? Yeah exactly. Yeah yeah. So they might be coordinating on um successfully uh selling a shirt online or getting some likes for for a web page or or something like that.Seth: What so these are real world tasks or these are simulated tasks?Andrey: Real world tasks. Okay cool. Real world task. And um you know you can I encourage everyone to go to it and see how well that’s going for the AIs.Seth: Do they sell ten thousand dollar tungsten cubes?Andrey: That’s that you know that’s a different interesting project but you know uh yeah project you know that’s a project called Project Vend you know maybe another one that’s in the same in the same vein. But but but this this AI Village just goes to show that uh these AIs they’re missing something. They’re not able to do things that humans can do quite easily. Um especially coordination but not just coordination. They just get tripped up on interacting with various pieces of the digital world. Um I’m a big optimist that that will be improved of course but um but we have to take these these time numbers with truly with a grain of salt.[53:15] Seth: Right. I guess one thing one kind of question I had going in and I’m not sure whether we kind of get a hard yes or no answer on this is like to what extent is doing a two-hour task just doing two one-hour tasks correctly in a row right. Yeah. To the extent that it is to the extent that it’s just like a six sigma problem to the extent that it’s just like Waymo and it’s like okay you need to not crash one minute in a row a thousand times it seems like these extrapolations are pretty straightforward right. But if on the other hand longer tasks are somehow qualitatively different because they involve complex interactions between subtasks interactions with the world in a way that you never do with one second tasks then these projections become a little bit more dubious. I guess I would also say that there are also reasons it could be easier to do these longer tasks right because you can always back up and retry right. Uh but I guess you know I wish there was a little bit more in here... I guess with the messiness they talk they get at this maybe a little bit but I wish there was more about like what’s going on beyond just reliability going up on each subtask.Andrey: Yes yeah. Yeah I agree that would be very interesting. I mean one one version of that could be is it reasoning. (Seth: Planning) Reasoning is a constraint right like planning. Yeah planning um I yeah I don’t know. Um I guess like one one version of this is let’s you know one way we can think about this is that if 50% reliability is actually quite small and if we wanted to get to let’s say the reliability of um a good worker at a company maybe that’s a 99% reliability. Um so uh one argument that maybe the authors of METR might might bring is like look the trend is the same regardless of the percentage numbers and we just need to uh you know you can just shift everything down. But otherwise we’re we’re doubling very quickly and that still has enormous economic implications. And then you know um unfortunately we don’t have any evidence or not that we don’t have any but I would love to see a 99% reliability threshold in this benchmark.Seth: It’s not sensitive enough right. I you know if there’s a hundred tasks right or a hundred and sixty that they’re doing right so you just not going to get 99% and you’d be worried yeah you’d be worried that it selected yeah.Andrey: Yes yeah yeah. Um another comment like another very interesting critique.Seth: Keep them coming dude these are all great.Andrey: Um is is thinking through like what an actual human project requires in terms of hours. And there was this very interesting essay uh by this guy uh Steve Newman. I guess he developed Rightly which ended up becoming Google Docs. Uh and he talks about like uh something being a prototype um which was his initial Rightly um that kind of took about uh four months to build. Um it was kind of hacked together. And you know it was it was kind of self-contained and so on. And then he talks about a subsequent project he did um called Scalyr uh Sc- I don’t know how to pronounce it whatever. Um and he kind of estimated that uh that project that that product took a hundred person years to do. Which is not a crazy idea if you imagine you have a company and you have a hundred employees and it tooks you took you a year to build your initial project. I mean you know like most startups don’t don’t work that way.[57:15] Seth: That’s a mythical man month.Andrey: Yeah you know it’s not quite you know that but like there is some some some substantive or alternatively one way to think about it is like maybe what we need to get to is a hundred person years not like you know uh not even one year for a person. Right?Seth: We need that for for whom?Andrey: We need that for to have the AI end to end develop you know build things truly build things you know. Seth: For you to really feel like I am to have that one person company right the mythical first the one one employee unicorn right.Andrey: Yeah exactly. With zero employee unicorn you know.Seth: Well I mean that’s I dude you gotta make yourself CEO.Andrey: CEO is I don’t know as someone who has an S-corp that not necessarily you know…Seth: Do you call yourself president? What’s your position at your S-corp?Andrey: I think I’m president yeah.

Seth: Oh wow. I’m going to be Chief Czar of my S-corp. Does your does your S-corp have a fun Lord of the Rings name or is it like Andrey Consulting or something?Andrey: It’s uh you know it’s a it’s actually very related to this podcast it’s called uh Justified Strategy.Seth: Ooh I like it I like it. See you know you gotta the marketing synergies are obvious here. Yes yes. That’s something the AI can’t do yet.Andrey: Oh man do you have any more of these hot limitations or or have I tapped you out?Andrey: No I mean look I think I think we’ve said enough yeah on the limitations yeah.Seth: We’ve done one we’ve done uh two man hours of talking about this. Okay. Yes yes. So let’s move into our posteriors.[59:10] Seth: So uh Andrey um can I tell you a joke before we move into our posteriors?Andrey: No jokes allowed.Seth: Well I’ll tell you an unfunny anecdote then. Okay okay. I heard a joke once. A man goes to doctor. He says that he’s unevaluated. Says that life is meaningless and vague and uncertain. The doctor says that treatment is simple. The great evaluator METR is in town. Go and see them. That will get you evaluated. And then METR says to the doctor “But I am METR.” You know drum roll curtain closes. I mean it is it is so it’s so tempting to kind of want to do the meta thing here and like ask because it is such a software engineering-y kind of task the the evaluating. It is sort of surprising that uh you have that Twitter post saying that it takes them 60 hours 60 man hours to do the evals.Andrey: I mean look I’m sure they’ve tried to automate more of it but yeah I agree it is very metapoint. It is uh...

Seth: Hopefully someone got a laugh out of that. All right. So the first posterior uh we have to come back to is is this more or less useful than GDPval?Andrey: I look I I think it’s hard to argue that given where we are today that this is not more useful. Um it’s been this has uh been in the media a lot more than GDPval. I think one of the reasons it’s more useful is because lots of models are plotted against it. There’s more of a trend. Maybe GDPval will have this flavor going forward. Um but it is worth you know thinking through just the fact that GDPval is also way more expensive eval.Seth: Right. I don’t know I dude G- I I know GDPval is way more expensive I vastly prefer that to this. This is a good paper I have nothing against this paper. But you gotta if it’s a if you can’t say this is about agents generally and then not put in the title that it’s just software engineering. I love the the breadth that GDP-eval tries to get at um that’s just not present here. I it is in- it’s it’s vertiginous to look at that curve going up to you know 10 hour tasks 20 hour tasks 40 hour tasks but the fact that it’s vertiginous and newsy doesn’t make it better necessarily.Andrey: Sure sure. Um yeah I mean I hear that point.Seth: The second thing we wanted to think about is how long until AI can do a human month-size task on its own. I came on saying that we we sort of we’ve defined that as do a good draft of an econ paper given a premise and a giant data set. You know viewers at home think about your own month-long task that you’re familiar with. Uh I said maybe 50-50 in five years and pretty conf- and you know 90% in ten years. This paper is a good paper it’s an intriguing paper but when you dig into it it says a little bit less than what it seems to on its face. So to the extent that I was thinking that we were going to be there for sure in 10 years and pretty con- and you know 50-50 in five years I at least I have to take a step back and put bigger error bars on that ladder one and maybe go down to you know 70-80%...Andrey: I’m confused Seth. How could that be? Because if that was your prior... yeah yeah... this didn’t have negative information so you I would believe if you said your prior didn’t change…Seth: No no no no. It signaled me down right so I so when I came into this paper I had an assumption about what this paper would say. So I had a prior that included “Oh and there’s this great paper that says 7 months.” Okay. I see. So your prior included already some notion about what the paper is. (Andrey: Okay got it got it got it got it got it. I hear you.)So this paper was less impressive than I anticipated. And so um I think my five year estimate is maybe about the same but my 10 year estimate comes down a little bit.Andrey: Yeah yeah I think I’m more confident than you that in five years we’ll have it. So my 10 year doesn’t um change very much. Um yeah I mean I think the interesting thing is like do we get there in two years or do we get there in five years? And because of the narrow domains here I I really there’s other evidence that like for example like Open you know we’re recording this as Opus 4.5 was uh recently released the latest Anthropic model that has updated my priors a lot more than um than this paper.Seth: Yeah. Do you want to talk about that for a little bit and that can be our our wrap up discussion? What’s so what has impressed you about the the latest latest models?Andrey: Um I mean look they have through a variety of benchmarks they seem very good but just I’ve had a chance to work with it yesterday and uh I was extraordinarily impressed.Seth: Give me give me a little bit more dude just a taste. What was one cool thing it did?Andrey: It’s too secret dude.Seth: All right. Um let’s just say like it did it when thinking about like writing a paper it did something that would have probably taken me a week and probably about an hour.Seth: All right. Okay. We that’s a week-long task uh 40 hours of work that’s uh off the charts in what we’ve been looking at.Andrey: Yeah I mean I do think like one one constraint there I mean if you look at the clock time for me it was longer than an hour but I could use a lot of that time to do other things. I think but like my interventions into it were rel- you know they were expert but relatively minimal and it did a lot of awesome stuff on its own uh very effectively.Seth: Right. So listeners at home we are not AI pessimists. We think that there’s a lot going on here. This paper maybe uh very intriguing vertiginous exciting maybe a little bit less than it seems uh on its face. Uh but we are watching this space and we’re we’re looking forward to see uh how good these agents get and how long tasks that they can do moving forward.[1:05:47] Andrey: All right. Keep your posteriors justified.Seth: And if you have another uh cool eval you want us to eval send it our way.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit empiricrafting.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books