Speaker 1
They've kind of gotten
Speaker 2
to a point where I wouldn't say there's like abundant compute, but they've had enough compute that they've needed in order to get to the models where they're at. That's not a constraint necessarily. They've kind of exhausted as much data as they possibly can, all the frontier labs. Yep. And so the next thing will be breakthroughs on that and then advancing the ball on the data side. Is
Speaker 1
that fair? Yeah, no, I think basically, yeah, if you look at the pillars, compute, we're obviously continuing to scale up the training clusters. So I think that that direction is pretty clear. On the algorithms, I think there has to be a lot of innovation there. Frankly, I think that's where there's a lot of, that's where a lot of the labs are really working hard, I think, on the pure research of that. And then data, you kind of alluded to it, you know, we've kind of run out of all the easily accessible and easily available data out there, you know, and...
Speaker 2
Yeah, common crawl is all done. Everybody's got the same. Everyone's had the same access to it.
Speaker 1
Yeah, exactly. And so a lot of people talk about this as the data wall. You know, we're kind of hitting this wall where we've leveraged all the publicly available data. And so one of the hallmarks of this next phase is actually going to be data production. And what is the method that each of these labs is going to use to actually generate the data necessary to get you to the next levels of intelligence? And how do we get towards data abundance? I think this is going to require a number of fields of advanced work and advanced study. I think the first is really pushing on the complexity of the data, so moving towards frontier data. A lot of the capabilities that we want to build into the models, the biggest blocker is actually a lack of data. So for example, agents has been the buzzword for the past two years and basically no agent really works. Well, it turns out there's just no agent data on the internet. There's no just like pool of really valuable agent data that's just sitting around anywhere. And so we have to figure out how to produce a really high quality agent data. Give an example of like, what would you have to produce? So
Speaker 1
have a, we have some work coming out on this soon, which demonstrates that like right now, if you look at all the frontier models, they suck at composing tools. So if they have to use one tool and then another tool, let's say they have to look something up and then write a little Python script and then chart something. You know, if they use multiple tools in a row, they just suck at that. They just are really, really bad at utilizing multiple tools in a row. And that's something that's actually very natural for humans to do. So like,
Speaker 2
yeah, but it's not captured anywhere, right? Is that the point? That's the point, right? So you can't actually go take the capture of somebody going from one window to another into a different application and then feed to the model so it learns,
Speaker 1
right? Exactly. Yeah, yeah. So these sort of reasoning chains through, like, you know, when humans are solving complex problems, we naturally will use a bunch of tools, we'll think about things, we'll reason through what needs to happen next, we'll hit errors and failures, and then we'll go back and sort of, like, reconsider. You know, a lot of this sort reasoning chains, these agentic chains, the data just doesn't exist today. So that's an example of something that needs to be produced. But yeah, taking a big step back, what needs to happen on data? First is increasing data complexity, so moving towards frontier data. The second is just data abundance, increasing the data production. So capturing
Speaker 2
more of what humans actually do in the field of work.
Speaker 1
Yeah. Both capturing more of what humans do and I think investing into things like synthetic data, hybrid data. So utilizing synthetic data, but having humans be a part of that loop so that you can generate much more high quality data. We need basically, just in the same way, I think with chips, we talk a lot about chip foundries and how do we ensure that we have enough means of production of chips. I mean, the same thing is true for data. We need to have effectively data foundries and the ability to generate huge amounts of data to fuel the training of these models. And then I think that the last leg of the stool, which is often rated is measurement of the models and ensuring that we actually have, you know, I think for a while, the industry is just sort of like, oh, yeah, we just add a bunch more data and we see how good the model is. And we add a bunch more data and we see how good the model is. But, you know, we're going to have to get pretty scientific around, you know, exactly what is the model not capable of today? And therefore, what are the exact kinds of data that need to be
Speaker 2
added to improve
Speaker 1
the model's performance.