Speaker 2
now we're really in for a ride. So, I mean, when I had Zuck on the podcast, he was claiming not a plateau per se, but that AI progress would be bottlenecked by specifically this constraint on energy and specifically like, oh, gigawatt data centers are going to build another three gorgeous dam or something. I know that there's companies, according to public reports, who are planning things on the scale of a gigawatt data center. 10 gigawatt data center, who's going to be able to build that? I mean, 100 gigawatt center, like a state, are you going to pump that into one physical data center? How is this going to be possible? What is Zuck missing?
Speaker 1
I mean, you know, I don't know. I think 10 gigawatt, you know, like six months ago, you know, 10 gigawatts was the talk of the town. I mean, I think, I feel like now, you know, people have moved on, you know, 10 gigawatts is happening. I mean, I don't know, there's the information report on OpenAI and Microsoft planning a hundred billion dollar cluster. So, you know, you got to, you know, if you do that. Is that the gigawatt or
Speaker 2
is that a gigawatt i
Speaker 1
mean i don't know but you know if you try to like map out you know how expensive would the with the 10 gigawatt cluster be you know that's maybe a couple hundred billion so it's sort of on that scale um and they're planning it they're working on it you know so so um the um you know it's not just sort of my crazy take i mean amd amd i think forecasted a 400 billion dollar ai accelerator market by 27 you know i think it's, you know, and AI accelerators are only part of the expenditures. It's sort of, you know, I think sort of a trillion dollars of sort of like total AI investment by 2027 is sort of like we're very much in track on. I think the trillion dollar cluster is going to take a bit more sort of acceleration. But, you know, we saw how much sort of chat GPT unleashed, right? And so like every generation, you know, the models are going to be kind of crazy and people, it's going to shift the Overton window. And then, you know, obviously the revenue comes in, right? So these are forward-looking investments. The question is, do they pay off, right? And so if we sort of estimated the, you know, the GPT-4 cluster at around 500 million, by the way, that's sort of a common mistake people make is they say, you know, people say like $100 million for GPT-4, but that's just the rental price, right? They're like, ah, you rent the cluster for three months. But if you're building the biggest cluster, you got to build the whole cluster. You got to pay for the whole cluster. You can't just rent it for three months. But then, I mean, really, once you're trying to get into the sort of hundreds of billions, eventually, you got to get to like $100 billion a year revenue. I think this is where it gets really interesting for the big tech companies, right? Because their revenues are in order, you know, hundreds of billions, right? So it's like 10 billion fine you know and it'll pay off the you know 2024 size training cluster um but you know really when sort of big tech it'll be gangbusters is 100 billion a year and so the question is sort of how feasible is 100 billion a year from ai revenue and um you know it's a lot more than right now but i think um you know if you sort of believe in the trajectory of the ai systems as i do and which we'll probably talk about it's not that crazy right so there's i think there's like 300 million, you know, ish Microsoft Office subscribers, right? And so they have Copilot now, and I don't know what they're selling it for, but, you know, suppose you sold some sort of AI add-on for a hundred bucks a month, and you sold that to, you know, a third of Microsoft Office subscribers subscribed to that. That'd be a hundred billion right there. You know, a hundred dollars a month is, you know, a lot. It's a lot, yeah. It's a lot, it's a lot. For a third of Office subscribers yeah but it's but it's you know for the average knowledge worker it's like a few hours of productivity a month and it's you know kind of like you have to be expecting pretty lame ai progress to not hit like you know some few hours of productivity a month of of yeah
Speaker 2
okay sure so let's assume all this yeah um what what happens in the next few years in terms of uh what is the one gigawatt training uh the the ai that's trained on the one gigawatt data center? What can it do, the one on the 10 gigawatt data center? Just map out the next few years of AI progress for me.
Speaker 1
Yeah, I think probably the sort of 10 gigawatt-ish range is sort of my best guess for when you get the sort of true AGI. I mean, yeah, I think it's sort of like one gigawatt data center. And again, I think actually compute is overrated, and we're going to talk about that. But we will talk about compute right now. So, you know, I think sort of 25, 26, we're gonna get models that are um you know basically smarter than most college graduates um i think sort of the practice a lot of the economic usefulness i think really depends on sort of you know sort of on hobbling basically it's you know the models are kind of you know they're smart but they're limited right there you know there's this chat bot you know and things like being able to use a computer things like being able to do kind of like agentic long horizon tasks yeah um and then i think by 27 28 you know if you extrapolate the trends and and you know we'll talk about that more later and i talk about in the series i think we hit you know basically you know like as smart as the smartest experts i think that hobbling trajectory kind of points to um you know it looks much more like an agent than a chat bot um and much more almost like basically like a drop in remote worker right so it's not like i think basically i, I think this is the sort of question on the economic returns. I think a lot of the intermediate AI systems could be really useful, but, you know, it actually just takes a lot of schlep to integrate them, right? Like GPT-4, you know, whatever, 4.5, you know, probably there's a lot you could do with them in a business use case, but, you know, you really got to change your workflows to make them useful. And it's just like, there's a lot of, you know, it's a very Tyler Coweness take. It just takes a long time to diffuse. Yeah. It's like, you know, we're an SF and so we missed that or whatever. But I think in some sense, you know, the way a lot of these systems want to be integrated is you kind of get this sort of sonic boom where it's, you know, the sort of intermediate systems could have done it, but it would have taken Schlapp. And before you do the Schlapp to integrate them, you get much more powerful systems, much powerful systems that are sort of unhubbled and so they're this agent and there's drop in remote worker and um you know and then you're kind of interacting with them like a co-worker right you know you can take do zoom calls with them and you're slacking them and you're like ah can you do this project and then they go off and they you know go away for a week and write a first draft and get feedback on them and uh you know run tests on their code and then they come back and and you see and you tell them a little bit more things or, you know, and that'll be much easier to integrate. And so, you know, it might be that actually you need a bit of overkill to make the sort of transition easy and to really harvest the gains. What do you mean by the overkill?
Speaker 2
Overkill on the model capabilities? Yeah, yeah. So basically intermediate
Speaker 1
models could do it, but it would take a lot of slap. And so then, you know, the like, actually, it just the drop-in remote worker kind of agi that can automate you know cognitive tasks that actually just ends up kind of like you know basically it's you're like you know the intermediate models would have made the software engineer more productive but you know will the software engineer adopted and then the you know 27 model is uh well you know you just don't need the software engineer you can literally interact with it like a software engineer and it'll do the work of a software engineer so
Speaker 2
the last episode i did was with John Shulman. Yeah. And I was asking about basically this. And one of the questions I asked is, we have these models that have been coming out in the last year and none of them seem to have significantly surpassed GPT-4 and certainly not in the agentic way in which they are interacting with as a coworker. You know, they'll brag that they got a few extra points on MMLU or something. And even GPT-4 it's cool that they can talk like Scarlett Johansson or something, but like... And honestly, I'm going to use that. Oh, I guess not anymore. Not anymore. Okay, but the whole coworker thing. So this is going to be a run-on question, but you can address it in any order. But it makes sense to me why they'd be good at answering questions. They have a bunch of data about how to complete Wikipedia text or whatever. Where is the equivalent training data that enables it to understand what's going on in the Zoom call? How does this connect with what they were talking about in the Slack? What is the cohesive project that they're going after based on all this context that i have where is that turning data coming from yeah
Speaker 1
so i think a really key question for sort of ai progress in the next few years is sort of how hard is it to do sort of unlock the test time compute overhang so you know right now gpd4 answers a question and um you know it kind of can do a few hundred tokens of kind of chain of thought and that's already a huge improvement right sort of like this is a big on hobbling before, you know, answer a math question. It's just shotgun. And, you know, if you try to kind of like answer a math question by saying the first thing that came to mind, you know, you wouldn't be very good. So, you know, GP4 thinks for a few hundred tokens. And, you know, if I thought for a few hundred, you know, if I think at like a hundred tokens a minute and I thought. You think it much more than a hundred tokens a minute. I don't know. If I thought for like 100 tokens a minute you know it's like what GPT-4 does maybe it's like you know it's equivalent to me thinking for three minutes or whatever right. You know suppose GPT-4 could think for millions of tokens right that's sort of plus four rooms plus four orders of magnitude on test time compute just like on one problem. It can't do it right now. It kind of gets stuck right like write some code even if know, you can do a little bit of iterative debugging, but eventually just kind of like it can't, it kind of gets stuck in something. It can't correct its errors and so on. And, you know, in a sense, there's this big overhang, right? And like other areas of ML, you know, there's this great paper on AlphaGo, right? Where you can trade off train time and test time compute. And if you can use, you know, four rooms more test time compute, that's almost like, you know, a three and that's a few months of sort of working time. There's a lot more you can do in a few months of working time than, and then right now. So the question is how hard is it to unlock that? And I think the, you know, the sort of short timelines AI world is if it's not that hard. And the reason that might not be that hard is that, you know, there's only really a few extra tokens you need to learn, right? You need to kind of learn the error correction tokens, the tokens where you're like, ah, I think I made a mistake. Let me think about that again. You need to learn the kind of planning tokens. That's kind of like, I'm going to start by making a plan. Here's my plan of attack. And then I'm going to write a draft. And I'm going to like, now I'm going to critique my draft. I'm going to think about it. And so it's not it's not things that models can do right now. But, you know, the question is, how hard is that? And in some sense, also, you know, there's sort of two paths to agents, right? You know, when Cholto was on your podcast, you know, he talked about kind of scaling leading to more nines of reliability. And so that's one path. I think the other path is a sort of like unhobbling path where you it needs to learn this kind of like system to process and if it can learn this sort of system to process it can just use kind of millions of tokens um and think for them and be cohesive and be coherent um you know one analogy so when you drive here's an analogy when you drive right okay you're driving and um you know most of the time you're kind of on autopilot right you're just kind of driving and you're doing well and then um but sometimes you hit like a weird construction zone or a weird intersection, you know, and then I sometimes I'm like, you know, my, my passenger seat, my girlfriend, I'm kind of like, ah, be quiet for a moment. I need to like figure out what's going on. Right. Right. And that's sort of like, you know, you go from autopilot to like the system two is jumping in and you're thinking about how to do it. And so the scale, scaling is improving that system one autopilot. And I think it's sort of, it's the brute force way to get to kind of agents. You just improve that sort of system. But if you can get that system two working, then, you know, I think you could like quite quickly jump, you know, to sort of this like more agentified, you know, test time compute overhang is unlocked.
Speaker 2
What's the reason to think that this is an easy win in the sense that, oh, you just get the, there's like some loss function that easily enables you to train it to enable the system to thinking. There's not a lot of animals that have system to thinking, you know, it like took a long time for evolution to give us system to thinking. Yeah. The free training it like, listen, I get it. You got like trillions of tokens of internet texts. I get that. Like, yeah, you like match that and you get all these, all this free training capabilities. What's the reason to think that this is an easy and hobbling? Yeah,
Speaker 1
so, okay, a bunch of things. So, first of all, free training is magical, right? And it gave us this huge advantage for models of general intelligence, because, you know, you could just predict the next token, but predicting the next token, I mean mean it's sort of a common misconception but what it does is lets this model learn these incredibly rich representations right like these sort of representation learning properties are the magic of deep learning you have these models and instead of learning just kind of like you know whatever cisco artifacts or whatever it learns for these models of the world you know that's also why they can kind of like generalize right because it learned the right representations um and so you know you pre-ttrain these models and you have this sort of like raw bundle of capabilities that's really useful. And sort of this almost unformed raw mass. And sort of the unhobbling we've done over sort of like GPT-2 to GPT-4 was you kind of took this sort of like raw mass and then you like RLHF'd it into really good chatbot. And that was a huge win, right? Like, you know, going, going, you know, in the original, I think it struck GPT paper, you know, RLHF versus non-RLHF model. It's like a hundred X model size win on sort of human preference rating, you know, it started to be able to do like simple chain of thought and so on, but you still have this advantage of all these kinds of like raw capabilities. And I think there's still like a huge amount that you're not doing with them. And by the way, I think this sort of this pre-training advantage is also sort of the difference to robotics, right? Where I think robotics, you know, you know, I think people used to say it was a hardware problem, but I think the hardware stuff is getting solved. But the thing we have right now is you don't have this huge advantage of being able to brute strap yourself with pre-training. You don't have all this sort of unsupervised learning you can do. You have to start right away with the sort of RL self-play and so on. All right. So now the question is why, you know, why might some of this on hobbling and RL and so on work? And again, there's sort of this advantage of bootstrapping, right? So, you know, your Twitter bio is being pre-trained, right? You're actually not being pre-trained anymore. You're not being pre-trained anymore. You're pre-trained in, like, grade school and high school. At some point, you transitioned to being able to like learn by yourself right um you weren't able to do that in elementary school um i don't know middle school probably high school is maybe when it sort of started you need some guidance um you know college you know if you're smart you can kind of teach yourself and then sort of models are just starting to enter that regime right and so it's sort of like it's a little bit probably a little bit more scaling um and then you got to figure out what goes on top. And it won't be trivial, right? So a lot of, a lot of deep learning is sort of like, you know, it sort of seems very obvious in retrospect. And there's sort of some obvious cluster of ideas, right? There's sort of some kind of like thing that seems a little dumb, but that just kind of works. But there's a lot of details you have to get right. So I'm not saying this, you know, we're going to get this, you know, next I think it's going to take a while to like really figure out the details.
Speaker 2
A while for you is like half
Speaker 1
a year or something. I don't know. I think. That next month. Six months. Between six
Speaker 3
months and three years, you know.
Speaker 1
But, you know, I think it's possible. And I think there's, you know, I think, and this is, I think it's also very related to the sort of issue of the data wall. But I mean, I think the, you know, one intuition on the sort of like learning learning learning by yourself right is sort of pre-training is kind of the words are flying by yeah right you know and and um or it's like you know the teacher is lecturing to you and the models you know the words are flying by you know they're taking they're just getting a little bit from it um but that's sort of not what you do when you learn from yourself right when you learn by yourself you know so you're reading a dense math textbook you're not just kind of like skimming through it once. You wouldn't learn that much from it. I mean, some word cells just skim through, reread and reread the math textbook. And then they memorize, you know, like you just repeated the data, then they memorize. What you do is you kind of like, you read a page, kind of think about it. You have some internal monologue going on. You have a conversation with a study buddy. You try a practice problem. You know, you fail a bunch of times at some point it clicks. Then you're like, this made sense. Then you read a few more pages. And so we've kind of bootstrapped our way to being able to do that now with models. We're just starting to be able to do that. And then the question is being able to read it, think about it, try problems. And the question is, all this sort of self-play synthetic data, RL, is kind of like making that thing work. So basically translating, translating like in context, right? Like right now there's like in context learning, right? Super sample efficient. There's that, you know, in the Gemini paper, right? It just like learns a language in context. And then you're pre-training, not at all sample efficient. But you know, what humans do is they kind of like, they do in context learning, you read a book, you think about it until eventually it clicks. But then you somehow distill that back into the weights. And in some sense, that's sort of like what RL is trying to do. And like when RL is super finicky, but when RL works, RL is kind of magical because it's sort of the best possible data for the model. It's like when you try a practice problem and then you fail and at some point you kind of figure it out in a way that makes sense to you. That's sort of like the best possible data for you because like the way you would have solved the problem. And that's sort of that's what RL is rather than just, you know, you kind of read how somebody else solved the problem and doesn't initially click.