A Conversation with Val Bercovici about Disaggregated Prefill / Decode

Fabricated Knowledge

chevron_right

00:00

What is pre-fill and the KV cache explained

Val and Doug define pre-fill, tokenization, KV matrices, and why KV caches consume gigabytes of working memory.

Play episode from 08:40

chevron_right

Transcript

chevron_right

Transcript

Episode notes

This transcript is lightly edited for readability.

Doug: Welcome to Fabricated Knowledge. This is a podcast edition today, and I have the special honor of having Val Bercovici from Weka. And today, we wanted to specifically highlight an important trend I think everyone should be paying attention to. Disaggregated PD. I also wanted Val the opportunity to tell the story of Weka and AI-enabled storage today anyway. So take it away Val.

Val Bercovici: Thanks, Doug. So, Val Berkevich, Valentin, for those who did sometimes see my name online, a long-time infrastructure guy focused on storage for most of my career. I ended up being the CTO at NetApp about 10 years ago, following a cloud acquisition they made called SolidFire.

At that time, I was also actively involved with Google in open-sourcing their Kubernetes project. That was cool; it helped create and co-create the Cloud Native Compute Foundation (CNCF) under the Linux Foundation. I enjoyed category creation, and I was relatively early in the cloud, early in machine learning, and not so much in general AI, but I'm catching up on general AI right now. And yes, I'm at Weka right now, focusing as Chief AI Officer on our AI product strategy and marketing and education. Because, as we're going to talk about Doug, there are a lot of very new and different things happening about infrastructure when you look at AI workloads. It bears almost no resemblance to traditional CPU-based data centers, and that's what we're going to dive into right now.

Doug: Yeah, so maybe to start, I do think that the important thing is context, right? Things are changing quite rapidly all the time. Let's walk through how inference in a single GPU node was done and what has changed recently to shake up the entire ecosystem with this disaggregated pre-fill / decode. I think a listener needs to understand what pre-fill decode is on the traditional node, and then we can discuss what's changing by splitting the two.

Val Bercovici: Sure, and let's maybe introduce some overall financial context here and Semi-Analysis is leading this. But estimates are now that about four out of every five dollars of AI is gonna go to inference. So inference is not a trivial thing here; it's a very, very material thing in the future. One of the reasons why inference matters so much is that LLMs, in general, and large reasoning models, in particular, are concerned with this aspect.

LRMs fundamentally use inference-like techniques as part of their test-time computation, which is scaling now in the inference dimension far more than in the legacy pre-training dimension. So, all of these kinds of workloads matter a lot in terms of how infrastructure is allocated and spent, as well as the revenues collected from it. So specifically on Disag Prefill, which is a relatively new phenomenon. It's mostly a 2025 thing.

Some research papers obviously were written well before last year, and last year on it. And it's a way to scale the entire process of inference. And this is specifically to be technical, not for image models, which are classically or video models, known as diffusion generative AI models, but the more popular text-based, if you will, and video-based as well, transformer models.

So this is a crucial qualifier. But having said that, transformer models are really where a lot of the innovation is happening and new features are happening right now, and a lot of the focus is on engineering and efficiency.

So, Disagg Pre-Fill / Decode, to open up the topic, one of the simpler ways to start is to consider the following: if you're not necessarily a layperson, but you're not particularly AI savvy, and you've just been in tech for a while, think of zip files. Please consider the process of archiving data, then compressing and decompressing it. So a lot of AI discussions focus on encoding.

That is optimizing the training of models and getting that loss function down and getting this fit right between underfit and overfit. That's basically what it takes to compress the internet, as we like to say in the industry semantically, right? That's fundamentally what a lot of AI models are, compressing some fraction, or if you're OpenAI, most of the world's knowledge online into a multi-gigabyte file that gets loaded into these GPUs.

So the encoding part is the compressing part, and inference is all about the decoding. It's all about how you decompress that model and how you do that by serving the model hundreds and thousands of times across hundreds and thousands of GPUs and how you decompress that model for every user. So every time you and I sit down at a ChatGPT session, or more often now, how an agent consumes these AI models on your behalf if you're running Cursor or other popular AI apps, the Manus AI agent, for example, how that decoding is happening.

And that decoding now is really very different from the unzipping or unarchiving of old days, fundamentally because, if we were to use a zip example, that was a CPU and was traditionally a single core. So it was a single-threaded or single-instance process focused on compression. Therefore, it's a very serial process for compressing the data, and a similarly serial process for decompressing the data.

Lots of those bottlenecks are with regard to just a CPU serially waiting on memory and storage. GPUs, again, maybe the main takeaway for everyone here is that nothing happens in serial on GPUs. The whole point of GPUs, the reason we spend an obscene amount of money on GPUs in AI data centers today, is that everything wonderful happens in parallel. That's the only reason, now, Gen.AI in particular exists is that it's a very parallel process run in these GPU kernels.

And so the process of Encoding the data into a model and particularly as we're talking today the process of decoding the data from these models as we serve them at inference time is a very very parallel process and asking the way these models work asking the GPUs to do different things in parallel turns out to be incredibly inefficient.

So, let me give you a tangible example of that, and we'll dive into the details of why the reality of the market is the way it is today. When you go to OpenAI's pricing page, Google's pricing page, Anthropic, Together AI, and other popular model providers, you'll find that they have three classes of pricing.

One is the price per million input tokens. One is the price per million cached input tokens. And one is the price per output token. This caching tier, or rather, this caching class of pricing, also known as context caching or prefix caching, goes by multiple names and was introduced about six to nine months ago because people were discovering that it was very inefficient to reprocess the same data repeatedly.

If you can process cache data, it is much more efficient. And we've seen OpenAI now join the pricing wars, along with Google and DeepSeek, setting the stage for a race to the bottom. Some great reports that SemiAnalysis wrote, you know, going back even two years ago. There's kind of an industry standard of about a 75 % discount to the user or to the API caller, whether it's a developer or an agent, if you use a cached input pricing option. But there's a giant caveat.

That is, all of these providers, as they're inferring, will only give you about five to 15 minutes. They can't even specify beyond that, five to 15 minutes of prefix or content caching before you lose the context, you lose the cache. And so why is that?

Let's dive a little bit deeper into how this decoding and inferencing work. The way it works is that you build a working memory. So as you're inferencing a model, as you're decoding the model, you actually can't jump into the decoding part, right? That's the valuable part. That's where those reasoning tokens pop up.

Doug: I'll slightly interject here. I want to make sure that we have this, just because Val, you're going pretty quickly. It's a hard space. But specifically, let's discuss the prefill decode, right? Because we're talking about it, we're throwing these words around. On the prefill side, I think it's helpful to understand that that's compute-bound, and the decode side is memory-bound. And so traditionally, how this was done was all on one node, GPU was essentially, you know, this is a virtualization problem essentially. What we're talking about now is the shift to the understanding that the decode to in the, you know, the pre-fill, which is, I believe the input side of it versus the output tokens, the ratio is exploding so that you're talking 10 to 1, 20 to 1 of these tokens that are moving. And so what people are finding out is that decoding is becoming a big issue in inference scaling. And so, maybe that context is helpful, because I feel like we're diving in very quickly, and I wanted to make sure we...

Val Bercovici: I was jumping ahead to illustrate the fact that most people skip the prefill, right? They skip this fundamental critical path item in parallel with how inferencing works overall because we're used to paying for input and output. So what is this prefill thing? Prefill is literally how we prepare to do the decode. Decode and transform models can't happen without pre-fills. So, what is prefill?

It's a bit challenging to fathom if you're not in the machine learning or AI space, but it's utilizing a significant portion of the data that we have at inference time. So the data we want to process, which is supposed to represent that, but as we're doing inference, we ask, know.

Upload a complex legal PDF or upload terms of service for the stuff we sign up for every day, or upload some complex documentation, and start to ask questions about it, right? Specifically. So we're not asking about general world knowledge. We're not asking about mathematical or scientific knowledge. I want to know if I can get my insurance carrier actually to cover this particular claim. And it's a very long and complicated document. So that's a very good example of what it takes to do inference.

So, Prefill is taking your question about that document as well as the document itself. And after it's tokenized, after we convert the words, charts, and graphs into these AI tokens that we all use, we have to put context around these tokens. And we have to vectorize these tokens. And again, hard to fathom, but we add anywhere from 10 to 20,000 dimensions to each of these tokens. Because every word will have a different meaning based on its context, right? The C word here is everything context. So, Prefill is all about adding all these dimensions of context and creating these query key value KV matrices, which take kilobytes of these tokens and convert them to six orders of magnitude to gigabytes, so way past megabytes. We're approaching the gigabytes of working memory for this key-value cache, also known as a KV cache. has to be stored in memory.

It can't be stored because we're accessing it all the time. The only way that...

Doug: We also explained KV cache, right? Because I know we often use these phrases, I think it might be helpful to discuss them. KV being key value cache, right?

Val Bercovici: Yeah, these are multi-dimensional structures. So, the keys and the values themselves are multi-dimensional matrices, and they each consume, again, gigabytes of data as they're transformed from words and tokens, graphs and tokens, to ultimately tokens, the keys and values. And it is how the algorithms of inference work. These algorithms are often referred to as flash attention or simply attention algorithms. One of the most popular ones for Nvidia processors is Flash Attention.

And all of this ultimately involves matrix multiplication. And so that's why we've gone from these very, I think you know, fortunately timed graphics processors, which were always the matrix multiplications to render pixels on screen. We had these arithmetic logic units, ALUs, and these graphics processors, and we had thousands upon thousands of them. And NVIDIA, wisely recognizing that machine learning and Gen.AI utilize very large matrix multiplications. Now, they are adding more and more of these tensor cores, not just regular GPU cores, to their GPUs. These tensor cores specialize in performing large matrix multiplications at a more AI-optimized precision.

Ironically, and I'm not sure if we want to go down this rabbit hole, but AI doesn't require as much precision as traditional graphics or high-performance computing. So, lower precision, same results, basically, same general level of accuracy, but way faster processing. So the flash attention algorithms essentially are what makes AI work on ⁓ GPUs.

Doug: So let's specifically get back to the pre-fill deep, and like maybe I want to do some paraphrasing here. So, essentially, when you run an inference workload on a GPU, what happens is you load all the key values into the KV cache into memory. This memory is often HBM. The reason why HBM isn't so high in demand is that you want this big, as large as you can, essentially. Then pre-fill the weights. And then that's the compute-bound part. But then after that, you're just running the inference over and over and over. You submit a query, and then the model generates the results. So that's the decode portion. And that is much more memory bandwidth limited. I wanted to confirm that's the high-level summary. Is that correct?

Val Bercovici: Exactly. And this is very much a kid buying our laptops today, right? We always want to buy as much DRAM if we wish to or memory on our laptops as possible. The same rules always apply to servers for making databases run faster. Really critical to this process of inference with GPUs. And there's three key tiers of memory we want to discuss here, Doug. There's the SRAM, which is really where the algorithms do the actual work with these tensor cores. These precious tensor cores on the video GPUs do the matrix multiplication.

Then there's a high-bandwidth memory reference, which is essentially the working memory, where data is stored and retrieved from SRAM to facilitate the actual matrix multiplications. However, the challenge remains that we can never afford as much memory as we would like on our laptops or servers. It's the same challenge we see on GPUs. On the GPU package itself.

There are all the GPU cores and the tensor cores, and we package this high-bandwidth memory; the real estate is just very finite. We can only get, you know, one, maybe 200 gigabytes now per GPU of high bandwidth memory. And because the working memory of these models has to also contend with the weights of these models, by the time you load the model weights for DeepSeek, by the time you begin to create this key value cache as working memory for the very first user you're almost out of that high bandwidth memory, right?

Memory is just so precious right now, these models and the working memory are so large that there's this third tier of memory, which is the DRAM, regular dynamic RAM, DDR class five, you know, DDR five class memory right now. And that's a memory that's shared on the motherboard across all the GPUs, often eight GPUs on a single motherboard. That's the shared memory. And ⁓ that typically when you...

Doug: Also, and then for the next generation, because you know the listeners are often semiconductor investors, so they are clued in, right? In the next generation, the grace portion of the Grace Blackwell CPU controller is often connected to banks and banks of DRAM, so that DRAM essentially is the third tier of memory. And so that's where the third tier kind of rolls in

Val Bercovici: Grace Blackwell, yeah. That is certainly the case for GB 200s and even GHs, as well as Grace Hoppers, which have become a very standard computing unit in the world of AI training and AI inference.

And the software, which we haven't mentioned, the software, which is key here in terms of how inference happens, is now very aware of that. So, the software consists of these inference servers. Historically, it would be NVIDIA's Triton TRT-LLM. For the past year or so, we've seen overwhelming market momentum and interest in the open-source VLLM inference server, as well as its thriving community. And we also see SG-LANG and other inference server models.

We combine models and these inference servers into large kernels that we load into the GPUs and let them run. ⁓ so the software is now aware of these three tiers of memory. And there are these things called KV, Key Value Cache Managers, which manage the three tiers of working memory on behalf of these inference servers so that when you run out of high bandwidth memory, which is always, that you don't have to evict from KVCache.

You don't have to say after five to 15 minutes, I'm just out of cache. I have to go back and re-prefill all that data, give you that 30-second delay, consume hundreds and hundreds of thousands of watts, and start the KV cache process all over again.

Everyone's trying to minimize and avoid KV cache eviction as much as possible. We frequently observe KV transfers between GPUs and GPU servers to move memory around until it's exhausted, and then we have to re-prefill. One of the fundamental aspects here is that we want to minimize the time spent pre-filling by keeping everything in cache across an entire cluster of hundreds or thousands of GPU servers.

Doug: So this is a, think, okay, let me try to summarize this back, because this is dense stuff. We're talking about how inference is done at a core level, right? We're discussing all the aspects that go into loading the KV cache into memory, including having banks of it, trying not to get evicted, and the software having an understanding and being able to KB manage so it can dynamically address where the memory weights are being stored or held. And that's kind of, I think, the state of the art till today.

However, the reason I have Val on is not to explain it, but rather to explain how we infer at all on a GPU. It's the next step. The next step that I think is starting to become increasingly apparent. For instance, the foundational change is something called disaggregated, pre-fill, and decode.

So, we've been discussing pre-fill as the loading of the KV cache onto the GPU, and then decode as the serving of the model, essentially running a query and then activating all the model weights, which then tells you the result. But importantly, there has been a significant shift, where we're starting to disaggregate it, meaning that it no longer has to be done on a single GPU.

Because we have this KV manager, we can pretty much handle routing, and we could create a pooled access of resources to achieve better utilization of a GPU cluster. And that's disaggregated pre-filled decode at the highest level, but I wanted to give the mic over to Val to explain what, maybe in a little bit more depth, what exactly it is, why is it such a big deal, can you kind of even talk about the amortization of the KVCache, like how many users are being able to, the difference of being able to do one GPU on the pre-fill versus many on the decode, just kind of walk through the whole and what it will hopefully be able to enable.

Val Bercovici: Absolutely. So it was DeepSeek, at least publicly introduced a new tier of pricing, that new class of pricing, cached input pricing, last year. They wrote a paper about it and disclosed how they do it, a few months ago, during their infamous six days of open-source disclosures. But it takes the concept, and we can go back to the early days of cloud and Snowflake. One of the reasons Snowflake became so popular is that they said, you know what, you don't have to have the same Amazon instance, cloud instance, to do both your data processing and your data storage.

You could decouple, which was a term they would use back then, your storage from your processing, scale those differently, pay for those differently, consume those differently, as you have different ratios of, you know, storage work and processing analytics work. The same thing is now happening in the AI inference space, with disaggregated, rather than decoupled, pre-fill and decode.

And so what that means now is the process of pre-filling, preparing the data to decode, is very GPU-intensive. That's where all those tensor cores kick in and operate full-time. And that is where it makes sense to have your best accelerator, your highest-performance GPU, focus on pre-filling your data and creating the KV working cache for five to 15 minutes or so, so that you can then begin to decode it. And that still is a serial process. You really can't decode until you create the KV cache, and creating the KV cache is very compute-bound.

Now, once you've created that KV cache, the process of, again, getting to your reasoning and finally outputting tokens that you care about is decoding. It's very memory-intensive. It is going back because of the way these autoregressive transformer models work, which is a very Bayesian approach. All of the next token prediction is best performed by looking at all the prior tokens just up until the point in time you're creating that next token. So, you've to look back into your memory, into the context, over and over again in parallel millions of times, so that you can make that high-quality next token prediction.

This is very memory-intensive, the decode. And you're stressing very different parts of a GPU server at that point, right? At decode, you're stressing those three tiers, particularly, you know, the high bandwidth memory and the DRAM portion of KV cache, because you're trying to pull during decode as much memory into the GPU, into the tensor core as fast as possible.

Any delay there again just keeps that costly asset idle. Therefore, decoding again is a significantly different workload profile in the server compared to pre-fill. And what's making sense right now is dedicating at scale, particularly banks and banks of GPUs, just for pre-fill. That first part of the disaggregation, and then banks and banks of GPUs just for decode, and you can process optimize the GPUs for pre-fill, you should be processing optimizing differently than the GPUs for decode, because those are memory-centric and memory-focused. And what that means is you can bring new life into prior generations of accelerators by mixing Blackwell's, for example, for pre-fill, and hoppers for decode and each are performing optimally, each are certainly giving you a certain depreciated value, know, and current rate of return and you're not creating any artificial delays, you're keeping everything humming as efficiently as possible by optimizing what each class of processor and its memory ecosystem is best suited for.

Doug: Yeah, so I want to reiterate this because I think it's really important to understand why this is such a big deal. Nvidia talked about it at GTC. Dynamo was essentially the implementation of some of the work that DeepSeek had already did. This is going to be the key story for inference serving for the rest of this year. I think, and we've got to think about the bigger picture here, I think it's really important because what this will enable, like we're talking about the inference and the token throughput per a single node, this will hopefully be able to add a lot more scaling out and meaning that you know it's the same GPU bottleneck we had will probably because of the increase. I don't know, and this is something that I think a lot of people are working on in terms of quantifying. think we are at SemiAnalysis as well, right?

However, understanding what this unlocks is a really big deal. And essentially, I mean, it's kind of interesting. I don't think it will happen, but it would be effective within inference, right? Within this test time, compute scaling, we're almost unlocking two different types of compute. That's being done just for inference, if you remember roll back a year ago. Everyone's talking about doing or two years ago.

Everyone's talking about accelerators focused on inference versus training, right? This is just within inference, two different types of computation being done here, can head, and then the point is that the heterogeneous Output is going to be a lot bigger than what one single node can do. I'm very excited because I think it's going to increase token throughput massively, I think it's going to improve memory utilization massively.

Val Bercovici: Exactly.

Doug: And then obviously that brings the cost of tokens down. Is there any other kind of things that I should be aware of that are like logical takeaways from that?

Val Bercovici: So, you know, one of the first takeaways here is that ⁓ disaggregated prefill, disag prefill is introducing the notion of assembly lines to AI factories for the first time. because it's kind hard to imagine a factory without an assembly line today.

But back in the olden days, when factories were first established, they had clusters of workers coming in and out of a particular work area, such as a car assembly line. The different specialists would go and work on the car, then leave, but the car never moved, and it was very inefficient to have to move the workers in and out. That's exactly how inference happens today before Disagg Prefill.

Assembly lines, as the name implies, essentially mean that we keep the data flowing and keep them flowing through whatever specialty is necessary for processing. So prefill is a very different set of work than decode. And that's why moving data using an assembly line process, the way DeepSeek innovated, at least publicly, is very, very important right now. So that's one of the first takeaways: we're finally entering the era of assembly lines for AI factories on the inference side. Next, of course, is just the nature of the workloads themselves.

Specifically, 2025 and a subsequent year, as an excellent proxy for this, have seen workloads shift from being individual chat session-based to agent-based, which means much more context.

We're talking about processing large volumes of documents, extensive code bases, extensive transcripts, or large videos themselves. And more importantly, we're not just stopping at the first question and answer. We're asking a lot of repeated questions and receiving the same answers. It's called a multi-turn prompt.

This notion of high-context, multi-turn communication is becoming the dominant workload of 2025. And that really presses, know, or stresses inference far, far more and creates this need to scale inference and decouple the two different natures of the workloads, the pre-fill and the decode, to get to the maximum efficiencies, tokens per second, tokens per watt, and ultimately be able to support more users simultaneously, larger batch sizes on the same GPU asset investment, the same memory deployment, and certainly just the same power budget that every AI data center operates under.

Doug: So, I'm going to transition this to Weka and your potential solution. You know, we're working on some of the verification for all this stuff, but talking about how we scale the decode out to become a lot bigger, because in this process, we're talking about the assembly line and being able to split one portion of inference into one part and another portion of inference into the other.

The thing that's specifically on these agentic really large multi-turn processes is that the KB cache just explodes because you know the entire context of your Claude code or your cursor session becomes it hits the context window every single time, so the decode and kind of disaggregating the decode can make hopefully I believe a much, much larger context windows.

Can you help me discuss what is currently being done, which, to my understanding, involves DRAM caching —essentially, giant pools of DRAM cache? For example, if you have many users at OpenAI, what Weka is trying to do is address and scale out the decode context window.

Val Bercovici: Absolutely. So, you know, let's take one quick step back, one short step back in an ideal world. Since we're essentially creating this working memory in prefill before we can actually use it for the outputs on the decode side, we don't want to have to recreate it repeatedly. A best-case scenario is that we create working memory for every model instance session, and then we just decode forever.

So DeepSeek created a simple formula for this, xpyd, and the open source community in particular, Vllm, has adopted it as a discussed scaling inference. So, XPYD simply stands for X, how many times you have to pre-fill the data in a working session. So, if your working session extends for more than five or 15 minutes, and you have multiple simultaneous users on the same cluster, that five to 15 minutes gets compressed because you have to support more users on the same hardware and memory. So, you typically have to do many, many pre-fills, and that's where a lot of the waste, slowness, and inefficiency come in. However, you want to have a certain ratio between X pre-fill and Y number of decodes. However, the ideal scenario, of course, is one P and one pre-fill.

You pre-fill the data, and then you decode forever. How do you get there? Well, how you get there is by having more or less infinite memory. And we've heard this term from Google, for example, infinite context windows. So, Google has been able to approximate this by using both their TPU, or Tensor Processing Unit, architectures, as well as a ring retention algorithm particular to Google and Gemini.

And they've been able to take banks and banks of DRAM, network DRAM associated with their TPUs, and give you these million tokens, up to 10 million token context windows. They were first to market with that because they were able to optimize their infrastructure for that. However, the economics of doing so are very stressful, even for companies like Google. And not everyone is Google. Until recently, no one else had million-token context windows. We're just on the cusp of seeing that go mainstream with Facebook, with Llama 4, with Minimax, which was just released the other week, and various models that will soon support millions in token context windows, which everybody wants, by the way.

So, the pressure is on. How do we take these three tiers of memory —SRAM, which is super expensive, super finite, and high-bandwidth memory —as your readers know all too well, also relatively expensive and completely finite —and DRAM, which is also theoretically expensive and finite? Well, when you look at the math, DRAM is the one now, and its cousin, NVMe, non-volatile memory extension, and flash devices, are now within striking distance of each other. know, DRAM in isolation is nanosecond class latency, NVMe in isolation is microsecond class latency, but you brought up the grace chip earlier on, and even, you know, the Hopper class, whether it's GH or Blackwell, actually doing the transfer between DRAM and HBM is on the order of microseconds, whereas in doing the transfer from HBM to SRAM is still on the order of nanoseconds on these GPU servers.

So because now we're looking at a microsecond-level transfer between HBM and DRAM, optimally managing these NVMe NAND flash device as well is the key to making this all work. And I always like to joke there's no S in NV’ in NVMe, right? It's a non-volatile memory extension. So that's one of the things Weka was optimized for doing from day one is taking full advantage of all of these new converging standards, NVMe, the fact that a lot of people still haven't internalized the notion that the high performance compute networks in AI factories that GPUs are attached to are faster than the motherboards themselves.

This is very counterintuitive to most people. So there are more PCI lanes accessible on the network than on any individual server motherboard, or GPU server motherboard. And so that means that if you can aggregate all that amazing network bandwidth to the GPUs across your high-performance computing networks and attach NVMe devices on one end and get the consumption by each GPU across that high-performance computing on the other end, you end up with more memory to the GPU, more DRAM particularly, to the GPU from the network than from the motherboard. And that's what WECA has released with augmented memory technology.

We've been publishing our own benchmarks on this topic in our blogs since February. We recently had our first cloud partner, OSCI, and Oracle Cloud publicly publish their own benchmarks and validate these results, showing that you can extend the DRAM class of memory from the motherboard down to the compute network on these GPU clusters. And by leveraging that compute network, we'll dive into what that means in contrast to the storage network. You're able to have effectively limitless DRAM, which now means limitless KV cache, and ultimately, infinite context windows for everybody else outside of Google.

Doug: So yeah, that's the engineering that will make the infinite context window. And to be clear, context windows are exploding already. And leading-edge models, for example, are often larger than the stated numbers available in the API, which is necessary for some fine-tuning, as well as for some of the system prompts that are incorporated into the models. think, yeah, that's kind of like the Weka solution here.

I would actually like to use the rest of this time to transition to a brief history of storage, as Val has been around the block, and it would be a real shame for me not to discuss the other companies.

One of the ones that everyone is probably aware of from here is like Pure Storage. They're publicly traded. They're, you know, the kings of network-attached storage. I would love to discuss this transition, possibly moving away from the disaggregated, pre-filled decode, which, to be clear, is a significant change. This is how we're going to scale to much bigger.

Now I want to talk about how the network attack storage space itself has kind of transitioned from you know, kind of the history of the past to now, because I think that that would be a very interesting arc to sort of go through with my listeners.

Val Bercovici: Absolutely. So let's go all the way back, even though storage predates network attached storage, but clearly network attached storage is a big deal. NetApp was born in the network-attached storage era.

They kind of optimized storage not for the storage media, which back then was only what we call spinning rust, now hard drives, which have heads that platter that spin and heads that actually go back and forth across the platters to extract the data. NetApp was born in that era, optimized for that environment, and, you know, became the brand that it is today.

Pure that you brought up earlier on was born of the flash era, of the NAND flash era. And so Pure realized, you one fundamental thing, which was, hey, all of a sudden this NAND flash media that we're having terabytes and terabytes of in our storage arrays is worth more than this CPU based controller that controls all of it and presents, you know, these protocols, these SAN and NAS and other protocols to the servers and other end users.

So by just realizing that, you know, there was an advantage to engineering the NAND flash and the NAND flash shelves and then creating this really brilliant program, the Evergreen program, just to let people upgrade controllers because they were the lower-priced item compared to the media.

Pure, you know, became the brand that it is today in the sand space and in the flash space, flash storage space, because they were born in this era, understood the supply chain, understood the value, the ultimate customer value that is fundamentally different than spinning media, the older hard drive generation of technology and emerged to be the leader they are today.

Doug: So, then I want to, so I know Pure Storage is the leader, like in the legacy compute, let's call it that. I'm sorry to call it legacy compute. I know the CPU guys probably hate it, but let's be real, it's legacy computing, right? Let's discuss what that looks like in the future. Because I think that there have been a few other, DDN, VAST, Weka, started, these newer challengers who are entering this space specifically with the focus on the opportunity from the newer compute strap ⁓ world, which is mostly ⁓ a GPU-driven architecture. So let's just kind of talk about how, know, what are, you know, you can talk about your competitors, whatever you want to do. I just think a market ecosystem overview of, and then obviously with the caveat that that works for Weka.

Val Bercovici: Yeah, you know, let's take a simple visual that I think a lot of us have seen. We've all encountered the concept of exponential data growth. And we've probably seen some of the crazy charts. I'm a visual guy, so looking at the hockey stick charts for exponential data growth, that's no longer a future state; that's a present state.

We're in the steep part of this hockey stick curve, and exponential data growth means just that. There's just an enormous amount of data being generated synthetically now, as well as organically by the world and by all of our systems. The vast majority, 80- 90 % of it, is unstructured data.

So the legacy market, the legacy compute and storage market that we talk about now jokingly, was not optimized for unstructured, explosive, exponential data growth. It was optimized for transactions, Oracle databases, DBT databases, SQL servers worldwide, structured data, and latency-sensitive data. And that was paired with more analytical systems, which provided more horsepower, more CPUs, and more memory, all brought to bear in the cloud. This was the dominant workload of the cloud, creating these analytical systems to extract meaning and value from all this transactional data.

Enter GPUs. And again, GPUs just are fundamentally different than CPUs. The stress and workload on GPUs bear no resemblance to what a database transaction looks like. To process all of this massive, exponentially growing data in parallel, we need fundamentally different storage systems and they're not storage arrays at all.

They have to scale out much, much more horizontally. And let me know if we're gonna have a new member in the group. That's all good. To address exponential data growth, you need all your storage systems to work in parallel. They have to be scaled out from the start. They have to be optimized for unstructured data.

And unstructured data doesn't mean just one workload, one storage workload. It means large file reads and small file reads. It means both random access and sequential access. It means ⁓ focusing on ⁓ billions and billions of directories and trillions and trillions of files in those directories. It means object interfaces, S3 protocol interfaces to millions and trillions of buckets. It means so many different things.

In parallel compared to what we've seen in the past for CPU-based storage. So the fundamental systems are just built differently. Weka, in particular, doesn't build storage arrays, never really has. Many people like to think that we build bigger storage arrays. That's not true. We've had this software-defined containerized system from day one. Again, we're born not just of the cloud era, but of the AI era, just as NetApp was for NAS and Flash was for Pure. We're AI-native; we're very GPU-native systems.

So we call this a mesh now, just a mesh of containers that fundamentally take advantage of one reality, which is that the network in GPU computing and AI factories is faster than a motherboard. Weka is the first storage system and the first storage cluster designed for this reality. No one else is. Everyone else optimizes around.

Either individual storage controllers or, like most of our competitors, clusters of storage controllers front-end the actual high-performance NAND flash and storage media. Weka doesn't work that way, right? One of the most radical examples of how Weka differs is something we call converged mode, where we deliver software-defined memory. That's an oxymoron for most people, but then again, software-defined networking was very heretical in the earliest days.

Software-defined storage became somewhat less heretical but still novel when it emerged. Software-defined memory today is very heretical, but people will realize soon that when you're buying these banks and banks of GPU servers, they come, of course, with GPUs, they come, of course, with all three classes of memory we talked about. Still, they come with X NVMe drives, often eight NVMe drives per server.

Since GPUs, particularly at inference time as well as training, often work in clusters themselves, when you have eight GPU servers, each with eight drives, or let's pick nine GPU servers to talk about NVL72s in particular, with eight GPUs per, often you get about eight drives, about 72 drives, NVMe devices, per NVL72.

Installing Weka software on that instantly creates software-defined memory because we take those 72 NVMe devices and convert them into a DRAM class of memory. And these inference servers we spoke of, VLLM, TRTLLM, SG-LANG, now understand how to recognize the thousand times more density, the terabytes per device of memory, now to complement the very limited number of terabytes.

Really the fractions of petabytes of NVMe capacity with the terabytes of DRAM capacity with the gigabytes of HBM capacity. And all of a sudden, we have a theory mechanism that gets you to this Nirvana ideal state of pre-filling that working memory that KV cache wants after you've loaded the model weights, never having to pre-fill again, never having to evict cache and decoding forever. And we can essentially fast-forward AI inference to just decoding.

Everybody wins, particularly for these cursor-like, agentic workloads where it's always high context, it's always multi-turn. If we never pause and slow down to re-prefill every 10 or 15 minutes, but we hit the throttle, you know, we hit the throttle, we put the pedal to the metal, so to speak, nonstop during inference, we get optimal inference. then what I love is the token economics, the unit economics of this, all of a sudden make a lot of sense right now, because we can do, create more tokens per second in aggregate, generate larger batch sizes, which means supporting more users, and every one of those users gets the lowest latency time to first and last token. So it's a win-win all around when you can unlock that last final bottleneck of AI inference.

Doug: Yeah, so yeah, I think that that's a really good ending on what weka does as well as like, know, Kind of wrapping it all together of the disaggregated PD. This is the example. Yeah, I'm pretty excited for what's to happen there. You know, there could be some, well, let's see, I'm not sure if everyone wins, right?

Well, I guess the reality is that if we purchase more GPUs, right or no, if we do more tokens, we create more things. They'll still purchase more GPUs, like me, something that I'm not familiar with, but it doesn't concern me. However, that does sound like GPU utilization is about to reach its limit, right? But that's an aside, right? But I do appreciate, yeah.

Val Bercovici: I would say just to wrap up technically, GPUs were underutilized before. They weren't utilized during the majority of inference, which was the decoding part. Now we truly drive up GPU utilization to its maximum potential.

All of the NVMe, ⁓ sorry, all of the HBM capacity being manufactured is already allocated and being purchased. All the DRAM capacity, particularly the AI-friendly DRAM, is already being purchased. So this is very bullish for NVMe, the Flash SSD ecosystem, particularly the high-performance TLC class of NAND Flash. It's very bullish for that because it represents a new life on top of a pretty valuable market opportunity that NVMe has always had.

Doug: To be clear, that's not the official opinion of fabricated knowledge of SemiAnalysis, but it's the official opinion of Val. Anyways, I just wanted to wrap it up here. Is there anything else you'd like to say to the listeners, or should we probably leave it at that?

Val Bercovici: It's my opinion that I expect one fundamental change in the AI business, which is again funded by inference. And that is, we're going to see the three classes of pricing, which is input cashed input and output, collapse very soon to only two classes of pricing because people will realize when you don't have to only have five to 15 minutes of life for your input token pricing, it could be weeks or months of life or effectively infinite. You don't need to have a distinct tier of pricing.

So this is gonna fundamentally shift. The business of AI and the unit economics of AI, simplify the pricing and give more value to users. And it's a question of which provider will lead the charge and define that new low class of pricing, and who will be the followers.

Doug: I'm excited to see how it shakes out. Thanks for the time, Val. Appreciate having you on.

Val Bercovici: Likewise, love the conversation.

That’s it for this week! Thanks for reading!

Get full access to Fabricated Knowledge at www.fabricatedknowledge.com/subscribe

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books