Fabricated Knowledge

A Conversation with Val Bercovici about Disaggregated Prefill / Decode

Jul 7, 2025
Val Bercovici, Chief AI Officer at Weka and a veteran in storage infrastructure, delves into the fascinating world of disaggregated pre-fill and decode for AI workloads. He discusses how inference is now taking center stage in AI spending and the challenges it presents. Bercovici explains how the split between pre-fill and decode enhances GPU utilization and efficiency, likening it to an assembly-line architecture. He highlights Weka's innovative solutions in software-defined memory, which promise to revolutionize memory management and scalability in AI-heavy environments.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Inference Equals Decoding At Scale

  • Inference is fundamentally a decoding problem where models decompress stored knowledge into outputs.
  • Decoding is highly parallel and stresses memory rather than serial CPU compute.
INSIGHT

GPUs Favor Parallel Decoding

  • GPUs excel because they run massively parallel operations rather than serial CPU tasks.
  • That parallelism makes decoding workloads fundamentally different and more efficient on GPUs.
INSIGHT

Why Cached Pricing Appeared

  • Model providers price inputs, cached inputs, and outputs differently because repeated pre-processing is expensive.
  • Cached input pricing emerged to avoid reprocessing the same context repeatedly during inference.
Get the Snipd Podcast app to discover more snips from this episode
Get the app