AI + a16z

Inferact: Building the Infrastructure That Runs Modern AI

23 snips
Jan 22, 2026
Simon Mo and Woosuk Kwon, co-founders of Infraact and core maintainers of the vLLM inference engine, dive into the complexities of modern AI infrastructure. They discuss how vLLM originated from Berkeley research to enhance large language model deployment. The duo highlights the challenges of scheduling and managing diverse model architectures for efficient inference. They also share their vision for a universal inference layer that supports any hardware or model, emphasizing the importance of open-source collaboration for innovation.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Inference Became The Hard Systems Problem

  • Inference has become as hard as building models because requests are unpredictable and continuous.
  • Modern LLM serving forces new systems problems like scheduling and memory management at scale.
ANECDOTE

Side Project Grew Into VLLM

  • Woosuk started optimizing a slow OPT demo service in 2022 as a side project and it grew into research and open source work.
  • That curiosity-led effort evolved into the VLLM project and a paper on page attention.
INSIGHT

LLM Workloads Are Fundamentally Dynamic

  • Autoregressive LLMs are dynamic: inputs and outputs vary widely making batching and static shapes ineffective.
  • Serving LLMs requires treating per-token steps and unpredictable lengths as first-class concerns.
Get the Snipd Podcast app to discover more snips from this episode
Get the app