Inferact: Building the Infrastructure That Runs Modern AI

23 snips

Jan 22, 2026

Guest

Woosuk Kwon

Guest

Simon Mo

Simon Mo and Woosuk Kwon, co-founders of Infraact and core maintainers of the vLLM inference engine, dive into the complexities of modern AI infrastructure. They discuss how vLLM originated from Berkeley research to enhance large language model deployment. The duo highlights the challenges of scheduling and managing diverse model architectures for efficient inference. They also share their vision for a universal inference layer that supports any hardware or model, emphasizing the importance of open-source collaboration for innovation.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Inference Became The Hard Systems Problem

Inference has become as hard as building models because requests are unpredictable and continuous.
Modern LLM serving forces new systems problems like scheduling and memory management at scale.

ANECDOTE

Side Project Grew Into VLLM

Woosuk started optimizing a slow OPT demo service in 2022 as a side project and it grew into research and open source work.
That curiosity-led effort evolved into the VLLM project and a paper on page attention.

INSIGHT

LLM Workloads Are Fundamentally Dynamic

Autoregressive LLMs are dynamic: inputs and outputs vary widely making batching and static shapes ineffective.
Serving LLMs requires treating per-token steps and unpredictable lengths as first-class concerns.

Get the Snipd Podcast app to discover more snips from this episode

Get the app