
AI + a16z Inferact: Building the Infrastructure That Runs Modern AI
23 snips
Jan 22, 2026 Simon Mo and Woosuk Kwon, co-founders of Infraact and core maintainers of the vLLM inference engine, dive into the complexities of modern AI infrastructure. They discuss how vLLM originated from Berkeley research to enhance large language model deployment. The duo highlights the challenges of scheduling and managing diverse model architectures for efficient inference. They also share their vision for a universal inference layer that supports any hardware or model, emphasizing the importance of open-source collaboration for innovation.
AI Snips
Chapters
Transcript
Episode notes
Inference Became The Hard Systems Problem
- Inference has become as hard as building models because requests are unpredictable and continuous.
- Modern LLM serving forces new systems problems like scheduling and memory management at scale.
Side Project Grew Into VLLM
- Woosuk started optimizing a slow OPT demo service in 2022 as a side project and it grew into research and open source work.
- That curiosity-led effort evolved into the VLLM project and a paper on page attention.
LLM Workloads Are Fundamentally Dynamic
- Autoregressive LLMs are dynamic: inputs and outputs vary widely making batching and static shapes ineffective.
- Serving LLMs requires treating per-token steps and unpredictable lengths as first-class concerns.


