

The server-side rendering equivalent for LLM inference workloads
Aug 19, 2025
Tuhin Srivastava, CEO and co-founder of Baseten, shares his insights on advancing AI infrastructure for large-scale neural networks. He discusses the evolving demands on GPUs and the challenges of scalability in generative AI. The conversation highlights the trade-offs between different model types, like RAG and embedding models. Tuhin also emphasizes the cost-saving benefits and customization potential of open-source models, while addressing the rapid changes in chip architecture that shape software development.
AI Snips
Chapters
Transcript
Episode notes
Inference Is Now The Main Cost Driver
- Production LLM systems demand low latency, high reliability, and cost-effectiveness, which change implied infrastructure needs.
- GPUs add complexity because models require specialized runtimes and high utilization to be economical.
GPUs Changed The Inference Game
- Large models need GPUs and heavy tensor computation, unlike older small models that fit CPU memory.
- This shift forces teams to solve new problems around reliability, speed, and cost.
Rapidly Evolving GPU Runtimes
- Tuhin describes runtimes like TRT-LM and SGLang being developed and modified rapidly in production.
- He recounts using runtimes changed in the last 48 hours, creating brittle production behavior.