The server-side rendering equivalent for LLM inference workloads

Aug 19, 2025

Tuhin Srivastava, CEO and co-founder of Baseten, shares his insights on advancing AI infrastructure for large-scale neural networks. He discusses the evolving demands on GPUs and the challenges of scalability in generative AI. The conversation highlights the trade-offs between different model types, like RAG and embedding models. Tuhin also emphasizes the cost-saving benefits and customization potential of open-source models, while addressing the rapid changes in chip architecture that shape software development.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Inference Is Now The Main Cost Driver

Production LLM systems demand low latency, high reliability, and cost-effectiveness, which change implied infrastructure needs.
GPUs add complexity because models require specialized runtimes and high utilization to be economical.

INSIGHT

GPUs Changed The Inference Game

Large models need GPUs and heavy tensor computation, unlike older small models that fit CPU memory.
This shift forces teams to solve new problems around reliability, speed, and cost.

ANECDOTE

Rapidly Evolving GPU Runtimes

Tuhin describes runtimes like TRT-LM and SGLang being developed and modified rapidly in production.
He recounts using runtimes changed in the last 48 hours, creating brittle production behavior.

Get the Snipd Podcast app to discover more snips from this episode

Get the app