The Stack Overflow Podcast

The server-side rendering equivalent for LLM inference workloads

Aug 19, 2025
Tuhin Srivastava, CEO and co-founder of Baseten, shares his insights on advancing AI infrastructure for large-scale neural networks. He discusses the evolving demands on GPUs and the challenges of scalability in generative AI. The conversation highlights the trade-offs between different model types, like RAG and embedding models. Tuhin also emphasizes the cost-saving benefits and customization potential of open-source models, while addressing the rapid changes in chip architecture that shape software development.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Inference Is Now The Main Cost Driver

  • Production LLM systems demand low latency, high reliability, and cost-effectiveness, which change implied infrastructure needs.
  • GPUs add complexity because models require specialized runtimes and high utilization to be economical.
INSIGHT

GPUs Changed The Inference Game

  • Large models need GPUs and heavy tensor computation, unlike older small models that fit CPU memory.
  • This shift forces teams to solve new problems around reliability, speed, and cost.
ANECDOTE

Rapidly Evolving GPU Runtimes

  • Tuhin describes runtimes like TRT-LM and SGLang being developed and modified rapidly in production.
  • He recounts using runtimes changed in the last 48 hours, creating brittle production behavior.
Get the Snipd Podcast app to discover more snips from this episode
Get the app