Kubernetes Podcast from Google

LLM-D, with Clayton Coleman and Rob Shaw

Aug 20, 2025
Join Clayton Coleman, a core contributor to Kubernetes and OpenShift architect, alongside Rob Shaw, Engineering Director at Red Hat and vLLM contributor. They dive into deploying large language models (LLMs) on Kubernetes, discussing unique challenges and performance optimizations. Expect insights on the future of AI models, the pivotal role of collaborative open-source communities, and innovations like the Inference Gateway that drive efficiency in processing workloads. Get ready for an enlightening take on AI in the cloud-native space!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Models Act Like Shared Computers

  • LLM inference behaves like sharing a dedicated computer rather than a typical microservice.
  • Random load balancing fails; requests vary massively in cost so routing must be model-aware.
INSIGHT

KV Cache Is The Performance Core

  • LLMs are autoregressive and require repeated forward passes for each token generated.
  • Managing KV caches and continuous batching (VLLM) is critical for good performance.
ADVICE

Adopt Well-Lit Deployment Paths

  • Use well-lit paths like intelligent inference scheduling to standardize deployments.
  • Compose Gateway and VLLM integrations rather than building bespoke proxies for each trick.
Get the Snipd Podcast app to discover more snips from this episode
Get the app