

LLM-D, with Clayton Coleman and Rob Shaw
Aug 20, 2025
Join Clayton Coleman, a core contributor to Kubernetes and OpenShift architect, alongside Rob Shaw, Engineering Director at Red Hat and vLLM contributor. They dive into deploying large language models (LLMs) on Kubernetes, discussing unique challenges and performance optimizations. Expect insights on the future of AI models, the pivotal role of collaborative open-source communities, and innovations like the Inference Gateway that drive efficiency in processing workloads. Get ready for an enlightening take on AI in the cloud-native space!
AI Snips
Chapters
Transcript
Episode notes
Models Act Like Shared Computers
- LLM inference behaves like sharing a dedicated computer rather than a typical microservice.
- Random load balancing fails; requests vary massively in cost so routing must be model-aware.
KV Cache Is The Performance Core
- LLMs are autoregressive and require repeated forward passes for each token generated.
- Managing KV caches and continuous batching (VLLM) is critical for good performance.
Adopt Well-Lit Deployment Paths
- Use well-lit paths like intelligent inference scheduling to standardize deployments.
- Compose Gateway and VLLM integrations rather than building bespoke proxies for each trick.