LLM-D, with Clayton Coleman and Rob Shaw

18 snips

Aug 20, 2025

Join Clayton Coleman, a core contributor to Kubernetes and OpenShift architect, alongside Rob Shaw, Engineering Director at Red Hat and vLLM contributor. They dive into deploying large language models (LLMs) on Kubernetes, discussing unique challenges and performance optimizations. Expect insights on the future of AI models, the pivotal role of collaborative open-source communities, and innovations like the Inference Gateway that drive efficiency in processing workloads. Get ready for an enlightening take on AI in the cloud-native space!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Models Act Like Shared Computers

LLM inference behaves like sharing a dedicated computer rather than a typical microservice.
Random load balancing fails; requests vary massively in cost so routing must be model-aware.

INSIGHT

KV Cache Is The Performance Core

LLMs are autoregressive and require repeated forward passes for each token generated.
Managing KV caches and continuous batching (VLLM) is critical for good performance.

ADVICE

Adopt Well-Lit Deployment Paths

Use well-lit paths like intelligent inference scheduling to standardize deployments.
Compose Gateway and VLLM integrations rather than building bespoke proxies for each trick.

Get the Snipd Podcast app to discover more snips from this episode

Get the app