Optimizing Inference Engines on Kubernetes with TGI X Gen and More

This chapter delves into the advantages of utilizing the TGI X Gen inference engine from hugging face on Kubernetes, highlighting its ease of use and memory management capabilities. It explores various optimized inference engines like TGI, VLLM, and TensorRT LLM for deploying models effectively on Kubernetes clusters. The discussion extends to utilizing tools like Olaama and Ray, building infrastructures for retrieval augmented generation pipelines, and leveraging LLM for contextual question answering, concluding with reflections on the episode and promoting upcoming content by the guest.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app