Explore the integration of Generative AI models on Kubernetes with expert Janakiram MSV. Dive into the challenges faced in running LLM models on NVIDIA GPUs, lessons learned, and the evolution of managing AI models. Also, discover insights on market trends, acquisitions in cloud native solutions, and optimizing inference engines on Kubernetes for efficient model deployment.
Generative AI focuses on content generation, impacting diverse fields like marketing.
Kubernetes supports gen AI platforms with GPU operators and vector databases.
Hugging Face serves as a model repository, requiring decoupling and shared storage for deployment.
Deep dives
Evolution of AI and Generative AI
The evolution of AI is highlighted, from simple machine learning to neural networks, leading to the current generation of generative AI. Generative AI focuses on generating content rather than just predicting or classifying. It has become more accessible and impactful, influencing diverse fields like marketing and content creation.
Kubernetes as a Platform for Gen AI
Kubernetes serves as a platform for generative AI, supporting researchers, engineers, and developers in deploying and serving AI models efficiently. Key components like GPU operators, shared storage layers, and vector databases contribute to building a gen AI platform stack. Kubernetes integrates with LLMs to enable autonomous operations and infuse AI into operational tasks.
Setting Up Gen AI Environment
The process of setting up a gen AI environment in a home lab involves configuring powerful GPUs, installing Ubuntu, NVIDIA drivers, CUDA, Docker, and the NVIDIA container toolkit. Connecting Kubernetes to the container runtime and using NVIDIA GPU operator for running TensorFlow models efficiently. GPU sharing algorithms are limited within containers, requiring careful consideration of resource allocation.
Hugging Face and Gen AI Models
Hugging Face serves as a repository for generative AI models and datasets, akin to Docker Hub for container images. It houses foundation models like LLMs and offers data sets for model training. Large models are pulled from Hugging Face, requiring substantial storage and integration within Kubernetes for efficient data access and model deployment.
Decoupling Model and Inference Code
Decoupling the model from the inference code is crucial to avoid issues when scaling out the inference engine. The inference code is stateless while the model is stateful, necessitating their separation. Utilizing a shared storage layer accessible on every node to address model synchronization across a GPU cluster is recommended, rather than replicating models on each node.
Model Serving and Inference Engines
Model serving plays a pivotal role in exposing and interfacing with models for consumption. In the context of AI, inference engines like TGI X Gen offer open AI-compatible APIs, enabling seamless model interaction. Leveraging tools like Langchain enhances flexibility by allowing easy model swapping for consistent API endpoints, optimizing the serving process for AI applications.
In this episode of the Kubernetes Bytes podcast, Ryan and Bhavin sit down with Janakiram MSV - an advisor, analyst and architect to talk about how users can run Generative AI models on Kubernetes. The discussion revolves around Jani's home lab and his experimentation with different LLM models and how to get them running on NVIDIA GPUs. Jani has spent the past year becoming a subject matter expert in GenAI, and this discussion highlights all the different challenges he faced and what lessons he learnt from them.
Check out our website at https://kubernetesbytes.com/