Working Group Serving, with Yuan Tang and Eduardo Arango
Oct 31, 2024
auto_awesome
Yuan Tang is a principal software engineer at Red Hat, focusing on OpenShift AI, and is a leader in Kubernetes WG Serving. Eduardo Arango, a software engineer at NVIDIA, specializes in making Kubernetes suitable for high-performance computing. They delve into the challenges of AI model serving, discussing startup times and Kubernetes API limitations. The conversation also covers orchestration complexities for large language models and highlights innovative solutions like Model Mesh to optimize multi-host environments. Engagement and collaboration in Kubernetes working groups are urged for community-driven advancements.
The Serving working group within Kubernetes aims to enhance model serving for AI and machine learning workloads by addressing scalability challenges.
Efforts to optimize auto-scaling and resource sharing in Kubernetes are critical for deploying large, multi-GPU models efficiently.
Deep dives
Introduction of the Serving Working Group
The formation of the Serving working group within the Kubernetes community emerged from discussions around the specific needs of AI and machine learning workloads. It addresses particular challenges faced in model serving, especially those linked to scalability and efficiency. The KSERV system has introduced advanced techniques for handling models, including pulling models from OCI images, which enhances startup times and enables capabilities like prefetching images. This working group aims to develop better foundational pieces that cater to the growing complexity of model serving and benefit the broader cloud-native ecosystem.
Mission and Goals of the Working Group
The mission of the Serving working group is to optimize serving workloads on Kubernetes while focusing on hardware-accelerated AI and machine learning inference. Its goals encompass enhancing controller workloads within Kubernetes, effectively addressing auto-scaling, and promoting efficient resource sharing between related working groups. Engaging with community feedback, the group gathers insights on various use cases to develop standardized recommendations and solutions. By offering better primitives, the initiative aims to advance the capabilities of serving systems to meet the demands driven by generative AI and other evolving workloads.
Challenges and Limitations in Kubernetes
The group is tackling various limitations in Kubernetes, particularly concerning multi-node and multi-GPU serving for expansive models. Current Kubernetes frameworks struggle to efficiently define multi-GPU workloads, complicating the deployment of large models across networks. Additionally, the difficulties surrounding auto-scaling, especially in accurately measuring metrics related to latency and utilization, hinder effective implementation. As model sizes increase, these challenges necessitate collaborative efforts to develop new solutions within the Kubernetes architecture.
Work Streams and Their Focus Areas
Several work streams within the Serving working group focus on critical areas such as orchestration, multi-host serving, and dynamic resource allocation. The orchestration stream aims to address high-level abstractions for serving workloads, integrating solutions like blueprint APIs to simplify the deployment of inference workloads. Meanwhile, the auto-scaling stream emphasizes the importance of using both hardware and soft-layer metrics to optimize scaling techniques. The dynamic resource allocation work stream focuses on identifying feature requests necessary for prioritizing improvements relating to serving within the larger device management framework.
Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes WG Serving. Yuan authored three technical books and is a regular conference speaker, technical advisor, and leader at various organizations.
Eduardo is an environmental engineer derailed into a software engineer. Eduardo has been working on making containerized environments the de facto solution for High Performance Computing(HPC) for over 8 years now. Began as a core contributor to the niche Singularity Containers, today known as Apptainer under the Linux foundation. In 2019 Eduardo moved up the ladder to work on making Kubernetes better for performance oriented applications. Nowadays Eduardo works at NVIDIA on the Core Cloud Native team working on enabling specialized accelerators into Kubernetes workloads.
Do you have something cool to share? Some questions? Let us know: