Brandon Jacobs, Infrastructure architect at Coreweave, discusses how Coreweave uses Kubernetes to build an AI hyperscaler. They cover managing Day 0 & 2 operations for AI labs, lessons learned, and best practices for a Kubernetes based cloud. Topics include leveraging bare metal Kubernetes for GPU workloads, storage options for AI labs, observability, monitoring, handling CVEs, and customer cluster support.
CoreWeave uses Kubernetes for AI hyperscaler on bare metal, focusing on efficient cloud platforms.
Observability and monitoring are key priorities for CoreWeave, sharing security responsibilities with customers.
Efficient networking and distributed training support are crucial for CoreWeave's scale challenges.
Deep dives
Kubernetes Bites Podcast Overview
The Kubernetes Bites podcast delves into topics surrounding cloud-native data management, featuring hosts Ryan Walner, Bob, and Shaw from Boston. They cover recent news, interviews with industry experts, and practical experiences in managing data in cloud-native ecosystems. Learn about the diverse range of guests sharing insights on Kubernetes and cloud platforms.
Olama CVE Issue and StackState Acquisition
Olama faced a security issue with a CVE leading to a quick resolution, reflecting their focus on security. SUSE's acquisition of StackState aimed to integrate observability capabilities into Rancher Prime, enhancing end-to-end visibility for users with plans for open-sourcing to promote broader accessibility.
Core Weave's AI Hyperscaler Cloud Platform
Core Weave, an AI hyperscaler cloud, focuses on purpose-built AI applications and infrastructure powered by Kubernetes on bare metal. Their infrastructure architect, Brandon Jacobs, discusses building efficient cloud platforms for AI workloads, highlighting the advantages of Kubernetes in driving performance and scalability.
Scalability Challenges and Networking Considerations
Core Weave's scale challenges emphasize the importance of efficient networking and distributed training support. Ensuring robust CNI configurations and addressing ripple effects across network disruptions are crucial for maintaining cluster stability at scale.
Observability, Security, and Model Management
Core Weave prioritizes observability and monitoring at all levels, sharing responsibilities with customers for CVE and security patches. Brandon advocates for early observability integration and the importance of security measures. While Core Weave does not host a model garden currently, they focus on offering hands-on support for customers' model integration and scaling needs.
In this episode of the Kubernetes Bytes podcast, Bhavin sits down with Brandon Jacobs, an Infrastructure architect at Coreweave. They discuss how Coreweave has adopted Kubernetes to build the AI hyperscaler. The discussion dives into details around how Coreweave handles Day 0 and Day 2 operations for AI labs that need access to GPUs. They also talk about lessons learnt and best practices for building a Kubernetes based cloud.
Check out our website at https://kubernetesbytes.com/
Episode Sponsor: Nethopper Learn more about KAOPS: nethopper.io For a supported-demo: info@nethopper.io Try the free version of KAOPS now! https://mynethopper.com/auth