How Oracle Is Meeting the Infrastructure Needs of AI
Jan 23, 2025
auto_awesome
Sudha Raghavan, SVP for Developer Platform at Oracle Cloud Infrastructure, discusses the seismic shift in infrastructure needs driven by AI's rapid adoption. She highlights the explosive demand for GPUs and the challenges posed by continuous workloads like large language model training. Raghavan elaborates on Oracle's innovative GPU superclusters and improvements in Kubernetes for better job management and observability. The conversation also touches on the evolution of AI pipelines and the critical trade-offs in AI model selection to meet specific business demands.
The rapid adoption of generative AI has drastically increased demand for GPUs, necessitating innovative infrastructure solutions like Oracle's massive GPU superclusters.
To effectively manage AI workloads, Oracle is enhancing Kubernetes functionalities to support stateful operations and improve job scheduling for GPU resources.
Deep dives
The Impact of GPU Demand on Infrastructure
The demand for GPUs has surged rapidly due to advancements in generative AI, leading to new infrastructure requirements. Unlike traditional workloads that experience peaks and troughs, GPUs operate at peak levels consistently, demanding substantial power and posing a risk of hardware failure. Oracle Cloud Infrastructure (OCI) is addressing these challenges by developing a massive GPU supercluster, consisting of over 131,000 nodes, to handle extensive training jobs for its clients. This shift emphasizes the need for close physical placement of nodes to minimize latency and ensure efficient state management across large batch jobs.
Kubernetes Adaptations for GPU Workloads
The Kubernetes architecture, originally designed for stateless CPU workloads, must adapt to accommodate the complexities of stateful GPU operations. With the increasing scale of GPU utilization, there are significant new demands for features such as robust checkpointing to manage job state during potential node failures. OCI is implementing enhancements in Kubernetes to ensure nodes are physically co-located for optimal performance and to introduce better scheduling capabilities to maximize GPU usage. This evolution reflects the necessity for Kubernetes to integrate observability metrics specific to GPUs to avoid inefficiencies and monitor overall health effectively.
MLOps and the Intersection of Data Science and DevOps
The rise of AI and generative models has led to the emergence of MLOps, which facilitates the integration of data science with traditional DevOps practices. While managing AI workloads presents unique challenges, such as orchestrating extensive training data without interrupting ongoing operations, there is a push to create seamless user experiences for developers. Tools like Kubeflow are being utilized to streamline AI pipeline management, enabling developers to focus on inference and minimize operational complexity. As such, building a unified observability portfolio that includes AI operations is essential to ensure consistency across application and AI monitoring.
Generative AI is a data-driven story with significant infrastructure and operational implications, particularly around the rising demand for GPUs, which are better suited for AI workloads than CPUs. In an episode ofThe New Stack Makersrecorded at KubeCon + CloudNativeCon North America, Sudha Raghavan, SVP for Developer Platform at Oracle Cloud Infrastructure, discussed how AI’s rapid adoption has reshaped infrastructure needs.
The release of ChatGPT triggered a surge in GPU demand, with organizations requiring GPUs for tasks ranging from testing workloads to training large language models across massive GPU clusters. These workloads run continuously at peak power, posing challenges such as high hardware failure rates and energy consumption.
Oracle is addressing these issues by building GPU superclusters and enhancing Kubernetes functionality. Tools like Oracle’s Node Manager simplify interactions between Kubernetes and GPUs, providing tailored observability while maintaining Kubernetes’ user-friendly experience. Raghavan emphasized the importance of stateful job management and infrastructure innovations to meet the demands of modern AI workloads.
Learn more from The New Stack about how Oracle is addressing the GPU demand for AI workloads with its GPU superclusters and enhancing Kubernetes functionality: