MLOps.community

Building Out GPU Clouds // Mohan Atreya // #317

12 snips
May 23, 2025
Mohan Atreya, Chief Product Officer at Rafay Systems with a rich background at Okta and McAfee, dives into the chaos of GPUs in AI. He discusses the hurdles of GPU scarcity and high prices, as well as dynamic cloud models that adopt tokenized access. The conversation highlights the challenges of crafting GPU cloud infrastructures, power management issues, and how innovative strategies are redefining user experience. Mohan also touches on the shift towards payer-friendly systems that enhance flexibility, paving the way for a more efficient AI landscape.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

GPU Access Challenges

  • GPUs are hard to get and expensive, limiting access for AI/ML experimentation.
  • This scarcity forces companies to either buy hardware or use cloud providers, both with drawbacks.
ANECDOTE

Universities Use Neo Clouds

  • Universities want to launch AI/ML labs but face difficulty setting up GPUs.
  • New GPU clouds offer services like notebooks and Kubeflow as a service to solve this.
INSIGHT

GPU Failures Are Common

  • GPU systems have high failure rates, around 30%, disrupting long training runs.
  • Providers must offer capacity swap and SLA guarantees to maintain user trust during failures.
Get the Snipd Podcast app to discover more snips from this episode
Get the app