

Building Out GPU Clouds // Mohan Atreya // #317
12 snips May 23, 2025
Mohan Atreya, Chief Product Officer at Rafay Systems with a rich background at Okta and McAfee, dives into the chaos of GPUs in AI. He discusses the hurdles of GPU scarcity and high prices, as well as dynamic cloud models that adopt tokenized access. The conversation highlights the challenges of crafting GPU cloud infrastructures, power management issues, and how innovative strategies are redefining user experience. Mohan also touches on the shift towards payer-friendly systems that enhance flexibility, paving the way for a more efficient AI landscape.
AI Snips
Chapters
Transcript
Episode notes
GPU Access Challenges
- GPUs are hard to get and expensive, limiting access for AI/ML experimentation.
- This scarcity forces companies to either buy hardware or use cloud providers, both with drawbacks.
Universities Use Neo Clouds
- Universities want to launch AI/ML labs but face difficulty setting up GPUs.
- New GPU clouds offer services like notebooks and Kubeflow as a service to solve this.
GPU Failures Are Common
- GPU systems have high failure rates, around 30%, disrupting long training runs.
- Providers must offer capacity swap and SLA guarantees to maintain user trust during failures.