MLOps.community

The GPU Uptime Battle

28 snips
Nov 11, 2025
In this engaging discussion, Andy Pernsteiner, Field CTO at VAST Data, dives into the complexities of building robust AI infrastructures. He highlights the critical gap between prototypes and production systems, emphasizing the importance of unified data and real-time processing. Andy reveals how GPU downtime can escalate costs dramatically and advocates for chaos engineering to ensure reliability. He also shares insights on workflow automation, the need for empathy between tech teams, and the advantages of separating logic from data for scalability. This conversation is a must-listen for anyone in the AI space!
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Expectations Outpaced Production Reality

  • Public expectations for AI sped up because everyone can try generative AI on a phone.
  • That forces developers to deliver demos quickly but production problems remain due to messy data.
INSIGHT

GPU Minutes Multiply Failures

  • Large GPU farms measure losses in GPU-minutes and outages multiply cost by scale.
  • A small hiccup multiplied by thousands of GPUs becomes a multi-million-dollar problem.
ADVICE

Plan For Parallelism Early

  • Design for parallelism and sharding from the start so workloads scale across GPUs and storage.
  • Profile and find bottlenecks early, because single-node bottlenecks waste large-scale resources.
Get the Snipd Podcast app to discover more snips from this episode
Get the app