GPU Uptime with VAST Data CTO

60 snips

Nov 11, 2025

In this engaging discussion, Andy Pernsteiner, Field CTO at VAST Data, dives into the complexities of building robust AI infrastructures. He highlights the critical gap between prototypes and production systems, emphasizing the importance of unified data and real-time processing. Andy reveals how GPU downtime can escalate costs dramatically and advocates for chaos engineering to ensure reliability. He also shares insights on workflow automation, the need for empathy between tech teams, and the advantages of separating logic from data for scalability. This conversation is a must-listen for anyone in the AI space!

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Expectations Outpaced Production Reality

Public expectations for AI sped up because everyone can try generative AI on a phone.
That forces developers to deliver demos quickly but production problems remain due to messy data.

INSIGHT

GPU Minutes Multiply Failures

Large GPU farms measure losses in GPU-minutes and outages multiply cost by scale.
A small hiccup multiplied by thousands of GPUs becomes a multi-million-dollar problem.

ADVICE

Plan For Parallelism Early

Design for parallelism and sharding from the start so workloads scale across GPUs and storage.
Profile and find bottlenecks early, because single-node bottlenecks waste large-scale resources.

Get the Snipd Podcast app to discover more snips from this episode

Get the app