
The New Stack Podcast Keeping GPUs Ticking Like Clockwork
10 snips
Nov 17, 2025 Suresh Vasudevan, CEO of Clockwork, dives into the fascinating evolution of his company from synchronizing clocks to enhancing the performance of large GPU clusters. He highlights how FleetIQ revolutionizes traffic management, preventing costly disruptions in AI training. The conversation covers the critical importance of visibility, fault tolerance, and automated remediation, as well as integrating with NVIDIA libraries. Suresh also discusses common GPU failures and the future of AI infrastructure, providing valuable insights for tech enthusiasts.
AI Snips
Chapters
Transcript
Episode notes
Platform To Optimize GPU Communication
- Clockwork optimizes GPU-to-GPU communication across large clusters to boost AI efficiency.
- Their FleetIQ platform provides visibility, fault tolerance, and congestion management for training workloads.
From Clock Sync To Network Telemetry
- Clockwork began as Stanford spinout work on precise software clock synchronization.
- That accuracy enabled measuring packet latency and became the foundation for their network telemetry pivot.
Measure First, Then Control Traffic
- Accurate timing plus telemetry enabled Clockwork to add dynamic traffic control to actively reroute flows.
- They integrate with NCCL, TCP and RDMA to both observe and control GPU communications.
