Debugging Distributed Training Bugs?

The most subtle things to debug are like, I personally hate these kind of bugs. When you're doing distributed training and you're calculating something on two separate processes, on two different GPUs, and then you average them or something and then you're like, ah, what happened? So basically like the models just can't learn anymore. That's the worst of debugging NANs.

Play episode from 45:29

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app