AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Debugging Distributed Training Bugs?
The most subtle things to debug are like, I personally hate these kind of bugs. When you're doing distributed training and you're calculating something on two separate processes, on two different GPUs, and then you average them or something and then you're like, ah, what happened? So basically like the models just can't learn anymore. That's the worst of debugging NANs.