
Episode 549: William Falcon Optimizing Deep Learning Models
Software Engineering Radio - the podcast for professional software developers
00:00
Debugging Distributed Training Bugs?
The most subtle things to debug are like, I personally hate these kind of bugs. When you're doing distributed training and you're calculating something on two separate processes, on two different GPUs, and then you average them or something and then you're like, ah, what happened? So basically like the models just can't learn anymore. That's the worst of debugging NANs.
Transcript
Play full episode