Achieving Full Model Performance With ResNet 50

8,750 nodes on 25 dies coordinating to reduce and then broadcast the bash from mean and standard deviation values. Global reduction followed by global reduction towards the middle of the tile. Then the reduced value radiating from the middle accelerated by the hardware's broadcast facility. This operation takes only 5 microseconds on 25 dojo dies. The same operation takes 150 microseconds on 24 GPUs. And while we talked about and already saw operation in the context of a batch norm, it's important to reiterate that the same advantages apply to all other communication primitives. These primitives are essential for large scale training.

Play episode from 01:59:26

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app