Training in Stability in Distributed Transformers

Experts introduce a new communication primitive called alled to all, which isn't seen in normal like standard dente distributed transformers. So for instance, if an input is coming in am at like, time step t for the model, it might get a expert i. But now lets say we're looking at t plus one, so the prameters ted via the greating up date, and the same examples, coming in, now it might actually get sent to expert j. An expert j might have a very different output. And so you sort of fundamentally have these like a kind of very discontinuous systems. It seems like this is, again, highly speculative, but could be an

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

120. Liam Fedus and Barrett Zoph - AI scaling with mixture of expert models

Towards Data Science

Training in Stability in Distributed Transformers

Chapters:

The AI-powered Podcast Player