The History of Transformers

In the 90s, people were already working on things that we now call mixture of experts and also multiplicative interactions. Then there were ideas of neural networks that have a separate module for computation and memory. And then attention mechanism like this were popularized in around 2015 by a paper from the Yoshua Bengiros group at Mila. They are extremely powerful for doing things like translation language translation in NLP. That really started the craze on attention. So you come on all those ideas and you get a transformer that uses something called self attention where the input tokens are used both as queries and keys in a associative memory very much like a memory network. The advantage of transformers is

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app