Attention Is All You Need

The idea was that this nonlinear property I mentioned, which is that word order matters and matters differently in different languages, you need an architecture that can take that into account. And so the real innovation in this paper is to introduce these three big matrices, which are 10,000 by 10,000 dimensional mat Matrix: Q, K and V. So when you're done with your training, they're fixed. But interestingly, like more training further refines them. If you come across more data or you can run longer, you get more refined versions of these matices.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app