The Multi-Headed Attention Layer in the Encoder

White space is padded with however many extra tokens you need to fill out the total amount. Mask, which just means like don't look at anything that is in like the empty part of the stuff you haven't predicted yet. And so that mask gets like rolled back as you predict more stuff. There's this bypass the ad norm. And then again, there's this bypass to the next multi-headed attention. The probabilities are for every token that I know about every token in my vocabulary.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app