
Anton Teaches Packy AI | E1
"Age of Miracles"
The Multi-Headed Attention Layer in the Encoder
White space is padded with however many extra tokens you need to fill out the total amount. Mask, which just means like don't look at anything that is in like the empty part of the stuff you haven't predicted yet. And so that mask gets like rolled back as you predict more stuff. There's this bypass the ad norm. And then again, there's this bypass to the next multi-headed attention. The probabilities are for every token that I know about every token in my vocabulary.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.