The Structure of Attention Heads

A typical transformer will have aboard 100 layers of order 100 attention heads. Q and K are like a kind of kernel telling you, you know, how to modify X, J and X, I in order to compute that inner before you compute the inner product. So it's not really looking at what happened from X, J, right? It's just looking at something internal to X, I andinternal to X, J. And this is a way of combining that information in, in waiting the combination of those two input factors. The idea is that there is a specific subspace that K, a particular K matrix is cares about. And then there's this particular subspace that a particular

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app