Manifold cover image

Artificial Intelligence & Large Language Models: Oxford Lecture — #35

Manifold

CHAPTER

The Structure of Attention Heads

A typical transformer will have aboard 100 layers of order 100 attention heads. Q and K are like a kind of kernel telling you, you know, how to modify X, J and X, I in order to compute that inner before you compute the inner product. So it's not really looking at what happened from X, J, right? It's just looking at something internal to X, I andinternal to X, J. And this is a way of combining that information in, in waiting the combination of those two input factors. The idea is that there is a specific subspace that K, a particular K matrix is cares about. And then there's this particular subspace that a particular

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner