The Cost of Loading Up Tokens on the GPU

The next challenge will be is like sort of the cost of loading up all these tokens on the GPU, especially if you have to do them over the network. I would doubt that Nate K window would help that much simply because the model is not even going to be dealing with 8K tokens and practice on those tasks. It just remains coherent for a lot longer. And when you get to things like programming tasks where that may be necessary, it's really helpful. That's to me, that's my guess as to why C4 does so well, even though by all rights is a pretty old bad data set.

Play episode from 11:52

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app