In training you do have to run all of them because they all contribute to the presence of all these lost values that then you can use to optimize the model. We want a loss from every every byte yes so that is really interesting this is not like a strong theory of mind but increasingly i see kind of a lot of things and i'm like a lot of these things seem like isomorphic to other things or kind of nearly so. You're kind of shrinking the dimensionality of the input in order to have a more efficient global model and then kind of chunking in order for kind of more efficient parallelizationDo you think you could kind of do something similar where if you divide device a transformer

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode