Generally Intelligent cover image

Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference

Generally Intelligent

00:00

Adaptive sparsity and approaches for efficient attention in language modeling

The data sparsity is an interesting aspect to consider/nAdaptive sparsity can be useful for ignoring irrelevant tokens/nBetter results can be achieved by attending to the right information/nLocality sensitive hashing is a technique for clustering queries and keys/nAttention computation can be limited to within the same cluster/nHashing and clustering approaches have shown positive results in some cases/nThe scalability of these approaches for language modeling is still uncertain

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner