Generally Intelligent cover image

Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference

Generally Intelligent

NOTE

Adaptive sparsity and approaches for efficient attention in language modeling

The data sparsity is an interesting aspect to consider/nAdaptive sparsity can be useful for ignoring irrelevant tokens/nBetter results can be achieved by attending to the right information/nLocality sensitive hashing is a technique for clustering queries and keys/nAttention computation can be limited to within the same cluster/nHashing and clustering approaches have shown positive results in some cases/nThe scalability of these approaches for language modeling is still uncertain

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner