
Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference
Generally Intelligent
Adaptive sparsity and approaches for efficient attention in language modeling
The data sparsity is an interesting aspect to consider/nAdaptive sparsity can be useful for ignoring irrelevant tokens/nBetter results can be achieved by attending to the right information/nLocality sensitive hashing is a technique for clustering queries and keys/nAttention computation can be limited to within the same cluster/nHashing and clustering approaches have shown positive results in some cases/nThe scalability of these approaches for language modeling is still uncertain
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.