
Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference
Generally Intelligent
00:00
Adaptive sparsity and approaches for efficient attention in language modeling
The data sparsity is an interesting aspect to consider/nAdaptive sparsity can be useful for ignoring irrelevant tokens/nBetter results can be achieved by attending to the right information/nLocality sensitive hashing is a technique for clustering queries and keys/nAttention computation can be limited to within the same cluster/nHashing and clustering approaches have shown positive results in some cases/nThe scalability of these approaches for language modeling is still uncertain
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.