
Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference
Generally Intelligent
00:00
The Performance of Sparse Attention for Language Models
On the static sparsity, that's not part of the existing open release flash attention. It is more of a approval concept. We show in the paper, hey, we can do this. Here's an implementation. I haven't had as much time to polish that implementation compared to the dense attention implementation. If you're only dropping 5% or whatever, it probably doesn't make a difference at all. Of course, you're not getting any speed up. Does that curve look like? Yeah.
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.