
Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference
Generally Intelligent
00:00
The Performance of Sparse Attention for Language Models
On the static sparsity, that's not part of the existing open release flash attention. It is more of a approval concept. We show in the paper, hey, we can do this. Here's an implementation. I haven't had as much time to polish that implementation compared to the dense attention implementation. If you're only dropping 5% or whatever, it probably doesn't make a difference at all. Of course, you're not getting any speed up. Does that curve look like? Yeah.
Transcript
Play full episode