Generally Intelligent cover image

Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference

Generally Intelligent

00:00

The Performance of Sparse Attention for Language Models

On the static sparsity, that's not part of the existing open release flash attention. It is more of a approval concept. We show in the paper, hey, we can do this. Here's an implementation. I haven't had as much time to polish that implementation compared to the dense attention implementation. If you're only dropping 5% or whatever, it probably doesn't make a difference at all. Of course, you're not getting any speed up. Does that curve look like? Yeah.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner