Generally Intelligent cover image

Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference

Generally Intelligent

00:00

The Performance of Sparse Attention for Language Models

On the static sparsity, that's not part of the existing open release flash attention. It is more of a approval concept. We show in the paper, hey, we can do this. Here's an implementation. I haven't had as much time to polish that implementation compared to the dense attention implementation. If you're only dropping 5% or whatever, it probably doesn't make a difference at all. Of course, you're not getting any speed up. Does that curve look like? Yeah.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app