Generally Intelligent cover image

Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference

Generally Intelligent

00:00

ML Perp: A Low-Level Implementation of Bert From NVIDIA

Chris Ray: I was blown away at all of the low-level implementation that was going on in ML Perp. He wanted to make it scale to much longer sequence length. So he spent two, three months just doing that and with the help of Dan Fu as well,. Chris Ray: We finally have something that was pretty nice. It's now part of PyTorch 2.0.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner