
Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference
Generally Intelligent
00:00
ML Perp: A Low-Level Implementation of Bert From NVIDIA
Chris Ray: I was blown away at all of the low-level implementation that was going on in ML Perp. He wanted to make it scale to much longer sequence length. So he spent two, three months just doing that and with the help of Dan Fu as well,. Chris Ray: We finally have something that was pretty nice. It's now part of PyTorch 2.0.
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.