Latent Space: The AI Engineer Podcast cover image

FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI

Latent Space: The AI Engineer Podcast

00:00

Optimizing with FlashAttention 2

This chapter covers the release of FlashAttention 2 and its integration with NVIDIA's Cutlass library, leading to enhanced GPU efficiency for matrix operations. It discusses the implications of hardware dependencies, compiler advancements, and the innovative strategies being pursued by AI hardware companies amidst rapid technological changes.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app