Latent Space: The AI Engineer Podcast

FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI

49 snips
Jul 26, 2023
Tri Dao, a recent Stanford PhD grad and Chief Scientist at Together AI, discusses his groundbreaking work on FlashAttention-2, enhancing transformer models for faster inference. He explains how FlashAttention improves efficiency by reducing memory usage from quadratic to linear scaling. The conversation also touches on the importance of memory architecture in GPU performance and the balance of traditional techniques with modern AI innovations. Lastly, Tri reflects on the dynamic landscape of AI research and the rise of open-source contributions in the field.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Tri Dao's Major Switch

  • Tri Dao initially intended to major in economics during college.
  • He switched to math after taking math classes during his first week at Stanford.
INSIGHT

FlashAttention's Core Innovation

  • FlashAttention improves Transformer efficiency by prioritizing memory efficiency over solely approximation.
  • It achieves this by being hardware-friendly, leading to speedups without compromising precision.
INSIGHT

Exact vs. Approximate Attention

  • FlashAttention is exact and doesn't use approximation methods like sparse networks.
  • Approximation methods often sacrifice quality and, surprisingly, don't always improve wall-clock speed.
Get the Snipd Podcast app to discover more snips from this episode
Get the app