

Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference
20 snips Aug 9, 2023
Chapters
Transcript
Episode notes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Introduction
00:00 • 5min
The Importance of Hardware in the Operations Field
05:16 • 2min
The Theory of Modern Data Augmentation
07:07 • 2min
The Kernel Theory of Data Augmentation
09:19 • 3min
The Importance of Data Augmentation
12:17 • 2min
The Butterfly Factorization, Nicely Modeled 2019
13:50 • 4min
The Evolution of Butterfly Matrixes in Machine Learning
17:46 • 3min
The Importance of Sparsity in GPUs
20:31 • 2min
The Impact of Sparse Training on Language Models
22:41 • 3min
The Inductive Bias in Language Model Training
25:50 • 2min
The Future of Intuitive Learning
27:37 • 4min
How to Preserve Quality if We Are Zero Up With the Percent of the Entries
31:49 • 2min
The Future of Recurrent Models
33:25 • 2min
The Evolution of Attention in Language Models
35:52 • 3min
The Opposition to Attention
38:31 • 2min
The Importance of Larger Models in Research
40:34 • 3min
The Future of Recurrent Networks
44:01 • 2min
How to Scale to Longer Context Longevity
46:01 • 4min
ML Perp: A Low-Level Implementation of Bert From NVIDIA
49:42 • 4min
The Importance of Static Sparsity in Truning
53:12 • 3min
The Performance of Sparse Attention for Language Models
55:46 • 2min
The Right Sparsity Schedule for Language Models
57:24 • 2min
The Importance of Adaptive Sparsity
59:14 • 2min
The Future of Flash Tension
01:01:21 • 4min
The Future of Language Models
01:05:38 • 2min
The Future of Inference
01:08:03 • 2min
The Importance of Feedback in Inference
01:09:41 • 3min
The Future of Machine Learning
01:12:37 • 3min
How to Get Better at Engineering
01:15:09 • 3min
The Importance of Optimizing Hardware for Faster Attention
01:18:16 • 2min