Generally Intelligent

Episode 33: Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference

20 snips
Aug 9, 2023
Ask episode
Chapters
Transcript
Episode notes
1
Introduction
00:00 • 5min
2
The Importance of Hardware in the Operations Field
05:16 • 2min
3
The Theory of Modern Data Augmentation
07:07 • 2min
4
The Kernel Theory of Data Augmentation
09:19 • 3min
5
The Importance of Data Augmentation
12:17 • 2min
6
The Butterfly Factorization, Nicely Modeled 2019
13:50 • 4min
7
The Evolution of Butterfly Matrixes in Machine Learning
17:46 • 3min
8
The Importance of Sparsity in GPUs
20:31 • 2min
9
The Impact of Sparse Training on Language Models
22:41 • 3min
10
The Inductive Bias in Language Model Training
25:50 • 2min
11
The Future of Intuitive Learning
27:37 • 4min
12
How to Preserve Quality if We Are Zero Up With the Percent of the Entries
31:49 • 2min
13
The Future of Recurrent Models
33:25 • 2min
14
The Evolution of Attention in Language Models
35:52 • 3min
15
The Opposition to Attention
38:31 • 2min
16
The Importance of Larger Models in Research
40:34 • 3min
17
The Future of Recurrent Networks
44:01 • 2min
18
How to Scale to Longer Context Longevity
46:01 • 4min
19
ML Perp: A Low-Level Implementation of Bert From NVIDIA
49:42 • 4min
20
The Importance of Static Sparsity in Truning
53:12 • 3min
21
The Performance of Sparse Attention for Language Models
55:46 • 2min
22
The Right Sparsity Schedule for Language Models
57:24 • 2min
23
The Importance of Adaptive Sparsity
59:14 • 2min
24
The Future of Flash Tension
01:01:21 • 4min
25
The Future of Language Models
01:05:38 • 2min
26
The Future of Inference
01:08:03 • 2min
27
The Importance of Feedback in Inference
01:09:41 • 3min
28
The Future of Machine Learning
01:12:37 • 3min
29
How to Get Better at Engineering
01:15:09 • 3min
30
The Importance of Optimizing Hardware for Faster Attention
01:18:16 • 2min