Tri Dao, Stanford: FlashAttention and sparsity, quantization, and efficient inference

1

Introduction

00:00 • 5min

2

The Importance of Hardware in the Operations Field

05:16 • 2min

3

The Theory of Modern Data Augmentation

07:07 • 2min

4

The Kernel Theory of Data Augmentation

09:19 • 3min

5

The Importance of Data Augmentation

12:17 • 2min

6

The Butterfly Factorization, Nicely Modeled 2019

13:50 • 4min

7

The Evolution of Butterfly Matrixes in Machine Learning

17:46 • 3min

8

The Importance of Sparsity in GPUs

20:31 • 2min

9

The Impact of Sparse Training on Language Models

22:41 • 3min

10

The Inductive Bias in Language Model Training

25:50 • 2min

11

The Future of Intuitive Learning

27:37 • 4min

12

How to Preserve Quality if We Are Zero Up With the Percent of the Entries

31:49 • 2min

13

The Future of Recurrent Models

33:25 • 2min

14

The Evolution of Attention in Language Models

35:52 • 3min

15

The Opposition to Attention

38:31 • 2min

16

The Importance of Larger Models in Research

40:34 • 3min

17

The Future of Recurrent Networks

44:01 • 2min

18

How to Scale to Longer Context Longevity

46:01 • 4min

19

ML Perp: A Low-Level Implementation of Bert From NVIDIA

49:42 • 4min

20

The Importance of Static Sparsity in Truning

53:12 • 3min

21

The Performance of Sparse Attention for Language Models

55:46 • 2min

22

The Right Sparsity Schedule for Language Models

57:24 • 2min

23

The Importance of Adaptive Sparsity

59:14 • 2min

24

The Future of Flash Tension

01:01:21 • 4min

25

The Future of Language Models

01:05:38 • 2min

26

The Future of Inference

01:08:03 • 2min

27

The Importance of Feedback in Inference

01:09:41 • 3min

28

The Future of Machine Learning

01:12:37 • 3min

29

How to Get Better at Engineering

01:15:09 • 3min

30

The Importance of Optimizing Hardware for Faster Attention

01:18:16 • 2min