Transformer Memory Cost

The attention operator in transformers has a quadratic memory cost relative to context length.
Longer inputs significantly increase memory usage, leading to innovations like subquadratic and linear attention forms.

Snipped by

Ian O'Byrne

Transcript

chevron_right

Play full episode

chevron_right

Transcript

Episode notes

Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects.
Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc
See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc.

Transcript:
https://lexfridman.com/deepseek-dylan-patel-nathan-lambert-transcript

CONTACT LEX:
Feedback – give feedback to Lex: https://lexfridman.com/survey
AMA – submit questions, videos or call-in: https://lexfridman.com/ama
Hiring – join our team: https://lexfridman.com/hiring
Other – other ways to get in touch: https://lexfridman.com/contact

EPISODE LINKS:
Dylan’s X: https://x.com/dylan522p
SemiAnalysis: https://semianalysis.com/
Nathan’s X: https://x.com/natolambert
Nathan’s Blog: https://www.interconnects.ai/
Nathan’s Podcast: https://www.interconnects.ai/podcast
Nathan’s Website: https://www.natolambert.com/
Nathan’s YouTube: https://youtube.com/@natolambert
Nathan’s Book: https://rlhfbook.com/

SPONSORS:
To support this podcast, check out our sponsors & get discounts:
Invideo AI: AI video generator.
Go to https://invideo.io/i/lexpod
GitHub: Developer platform and AI code editor.
Go to https://gh.io/copilot
Shopify: Sell stuff online.
Go to https://shopify.com/lex
NetSuite: Business management software.
Go to http://netsuite.com/lex
AG1: All-in-one daily nutrition drinks.
Go to https://drinkag1.com/lex

OUTLINE:
(00:00) – Introduction
(13:28) – DeepSeek-R1 and DeepSeek-V3
(35:02) – Low cost of training
(1:01:19) – DeepSeek compute cluster
(1:08:52) – Export controls on GPUs to China
(1:19:10) – AGI timeline
(1:28:35) – China’s manufacturing capacity
(1:36:30) – Cold war with China
(1:41:00) – TSMC and Taiwan
(2:04:38) – Best GPUs for AI
(2:19:30) – Why DeepSeek is so cheap
(2:32:49) – Espionage
(2:41:52) – Censorship
(2:54:46) – Andrej Karpathy and magic of RL
(3:05:17) – OpenAI o3-mini vs DeepSeek r1
(3:24:25) – NVIDIA
(3:28:53) – GPU smuggling
(3:35:30) – DeepSeek training on OpenAI data
(3:45:59) – AI megaclusters
(4:21:21) – Who wins the race to AGI?
(4:31:34) – AI agents
(4:40:16) – Programming and AI
(4:47:43) – Open source
(4:56:55) – Stargate
(5:04:24) – Future of AI

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests