Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

196 snips

Sep 10, 2025

Tri Dao, Chief Scientist at Together AI and a professor at Princeton, is a pioneer behind Flash Attention and Mamba. He discusses the dramatic 100x drop in inference costs since ChatGPT, driven by hardware-software co-design and memory optimization. Dao predicts Nvidia's dominance will wane in 2-3 years as specialized chips emerge. He also shares insights on AI models improving expert-level productivity and the challenges of generating quality training data for various domains, while envisioning another 10x cost reduction ahead.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Architecture Appears Stable But Changes Matter

Transformer architectures have broadly stabilized at a high level, but many important internal variations continue to change workload characteristics.
These micro-changes make chip design and optimization harder because performance depends on fine-grained model details.

INSIGHT

Inference Will Become Multi-Silicon

The inference market will diversify as workloads split into low-latency agents, high-throughput batch, and interactive chatbots.
Specialized chips and stacks will emerge to serve these distinct performance profiles.

ADVICE

Place Focused Hardware Bets

Startups must place focused bets on particular workloads (e.g., video, agents, batch) rather than general-purpose chips.
If you don't specialize, incumbents will out-execute you on general workloads.

Get the Snipd Podcast app to discover more snips from this episode

Get the app