

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed
102 snips Sep 10, 2025
Tri Dao, Chief Scientist at Together AI and a professor at Princeton, is a pioneer behind Flash Attention and Mamba. He discusses the dramatic 100x drop in inference costs since ChatGPT, driven by hardware-software co-design and memory optimization. Dao predicts Nvidia's dominance will wane in 2-3 years as specialized chips emerge. He also shares insights on AI models improving expert-level productivity and the challenges of generating quality training data for various domains, while envisioning another 10x cost reduction ahead.
AI Snips
Chapters
Transcript
Episode notes
Architecture Appears Stable But Changes Matter
- Transformer architectures have broadly stabilized at a high level, but many important internal variations continue to change workload characteristics.
- These micro-changes make chip design and optimization harder because performance depends on fine-grained model details.
Inference Will Become Multi-Silicon
- The inference market will diversify as workloads split into low-latency agents, high-throughput batch, and interactive chatbots.
- Specialized chips and stacks will emerge to serve these distinct performance profiles.
Place Focused Hardware Bets
- Startups must place focused bets on particular workloads (e.g., video, agents, batch) rather than general-purpose chips.
- If you don't specialize, incumbents will out-execute you on general workloads.