Latent Space: The AI Engineer Podcast

Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation

8 snips
Sep 3, 2024
Nyla Worker, a Senior PM at Nvidia with a background in optimizing AI models at Google and eBay, shares insights on dramatic advancements in AI efficiency and inference. The discussion highlights a staggering reduction in costs and time for training models, with examples like the Cerebras platform achieving unheard-of speeds. They delve into optimizing large language models and the revolutionary potential of 3D conversational AI technology. Worker also touches on the future of digital personas and their applications in various sectors, including healthcare.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

From Astrophysics to AI

  • Nyla Worker transitioned from astrophysics to AI after finding a 1996 paper using AI for image classification.
  • Manually classifying images of space led her to explore AI for automation.
ANECDOTE

ResNet-50 Optimization at eBay

  • At eBay, Nyla optimized ResNet-50 inference on a V100 GPU for image search, improving throughput from one image to four images in 7 milliseconds.
  • This optimization was crucial for meeting human-perceived latency requirements.
INSIGHT

Optimizing for Hardware Advancements

  • Hardware advancements significantly impact optimization strategies: a ResNet-50 task that required a V100 in 2018 can now run on a cheaper Jetson device.
  • Forecasting hardware advancements beyond two years is difficult, so optimize for current hardware and near-term improvements.
Get the Snipd Podcast app to discover more snips from this episode
Get the app