AI Inference: Why Speed Matters More Than You Think (with SambaNova's Kwasi Ankomah)

75 snips

Oct 7, 2025

Kwasi Ankomah, Lead AI Architect at SambaNova Systems, dives into the significance of AI inference and its bottlenecks. He explains how their innovative RDU chip architecture delivers over 700 tokens per second while using 90% less power. The discussion highlights the growing issue of latency with AI agents, emphasizing their increased token demands. Kwasi also explores multi-model serving to optimize costs and performance, and shares insights on the future of open-source models tailored for enterprises.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

What Inference Actually Is

Inference is the model predicting the next token and streaming outputs back into itself to continue generation.
This token-by-token process is what users experience as model responses and is the core of inference.

INSIGHT

Latency Dictates UX And Scale

Inference latency directly shapes user experience and scalability, especially for real-time apps like voice.
High latency breaks user expectations and prevents services from scaling to thousands of users affordably.

ADVICE

Optimize Inference Costs Early

Track inference cost early and optimize models for production workloads to avoid huge operational bills.
Measure milliseconds improvements because they scale to millions of users and translate to millions in savings.

Get the Snipd Podcast app to discover more snips from this episode

Get the app