Software Engineering Radio - the podcast for professional software developers

SE Radio 703: Sahaj Garg on Low Latency AI

17 snips

Jan 14, 2026

Sahaj Garg, CTO and co-founder of wispr.ai, shares insights on building low-latency AI applications, which are crucial for interactive voice experiences. He explains how latency affects consumer behavior and the importance of measuring it accurately. Topics include managing trade-offs between speed and accuracy, as well as scaling impacts on latency. Sahaj also delves into advanced techniques like quantization and speculative decoding, emphasizing the need for latency budgets in engineering decisions and the role of latency as a core product requirement.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

User-Perceived Latency Is What Matters

Latency is the user's experienced time between action and response.
Optimizing for perceived user latency guides design choices in interactive apps.

ADVICE

Break Down Latency With Deep Observability

Instrument and break down latency into component hops to find P99 causes.
Add observability so you can pinpoint network or processing spikes quickly.

ANECDOTE

Cross-Region GPUs Caused 400ms Spike

A cross-region deployment caused a 300–400ms latency spike when successive models ran in different regions.
Fixing region placement and network routing restored expected response times.

Get the Snipd Podcast app to discover more snips from this episode

Get the app