Software Engineering Radio - the podcast for professional software developers

SE Radio 703: Sahaj Garg on Low Latency AI

17 snips
Jan 14, 2026
Sahaj Garg, CTO and co-founder of wispr.ai, shares insights on building low-latency AI applications, which are crucial for interactive voice experiences. He explains how latency affects consumer behavior and the importance of measuring it accurately. Topics include managing trade-offs between speed and accuracy, as well as scaling impacts on latency. Sahaj also delves into advanced techniques like quantization and speculative decoding, emphasizing the need for latency budgets in engineering decisions and the role of latency as a core product requirement.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

User-Perceived Latency Is What Matters

  • Latency is the user's experienced time between action and response.
  • Optimizing for perceived user latency guides design choices in interactive apps.
ADVICE

Break Down Latency With Deep Observability

  • Instrument and break down latency into component hops to find P99 causes.
  • Add observability so you can pinpoint network or processing spikes quickly.
ANECDOTE

Cross-Region GPUs Caused 400ms Spike

  • A cross-region deployment caused a 300–400ms latency spike when successive models ran in different regions.
  • Fixing region placement and network routing restored expected response times.
Get the Snipd Podcast app to discover more snips from this episode
Get the app