The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

11 snips
Dec 2, 2025
Zain Asgar, co-founder and CEO of Gimlet Labs, is an expert in efficient AI compute orchestration and heterogeneous inference. He discusses the challenges of handling token-heavy agentic workloads and the need for diverse hardware solutions. Zain elaborates on Gimlet's innovative three-layer architecture for workload disaggregation and LLM-driven optimization. He shares insights on the complexities of networking, the trade-offs in precision, and the future of resource scheduling, all while emphasizing the importance of cost-effective AI operations.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Agentic Workloads Multiply Token Costs

  • Agentic AI consumes many more tokens than traditional LLM use, making current GPU-first economics unsustainable.
  • Zain Asgar argues efficiency gains across hardware are required to keep agent workloads viable.
ADVICE

Right-Size Model Partitions

  • Right-size model partitions so the most latency-sensitive parts run on top hardware and others run on cheaper machines.
  • Partition agent dataflow graphs to optimize cost per token without breaking SLAs.
INSIGHT

Heterogeneity Makes Placement A Hard Optimization

  • Heterogeneous inference amplifies trade-offs across compute, memory bandwidth and capacity, turning placement into a complex optimization.
  • Gimlet models cost-per-token and resource criticality to allocate sub-workloads optimally.
Get the Snipd Podcast app to discover more snips from this episode
Get the app