
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757
11 snips
Dec 2, 2025 Zain Asgar, co-founder and CEO of Gimlet Labs, is an expert in efficient AI compute orchestration and heterogeneous inference. He discusses the challenges of handling token-heavy agentic workloads and the need for diverse hardware solutions. Zain elaborates on Gimlet's innovative three-layer architecture for workload disaggregation and LLM-driven optimization. He shares insights on the complexities of networking, the trade-offs in precision, and the future of resource scheduling, all while emphasizing the importance of cost-effective AI operations.
AI Snips
Chapters
Transcript
Episode notes
Agentic Workloads Multiply Token Costs
- Agentic AI consumes many more tokens than traditional LLM use, making current GPU-first economics unsustainable.
- Zain Asgar argues efficiency gains across hardware are required to keep agent workloads viable.
Right-Size Model Partitions
- Right-size model partitions so the most latency-sensitive parts run on top hardware and others run on cheaper machines.
- Partition agent dataflow graphs to optimize cost per token without breaking SLAs.
Heterogeneity Makes Placement A Hard Optimization
- Heterogeneous inference amplifies trade-offs across compute, memory bandwidth and capacity, turning placement into a complex optimization.
- Gimlet models cost-per-token and resource criticality to allocate sub-workloads optimally.

