The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Dataflow Computing for AI Inference with Kunle Olukotun - #751

43 snips
Oct 14, 2025
Kunle Olukotun, a Stanford professor and chief technologist at SambaNova Systems, dives into reconfigurable dataflow architectures for AI. He explains how this innovative approach enhances AI inference by dynamically matching hardware to model dataflows, leading to reduced bandwidth bottlenecks. Kunle also highlights the benefits of fast model switching and efficient multi-model serving, crucial for low-latency applications. Plus, he explores future possibilities of using AI to create compilers for evolving hardware setups, offering insights into significant performance improvements.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Compute That Matches The Model Graph

  • Reconfigurable dataflow maps model graphs directly into hardware rather than repeatedly fetching instructions.
  • This removes shared-memory synchronization and aligns execution with ML dataflow graphs for efficiency.
INSIGHT

Memory Bandwidth Is The Inference Bottleneck

  • Inference is fundamentally limited by memory bandwidth to parameters and KV cache, not raw compute.
  • Fusing operators and streaming dataflow dramatically reduces bandwidth and raises HBM utilization near 90%.
ADVICE

Map Models By Tiling, Fusing, And Sharding

  • Start from PyTorch graphs and implement operators, then decide per-tensor whether to tile, parallelize, or shard.
  • Use compiler-driven fusion and tensor mapping to maximize the critical HBM bandwidth resource.
Get the Snipd Podcast app to discover more snips from this episode
Get the app