

Dataflow Computing for AI Inference with Kunle Olukotun - #751
43 snips Oct 14, 2025
Kunle Olukotun, a Stanford professor and chief technologist at SambaNova Systems, dives into reconfigurable dataflow architectures for AI. He explains how this innovative approach enhances AI inference by dynamically matching hardware to model dataflows, leading to reduced bandwidth bottlenecks. Kunle also highlights the benefits of fast model switching and efficient multi-model serving, crucial for low-latency applications. Plus, he explores future possibilities of using AI to create compilers for evolving hardware setups, offering insights into significant performance improvements.
AI Snips
Chapters
Transcript
Episode notes
Compute That Matches The Model Graph
- Reconfigurable dataflow maps model graphs directly into hardware rather than repeatedly fetching instructions.
- This removes shared-memory synchronization and aligns execution with ML dataflow graphs for efficiency.
Memory Bandwidth Is The Inference Bottleneck
- Inference is fundamentally limited by memory bandwidth to parameters and KV cache, not raw compute.
- Fusing operators and streaming dataflow dramatically reduces bandwidth and raises HBM utilization near 90%.
Map Models By Tiling, Fusing, And Sharding
- Start from PyTorch graphs and implement operators, then decide per-tensor whether to tile, parallelize, or shard.
- Use compiler-driven fusion and tensor mapping to maximize the critical HBM bandwidth resource.