The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Dataflow Computing for AI Inference with Kunle Olukotun - #751

55 snips

Oct 14, 2025

Kunle Olukotun, a Stanford professor and chief technologist at SambaNova Systems, dives into reconfigurable dataflow architectures for AI. He explains how this innovative approach enhances AI inference by dynamically matching hardware to model dataflows, leading to reduced bandwidth bottlenecks. Kunle also highlights the benefits of fast model switching and efficient multi-model serving, crucial for low-latency applications. Plus, he explores future possibilities of using AI to create compilers for evolving hardware setups, offering insights into significant performance improvements.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Compute That Matches The Model Graph

Reconfigurable dataflow maps model graphs directly into hardware rather than repeatedly fetching instructions.
This removes shared-memory synchronization and aligns execution with ML dataflow graphs for efficiency.

INSIGHT

Memory Bandwidth Is The Inference Bottleneck

Inference is fundamentally limited by memory bandwidth to parameters and KV cache, not raw compute.
Fusing operators and streaming dataflow dramatically reduces bandwidth and raises HBM utilization near 90%.

ADVICE

Map Models By Tiling, Fusing, And Sharding

Start from PyTorch graphs and implement operators, then decide per-tensor whether to tile, parallelize, or shard.
Use compiler-driven fusion and tensor mapping to maximize the critical HBM bandwidth resource.

Get the Snipd Podcast app to discover more snips from this episode

Get the app