airhacks.fm podcast with adam bien

From SIMD to CUDA with TornadoVM

Aug 16, 2025
Michalis Papadimitriou, a compiler engineer at TornadoVM, discusses GPU acceleration for LLMs in Java. He covers the evolution from SIMD optimizations to enhanced GPU memory management. Key insights include the hybrid approach that blends CPU and GPU tasks, and the introduction of a persist/consume API to optimize data handling. Michalis highlights the performance trade-offs between TornadoVM and CUDA, along with the increasing role of LLMs in kernel optimization. He also hints at future support for Apple Silicon and new models, showcasing TornadoVM's expanding capabilities.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Alfonso's Java Llama Port Sparked GPU Work

  • Alfonso ported llama.cpp to Java using SIMD and achieved about 10 tokens/sec on a quantized model.
  • Michalis and the TornadoVM team used that as a starting point to target GPU acceleration.
INSIGHT

Hybrid Port Revealed IO Bottlenecks

  • Initial TornadoVM port uploaded the matmuls to GPU and kept other parts on CPU for a hybrid approach.
  • That first change yielded ~24% speedup but revealed heavy IO overhead from repeated host-device copies.
ADVICE

Persist And Consume To Cut Transfers

  • Persist data on the device and consume it across task graphs to avoid repeated host-device transfers.
  • Use TornadoVM's persist and consume APIs to express pipelines that keep data on GPU between iterations.
Get the Snipd Podcast app to discover more snips from this episode
Get the app