
airhacks.fm podcast with adam bien From SIMD to CUDA with TornadoVM
Aug 16, 2025
Michalis Papadimitriou, a compiler engineer at TornadoVM, discusses GPU acceleration for LLMs in Java. He covers the evolution from SIMD optimizations to enhanced GPU memory management. Key insights include the hybrid approach that blends CPU and GPU tasks, and the introduction of a persist/consume API to optimize data handling. Michalis highlights the performance trade-offs between TornadoVM and CUDA, along with the increasing role of LLMs in kernel optimization. He also hints at future support for Apple Silicon and new models, showcasing TornadoVM's expanding capabilities.
AI Snips
Chapters
Transcript
Episode notes
Alfonso's Java Llama Port Sparked GPU Work
- Alfonso ported llama.cpp to Java using SIMD and achieved about 10 tokens/sec on a quantized model.
- Michalis and the TornadoVM team used that as a starting point to target GPU acceleration.
Hybrid Port Revealed IO Bottlenecks
- Initial TornadoVM port uploaded the matmuls to GPU and kept other parts on CPU for a hybrid approach.
- That first change yielded ~24% speedup but revealed heavy IO overhead from repeated host-device copies.
Persist And Consume To Cut Transfers
- Persist data on the device and consume it across task graphs to avoid repeated host-device transfers.
- Use TornadoVM's persist and consume APIs to express pipelines that keep data on GPU between iterations.
