From SIMD to CUDA with TornadoVM

Aug 16, 2025

Michalis Papadimitriou, a compiler engineer at TornadoVM, discusses GPU acceleration for LLMs in Java. He covers the evolution from SIMD optimizations to enhanced GPU memory management. Key insights include the hybrid approach that blends CPU and GPU tasks, and the introduction of a persist/consume API to optimize data handling. Michalis highlights the performance trade-offs between TornadoVM and CUDA, along with the increasing role of LLMs in kernel optimization. He also hints at future support for Apple Silicon and new models, showcasing TornadoVM's expanding capabilities.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Alfonso's Java Llama Port Sparked GPU Work

Alfonso ported llama.cpp to Java using SIMD and achieved about 10 tokens/sec on a quantized model.
Michalis and the TornadoVM team used that as a starting point to target GPU acceleration.

INSIGHT

Hybrid Port Revealed IO Bottlenecks

Initial TornadoVM port uploaded the matmuls to GPU and kept other parts on CPU for a hybrid approach.
That first change yielded ~24% speedup but revealed heavy IO overhead from repeated host-device copies.

ADVICE

Persist And Consume To Cut Transfers

Persist data on the device and consume it across task graphs to avoid repeated host-device transfers.
Use TornadoVM's persist and consume APIs to express pipelines that keep data on GPU between iterations.

Get the Snipd Podcast app to discover more snips from this episode

Get the app