Accelerating LLMs with TornadoVM: From GPU Kernels to Model Inference

May 18, 2025

Juan Fumero, a Software Engineer and contributor to TornadoVM, dives into the world of GPU acceleration and Java. He shares his insights on how TornadoVM enables efficient data parallelization and optimizes large language models like Llama3. The discussion highlights innovative features such as dynamic hardware reconfiguration, tensor types for FP8 and FP16, and the potential for model quantization. Fumero also touches on the integration possibilities with Project Babylon, emphasizing Java's growing role in enterprise applications for LLMs.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Adam Bien's Early GPU Story

Adam Bien shared his early GPU experiences, including his interest in graphics cards like Elsa Winner and Voodoo.
He enjoyed building PCs himself to get the best hardware for graphics and resolution, even without a focus on gaming.

INSIGHT

TornadoVM's Java Integration

TornadoVM is a Java parallel framework that accelerates data-parallel applications primarily on GPUs but also on CPUs and FPGAs.
It works as a plugin to existing JDKs using Graal as a library without modifying the JVM itself.

ADVICE

TornadoVM Parallel Programming Tips

Use Tornado's @parallel and @reduce annotations to hint which loops in Java code are parallelizable.
For more control, use the lower-level Kernel API to define work per GPU thread with access to thread IDs and local memory.

Get the Snipd Podcast app to discover more snips from this episode

Get the app