Latent Space: The AI Engineer Podcast cover image

LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Latent Space: The AI Engineer Podcast

NOTE

Optimization: Kernel Fusion, Memory Planning, Loop Optimization, and Weight Quantization

Kernel Fusion allows for smart combining of GPU kernels to optimize memory usage/nMemory planning involves statically allocating and planning memory ahead of time for better performance/nLoop transformation is important for improving performance, and ML compilation framework helps with this process/nWeight quantization can significantly reduce memory footprint, but tradeoffs and precision decisions need to be considered/nMLC allows customization of correlation and supports multiple quantization formats/nFurther research is being done on sparsity and quantization of activations

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner