Deploying Large Machine Learning Models on Small Devices

26min Snip

00:00

Play full episode

Summary

Transcript

Episode notes

The chapter explores the challenges of deploying large machine learning models on small devices, emphasizing the importance of real-time applications and data privacy. It discusses model compression techniques, focusing on the development of the on-device WISPR kit for translation transcription applications. The conversation highlights optimizations for performance and compression, balancing generic tools and bespoke techniques for efficient deployment.

The size of ML models is growing into the many billions of parameters. This poses a challenge for running inference on non-dedicated hardware like phones and laptops.

Argmax is a startup focused on developing methods to run large models on commodity hardware. A key observation behind their strategy is that the largest models are getting larger, but the smallest models that are commercially relevant are getting smaller. The company was started in 2023 and has raised money from General Catalyst and other industry leaders.

Atila Orhon is the founder of Argmax and he previously worked at Apple and NVIDIA. He joins the show to talk about working in computer vision, building ML tooling at Apple, optimizing ML models, and more.

Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer.