The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) cover image

Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

00:00

Optimizing Language Models for Edge Devices

This chapter explores the distinctions between encoding and decoding in large language models (LLMs), along with their efficiency metrics like 'time to first token' and 'tokens per second'. It emphasizes the energy constraints when deploying LLMs on edge devices and discusses optimization techniques such as quantization and pruning. The conversation also highlights the significance of small language models (SLMs) and their integration within broader system designs to enhance efficiency and performance.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app