
Speculative Decoding and Efficient LLM Inference with Chris Lott - #717
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
00:00
Optimizing Language Models for Edge Devices
This chapter explores the distinctions between encoding and decoding in large language models (LLMs), along with their efficiency metrics like 'time to first token' and 'tokens per second'. It emphasizes the energy constraints when deploying LLMs on edge devices and discusses optimization techniques such as quantization and pruning. The conversation also highlights the significance of small language models (SLMs) and their integration within broader system designs to enhance efficiency and performance.
Transcript
Play full episode