AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Optimizing Language Models for Edge Devices
This chapter explores the distinctions between encoding and decoding in large language models (LLMs), along with their efficiency metrics like 'time to first token' and 'tokens per second'. It emphasizes the energy constraints when deploying LLMs on edge devices and discusses optimization techniques such as quantization and pruning. The conversation also highlights the significance of small language models (SLMs) and their integration within broader system designs to enhance efficiency and performance.