Latent Space: The AI Engineer Podcast cover image

ICLR 2024 — Best Papers & Talks (ImageGen, Vision, Transformers, State Space Models) ft. Durk Kingma, Christian Szegedy, Ilya Sutskever

Latent Space: The AI Engineer Podcast

CHAPTER

Optimizing Memory in Large Language Models

This chapter explores innovative approaches to reduce memory consumption in large language models, particularly focusing on the KV cache mechanism during inference. It introduces FastGAM, a KPcache eviction algorithm that enhances efficiency by optimizing attention mechanisms and token management. Additionally, the discussion includes an eviction strategy called C-sharp, which emphasizes tailored strategies for managing attention heads in order to achieve effective KV cache compression.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner