Vanishing Gradients

Episode 40: What Every LLM Developer Needs to Know About GPUs

8 snips
Dec 24, 2024
In this conversation with Charles Frye, Developer Advocate at Modal, listeners gain insights into the intricate world of GPUs and their critical role in AI and LLM development. Charles explains the importance of VRAM and how memory can become a bottleneck. They tackle practical strategies for optimizing GPU usage, from fine-tuning to training large models. The discussion also highlights a GPU Glossary that simplifies complex concepts for developers, along with insights on quantization and the economic considerations in using modern hardware for efficient AI workflows.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Memory Bottleneck for LLMs

  • Prioritize fitting model weights in GPU memory for faster inference.
  • Offload weights to CPU or disk drastically slows inference, similar to virtual memory on CPUs.
ADVICE

Estimating VRAM Requirements

  • Double model weight size and add 30% to estimate VRAM needs.
  • Consider sequence length and batch size for more accurate estimations, especially with long contexts.
ADVICE

Latency vs. Throughput

  • Prioritize throughput-oriented LLM applications.
  • Latency is hard to improve; focus on tasks where high throughput matters, not millisecond response times.
Get the Snipd Podcast app to discover more snips from this episode
Get the app