

Episode 40: What Every LLM Developer Needs to Know About GPUs
8 snips Dec 24, 2024
In this conversation with Charles Frye, Developer Advocate at Modal, listeners gain insights into the intricate world of GPUs and their critical role in AI and LLM development. Charles explains the importance of VRAM and how memory can become a bottleneck. They tackle practical strategies for optimizing GPU usage, from fine-tuning to training large models. The discussion also highlights a GPU Glossary that simplifies complex concepts for developers, along with insights on quantization and the economic considerations in using modern hardware for efficient AI workflows.
AI Snips
Chapters
Transcript
Episode notes
Memory Bottleneck for LLMs
- Prioritize fitting model weights in GPU memory for faster inference.
- Offload weights to CPU or disk drastically slows inference, similar to virtual memory on CPUs.
Estimating VRAM Requirements
- Double model weight size and add 30% to estimate VRAM needs.
- Consider sequence length and batch size for more accurate estimations, especially with long contexts.
Latency vs. Throughput
- Prioritize throughput-oriented LLM applications.
- Latency is hard to improve; focus on tasks where high throughput matters, not millisecond response times.