

Deep Dive into Inference Optimization for LLMs with Philip Kiely
8 snips Nov 5, 2024
Join Philip Kiely as he unpacks the intricacies of inference optimization for AI workloads. He discusses the hype of Compound AI and how to choose the right model and inference engine. Learn about optimization techniques like quantization and speculative decoding that maximize GPU efficiency. Explore the role of multi-model AI systems and the challenges of model routing, network latency, and performance tooling. Discover practical insights on enhancing inference in large language models while balancing latency, throughput, and cost.
Chapters
Transcript
Episode notes
1 2 3 4 5 6 7
Intro
00:00 • 2min
Navigating a Career in AI Inference and Developer Relations
01:52 • 2min
Mastering Inference in AI Infrastructure
03:46 • 19min
Harnessing Compound AI for Enhanced Performance
23:13 • 6min
Optimizing Performance in Multi-Model AI Systems
29:20 • 2min
Optimizing Inference in Large Language Models
31:10 • 7min
Optimizing Inference in Language Models
38:18 • 26min