Deep Dive into Inference Optimization for LLMs with Philip Kiely

8 snips

Nov 5, 2024

Join Philip Kiely as he unpacks the intricacies of inference optimization for AI workloads. He discusses the hype of Compound AI and how to choose the right model and inference engine. Learn about optimization techniques like quantization and speculative decoding that maximize GPU efficiency. Explore the role of multi-model AI systems and the challenges of model routing, network latency, and performance tooling. Discover practical insights on enhancing inference in large language models while balancing latency, throughput, and cost.

Ask episode

Chapters

Transcript

Episode notes

Intro

00:00 • 2min

Navigating a Career in AI Inference and Developer Relations

01:52 • 2min

Mastering Inference in AI Infrastructure

03:46 • 19min

Harnessing Compound AI for Enhanced Performance

23:13 • 6min

Optimizing Performance in Multi-Model AI Systems

29:20 • 2min

Optimizing Inference in Large Language Models

31:10 • 7min

Optimizing Inference in Language Models

38:18 • 26min