Software Huddle

Deep Dive into Inference Optimization for LLMs with Philip Kiely

8 snips
Nov 5, 2024
Join Philip Kiely as he unpacks the intricacies of inference optimization for AI workloads. He discusses the hype of Compound AI and how to choose the right model and inference engine. Learn about optimization techniques like quantization and speculative decoding that maximize GPU efficiency. Explore the role of multi-model AI systems and the challenges of model routing, network latency, and performance tooling. Discover practical insights on enhancing inference in large language models while balancing latency, throughput, and cost.
Ask episode
Chapters
Transcript
Episode notes