AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Optimizing Inference in Large Language Models
This chapter explores the phases of inference in large language models, focusing on efficiency through various optimization techniques and engines like VLLM and TensorRT-LLM. It also discusses the challenges and advancements related to hardware dependencies and usability improvements in TensorRT.