

Fast Inference with Hassan El Mghari
24 snips Apr 8, 2025
Hassan El Mghari, an AI expert from Together AI, dives into the exciting world of inference optimization. He discusses the rapid growth of Together AI and its hefty series B funding. Listeners will learn about customer applications of AI, the challenges and best practices in building AI apps, and the importance of speed in inference engines. Hassan also explores model fine-tuning techniques, serverless architectures, and common pitfalls in AI app development. This episode is a treasure trove for anyone interested in cutting-edge AI innovations!
AI Snips
Chapters
Transcript
Episode notes
Open-Source Model Challenges
- Open-source models require expertise and familiarity with LLM serving frameworks like VLLM or TLLM.
- Users must ensure framework compatibility with their model, architecture, and GPU, followed by testing.
Importance of Inference Speed
- Prioritize speed for inference in AI applications, as it directly impacts user experience and cost-effectiveness.
- Together AI achieves speed through a custom inference stack, kernel optimization, and speculative decoding.
Fine-Tuning Challenges and Vision
- Fine-tuning is less common than inference due to the higher barrier to entry, requiring high-quality, labeled data.
- Together AI aims to simplify fine-tuning by automating the process and lowering the data requirements.