Fast Inference with Hassan El Mghari

24 snips

Apr 8, 2025

Hassan El Mghari, an AI expert from Together AI, dives into the exciting world of inference optimization. He discusses the rapid growth of Together AI and its hefty series B funding. Listeners will learn about customer applications of AI, the challenges and best practices in building AI apps, and the importance of speed in inference engines. Hassan also explores model fine-tuning techniques, serverless architectures, and common pitfalls in AI app development. This episode is a treasure trove for anyone interested in cutting-edge AI innovations!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Open-Source Model Challenges

Open-source models require expertise and familiarity with LLM serving frameworks like VLLM or TLLM.
Users must ensure framework compatibility with their model, architecture, and GPU, followed by testing.

ADVICE

Importance of Inference Speed

Prioritize speed for inference in AI applications, as it directly impacts user experience and cost-effectiveness.
Together AI achieves speed through a custom inference stack, kernel optimization, and speculative decoding.

INSIGHT

Fine-Tuning Challenges and Vision

Fine-tuning is less common than inference due to the higher barrier to entry, requiring high-quality, labeled data.
Together AI aims to simplify fine-tuning by automating the process and lowering the data requirements.

Get the Snipd Podcast app to discover more snips from this episode

Get the app