"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

AI Inference: Good, Fast, and Cheap, with Lin Qiao & Dmytro Ivchenko of Fireworks AI

4 snips

Apr 20, 2024

Lin Qiao and Dmytro Ivchenko, co-founders of Fireworks AI, share insights on advancing AI inference. They discuss strategies for optimizing latency and performance while balancing cost efficiency. The duo highlights their collaboration with Stability AI and the significance of user-centered products in easing developer challenges. They also explore the innovative LoRa method for fine-tuning AI models, and the shifts from traditional to deep learning frameworks, alongside the impact of GPU programming on AI performance. Tune in for a deep dive into the future of AI technology!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Focus on Value-Add Services

Reselling hardware is a low-margin business, so Fireworks.ai focuses on value-add services.
They prioritize low latency, high quality, and low total cost of ownership (TCO) for generative AI inference.

ADVICE

Prioritize Performance over Low Cost

Don't solely focus on low cost; prioritize latency, quality, and TCO.
Low TCO is often a byproduct of high performance.

INSIGHT

Simplifying Generative AI for Developers

Generative AI application developers face challenges in model selection, optimization, and cost justification.
Fireworks.ai aims to abstract these complexities, allowing developers to focus on product development.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this episode, we delve into the intricate world of AI inference with cofounders of Firework AI. Discover the strategies behind optimizing AI performance, the importance of balancing latency and throughput, and the nuances of different AI architectures from GPT-3 to Stable Diffusion. Learn about their partnership with Stability AI, their unique focus on reducing total cost of ownership, and their vision for a seamless developer experience.

RECOMMENDED PODCAST:

How Do You Use ChatGPT with Dan Shipper.

Dan Shipper talks to programmers, writers, founders, academics, tech executives, and others to walk through all of their ChatGPT use cases (including Nathan!). They even use ChatGPT together, live on the show. Listen to How Do You Use ChatGPT? from Dan Shipper and the team at Every, wherever you get your podcasts: https://link.chtbl.com/hdyuchatgpt

SPONSORS:

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/

The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR

Plumb is a no-code AI app builder designed for product teams who care about quality and speed. What is taking you weeks to hand-code today can be done confidently in hours. Check out https://bit.ly/PlumbTCR for early access.

Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.

CHAPTERS :

(00:00:00) Introduction

(00:08:34) Compute Stack

(00:19:23) Fireworks Product Philosophy

(00:24:11) Sponsors : Brave / Omneky

(00:25:40) Fine-tuning Strategy

(00:38:40) Sponsors : Plumb / Squad

(00:40:33) NVIDIA Stack Overview

(00:47:14) TensorFlow Triton Service

(00:55:25) Reduced Precision Advantages

(01:03:57) Different Deployment Scenarios

(01:08:27) Seeking Intuition on Sharding

(01:19:28) Announcing Stability AI Partnership

(01:32:00) Closing Remarks