Navigating Serverless and AI Model Challenges

This chapter addresses misconceptions about serverless platforms like AWS Lambda in deploying machine learning models, especially larger ones. It underscores the importance of task decomposition and fine-tuning in enhancing model performance, alongside the necessity for robust infrastructure to manage and optimize these models. Additionally, it explores the evolving dynamics of AI deployment and the key factors that can differentiate organizations in this competitive landscape.

Transcript

chevron_right

Play full episode

chevron_right

Transcript

Episode notes

Nicolay here,

Today I have the chance to talk to Charles from Modal, who went from doing a PhD on neural network optimization in the 2010s - when ML engineers could build models with a soldering iron and some sticks - to architecting serverless infrastructure for AI models. Modal is about removing barriers so anyone can spin up a hundred GPUs in seconds.

The critical insight that stuck with me: "Don't build models, build systems that build models." Organizations often make the mistake of celebrating a one-time fine-tuned model that matches GPT-4 performance only to watch it become obsolete when the next foundation model arrives - typically three to six months down the road.

Charles's approach to infrastructure is particularly unconventional. He argues that serverless isn't just about convenience - it fundamentally changes how ambitious you can be with scale. "There's so much that gets in the way of trying to spin up a hundred GPUs or a thousand CPU containers that people just don't think to do something big."

The winning approach involves automated data pipelines with feedback collection, continuous evaluation against new foundation models, AB testing and canary deployments, and systematic error analysis and retraining.

In the podcast, we also cover:

Why inference, not training, is where the money is made
How to rethink compute when moving from traditional cloud to serverless
The economics of automated resource management
Why task decomposition is the key ML engineering skill
When to earn the right to fine-tune versus using foundation models

*📶 Connect with Charles:*

Twitter - https://twitter.com/charlesirl
Modal Labs - https://modal.com
Modal Slack Community - https://modal.com/slack

*📶 Connect with Nicolay:*

LinkedIn - https://linkedin.com/in/nicolay-gerold/
X / Twitter - https://x.com/nicolaygerold
Bluesky - https://bsky.app/profile/nicolaygerold.com
Website - https://nicolaygerold.com/
My Agency Aisbach - https://aisbach.com/ (for ai implementations / strategy)

*⏱️ Important Moments*

From CUDA to Serverless: [00:01:38] Charles's journey from PhD neural network optimization to building Modal's serverless infrastructure.
Rethinking Scale Ambition: [00:01:38] "There's so much that gets in the way of trying to spin up a hundred GPUs that people just don't think to do something big."
The Economics of Serverless: [00:04:09] How automated resource management changes the cattle vs pets paradigm for GPU workloads.
Lambda vs Modal Philosophy: [00:04:20] Why Modal was designed for tasks that take bytes and emit megabytes, unlike Lambda's middleware focus.
Inference Economics Reality: [00:10:16] "Almost nobody gets paid to make models - organizations get paid to make predictions."
The Open Source Commoditization: [00:14:55] How foundation models are becoming undifferentiated capabilities like databases.
Task Decomposition as Core Skill: [00:22:00] Why breaking down problems is equivalent to recognizing API boundaries in software engineering.
Systems That Build Models: [00:33:31] The critical difference between delivering static weights versus repeatable model production systems
Earning the Right to Fine-Tune: [00:34:06] The infrastructure prerequisites needed before attempting model customization.
Multi-Node Training Challenges: [00:52:24] How serverless platforms handle the contradiction of high-performance computing with spiky demand.

*🛠️ Tools & Tech Mentioned*

Modal - https://modal.com (serverless GPU infrastructure)
AWS Lambda - https://aws.amazon.com/lambda/ (traditional serverless)
Kubernetes - https://kubernetes.io/ (container orchestration)
Temporal - https://temporal.io/ (workflow orchestration)
Weights & Biases - https://wandb.ai/ (experiment tracking)
Hugging Face - https://huggingface.co/ (model repository)
PyTorch Distributed - https://pytorch.org/tutorials/intermediate/ddp_tutorial.html (multi-GPU training)
Redis - https://redis.io/ (caching and queues)

*📚 Recommended Resources*

Full Stack Deep Learning - https://fullstackdeeplearning.com/ (deployment best practices)
Modal Documentation - https://modal.com/docs (getting started guide)
Deep Seek Paper - https://arxiv.org/abs/2401.02954 (disaggregated inference patterns)
AI Engineer Summit - https://ai.engineer/ (community events)
MLOps Community - https://mlops.community/ (best practices)

💬 Join The Conversation

Follow How AI Is Built on YouTube - https://youtube.com/@howaiisbuilt, Bluesky - https://bsky.app/profile/howaiisbuilt.fm, or Spotify - https://open.spotify.com/show/3hhSTyHSgKPVC4sw3H0NUc?_authfailed=1%29

If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn - https://linkedin.com/in/nicolay-gerold/, X - https://x.com/nicolaygerold, or Bluesky - https://bsky.app/profile/nicolaygerold.com. Or at nicolay.gerold@gmail.com.

I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books