Right-Sizing AI: Small Language Models for Real-World Production

45 snips

Sep 20, 2025

In this discussion, Steven Huels, VP of AI Engineering at Red Hat, unpacks the power of small language models (SLMs) for real-world applications. He highlights the advantages of SLMs in fitting onto single enterprise GPUs and their operational capabilities. The conversation dives into self-hosting models versus relying on APIs, tackles organizational readiness, and discusses innovations in agentic systems. Steven shares real-world examples like scam detection and emphasizes the importance of customization, automated evaluation, and continuous retraining for efficient AI deployment.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Practical GPU-Based Model Size Heuristic

Define small vs large models by whether they fit on a single enterprise GPU rather than parameter count.
This practical heuristic shifts as hardware and software advance, changing what counts as "small."

ADVICE

Validate With The Best Model First

Start experiments with the best available frontier model to validate an idea quickly.
If the idea has value, then scale down to smaller models to find the right cost-performance trade-off.

ADVICE

Match Hosting To Operational Maturity

Evaluate whether your IT organization already runs platforms before self-hosting models.
If not, consider an integrated AI platform to extend existing operational skills and reduce maintenance burden.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Summary
In this episode of the AI Engineering Podcast Steven Huels, Vice President of AI Engineering & Product Strategy at Red Hat, talks about the practical applications of small language models (SLMs) for production workloads. He discusses how SLMs offer a pragmatic choice due to their ability to fit on single enterprise GPUs and provide model selection trade-offs. The conversation covers self-hosting vs using API providers, organizational capabilities needed for running production-grade LLMs, and the importance of guardrails and automated evaluation at scale. They also explore the rise of agentic systems and service-oriented approaches powered by smaller models, highlighting advances in customization and deployment strategies. Steven shares real-world examples and looks to the future of agent cataloging, continuous retraining, and resource efficiency in AI engineering.

Announcements

Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
When ML teams try to run complex workflows through traditional orchestration tools, they hit walls. Cash App discovered this with their fraud detection models - they needed flexible compute, isolated environments, and seamless data exchange between workflows, but their existing tools couldn't deliver. That's why Cash App rely on Prefect. Now their ML workflows run on whatever infrastructure each model needs across Google Cloud, AWS, and Databricks. Custom packages stay isolated. Model outputs flow seamlessly between workflows. Companies like Whoop and 1Password also trust Prefect for their critical workflows. But Prefect didn't stop there. They just launched FastMCP - production-ready infrastructure for AI tools. You get Prefect's orchestration plus instant OAuth, serverless scaling, and blazing-fast Python execution. Deploy your AI tools once, connect to Claude, Cursor, or any MCP client. No more building auth flows or managing servers. Prefect orchestrates your ML pipeline. FastMCP handles your AI tool infrastructure. See what Prefect and Fast MCP can do for your AI workflows at aiengineeringpodcast.com/prefect today.
Your host is Tobias Macey and today I'm interviewing Steven Huels about the benefits of small language models for production workloads

Interview

Introduction
How did you get involved in machine learning?
Language models are available in a wide range of sizes, measured both in terms of parameters and disk space. What are your heuristics for deciding what qualifies as a "small" vs. "large" language model?
What are the corresponding heuristics for when to use a small vs. large model?
The predominant use case for small models is in self-hosted contexts, which requires a certain amount of organizational sophistication. What are some helpful questions to ask yourself when determining whether to implement a model-serving stack vs. relying on hosted options?
What are some examples of "small" models that you have seen used effectively?
The buzzword right now is "agentic" for AI driven workloads. How do small models fit in the context of agent-based workloads?
- When and where should you rely on larger models?
When speaking of small models, one of the common requirements for making them truly useful is to fine-tune them for your problem domain and organizational data. How has the complexity and difficulty of that operation changed over the past ~2 years?
Serving models requires several operational capabilities beyond the raw inference serving. What are the other infrastructure and organizational investments that teams should be aware of as they embark on that path?
What are the most interesting, innovative, or unexpected ways that you have seen small language models used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on operationalizing inference and model customization?
When is a small or self-hosted language model the wrong choice?
What are your predictions for the near future of small language model capabilities/availability?

Contact Info

Parting Question

From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers.

Links

The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0