Getting Beyond the Demo with Hamel Husain

Sep 4, 2025

Ask episode

Chapters

Transcript

Episode notes

In this episode, Ben Lorica and Hamel Husain talk about how to take the next steps with artificial intelligence. Developers don’t need to build their own models—but they do need basic data skills. It’s important to look at your data, to discover your model’s weaknesses, and to use that information to develop test suites and evals that show whether your model is behaving well.

Links to Resources

Hamel's upcoming course on evaluating LLMs.
Hamel's O'Reilly publications: “AI Essentials for Tech Executives” and “What We Learned from a Year of Building with LLMs”
Hamel's website.

Points of Interest

0:39: What inspired you and your coauthors to create a series on practical uses of foundation models? What gaps in existing resources did you aim to address?
0:56: We’re publishing “AI Essentials for Tech Executives”¹ now; last year, we published “What We Learned from a Year of Building with LLMs.”² Coming from the perspective of a machine learning engineer or data scientist—you don’t need to build or train models. You can use an API. But there are skills and practices from data science that are crucial.
2:16: There are core skills around data analysis and error analysis and basic data literacy that you need to get beyond a demo.
2:43: What are some crucial shifts in mindset that you’ve written about on your blog?
3:24: The phrase we keep repeating is “look at your data." What does “look at your data" mean?
3:51: There’s a process that you should use. Machine learning systems have a lot in common with modern AI. How do you test those? Debug them? Improve them? Look at your data; people fail on this. They do vibe checks, but they don’t really know what to do next.
4:56: Looking at your data helps ground everything. Look at actual logs of user interactions. If you don’t have users, generate interactions synthetically. See how your AI is behaving and write detailed notes about failure modes. Do some analysis on those notes: Categorize them. You’ll start to see patterns and your biggest failure modes. This will give you a sense of what to prioritize.
6:08: A lot of people are missing that. People aren’t familiar with the rich ecosystem of data tools, so they get stuck. We know that it’s crucial to sample some data and look at it.
7:08: It’s also important that you have the domain expert do it with the engineers. On a lot of teams, the domain expert isn’t an engineer.
7:44: Another thing is focusing on processes, not tools. Tools aren’t the problem—the problem is that your AI isn’t working. The tools won’t take care of it for you. There’s a process: how to debug, look at, and measure AI. Those are the main mind shifts.
9:32: Most people aren’t building models (pretraining); they might be doing posttraining on a base model. But there are a lot of experiments that you still have to run. There’[re] knobs you have to turn, and without the ability to do it systematically and measure, you’re just mindless[ly] turning knobs without learning much.
10:29: I’ve held open office hours for people to ask questions about evals. What people ask most is what to eval. There are many components. You can’t and shouldn’t test everything. You should be grounded in your actual failure modes. Prioritize your tests on that.
11:30: Another topic is what I call the prototype purgatory. A lot of people have great demos. The demos work, and might even be deployable. But people struggle with pulling the trigger.
12:15: A lot of people don’t know how to evaluate their AI systems if they don’t have any users. One way to help yourself is to generate synthetic data. Have an LLM generate realistic user inputs and brainstorm different personas and scenarios. That bootstraps you significantly towards production.
13:57: There’s a new open source tool that does something like this for agents. It’s called IntelAgent. It generates synthetic data that you might not come up with yourself.