Lenny's Podcast: Product | Career | Growth

chevron_right

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

whatshot 1879 snips

Sep 25, 2025

Hamel Husain, an AI product educator and consultant, and Shreya Shankar, a researcher and product expert, share their insights on AI evals. They explain why evals are essential for AI product builders, delve into error analysis techniques, and discuss the balance between code-based evaluations and LLM judges. Listeners will learn about practical tips for implementing evals with minimal time investment and common pitfalls to avoid. The duo also highlights the importance of systematic measurement in enhancing AI product effectiveness.

01:46:33

forum

Ask episode

web_stories

AI Snips

view_agenda

Chapters

menu_book

Books

auto_awesome

Transcript

info_circle

Episode notes

insights

INSIGHT

Evals Are Product Analytics For LLMs

Evals are systematic measurements of an AI application's quality, like data analytics for LLM products.
They create metrics and feedback loops so teams can iterate with confidence.

question_answer

ANECDOTE

Real Estate Assistant Example

Hamel used NurtureBoss, an AI assistant for property managers, as a real-world example to show traces and system prompts.
He walked through actual logs to demonstrate how evals uncover problems in production behavior.

volunteer_activism

ADVICE

Write Quick Open Notes First

Start error analysis by writing quick open notes on individual traces and capture the first upstream error you see.
Sample traces rather than trying to label everything to learn rapidly and keep the process manageable.

Get the Snipd Podcast app to discover more snips from this episode

What are evals and why they matter

05:41 • 4min

chevron_right

How to inspect traces and start error analysis

10:07 • 7min

chevron_right

Open coding: writing notes on observed errors

17:18 • 8min

chevron_right

The benevolent dictator and who should code errors

25:28 • 4min

chevron_right

How many traces to review and theoretical saturation

29:40 • 2min

chevron_right

Using LLMs to synthesize open codes into axial codes

31:43 • 8min

chevron_right

Mapping notes to categories and automated labeling

40:13 • 4min

chevron_right

Counting failure modes and prioritization

44:42 • 3min

chevron_right

Code-based evaluators versus LLM-as-judge

48:06 • 4min

chevron_right

How to design an LLM-as-judge prompt

52:17 • 5min

chevron_right

Aligning LLM judges with human labels

56:58 • 5min

chevron_right

Operationalizing evals: unit tests and monitoring

01:02:01 • 3min

chevron_right

How many evals to maintain and prioritization rules

01:05:19 • 2min

chevron_right

Evals versus A/B tests and broader analytics

01:07:48 • 4min

chevron_right

Common misconceptions about evals

01:11:54 • 16min

chevron_right

Practical tips: start simple, iterate, and use AI wisely

01:27:52 • 3min

chevron_right

Effort and ongoing time commitment for evals

01:30:46 • 5min

chevron_right

Course resources and tooling to scale evals

01:35:26 • 3min

chevron_right

Lightning round: books, shows, products, mottos

01:38:06 • 6min

chevron_right

How to follow up and be helpful to the hosts

• Mentioned in 2 episodes

Machine Learning Yearning

An introductory book about developing ML algorithms

Andrew Ng

Andrew Ng's "Machine Learning Yearning" is a concise guide focused on practical strategies for building successful machine learning systems. The book emphasizes a structured approach to problem-solving, guiding readers through crucial steps like data collection, model selection, and evaluation. Ng stresses the importance of iterative development and experimentation, encouraging readers to learn from their mistakes and refine their models over time. The book's clear explanations and practical advice make it accessible to both beginners and experienced practitioners. It serves as a valuable resource for anyone seeking to improve their machine learning skills and build effective solutions.

#190

• Mentioned in 107 episodes

Apple in China

The Capture of the World's Greatest Company

Patrick McGee

#3958

• Mentioned in 11 episodes

Artificial Intelligence, a Modern Approach

Peter Norvig

Blaise Aguera y Arcas

Artificial Intelligence: A Modern Approach, by Stuart Russell and Peter Norvig, is a comprehensive textbook covering various aspects of artificial intelligence. It provides a broad overview of the field, encompassing search algorithms, knowledge representation, reasoning, machine learning, and natural language processing. The book is known for its clear explanations, numerous examples, and extensive coverage of both classical and modern AI techniques. It serves as a valuable resource for students and researchers alike, offering a solid foundation in the principles and applications of AI. Its wide adoption makes it a standard reference in the field.

#1737

• Mentioned in 22 episodes

Pachinko

Allison Hiroto

Min Jin Lee

This novel follows the story of Sunja, a young Korean woman, and her family as they navigate the challenges of living as immigrants in Japan from the early 1900s to the late 1980s. The book explores themes of love, sacrifice, ambition, and loyalty, set against the backdrop of significant historical events including World War II and the Korean War. It delves into the experiences of discrimination, cultural identity, and the struggles of everyday life for Korean families in Japan.

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

—

Brought to you by:

Fin—The #1 AI agent for customer service

Dscout—The UX platform to capture insights at every stage: from ideation to production

Mercury—The art of simplified finances

—

Where to find Shreya Shankar

• X: https://x.com/sh_reya

• LinkedIn: https://www.linkedin.com/in/shrshnk/

• Website: https://www.sh-reya.com/

• Maven course: https://bit.ly/4myp27m

—

Where to find Hamel Husain

• X: https://x.com/HamelHusain

• LinkedIn: https://www.linkedin.com/in/hamelhusain/

• Website: https://hamel.dev/

• Maven course: https://bit.ly/4myp27m

—

In this episode, we cover:

(00:00) Introduction to Hamel and Shreya

(04:57) What are evals?

(09:56) Demo: Examining real traces from a property management AI assistant

(16:51) Writing notes on errors

(23:54) Why LLMs can’t replace humans in the initial error analysis

(25:16) The concept of a “benevolent dictator” in the eval process

(28:07) Theoretical saturation: when to stop

(31:39) Using axial codes to help categorize and synthesize error notes

(44:39) The results

(46:06) Building an LLM-as-judge to evaluate specific failure modes

(48:31) The difference between code-based evals and LLM-as-judge

(52:10) Example: LLM-as-judge

(54:45) Testing your LLM judge against human judgment

(01:00:51) Why evals are the new PRDs for AI products

(01:05:09) How many evals you actually need

(01:07:41) What comes after evals

(01:09:57) The great evals debate

(1:15:15) Why dogfooding isn’t enough for most AI products

(01:18:23) OpenAI’s Statsig acquisition

(1:23:02) The Claude Code controversy and the importance of context

(01:24:13) Common misconceptions around evals

(1:22:28) Tips and tricks for implementing evals effectively

(1:30:37) The time investment

(1:33:38) Overview of their comprehensive evals course

(1:37:57) Lightning round and final thoughts

—

LLM Log Open Codes Analysis Prompt:

Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from.

—

Referenced:

• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• Mercor: https://mercor.com/

• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b

• Nurture Boss: https://nurtureboss.io/

• Braintrust: https://www.braintrust.dev/

• Andrew Ng on X: https://x.com/andrewyng

• Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w

• Julius AI: https://julius.ai/

• Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948

• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/abs/10.1145/3654777.3676450

• Lenny’s post on X about evals: https://x.com/lennysan/status/1909636749103599729

• Statsig: https://statsig.com/

• Claude Code: https://www.anthropic.com/claude-code

• Cursor: https://cursor.com/

• Occam’s razor: https://en.wikipedia.org/wiki/Occam%27s_razor

• Frozen: https://www.imdb.com/title/tt2294629/

• The Wire on HBO: https://en.wikipedia.org/wiki/The_Wire

—

Recommended books:

• Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935

• Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/

• Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955

• Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.

—

Lenny may be an investor in the companies discussed.

My biggest takeaways from this conversation:

To hear more, visit www.lennysnewsletter.com

Home Top podcasts Popular guests Top books