
Building eval systems that improve your AI product
Lenny's Reads
Intro
This chapter emphasizes the critical role of robust evaluation systems for AI products, linking them to user needs for continuous improvement. It discusses best practices in error analysis and the creation of reliable metrics, aiming to guide engineers and product managers in avoiding pitfalls associated with vanity metrics.
If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.com
In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.
After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.
In this episode, you’ll learn:
• Why most AI eval dashboards fail to deliver real product improvements
• How to use error analysis to uncover your product’s most critical failure modes
• The role of a “principal domain expert” in setting a consistent quality bar
• Techniques for transforming messy error notes into a clean taxonomy of failures
• When to use code-based checks vs. LLM-as-a-judge evaluators
• How to build trust in your evals with human-labeled ground-truth datasets
• Why binary pass/fail labels outperform Likert scales in practice
• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows
• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvement
References:
• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve
• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals
• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/
• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272
• Aman Khan: https://www.linkedin.com/in/amanberkeley/
• Anthropic: https://www.anthropic.com/
• Arize Phoenix: https://phoenix.arize.com/
• Braintrust: https://www.braintrust.dev/
• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete
• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev/blog/posts/evals-faq/
• Hamel Husain: https://www.linkedin.com/in/hamelhusain/
• LangSmith: https://smith.langchain.com/
• Not Dead Yet: On RAG: https://hamel.dev/notes/llm/rag/not_dead.html
• OpenAI: https://openai.com/
• Shreya Shankar: https://www.linkedin.com/in/shrshnk/
Listen:
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Newsletter: https://www.lennysnewsletter.com/subscribe
Follow Lenny:
• Twitter/X: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@lennyspodcast
Subscribe
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Substack: https://lennysreads.com/
Follow Lenny
• Twitter: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@lennyspodcast
About
Welcome to Lenny's Reads, where every week you’ll find a fresh audio version of my newsletter about building product, driving growth, and accelerating your career, read to you by the soothing voice of Lennybot.
To hear more, visit www.lennysnewsletter.com