AI Evaluations Crash Course in 50 Minutes (2025) | Hamel Husain

39 snips

Sep 28, 2025

Hamel Husain, an expert in AI evaluation methods, shares his vast experience training PMs and engineers at top tech firms. He breaks down how to effectively analyze real production traces, emphasizing the power of binary pass/fail ratings over complex scoring systems. Hamel explains common pitfalls in evaluation metrics and introduces practical tools for continuous monitoring. Listeners gain insights into building simple annotation tools and the importance of grounding evaluations in real problems to drive meaningful improvements.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Analyze 100 Traces First

Look at raw traces and write simple notes about failures before anything else.
Do this ~100 times to surface common, actionable problems quickly.

ADVICE

Use Spreadsheets + LLMs To Classify Issues

Export your manual notes to a spreadsheet and use an LLM to propose categories.
Use pivot tables to count categories and prioritize fixes based on frequency.

INSIGHT

Prefer Code Tests For Deterministic Bugs

Some problems are best solved with code-based tests, not LLM judges.
Use reference-based checks for deterministic issues (e.g., date handling).

Get the Snipd Podcast app to discover more snips from this episode

Get the app