Behind the Craft

AI Evaluations Crash Course in 50 Minutes (2025) | Hamel Husain

39 snips
Sep 28, 2025
Hamel Husain, an expert in AI evaluation methods, shares his vast experience training PMs and engineers at top tech firms. He breaks down how to effectively analyze real production traces, emphasizing the power of binary pass/fail ratings over complex scoring systems. Hamel explains common pitfalls in evaluation metrics and introduces practical tools for continuous monitoring. Listeners gain insights into building simple annotation tools and the importance of grounding evaluations in real problems to drive meaningful improvements.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Analyze 100 Traces First

  • Look at raw traces and write simple notes about failures before anything else.
  • Do this ~100 times to surface common, actionable problems quickly.
ADVICE

Use Spreadsheets + LLMs To Classify Issues

  • Export your manual notes to a spreadsheet and use an LLM to propose categories.
  • Use pivot tables to count categories and prioritize fixes based on frequency.
INSIGHT

Prefer Code Tests For Deterministic Bugs

  • Some problems are best solved with code-based tests, not LLM judges.
  • Use reference-based checks for deterministic issues (e.g., date handling).
Get the Snipd Podcast app to discover more snips from this episode
Get the app