What you’ll learn in this episode:
- What “evals” actually mean in the AI/ML world
- Why evals are more than just quality assurance
- The difference between golden datasets, synthetic data, and real-world traces
- How to identify error modes and turn them into evals
- When to use code-based evals vs. LLM-as-judge evals
- How discovery practices inform every step of AI product evaluation
- Why evals require continuous maintenance (and what “criteria drift” means for your product)
- The relationship between evals, guardrails, and ongoing human oversight
Resources & Links:
Mentioned in the episode:
Coming soon from Teresa:
- Weekly Monday posts sharing lessons learned while building AI products
- A new podcast interviewing cross-functional teams about real-world AI product development stories