Behind the Craft

Complete Beginner's Course on AI Evaluations: Step by Step (2025) | Aman Khan

50 snips
Aug 24, 2025
Aman Khan, Head of Product at Arise, specializes in AI evaluations and large language models. He shares insights on creating AI evaluations, including four essential types every product manager should know. The podcast features a live demo of building evals for a customer support agent, discussing the importance of a golden dataset and aligning AI with human judgment. Aman emphasizes effective prompt crafting and the iterative process critical for enhancing AI performance in real-world applications, particularly in customer service scenarios.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Evals Are Non-Negotiable

  • LLMs hallucinate, so evals are essential to avoid product and brand harm.
  • Companies selling models also recommend running evals before deployment.
INSIGHT

Four Core Eval Types

  • There are four core eval types: code-based, human, LLM-as-judge, and user metrics.
  • Each serves different roles from deterministic checks to business impact measurement.
ADVICE

Start With A Structured Prompt

  • Start your agent with a clear prompt and include input variables like user question, product info, and policy.
  • Use tools (e.g., Anthropic Workbench) to generate and iterate prompt templates quickly.
Get the Snipd Podcast app to discover more snips from this episode
Get the app