MLOps.community

Are Evals Dead?

64 snips
Sep 26, 2025
Chiara Caratelli, a data scientist at Prosus Group, emphasizes the critical nature of evaluations in AI development. She discusses the importance of stress-testing and building trust through rigorous evaluations rather than merely relying on larger models. Chiara shares her approach to bootstrapping evaluation sets and the role of user feedback in refining these tests. She also touches on the significance of simulating real-world interactions and the need for effective error analysis to enhance agent performance. Learn how her insights can impact the future of AI.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Bootstrap With Curated Eval Sets

  • Bootstrap evaluation with a curated test set drawn from product-manager insights and team discussions.
  • Use these cases in CI/CD pipelines so core flows stay protected as you iterate.
ANECDOTE

Red-Team Personas Broke Guardrails

  • We built persistent persona agents to red-team multi-turn conversations and guardrails.
  • The simulated users tried to be persuasive and sometimes succeeded in breaking the agent.
ADVICE

Fixes: Start Simple, Roll Out Carefully

  • Start with simple fixes like prompt adjustments and escalate to code-level or moderation reviewers for security issues.
  • Run changes first on a trusted eval set and compare before rolling out via A/B or phased release.
Get the Snipd Podcast app to discover more snips from this episode
Get the app