Are Evals Dead?

74 snips

Sep 26, 2025

Chiara Caratelli, a data scientist at Prosus Group, emphasizes the critical nature of evaluations in AI development. She discusses the importance of stress-testing and building trust through rigorous evaluations rather than merely relying on larger models. Chiara shares her approach to bootstrapping evaluation sets and the role of user feedback in refining these tests. She also touches on the significance of simulating real-world interactions and the need for effective error analysis to enhance agent performance. Learn how her insights can impact the future of AI.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Bootstrap With Curated Eval Sets

Bootstrap evaluation with a curated test set drawn from product-manager insights and team discussions.
Use these cases in CI/CD pipelines so core flows stay protected as you iterate.

ANECDOTE

Red-Team Personas Broke Guardrails

We built persistent persona agents to red-team multi-turn conversations and guardrails.
The simulated users tried to be persuasive and sometimes succeeded in breaking the agent.

ADVICE

Fixes: Start Simple, Roll Out Carefully

Start with simple fixes like prompt adjustments and escalate to code-level or moderation reviewers for security issues.
Run changes first on a trusted eval set and compare before rolling out via A/B or phased release.

Get the Snipd Podcast app to discover more snips from this episode

Get the app