Software Engineering Daily

The Challenge of AI Model Evaluations with Ankur Goyal

34 snips
Jun 10, 2025
Ankur Goyal, CEO and Founder of Braintrust Data and former AI team lead at Figma, dives into the unique challenges of evaluating AI models. He highlights the shift from traditional software evaluations to the complexities posed by large language models. The conversation explores the need for high-quality data, collaboration for effective evaluation frameworks, and integration within development workflows. Ankur also discusses optimizing user experience through AI evaluations and securing data in cloud architectures, offering valuable insights for navigating generative AI development.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Building Braintrust from Experience

  • Ankur Goyal shared how his experience at Impira and Figma led to building internal eval tooling twice.
  • This repeated problem inspired the creation of Braintrust after confirming other companies faced similar challenges.
INSIGHT

Non-Deterministic Evaluations Challenge

  • AI model evaluations are tough due to their inherent non-determinism unlike traditional software testing.
  • Visualizing and interpreting variance is key since running single deterministic tests doesn't suffice for LLMs.
ADVICE

Three-Part Eval Creation Framework

  • Create evals by combining three parts: data with inputs/expected outputs, a task function generating output, and scoring functions rating output quality.
  • This simple modular approach makes evals easy to build and customize.
Get the Snipd Podcast app to discover more snips from this episode
Get the app