The Challenge of AI Model Evaluations with Ankur Goyal

56 snips

Jun 10, 2025

Ankur Goyal, CEO and Founder of Braintrust Data and former AI team lead at Figma, dives into the unique challenges of evaluating AI models. He highlights the shift from traditional software evaluations to the complexities posed by large language models. The conversation explores the need for high-quality data, collaboration for effective evaluation frameworks, and integration within development workflows. Ankur also discusses optimizing user experience through AI evaluations and securing data in cloud architectures, offering valuable insights for navigating generative AI development.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Building Braintrust from Experience

Ankur Goyal shared how his experience at Impira and Figma led to building internal eval tooling twice.
This repeated problem inspired the creation of Braintrust after confirming other companies faced similar challenges.

INSIGHT

Non-Deterministic Evaluations Challenge

AI model evaluations are tough due to their inherent non-determinism unlike traditional software testing.
Visualizing and interpreting variance is key since running single deterministic tests doesn't suffice for LLMs.

ADVICE

Three-Part Eval Creation Framework

Create evals by combining three parts: data with inputs/expected outputs, a task function generating output, and scoring functions rating output quality.
This simple modular approach makes evals easy to build and customize.

Get the Snipd Podcast app to discover more snips from this episode

Get the app