Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs

38 snips

Jul 8, 2025

Samuel Colvin, the mastermind behind Pydantic and founder of Logfire, discusses the often-overlooked challenges in AI reliability. He emphasizes how durability is key, not just flashy demos, and reveals that tiny feedback loops can significantly enhance performance insights. Colvin introduces innovative concepts like prompt self-repair systems and drift alarms, which can catch shifts before they become problems. He advocates for business-driven metrics that ensure features align with real goals, making AI not just functional but dependable in real-world applications.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

85% to 100% Reliability Is Hard

Moving from a demo-level 85% solution to 100% reliability demands many times more engineering effort.
Real improvements require systematic evaluation to truly understand performance and identify failure points.

ADVICE

Humans Must Inspect Outputs

Always review a sample of prompts and outputs yourself before relying on agents to do it automatically.
Human curiosity about how the system behaves is essential for diagnosing and fixing issues.

INSIGHT

LLM Eval Challenges and Reality

Evaluations for LLMs feel high-overhead and returns seem low especially beyond 85% performance.
Waiting for better models is often not viable due to urgent production timelines.

Get the Snipd Podcast app to discover more snips from this episode

Get the app