

Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs
26 snips Jul 8, 2025
Samuel Colvin, the mastermind behind Pydantic and founder of Logfire, discusses the often-overlooked challenges in AI reliability. He emphasizes how durability is key, not just flashy demos, and reveals that tiny feedback loops can significantly enhance performance insights. Colvin introduces innovative concepts like prompt self-repair systems and drift alarms, which can catch shifts before they become problems. He advocates for business-driven metrics that ensure features align with real goals, making AI not just functional but dependable in real-world applications.
AI Snips
Chapters
Transcript
Episode notes
85% to 100% Reliability Is Hard
- Moving from a demo-level 85% solution to 100% reliability demands many times more engineering effort.
- Real improvements require systematic evaluation to truly understand performance and identify failure points.
Humans Must Inspect Outputs
- Always review a sample of prompts and outputs yourself before relying on agents to do it automatically.
- Human curiosity about how the system behaves is essential for diagnosing and fixing issues.
LLM Eval Challenges and Reality
- Evaluations for LLMs feel high-overhead and returns seem low especially beyond 85% performance.
- Waiting for better models is often not viable due to urgent production timelines.