Evals and LLM-as-Judge for Regression Tests

Danny recommends golden datasets, LLM graders, and CI checks to measure prompt regressions and break builds.

Play episode from 32:15

chevron_right

Transcript

chevron_right

Transcript

Episode notes

The “prompt-and-pray” era is over — and that’s a good thing.

In this episode, we break down why AI “magic” collapses under real production traffic (edge cases, hallucinations, messy inputs, and even infrastructure-level failures)… and what replaces it: actual AI engineering.

Danny frames the shift with four architectural pillars that make LLM features shippable and reliable:

- State orchestration (stop treating models like employees — they’re stateless CPUs)

- Constraint generation (JSON forcing, schema-driven outputs, type-safe sampling)

- Infrastructure reliability (retries, backoff, fallbacks — because inference can and will fail)

- Regression testing & evals (measure prompts like code, break builds when quality drops)

SITE https://www.programmingpodcast.com/

Stay in Touch:

📧 Have questions for the show? Or are you a business that wants to talk business?

Email us at dannyandleonspodcast@gmail.com!

Danny Thompson

https://x.com/DThompsonDev

/ dthompsondev

www.DThompsonDev.com

Leon Noel

https://x.com/leonnoel

/ leonnoel

https://100devs.org/

📧 Have questions for the show? Or are you a business that wants to talk business?

Email us at dannyandleonspodcast@gmail.com!

We also hit the reality of agent “throughput” vs human review bottlenecks (Phoenix Project vibes), why monolithic agents are a trap, and a listener question about networking + credibility after pitching an MVP that isn’t fully shipped yet.

If you’re building AI features for real users — not demos — this is the blueprint.

00:00 — The “prompt-and-pray” era is over

02:49 — AI hype fades: guardrails + reality

06:34 — Deterministic software vs probabilistic models

07:29 — The 4 pillars of AI engineering (overview)

11:37 — Pillar 1: state orchestration (FSM, stateless models)

20:26 — Pillar 2: constraint generation (JSON, schemas, type safety)

28:28 — Pillar 3: infra reliability (retries, fallbacks, failures)

32:21 — Pillar 4: evals + regression testing (LLM-as-judge)

43:40 — Listener question: networking, MVP pressure, and credibility

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books