This episode is sponsored by AGNTCY. Unlock agents at scale with an open Internet of Agents.

Visit https://agntcy.org/ and add your support.

How do the world's most powerful AI models get trained and trusted at scale, and what does that really take from data to deployment?

In this episode, Appen CEO Ryan Kolln joins Eye on AI to unpack how rigorous human evaluation, culturally aware data, and model-based judges come together to raise real-world performance.

In this episode of Eye on AI, host Craig Smith speaks with Ryan Kolln, CEO of Appen, about building evaluation systems that go beyond static benchmarks to measure usefulness, safety, and reliability in production. They explore how human raters and AI evaluators work in tandem, why localization matters across regions and domains, and how quality controls keep feedback signals trustworthy for training and post-training.

Ryan explains how evaluation feeds reinforcement strategies, where rubric-driven human judgments inform reward models, and how enterprises can stand up secure workflows for sensitive use cases. He also discusses emerging needs around sovereign models, domain-specific testing, and the shift from general chat to agentic workflows that operate inside real business systems.

Learn how leading teams design human-in-the-loop evaluation, when to route judgments from models back to expert reviewers, how to capture cultural nuance without losing universal guardrails, and how to build an evaluation stack that scales from early prototypes to production AI.

Stay Updated: Craig Smith on X: https://x.com/craigss Eye on A.I. on X: https://x.com/EyeOn_AI

#298 Ryan Kolln: How Appen Trains the World's Most Powerful AI Models

Eye On A.I.

Why benchmarks alone don't measure real-world model quality

The AI-powered Podcast Player