
Eye On A.I. #298 Ryan Kolln: How Appen Trains the World's Most Powerful AI Models
18 snips
Nov 6, 2025 Ryan Kolln, CEO of Appen, discusses the critical role of human evaluation in training AI models. He explains why traditional benchmarks fall short, emphasizing the need for user-centered measures. Kolln highlights how curated human evaluators provide richer insights than random feedback, ensuring AI's cultural relevance through localized data. He also covers the evolution from supervised learning to large language model evaluations, and the synergy between AI evaluators and human annotators in enhancing quality control and model performance.
AI Snips
Chapters
Transcript
Episode notes
Benchmarks Are Limited Predictors
- Benchmarks give a snapshot but often fail to predict real user experience for LLMs.
- Narrow benchmarks miss edge cases and cultural nuance essential for global performance.
Human Rubrics Drive Real Feedback
- Human evaluators judge responses on accuracy, format, grammar, harm and bias using subjective rubrics.
- Their written feedback often doubles as training data to retrain and improve models.
Parenting Advice Shows Cultural Gaps
- Appen recruited parenting experts globally and found culturally diverse advice differs widely.
- That work ensures models provide locally relevant guidance rather than a Western average.
