Thoughtworks Technology Podcast

AI testing, benchmarks and evals

25 snips
Jan 23, 2025
Join Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager at the AI Lab, as they dive into the complexities of generative AI. They discuss the vital role of evals, benchmarks, and guardrails in ensuring AI reliability. The duo outlines the differences between testing and evaluations, highlighting their significance for businesses. Additionally, they explore mechanistic interpretability and the need for robust frameworks to enhance trust in AI applications. This conversation is essential for anyone navigating the evolving AI landscape.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Understanding Benchmarks and Evals

  • Benchmarks compare models, but don't guarantee real-world application performance.
  • Evals measure model or application properties, offering a deeper understanding than simple pass/fail tests.
ADVICE

Focus on Business Value and Risk

  • Define ROI and success metrics before implementing generative AI solutions.
  • Approach generative AI applications as you would any other software application, focusing on business value and risk assessment.
INSIGHT

Common Pitfalls in AI Evaluation

  • Misinterpreting metrics like perplexity is a common mistake, similar to misinterpreting accuracy in traditional machine learning.
  • Consider where trust starts and ends in your AI system, especially when using LLMs to evaluate other LLMs.
Get the Snipd Podcast app to discover more snips from this episode
Get the app