MLOps.community

All About Evaluating LLM Applications // Shahul Es // #179

8 snips
Oct 6, 2023
Shahul Es, creator of the Ragas Project and evaluation expert, discusses open source model evaluation, including debugging, troubleshooting, and benchmark challenges. They highlight the importance of custom data distributions and fine-tuning for better model performance. They also explore the difficulties of evaluating LLM applications and the need for reliable leaderboards. Additionally, they discuss the security aspects of language models and the significance of data preparation and filtering. Lastly, they contrast fine-tuning with retrieval augmented generation and provide resources for evaluating LLM applications.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Evaluation Defined

  • Evaluation means measuring and quantifying a system's performance to enable improvements.
  • Iterations and measurements help determine if changes are positive or negative.
INSIGHT

Leaderboard Gaming

  • Open-source LLM leaderboards can be unreliable due to over-optimization.
  • This "gaming" makes models less useful for real-world applications.
ANECDOTE

Kaggle's Approach

  • Kaggle uses public and private test sets to avoid overfitting in competitions.
  • This approach could be adopted by open-source LLM evaluators.
Get the Snipd Podcast app to discover more snips from this episode
Get the app