

All About Evaluating LLM Applications // Shahul Es // #179
8 snips Oct 6, 2023
Shahul Es, creator of the Ragas Project and evaluation expert, discusses open source model evaluation, including debugging, troubleshooting, and benchmark challenges. They highlight the importance of custom data distributions and fine-tuning for better model performance. They also explore the difficulties of evaluating LLM applications and the need for reliable leaderboards. Additionally, they discuss the security aspects of language models and the significance of data preparation and filtering. Lastly, they contrast fine-tuning with retrieval augmented generation and provide resources for evaluating LLM applications.
AI Snips
Chapters
Transcript
Episode notes
Evaluation Defined
- Evaluation means measuring and quantifying a system's performance to enable improvements.
- Iterations and measurements help determine if changes are positive or negative.
Leaderboard Gaming
- Open-source LLM leaderboards can be unreliable due to over-optimization.
- This "gaming" makes models less useful for real-world applications.
Kaggle's Approach
- Kaggle uses public and private test sets to avoid overfitting in competitions.
- This approach could be adopted by open-source LLM evaluators.