Evaluating Retrieval Systems and LLM Performance Metrics

This chapter explores the creation of effective evaluation datasets for assessing retrieval systems and the quality of answers generated by large language models. It emphasizes key metrics like context precision, recall, and the concept of faithfulness in LLM responses, highlighting the importance of human-generated answers for performance comparison.

Play episode from 14:36

chevron_right

Transcript

chevron_right

Transcript

Episode notes

RAG isn't a magic fix for search problems. While it works well at first, most teams find it's not good enough for production out of the box. The key is to make it better step by step, using good testing and smart data creation.

Today, we are talking to Saahil Ognawala from Jina AI to start to understand RAG.

To build a good RAG system, you need three things: ways to test it, methods to create training data, and plans to make it better over time. Testing starts with a set of example searches that users might make. These should include common searches that happen often, medium-rare searches, and rare searches that only happen now and then. This mix helps you measure if changes make your system better or worse.

Creating synthetic data helps make the system stronger, especially in spotting wrong answers that look right. Think of someone searching for a "gluten-free chocolate cake." A "sugar-free chocolate cake" might look like a good answer because it shares many words, but it's wrong.

These tricky examples help the system learn the difference between similar but different things.

When creating synthetic data, you need rules. The best way is to show the AI a few real examples and give it a list of topics to work with. Most teams find that using half real data and half synthetic data works best. This gives you enough variety while keeping things real.

Getting user feedback is hard with RAG. In normal search, you can see if users click on results. But with RAG, the system creates an answer from many pieces. A good answer might come from both good and bad pieces, making it hard to know which parts helped. This means you need smart ways to track which pieces of information actually helped make good answers.

One key rule: don't make things harder than they need to be. If simple keyword search (called BM25) works well enough, adding fancy AI search might not be worth the extra work.

Success with RAG comes from good testing, careful data creation, and steady improvements based on real use. It's not about using the newest AI models. It's about building good systems and processes that work reliably.

"It isn’t a magic wand you can place on your catalog and expect results you didn’t get before."

“Most of our users are enterprise users who have seen the most success in their RAG systems are the ones that very early implemented a continuous feedback mechanism.“

“If you can't tell in real time usage whether an answer is a bad answer or a right answer because the LLM just makes it look like the right answer then you only have your retrieval dataset to blame”

Saahil Ognawala:

LinkedIn
Jina AI

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)

00:00 Introduction to Retrieval Augmented Generation (RAG) 00:29 Interview with Saahil Ognawala 00:52 Synthetic Data in Language Generation 01:14 Understanding the E5 Mistral Instructor Embeddings Paper 03:15 Challenges and Evolution in Synthetic Data 05:03 User Intent and Retrieval Systems 11:26 Evaluating RAG Systems 14:46 Setting Up Evaluation Frameworks 20:37 Fine-Tuning and Embedding Models 22:25 Negative and Positive Examples in Retrieval 26:10 Synthetic Data for Hard Negatives 29:20 Case Study: Marine Biology Project 29:54 Addressing Errors in Marine Biology Queries 31:28 Ensuring Query Relevance with Human Intervention 31:47 Few Shot Prompting vs Zero Shot Prompting 35:09 Balancing Synthetic and Real World Data 37:17 Improving RAG Systems with User Feedback 39:15 Future Directions for Jina and Synthetic Data 40:44 Building and Evaluating Embedding Models 41:24 Getting Started with Jina and Open Source Tools 51:25 The Importance of Hard Negatives in Embedding Models

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books