Practical AI

Towards high-quality (maybe synthetic) datasets

5 snips
Oct 9, 2024
David Berenstein, a developer advocate engineer at Hugging Face, and Ben Burtenshaw, a machine learning engineer at Argilla, dive into the crucial realm of data quality in AI. They discuss how collaboration between domain experts and data scientists significantly enhances model efficacy. The conversation covers innovative strategies for generating synthetic datasets, utilizing AI for labeling, and maintaining privacy. The duo also shares insights on the importance of effective feedback loops and multimodal data integration for refining AI training.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Modeling the Problem

  • Start by defining the problem in simple terms, writing down expected questions for the AI system.
  • Then, manually associate small sets of documents with those questions and test if a model can answer them.
ADVICE

Practical Steps for RAG

  • Begin with a small set of questions and documents, testing feasibility with simple benchmarks like ChatGPT.
  • Gradually scale up documents and questions, iterating on the problem while using Argilla to gather feedback.
INSIGHT

Combining AI Techniques

  • Many AI workflows combine traditional data science models with newer generative AI models.
  • Smaller models are often more practical due to cost, privacy, and ease of fine-tuning.
Get the Snipd Podcast app to discover more snips from this episode
Get the app