Generating Training Data with Large Language Models w/ Special Guest Marzieh Fadaee

Dec 13, 2022

Marzieh Fadaee, an NLP Research Lead at Zeta Alpha, discusses her innovative work on using large language models like GPT-3 to generate domain-specific training data. The conversation dives into her papers, 'InPars' and 'Promptagator,' highlighting methods for high-quality data augmentation with minimal human intervention. Fadaee explores the challenges of leveraging LMs in information retrieval, the intricacies of prompt engineering, and the potential pitfalls of synthetic data. Her insights pave the way for future research in optimizing neural retrieval systems.

Ask episode

Chapters

Transcript

Episode notes

Intro

00:00 • 5min

Enhancing Machine Learning with Synthetic Data

04:55 • 9min

Optimizing Data Generation for Re-Ranking Models

13:46 • 29min

Exploring Query Intent and Synthetic Data Generation

42:57 • 12min

Evaluating Retrieval Methodologies

55:02 • 5min

Exploring Query Intent and Task Specialization in Information Retrieval

01:00:15 • 2min

Enhancing Retrieval through Consistency Filtering

01:02:30 • 14min