Latent Space: The AI Engineer Podcast cover image

AI Fundamentals: Datasets 101

Latent Space: The AI Engineer Podcast

00:00

The GPT Assisted Training Pipeline: Understanding the Different Data Sets

The speaker found Andre's talk at the Microsoft build conference to be insightful and authoritative./nAndre outlined the GPT assisted training pipeline, which consists of four types of data sets: raw internet data, demonstrations data, comparisons data, and prompts data./nThe raw internet data is collected from common crawl Wikipedia and books./nThe demonstrations data set includes prompt and response pairs for ideal assistant responses./nThe comparisons data involves comparing outputs between two options and determining which one is better./nThe prompts data is used for reinforcement learning./nThere are trillions of low-quality, large quantity data from the raw internet, tens of thousands of demonstrations, hundreds of thousands of comparisons, and tens of thousands of prompts./nThese data sets are important for building chat GPT models.

Play episode from 27:32
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app