4min snip

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0 cover image

AI Fundamentals: Datasets 101

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

NOTE

Scaling Data and the Importance of Datasets in Deep Learning

Understanding the limitations of models trained on the internet is important as they may not know everything/nThe size of the internet is estimated to be around 5 billion gigabytes while most datasets discussed today are in the hundreds of gigabytes range/nNew data is being created every day, and new modalities are being used to extract data/nDatasets and benchmarks have been decoupled, allowing for the creation of custom training objectives/nSelf-supervised learning, enabled by masking, allows for scaling training on unlimited amounts of data/nDeep learning is crucial for achieving optimal results in training objectives/nHaving a large dataset does not necessarily mean having a reasonable benchmark/nTraining on a dataset tends to lead to better performance on benchmarks/nUnderstanding the size of datasets and how to adapt models to suit them is essential

00:00

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode