Understanding Data Contamination and Benchmarking in AI

This chapter explores the critical implications of data contamination in AI model evaluation, highlighting discrepancies in model performance over time. It calls for greater transparency in data releases and underscores the challenges posed by dataset imbalances and tokenization across languages.

Play episode from 56:22

chevron_right

Transcript

chevron_right

Transcript

Episode notes

In April, we released our first AI Fundamentals episode: Benchmarks 101. We covered the history of benchmarks, why they exist, how they are structured, and how they influence the development of artificial intelligence.

Today we are (finally!) releasing Datasets 101! We’re really enjoying doing this series despite the work it takes - please let us know what else you want us to cover!

Stop me if you’ve heard this before: “GPT3 was trained on the entire Internet”.

Blatantly, demonstrably untrue: the GPT3 dataset is a little over 600GB, primarily on Wikipedia, Books corpuses, WebText and 2016-2019 CommonCrawl. The Macbook Air I am typing this on has more free disk space than that. In contrast, the “entire internet” is estimated to be 64 zetabytes, or 64 trillion GB. So it’s more accurate to say that GPT3 is trained on 0.0000000001% of the Internet.

Why spend $5m on GPU time training on $50 worth of data?

Simple: Garbage in, garbage out. No matter how good your algorithms, no matter how much money/compute you have, your model quality is strongly determined by the data you train it on and research scientists think we just don’t need or have that much high quality data. We spend an enormous amount of effort throwing out data to keep the quality high, and recently Web 2.0-era UGC platforms like StackOverflow, Reddit, and Twitter clamped down on APIs as they realize the goldmines they sit on.

Data is the new new oil. Time for a primer!

Show Notes

* Our 2 months worth of podcast prep notes!

* The Token Crisis paper

* Ilya Sutskever on datasets

* OpenAI Tokenizer

* Kaplan Scaling Laws Lecture

* Chinchilla Paper

* Sasha Rush’s Tweet

* Karpathy’s Build Conference Presentation

* LIMA Paper

* Phi-1 by Microsoft