When AI Cannibalizes Its Data | NPR's Short Wave

7 snips

Oct 31, 2025

Ilia Shumailov, a computer scientist specializing in large language models, joins to explore intriguing AI concerns. He discusses the rise of machine-generated internet content and its potential dangers for language models. Ilia reveals how training on synthetic data can lead to errors, systematically amplifying mistakes. They liken this phenomenon to a game of telephone, where accuracy deteriorates over iterations. However, he reassures that with better practices, the issue of model collapse can be addressed, keeping AI evolution on a positive track.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

AI Trains On Its Own Output

Large language models learn from huge amounts of human-written text and mimic those patterns.
As more internet content is machine-generated, models risk internalizing synthetic patterns instead of human realities.

ANECDOTE

Lunch Prompted A Research Question

Ilia and his brother discussed over lunch how machine-generated internet content might affect future models.
That conversation prompted research showing synthetic data can cause degradation when recycled into training.

INSIGHT

Model Collapse Is Theoretical And Real

Theoretical work shows iteratively training on synthetic outputs can make models degrade over time.
Repeatedly consuming their own approximations drives models toward lower-quality, collapsing behavior.

Get the Snipd Podcast app to discover more snips from this episode

Get the app