
TED Tech When AI Cannibalizes Its Data | Short Wave
7 snips
Oct 31, 2025 Ilia Shumailov, a computer scientist specializing in large language models, joins to explore intriguing AI concerns. He discusses the rise of machine-generated internet content and its potential dangers for language models. Ilia reveals how training on synthetic data can lead to errors, systematically amplifying mistakes. They liken this phenomenon to a game of telephone, where accuracy deteriorates over iterations. However, he reassures that with better practices, the issue of model collapse can be addressed, keeping AI evolution on a positive track.
AI Snips
Chapters
Transcript
Episode notes
AI Trains On Its Own Output
- Large language models learn from huge amounts of human-written text and mimic those patterns.
- As more internet content is machine-generated, models risk internalizing synthetic patterns instead of human realities.
Lunch Prompted A Research Question
- Ilia and his brother discussed over lunch how machine-generated internet content might affect future models.
- That conversation prompted research showing synthetic data can cause degradation when recycled into training.
Model Collapse Is Theoretical And Real
- Theoretical work shows iteratively training on synthetic outputs can make models degrade over time.
- Repeatedly consuming their own approximations drives models toward lower-quality, collapsing behavior.
