TED Tech

When AI Cannibalizes Its Data | Short Wave

7 snips
Oct 31, 2025
Ilia Shumailov, a computer scientist specializing in large language models, joins to explore intriguing AI concerns. He discusses the rise of machine-generated internet content and its potential dangers for language models. Ilia reveals how training on synthetic data can lead to errors, systematically amplifying mistakes. They liken this phenomenon to a game of telephone, where accuracy deteriorates over iterations. However, he reassures that with better practices, the issue of model collapse can be addressed, keeping AI evolution on a positive track.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

AI Trains On Its Own Output

  • Large language models learn from huge amounts of human-written text and mimic those patterns.
  • As more internet content is machine-generated, models risk internalizing synthetic patterns instead of human realities.
ANECDOTE

Lunch Prompted A Research Question

  • Ilia and his brother discussed over lunch how machine-generated internet content might affect future models.
  • That conversation prompted research showing synthetic data can cause degradation when recycled into training.
INSIGHT

Model Collapse Is Theoretical And Real

  • Theoretical work shows iteratively training on synthetic outputs can make models degrade over time.
  • Repeatedly consuming their own approximations drives models toward lower-quality, collapsing behavior.
Get the Snipd Podcast app to discover more snips from this episode
Get the app