Latent Space: The AI Engineer Podcast

2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS]

67 snips
Dec 24, 2024
Lubna Ben-Alau, an AI researcher at Hugging Face, dives into the dynamic world of synthetic data and small language models. She discusses how 2024 saw a remarkable surge in synthetic data applications, with notable contributions like Apple's Rephrasing the Web and Hugging Face's Cosmopedia. Lubna emphasizes the transformative impact of synthetic data on model performance and diversity. The conversation also touches on the evolution of small models, highlighting their efficiency, improved privacy, and specialized applications for on-device use.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Synthetic Data's Rise

  • Synthetic data now permeates the entire LLM pipeline, from pre-training to post-training and evaluation.
  • This allows for complete control over data and training, a paradigm shift from just a few years ago.
ANECDOTE

Fully Synthetic LLM Training

  • A 1B LLM can be trained entirely on 150B tokens of synthetic data like Cosmopedia.
  • This is evaluated using synthetic benchmarks and LLM judges, demonstrating a fully synthetic LLM pipeline.
INSIGHT

Why Synthetic Data is Popular

  • The rise of synthetic data is fueled by stronger LLMs, cheaper generation, and better frameworks.
  • Tools like VLM, TGI, and TensorRT facilitate easy and efficient synthetic data generation.
Get the Snipd Podcast app to discover more snips from this episode
Get the app