
The Information Bottleneck EP22: Data Curation for LLMs with Cody Blakeney (Datology AI)
Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.
Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.
We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?
We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.
On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.
Takeaways:
- Data work is high-leverage but underappreciated
- Mid-training helps extract signal from small, valuable datasets
- Good filters favor dense, factual text over polished prose.
- Synthetic data for pre-training works surprisingly well, but remains primitive.
- Optimal data mixtures depend on model scale, where smaller models need more aggressive distribution shifts.
Timeline
(00:12) Introduction to Data Correlation in LLMs
(05:14) The Importance of Data Quality
(10:15) Pre-training vs Post-training Data
(15:22) Strategies for Effective Data Utilization
(20:15) Benchmarking and Model Evaluation
(28:28) Maximizing Perplexity and Coherence
(30:27) Measuring Quality in Data
(32:56) The Role of Filters in Data Selection
(34:19) Understanding High-Quality Data
(39:15) Mid-Training and Its Importance
(46:51) Future of Data Sources
(48:13) Synthetic Data's Role in Pre-Training
(53:10) Creating Effective Synthetic Data
(57:39) The Debate on Pure Synthetic Data
(01:00:25) Navigating AI Training and Legal Challenges
(01:02:34) The Controversy of AI in the Art Community
(01:05:29) Exploring Synthetic Data and Its Efficiency
(01:11:21) The Future of Domain-Specific vs. General Models
(01:22:06) Bias in Pre-trained Models and Data Selection
(01:28:27) The Potential of Synthetic Data Over Human Data
Music:
- "Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.
- "Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.
Changes: trimmed
About
The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
