Interconnects

We aren't running out of training data, we are running out of open training data

May 29, 2024
Exploring the scarcity of open training data, data licensing deals, scaling language models, and the shift towards synthetic and multimodal data for training. Synthetic data generation at a rate of 1 trillion tokens per day. High costs of data licensing deals. Search for better tokens and new frontiers in AI development.
Ask episode
Chapters
Transcript
Episode notes