
The Surprising Websites Included in ChatGPT's Training Data
AI Chat: ChatGPT, AI News, Artificial Intelligence, OpenAI, Machine Learning
00:00
Google's C4 Dataset: A Massive Snapshot of Content From 15 Million Different Websites
Chat GPT is based on data from 15 million different websites that were used to construct some of the most high profile English language AIs. The opening eye doesn't necessarily disclose what dataset they're using to train the models backing Chat GPT. But we can assume it's some pretty similar things and you'll see why in a little bit. One of the biggest content sections in this entire dataset is just in business in general. They made up about 16% of the entire dataset.
Play episode from 01:46
Transcript


