Here's Proof You Can Train an AI Model Without Slurping Copyrighted Content
Mar 21, 2024
auto_awesome
Exploring training AI models without copyrighted data using public domain text, Fairly Trained certification program for ethical AI, and licensing trends for AI models like VoiceMod and Frostbite Orkings.
AI models can be trained ethically without copyrighted data by using public domain text.
The Common Corpus dataset provides a large training resource for AI models free from copyright concerns.
Deep dives
AI Training Models Without Copyrighted Data
A group of researchers, supported by the French government, has developed a significant AI training data set composed entirely of public domain text. Fairly Trained, a non-profit organization, has certified a large language model named CLIMM created by a legal tech startup, 273 Ventures, using a curated training data set of legal, financial, and regulatory documents. This approach challenges the common practice of using copyrighted material to train AI models, demonstrating an alternative path to AI development that respects copyright laws and emphasizes ethical data usage.
Public Domain Data Set for AI Development
Researchers have introduced the Common Corpus, claimed to be the largest AI data set for language models consisting solely of public domain content. This data set, with 500 million tokens, aims to provide a robust training resource free from copyright concerns. Coordinated by the French startup Playas and supported by various AI groups and the French Ministry of Culture, Common Corpus represents a move towards fair and ethical AI development by offering a diverse and infringement-free training set for researchers and startups across different fields.
OpenAI claimed it's "impossible" to build good AI models without using copyrighted data. An “ethically created” large language model and a giant AI dataset of public domain text suggest otherwise.