The Challenges of Training Open Source Language Models

OpenAN has never released any of their training data, so we had to go out and collect our own. We built this data site called the PIL and kind of the rough heuristic for this was like, what were the kinds of things we wanted a language model to know? Let's go find data that contains information and put it in it. It's in the amalgamation of data from like 20, 23, I want to say, different sources ranging from stuff like Wikipedia to question and answer sites. There is a whole lot of academic text because we are interested in scientific use in applications. Most language models nowadays are trained on a mix of code and regular natural language.

Play episode from 19:57

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app