Gradient Dissent: Conversations on AI cover image

How EleutherAI Trains and Releases LLMs: Interview with Stella Biderman

Gradient Dissent: Conversations on AI

00:00

The Challenges of Training Open Source Language Models

OpenAN has never released any of their training data, so we had to go out and collect our own. We built this data site called the PIL and kind of the rough heuristic for this was like, what were the kinds of things we wanted a language model to know? Let's go find data that contains information and put it in it. It's in the amalgamation of data from like 20, 23, I want to say, different sources ranging from stuff like Wikipedia to question and answer sites. There is a whole lot of academic text because we are interested in scientific use in applications. Most language models nowadays are trained on a mix of code and regular natural language.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app