Akshita Bhagia discusses the OLMo language model with a unique open-source approach. The OLMo umbrella includes projects like Dolma and Paloma. The importance of open-training datasets and data curation filters are emphasized. The podcast explores dataset contamination, task specificity, and the evolution of training data transparency.
Read more
AI Summary
Highlights
AI Chapters
Episode notes
auto_awesome
Podcast summary created with Snipd AI
Quick takeaways
OLMo provides transparent pre-training data and tools to foster collaborative research.
Dolma dataset offers three trillion tokens from public data for analyzing model capabilities across domains.
Deep dives
The Motivation Behind Almost Project
The Almost project aims to address the lack of transparency in language model development by providing truly open language models. This initiative was driven by the necessity for researchers to access complete details of model training data and pre-training specifics. By releasing the 1B and 7B versions of the models alongside comprehensive pre-training data, training code, logs, and evaluation tools, Almost seeks to foster collaborative research and avoid redundant, costly experiments.
The Dolma Dataset and Toolkit
The Dolma dataset, containing around three trillion tokens sourced primarily from accessible public data like Common Crawl, academic papers, books, and Wikipedia, was unveiled alongside the ALMO models. This data set allows researchers to explore the relationships between model inputs and outputs, assess its capabilities across various content types, and analyze toxicity levels. Furthermore, the Dolma toolkit provides curation filters to ensure data quality and privacy protection measures by eliminating personally identifiable information.
The Importance of Paloma Benchmark
The Paloma benchmark complements downstream task evaluations by offering fine-grained insights into a model's performance across 600 domains sourced from 18 diverse data repositories. By measuring perplexity on these domains, Paloma serves as a proxy for assessing a model's familiarity with specific content distributions. This benchmark goes beyond traditional evaluation metrics, emphasizing the importance of understanding model behavior in different contexts and domains for researchers and industry practitioners alike.
Today we’re joined by Akshita Bhagia, a senior research engineer at the Allen Institute for AI. Akshita joins us to discuss OLMo, a new open source language model with 7 billion and 1 billion variants, but with a key difference compared to similar models offered by Meta, Mistral, and others. Namely, the fact that AI2 has also published the dataset and key tools used to train the model. In our chat with Akshita, we dig into the OLMo models and the various projects falling under the OLMo umbrella, including Dolma, an open three-trillion-token corpus for language model pretraining, and Paloma, a benchmark and tooling for evaluating language model performance across a variety of domains.
The complete show notes for this episode can be found at twimlai.com/go/674.
Get the Snipd podcast app
Unlock the knowledge in podcasts with the podcast player of the future.
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode
Save any moment
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Share & Export
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
AI-powered podcast player
Listen to all your favourite podcasts with AI-powered features
Discover highlights
Listen to the best highlights from the podcasts you love and dive into the full episode