Latent Space: The AI Engineer Podcast cover image

MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

Latent Space: The AI Engineer Podcast

00:00

Technical Challenges in Pre-Processing Language Dataset Assessment

The central question is what mix of data sets should you use? There are various considerations, such as different data sources, the importance of repetition (quality vs quantity), and the definition of good quality data. The belief that code or spending time on good sources like Wikipedia improves models lacks evidence. Different data mixes yield varied results, with C4 dataset performing exceptionally well despite its problematic pre-processing. Evaluating models for generation tasks is challenging, as there is uncertainty about what to measure. Making reasonable choices based on evaluation becomes crucial.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app