Latent Space: The AI Engineer Podcast cover image

MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

Latent Space: The AI Engineer Podcast

00:00

Technical Challenges in Pre-Processing Language Dataset Assessment

The central question is what mix of data sets should you use? There are various considerations, such as different data sources, the importance of repetition (quality vs quantity), and the definition of good quality data. The belief that code or spending time on good sources like Wikipedia improves models lacks evidence. Different data mixes yield varied results, with C4 dataset performing exceptionally well despite its problematic pre-processing. Evaluating models for generation tasks is challenging, as there is uncertainty about what to measure. Making reasonable choices based on evaluation becomes crucial.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner