AI Stories cover image

AI Stories

Code Generation & Synthetic Data With Loubna Ben Allal #51

Nov 7, 2024
47:06

Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 .

In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.

We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.

Loubna also shares career mistakes, advice and her take on the future of developers and code generation. 

If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.

Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia

StarCoder blog post: https://huggingface.co/blog/starcoder

Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/

Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/  

---

(00:00) - Intro

(02:00) - How Loubna Got Into Data & AI

(03:57) - Internship at Hugging Face

(06:21) - Building A Code Generation Model: StarCoder

(12:14) - Data Filtering Techniques for LLMs

(18:44) - Training StarCoder

(21:35) - Will GenAI Replace Developers? 

(25:44) - Synthetic Data Generation & Building Cosmopedia

(35:44) - Evaluating a 1B Params Model Trained on Synthetic Data

(43:43) - Challenges faced & Career Advice


Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner