Interconnects cover image

Interviewing Louis Castricato of Synth Labs and Eleuther AI on RLHF, Gemini Drama, DPO, founding Carper AI, preference data, reward models, and everything in between

Interconnects

00:00

Navigating Reinforcement Learning: From PPO to DPO

This chapter explores the evolution of Reinforcement Learning from Human Feedback (RLHF), focusing on the shift from Proximal Policy Optimization (PPO) to Decision Process Optimization (DPO). It highlights the intricacies of preference data, the differences in methodologies, and the implications of small test sets on model performance. The discussion also addresses ethical concerns, benchmarking complexities, and safety measures within AI training and deployment.

Play episode from 20:56
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app