Latent Space: The AI Engineer Podcast cover image

RLHF 201 - with Nathan Lambert of AI2 and Interconnects

Latent Space: The AI Engineer Podcast

00:00
NOTE

Rejection sampling involves putting best event sampling in a feedback loop to return the best few answers, then applying instruction tuning on that dataset. Lama started their RLHF process with rejection sampling to get a signal out of preference data, which went into a reward model for ranking. This method is easier to implement than PPO and can be used with auto-regressive loss, making it suitable for RL at scale. Offline RL is also a relevant approach for RLHF, as the model doesn't have to generate data, but rather looks at existing data and backpropagates through the reward model.

Play episode from 55:28
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app