AI Safety Fundamentals: Alignment cover image

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

AI Safety Fundamentals: Alignment

00:00

Training Reward Models for Assistant Model Optimization

This chapter explores training a reward model on human-assistant dialogues to predict preferences between completions and using reinforcement learning to train the assistant model. It also covers fine-tuning on weak labels and studying generalization across tasks with positive outcomes.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app