

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Mar 26, 2024
Guest Collin Burns discusses weak-to-strong generalization in AI alignment, exploring fine-tuning strong models with labels from weaker models to enhance performance. Techniques like auxiliary confidence loss show promise in improving weak-to-strong generalization, suggesting progress in aligning superhuman models with human supervision.
Chapters
Transcript
Episode notes
1 2 3 4 5 6
Introduction
00:00 • 1min
Exploring Weak to Strong Generalization in AI Alignment
01:29 • 15min
Training Reward Models for Assistant Model Optimization
16:15 • 2min
Enhancing Weak-to-Strong Generalization through Bootstrapping
17:58 • 5min
Enhancing Generalization with Auxiliary Confidence Loss in NLP Tasks
23:00 • 3min
Understanding Weak to Strong Generalization in Superhuman Models
25:31 • 10min