AI Safety Fundamentals: Alignment

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Mar 26, 2024
Guest Collin Burns discusses weak-to-strong generalization in AI alignment, exploring fine-tuning strong models with labels from weaker models to enhance performance. Techniques like auxiliary confidence loss show promise in improving weak-to-strong generalization, suggesting progress in aligning superhuman models with human supervision.
Ask episode
Chapters
Transcript
Episode notes