

“AGI Safety and Alignment at Google DeepMind:A Summary of Recent Work ” by Rohin Shah, Seb Farquhar, Anca Dragan
Aug 21, 2024
Join Rohin Shah, a key member of Google's AGI safety team, alongside Seb Farquhar, an existential risk expert, and Anca Dragan, a safety researcher. They dive into the evolving strategies for ensuring AI alignment and safety. Topics include innovative techniques for interpreting neural models, the challenges of scalable oversight, and the ethical implications of AI development. The trio also discusses future plans to address alignment risks, emphasizing the importance of collaboration and the role of mentorship in advancing AGI safety.
AI Snips
Chapters
Transcript
Episode notes
Frontier Safety Framework Application
- The Frontier Safety Framework (FSF) applies responsible capability scaling to many model deployments across Google, not just chatbots.
- This approach facilitates stakeholder engagement, policy implementation, and mitigation planning tailored to diverse products.
Run Dangerous Capability Evaluations
- Regularly run and transparently report dangerous capability evaluations to understand risks of advanced models.
- Openly share evaluation norms to set safety and transparency standards across organizations.
Advances in Mechanistic Interpretability
- Sparse autoencoders (SAEs) improve interpretability of large language models without loss in feature quality.
- New architectures like gated SAEs enhance reconstruction loss versus sparsity on billion-parameter models.