“AGI Safety and Alignment at Google DeepMind:A Summary of Recent Work ” by Rohin Shah, Seb Farquhar, Anca Dragan

Aug 21, 2024

Guest

Anca Dragan

Guest

Seb Farquhar

Guest

Rohin Shah

Join Rohin Shah, a key member of Google's AGI safety team, alongside Seb Farquhar, an existential risk expert, and Anca Dragan, a safety researcher. They dive into the evolving strategies for ensuring AI alignment and safety. Topics include innovative techniques for interpreting neural models, the challenges of scalable oversight, and the ethical implications of AI development. The trio also discusses future plans to address alignment risks, emphasizing the importance of collaboration and the role of mentorship in advancing AGI safety.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Frontier Safety Framework Application

The Frontier Safety Framework (FSF) applies responsible capability scaling to many model deployments across Google, not just chatbots.
This approach facilitates stakeholder engagement, policy implementation, and mitigation planning tailored to diverse products.

ADVICE

Run Dangerous Capability Evaluations

Regularly run and transparently report dangerous capability evaluations to understand risks of advanced models.
Openly share evaluation norms to set safety and transparency standards across organizations.

INSIGHT

Advances in Mechanistic Interpretability

Sparse autoencoders (SAEs) improve interpretability of large language models without loss in feature quality.
New architectures like gated SAEs enhance reconstruction loss versus sparsity on billion-parameter models.

Get the Snipd Podcast app to discover more snips from this episode

Get the app