LessWrong (30+ Karma) cover image

LessWrong (30+ Karma)

“DeepMind: An Approach to Technical AGI Safety and Security” by Zach Stein-Perlman

Apr 6, 2025
34:45

I quote the abstract, 10-page "extended abstract," and table of contents. See link above for the full 100-page paper. See also the blogpost (which is not a good summary) and tweet thread.

I haven't read most of the paper, but I'm happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks. But I think safety at DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work.

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse [...]

---

Outline:

(02:05) Extended Abstract

(04:25) Background assumptions

(08:11) Risk areas

(13:33) Misuse

(14:59) Risk assessment

(16:19) Mitigations

(18:47) Assurance against misuse

(20:41) Misalignment

(22:32) Training an aligned model

(25:13) Defending against a misaligned model

(26:31) Enabling stronger defenses

(29:31) Alignment assurance

(32:21) Limitations

---

First published:
April 5th, 2025

Source:
https://www.lesswrong.com/posts/3ki4mt4BA6eTx56Tc/deepmind-an-approach-to-technical-agi-safety-and-security

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Flowchart diagram showing three stages: Training, Evaluation, and Deployment of security mitigations.
Diagram showing four types of AI risks: misuse, misalignment, mistakes, structural risks.

The image presents 4 panels with simple illustrations explaining different categories of AI risk and their key drivers. Each panel uses icons of humans, AI networks (shown as connected blue circles), and emoji-style indicators to demonstrate different scenarios where AI systems could cause harm. The diagram has an academic or educational style, with clear labels and explanations for each risk category.
Technical diagram showing AI model training and inference alignment mitigation approaches.

The diagram illustrates two main components - an

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode