The Alignment Problem From a Deep Learning Perspective

5 snips

May 13, 2023

Guest

Lawrence Chan

Guests Richard Ngo, Lawrence Chan, and Sören Mindermann discuss the dangers of artificial general intelligence pursuing undesirable goals. They explore topics such as reward hacking, situational awareness in policies, internally represented goals in deep learning models, the inner alignment problem, deceptive alignment in AI systems, and the risks of AGIs gaining power. They highlight the need for preventative measures to ensure human control over AGI.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 2min

Reward Hacking and Situational Awareness in Policies

02:16 • 9min

Internally Represented Goals in Deep Learning Models

10:55 • 7min

The Inner Alignment Problem and Misaligned Goals

17:43 • 10min

Deceptive Alignment and Its Implications in AI Systems

28:02 • 2min

Risks of AGIs Gaining Power and Illustrative Threat Models

30:19 • 3min