Computer scientist Scott Garrabrant discusses the challenge of building learning agents for real-world goals. The podcast explores the concept of embedded agents, the four complications of embedded agency, and open problems in world models and subsystem alignment. It also delves into the conflicts that arise when spinning up sub-agents with different goals.
Embedded agents face challenges in optimizing realistic goals in physical environments without clear input-output channels.
Decision theory, embedded world models, robust delegation, and subsystem alignment are active areas of research to address these challenges.
Deep dives
Decision Theory
Decision theory explores the challenges of optimization for embedded agents. Dualistic models like argmax, which find actions that maximize rewards, do not apply well to agents embedded in environments without clear input-output channels. Major open problems in decision theory include reasoning about counterfactuals, handling multiple agent copies within an environment, and addressing logical uncertainty.
Embedded World Models
Embedded world models focus on creating accurate models of the world within the constraints of an agent that is smaller than the environment. Challenges arise due to the inability to include the true universe in the hypothesis space and the need for non-Bayesian updates. Additional problems include combining logical reasoning with probability, multilevel modeling to describe the world at different levels of detail, and handling ontological crises when the model aligns with an incorrect ontology.
Robust Delegation
Robust delegation addresses the principal-agent problem when an initial agent wants to create a more intelligent successor to optimize its goals. Balancing power between the initial agent and the more intelligent successor is crucial. Challenges arise from the Lobean obstacle, difficulties in trusting more powerful agents, value learning to align the goals of the initial and successor agents, and ensuring corrigibility to allow meaningful modifications.
Subsystem Alignment
Subsystem alignment focuses on creating unified agents without conflicting subsystems. When agents have sub-goals, conflicts can arise, especially when unintentional sub-agents are created. The goal is to prevent adversarial subsystems from emerging, as even unintentional optimization can lead to conflicts with the outer system's goals. The major open problem is how to have a core optimizer that does not generate adversarial optimizers while achieving optimization targets.
Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don’t already know. There’s a complicated engineering problem here. But there’s also a problem of figuring out what it even means to build a learning agent like that. What is it to optimize realistic goals in physical environments? In broad terms, how does it work? In this series of posts, I’ll point to four ways we don’t currently know how it works, and four areas of active research aimed at figuring it out. This is Alexei, and Alexei is playing a video game. Like most games, this game has clear input and output channels. Alexei only observes the game through the computer screen, and only manipulates the game through the controller. The game can be thought of as a function which takes in a sequence of button presses and outputs a sequence of pixels on the screen. Alexei is also very smart, and capable of holding the entire video game inside his mind.