Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What and Why: Developmental Interpretability of Reinforcement Learning, published by Garrett Baker on July 9, 2024 on LessWrong.
Introduction
I happen to be in that happy stage in the research cycle where I ask for money so I can continue to work on things I think are important. Part of that means justifying what I want to work on to the satisfaction of the people who provide that money.
This presents a good opportunity to say what I plan to work on in a more layman-friendly way, for the benefit of LessWrong, potential collaborators, interested researchers, and funders who want to read the fun version of my project proposal
It also provides the opportunity for people who are very pessimistic about the chances I end up doing anything useful by pursuing this to have their say. So if you read this (or skim it), and have critiques (or just recommendations), I'd love to hear them! Publicly or privately.
So without further ado, in this post I will be discussing & justifying three aspects of what I'm working on, and my reasons for believing there are gaps in the literature in the intersection of these subjects that are relevant for AI alignment. These are:
1. Reinforcement learning
2. Developmental Interpretability
3. Values
Culminating in: Developmental interpretability of values in reinforcement learning.
Here are brief summaries of each of the sections:
1. Why study reinforcement learning?
1. Imposed-from-without or in-context reinforcement learning seems a likely path toward agentic AIs
2. The "data wall" means active-learning or self-training will get more important over time
3. There are fewer ways for the usual AI risk arguments to fail in the RL with mostly outcome-based rewards circumstance than the supervised learning + RL with mostly process-based rewards (RLHF) circumstance.
2. Why study developmental interpretability?
1. Causal understanding of the training process allows us to produce reward structure or environmental distribution interventions
2. Alternative & complementary tools to mechanistic interpretability
3. Connections with singular learning theory
3. Why study values?
1. The ultimate question of alignment is how can we make AI values compatible with human values, yet this is relatively understudied.
4. Where are the gaps?
1. Many experiments
2. Many theories
3. Few experiments testing theories or theories explaining experiments
Reinforcement learning
Agentic AIs vs Tool AIs
All generally capable adaptive systems are ruled by a general, ground-truth, but slow outer optimization process which reduces incoherency and continuously selects for systems which achieve outcomes in the world. Examples include evolution, business, cultural selection, and to a great extent human brains.
That is, except for LLMs. Most of the feedback LLMs receive is supervised, unaffected by the particular actions the LLM takes, and process-based (RLHF-like), where we reward the LLM according to how useful an action looks in contrast to a ground truth regarding how well that action (or sequence of actions) achieved its goal.
Now I don't want to make the claim that this aspect of how we train LLMs is clearly a fault of them, or in some way limits the problem solving abilities they can have. And I do think it possible we see in-context ground-truth optimization processes instantiated as a result of increased scaling, in the same way we see in context learning.
I do however want to make the claim that this current paradigm of mostly processed-based supervision, if it continues, and doesn't itself produce ground-truth based optimization, makes me optimistic about AI going well.
That is, if this lack of general ground-truth optimization continues, we end up with a cached bundle of not very agentic (compared to AIXI) tool AIs with limited search or bootstrapping capabilities.
Of course,...

LW - What and Why: Developmental Interpretability of Reinforcement Learning by Garrett Baker