LW - We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap by johnswentworth

Sep 19, 2024

Ask episode

Chapters

Transcript

Episode notes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap, published by johnswentworth on September 19, 2024 on LessWrong.
Background: "Learning" vs "Learning About"
Adaptive systems, reinforcement "learners", etc, "learn" in the sense that their behavior adapts to their environment.
Bayesian reasoners, human scientists, etc, "learn" in the sense that they have some symbolic representation of the environment, and they update those symbols over time to (hopefully) better match the environment (i.e. make the map better match the territory).
These two kinds of "learning" are not synonymous[1]. Adaptive systems "learn" things, but they don't necessarily "learn about" things; they don't necessarily have an internal map of the external territory.
(Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors "learned" by an adaptive system are not necessarily "about" any particular external thing, and don't necessarily represent any particular external thing[2].
We Humans Learn About Our Values
"I thought I wanted X, but then I tried it and it was pretty meh."
"For a long time I pursued Y, but now I think that was more a social script than my own values."
"As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it's a bit embarrassing in hindsight."
The ubiquity of these sorts of sentiments is the simplest evidence that we do not typically know our own values[3]. Rather, people often (but not always) have some explicit best guess at their own values, and that guess updates over time - i.e. we can learn about our own values.
Note the wording here: we're not just saying that human values are "learned" in the more general sense of reinforcement learning. We're saying that we humans have some internal representation of our own values, a "map" of our values, and we update that map in response to evidence. Look again at the examples at the beginning of this section:
"I thought I wanted X, but then I tried it and it was pretty meh."
"For a long time I pursued Y, but now I think that was more a social script than my own values."
"As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it's a bit embarrassing in hindsight."
Notice that the wording of each example involves beliefs about values. They're not just saying "I used to feel urge X, but now I feel urge Y". They're saying "I thought I wanted X" - a belief about a value! Or "now I think that was more a social script than my own values" - again, a belief about my own values, and how those values relate to my (previous) behavior. Or "I endorsed the view that Z is the highest objective" - an explicit endorsement of a belief about values.
That's how we normally, instinctively reason about our own values. And sure, we could reword everything to avoid talking about our beliefs about values - "learning" is more general than "learning about" - but the fact that it makes sense to us to talk about our beliefs about values is strong evidence that something in our heads in fact works like beliefs about values, not just reinforcement-style "learning".
Two Puzzles
Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap
Very roughly speaking, an agent could aim to pursue any values regardless of what the world outside it looks like; "how the external world is" does not tell us "how the external world should be". So when we "learn about" values, where does the evidence about values come from? How do we cross the is-ought gap?
Puzzle 2: The Role of Reward/Reinforcement
It does seem like humans have some kind of physiological "reward", in a hand-wavy reinforcement-learning-esque sense, which seems to at l...