AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
The Constraints of Hard Theory on Utility Optimization
Under the chart theory point of view i've got all these shards which are somehow simple but they can combine in really complicated ways right. A naive person might have said oh hey here's what humans are doing they're like optimizing the amount of cash they have in a utility optimal way and then shard theory rules out that kind of story. Even if you could derive a utility function it would be extremely complicated, especially if defined over all possible future configurations of the universe or trajectories over universe evolution.
What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
Episode art by Hamish Doodles: hamishdoodles.com
Topics we discuss, and timestamps:
- 0:00:42 - Why understand human value formation?
- 0:19:59 - Why not design methods to align to arbitrary values?
- 0:27:22 - Postulates about human brains
- 0:36:20 - Sufficiency of the postulates
- 0:44:55 - Reinforcement learning as conditional sampling
- 0:48:05 - Compatibility with genetically-influenced behaviour
- 1:03:06 - Why deep learning is basically what the brain does
- 1:25:17 - Shard theory
- 1:38:49 - Shard theory vs expected utility optimizers
- 1:54:45 - What shard theory says about human values
- 2:05:47 - Does shard theory mean we're doomed?
- 2:18:54 - Will nice behaviour generalize?
- 2:33:48 - Does alignment generalize farther than capabilities?
- 2:42:03 - Are we at the end of machine learning history?
- 2:53:09 - Shard theory predictions
- 2:59:47 - The shard theory research community
- 3:13:45 - Why do shard theorists not work on replicating human childhoods?
- 3:25:53 - Following shardy research
The transcript: axrp.net/episode/2023/06/15/episode-22-shard-theory-quintin-pope.html
Shard theorist links:
- Quintin's LessWrong profile: lesswrong.com/users/quintin-pope
- Alex Turner's LessWrong profile: lesswrong.com/users/turntrout
- Shard theory Discord: discord.gg/AqYkK7wqAG
- EleutherAI Discord: discord.gg/eleutherai
Research we discuss:
- The Shard Theory Sequence: lesswrong.com/s/nyEFg3AuJpdAozmoX
- Pretraining Language Models with Human Preferences: arxiv.org/abs/2302.08582
- Inner alignment in salt-starved rats: lesswrong.com/posts/wcNEXDHowiWkRxDNv/inner-alignment-in-salt-starved-rats
- Intro to Brain-like AGI Safety Sequence: lesswrong.com/s/HzcM2dkCq7fwXBej8
- Brains and transformers:
- The neural architecture of language: Integrative modeling converges on predictive processing: pnas.org/doi/10.1073/pnas.2105646118
- Brains and algorithms partially converge in natural language processing: nature.com/articles/s42003-022-03036-1
- Evidence of a predictive coding hierarchy in the human brain listening to speech: nature.com/articles/s41562-022-01516-2
- Singular learning theory explainer: Neural networks generalize because of this one weird trick: lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick
- Singular learning theory links: metauni.org/slt/
- Implicit Regularization via Neural Feature Alignment, aka circles in the parameter-function map: arxiv.org/abs/2008.00938
- The shard theory of human values: lesswrong.com/s/nyEFg3AuJpdAozmoX/p/iCfdcxiyr2Kj8m8mT
- Predicting inductive biases of pre-trained networks: openreview.net/forum?id=mNtmhaDkAr
- Understanding and controlling a maze-solving policy network, aka the cheese vector: lesswrong.com/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network
- Quintin's Research agenda: Supervising AIs improving AIs: lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
- Steering GPT-2-XL by adding an activation vector: lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Links for the addendum on mesa-optimization skepticism:
- Quintin's response to Yudkowsky arguing against AIs being steerable by gradient descent: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_argues_against_AIs_being_steerable_by_gradient_descent_
- Quintin on why evolution is not like AI training: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Edit__Why_evolution_is_not_like_AI_training
- Evolution provides no evidence for the sharp left turn: lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn
- Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets: arxiv.org/abs/1905.10854
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode