

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jun 16, 2024 • 30min
AF - Degeneracies are sticky for SGD by Guillaume Corlouer
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Degeneracies are sticky for SGD, published by Guillaume Corlouer on June 16, 2024 on The AI Alignment Forum.
Introduction
Singular learning theory (SLT) is a theory of learning dynamics in Bayesian statistical models. It has been argued that SLT could provide insights into the training dynamics of deep neural networks. However, a theory of deep learning inspired by SLT is still lacking. In particular it seems important to have a better understanding of the relevance of SLT insights to stochastic gradient descent (SGD) - the paradigmatic deep learning optimization algorithm.
We explore how the degeneracies[1] of toy, low dimensional loss landscapes affect the dynamics of stochastic gradient descent (SGD).[2] We also investigate the hypothesis that the set of parameters selected by SGD after a large number of gradient steps on a degenerate landscape is distributed like the Bayesian posterior at low temperature (i.e., in the large sample limit). We do so by running SGD on 1D and 2D loss landscapes with minima of varying degrees of degeneracy.
While researchers experienced with SLT are aware of differences between SGD and Bayesian inference, we want to understand the influence of degeneracies on SGD with more precision and have specific examples where SGD dynamics and Bayesian inference can differ.
Main takeaways
Degeneracies influence SGD dynamics in two ways: (1) Convergence to a critical point is slower, the more degenerate the critical point is; (2) On a (partially) degenerate manifold, SGD preferentially escapes along non-degenerate directions. If all directions are degenerate, then we empirically observe that SGD is "stuck"
To explain our observations, we show that, for our models, SGD noise covariance is proportional to the Hessian in the neighborhood of a critical point of the loss
Thus SGD noise covariance goes faster to zero along more degenerate directions, to leading order in the neighborhood of a critical point
Qualitatively we observe that the concentration of the end-of-training distribution of parameters sampled from a set of SGD trajectories sometimes differ from the Bayesian posterior as predicted by SLT because of:
The hyperparameters such as the learning rate
The number of orthogonal degenerate directions
The degree of degeneracy in the neighborhood of a minimum
Terminology and notation
We advise the reader to skip this section and come back to it if notation or terminology is confusing.
Consider a sequence of n input-output pairs (xi,yi)1in. We can think of xi as input data to a deep learning model (e.g., a picture, or a token) and yi as an output that model is trying to learn (e.g., whether the picture represents a cat or a dog, or a what the next token is). A deep learning model may be represented as a function y=f(w,x), where wΩ is a point in a parameter space Ω.
The one-sample loss function, noted li(w):=12(yif(w,xi))2 (1in), is a measure of how good the model parametrized by w is a predicting the output yi on input xi. The empirical loss over n samples is noted ln(w):=1nni=1li(w). Noting q(x,y) the probability density function of input-output pairs, the theoretical loss (or the potential) writes l(w)=Eq[ln(w)].[4] The loss landscape is the manifold associated with the theoretical loss function wl(w).
A point w is a critical point if the gradient of the theoretical loss is 0 at w i.e. l(w)=0. A critical point w is degenerate if the Hessian of the loss H(w):=2l(w) has at least one 0 eigenvalue at w. An eigenvector u of H with zero eigenvalue is a degenerate direction.
The local learning coefficient λ(w) measures the greatest amount of degeneracy of a model around a critical point w. For the purpose of this work, if locally l(w=(w1,w2))(w1w1)2k1(w2w2)2k2 then the local learning coefficient is given by λ(w)=min(1k1,1k2). We say that a critical point...

Jun 16, 2024 • 26min
EA - Kaya Guides Pilot Results by RachelAbbott
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Kaya Guides Pilot Results, published by RachelAbbott on June 16, 2024 on The Effective Altruism Forum.
Summary
Who We Are: Kaya Guides runs a self-help course on WhatsApp to reduce depression at scale in low and middle-income countries. We help young adults with moderate to severe depression. Kaya currently operates in India. We are the world's first nonprofit implementer of Step-by-Step, the World Health Organization's digital guided self-help program, which was proven effective in two RCTs.
Pilot: We ran a pilot with 103 participants in India to assess the feasibility of implementing our program on WhatsApp with our target demographic and to generate early indicators of its effectiveness.
Results: 72% of program completers experienced depression reduction of 50% or greater. 36% were depression-free. 92% moved down at least a classification in severity (i.e. they shifted from severe to moderately severe, moderately severe to moderate, etc). The average reduction in score was 10 points on the 27-point PHQ-9 depression questionnaire.
Context: To offer a few points of comparison, two studies of therapy-driven programs found that 46% and 57.5% of participants experienced reductions of 50% or more, compared to our result of 72%. For the original version of Step-by-Step, it was 37.1%. There was an average PHQ-9 reduction of 6 points compared to our result of 10 points.
Effect Size: Our effect size is estimated at 0.54, compared to an effect size of 0.48 for the original version of Step-by-Step. This is likely to be an upper bound.
Cost-Effectiveness: We estimate that the pilot was 7x as cost-effective as direct cash transfers at increasing subjective well being. This accounts for the initial effect, not duration of effects. The cost per participant was $96.27. We project that next year we will be 20x as cost-effective as direct cash transfers. These numbers don't reflect our full impact, as we may be saving lives. Four participants said overtly that the program reduced their suicidal thinking.
Impacts: Participants reported that the program had profound impacts on their lives, ranging from improved well-being to regaining control over their lives and advancing in their education and careers.
Recruitment: We were highly successful at recruiting our target population. 97% of people who completed the baseline depression questionnaire scored as having depression. 82% scored moderate to severe. Many of our participants came from lower-income backgrounds even though we did not explicitly seek out this group. Participants held professions such as domestic worker, construction and factory worker and small shopkeeper. 17% overtly brought up financial issues during their guide calls.
Retention: 27% of participants completed all the program content, compared to 32% in the WHO's most recent RCT. In the context of a digitally-delivered mental health intervention, which are notorious for having extremely low engagement, this is a strong result. Guide call retention was higher: 36% of participants attended at least four guide calls.
Participant Feedback: 96% of program completers said they were likely or very likely to recommend the program. Participant feedback on guide calls was overwhelmingly positive and their commentary gave the sense that guide calls directly drive participant engagement. Negative feedback focused on wanting more interaction with guides. Feedback on the videos was mixed. For the chatbot, it was neutral.
Feedback on the exercises was generally positive for the exercises, although there were signs of lack of engagement with some exercises. The stress reduction exercises were heavily favored.
Support Kaya Guides: To support us, donate here or contact Rachel Abbott at rachel@kayaguides.com. Per our most recent assessment, it costs us $96 to provide mental healthcare t...

Jun 16, 2024 • 37sec
EA - Mr Beast is now officially an EA! by Dave Cortright
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mr Beast is now officially an EA!, published by Dave Cortright on June 16, 2024 on The Effective Altruism Forum.
The latest video from YouTube's biggest creator has him teaming up with Give Directly to give $300k in cash to everyone in a remote Ugandan Village.
This is going to get tens of millions of views. Please watch and like to boost the signal!
Giving $300,000 to Rural Villagers
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jun 16, 2024 • 16min
AF - Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller by Henry Cai
Henry Cai, author of a paper on self-controlling LLM behaviors, discusses using suffix gradients to modify model behaviors effectively. Topics range from exploring dinosaur noises, resisting petting a cat, and reasoning exercises to improving self-control by compressing suffix gradients into a prefix controller for LLMs, emphasizing representation engineering and gradient control.

Jun 16, 2024 • 32min
EA - On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI by JWS
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI, published by JWS on June 16, 2024 on The Effective Altruism Forum.
Overview
Recently Dwarkesh Patel released an interview with François Chollet (hereafter Dwarkesh and François). I thought this was one of Dwarkesh's best recent podcasts, and one of the best discussions that the AI Community has had recently. Instead of subtweeting those with opposing opinions or vagueposting, we actually got two people with disagreements on the key issue of scaling and AGI having a good faith and productive discussion.[1]
I want to explicitly give Dwarkesh a shout-out for having such a productive discussion (even if I disagree with him on the object level) and having someone on who challenges his beliefs and preconceptions. Often when I think of different AI factions getting angry at each other, and the quality of AI risk discourse plummeting, I'm reminded of Scott's phrase "I reject the argument that Purely Logical Debate has been tried and found wanting.
Like GK Chesterton, I think it has been found difficult and left untried." More of this kind of thing please, everyone involved.
I took notes as I listened to the podcast, and went through it again to make sure I got the key claims right.
I grouped them into similar themes, as Dwarkesh and François often went down a rabbit-hole to pursue an interesting point or crux and later returned to the main topic.[2] I hope this can help readers navigate to their points of interest, or make the discussion clearer, though I'd definitely recommend listening/watching for yourself! (It is long though, so feel free to jump around the doc rather than slog through it one go!)
Full disclosure, I am sceptical of a lot of the case for short AGI timelines these days, and thus also sceptical of claims that x-risk from AI is an overwhelmingly important thing to be doing in the entire history of humanity.
This is of course comes across in my summarisation and takeaways, but I think acknowledging that openly is better than leaving it to be inferred, and I hope this post can be another addition in helping improve the state of AI discussion both in and outside of EA/AI-Safety circles. It is also important to state explicitly here that I might very well be wrong! Please take my perspective as just that, one perspective among many, and do not defer to me (or to anyone really).
Come to your own conclusions on these issues.[3]
The Podcast
All timestamps are for the YouTube video, not the podcast recording. I've tried to cover the podcast by the main things as they appeared chronologically, and then tracking them through the transcript. I include links to some external resources, passing thoughts in footnotes, and more full thoughts in block-quotes.
Introducing the ARC Challenge
The podcast starts with an introduction of the ARC Challenge itself, and Dwarkesh is happy that François has put out a line in the sand as an LLM sceptic instead of moving the goalposts [0:02:27]. François notes that LLMs struggle on ARC, in part because its challenges are novel and meant to not be found on the internet, instead the approaches that perform better are based on 'Discrete Program Search' [0:02:04].
He later notes that ARC puzzles are not complex and require very little knowledge to solve [0:25:45].
Dwarkesh agrees that the problems are simple and thinks it's an "intriguing fact" that ARC problems are simple for humans, but LLMs are bad at them, and he hasn't been convinced by the explanations he's got from LLM proponents/scaling maximalists about why that is [0:11:57]. Towards the end François mentions in passing that big labs tried ARC but didn't share because their results because they're bad [1:08:28].[4]
One of ARC's main selling points is that humans are clearly meant to do well at this, even children, [0...

Jun 16, 2024 • 15min
LW - CIV: a story by Richard Ngo
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: CIV: a story, published by Richard Ngo on June 16, 2024 on LessWrong.
The room was cozy despite its size, with wood-lined walls reflecting the dim lighting. At one end, a stone fireplace housed a roaring fire; in the middle stood a huge oak table. The woman seated at the head of it rapped her gavel. "I hereby call to order the first meeting of the Parliamentary Subcommittee on Intergalactic Colonization. We'll start with brief opening statements, for which each representative will be allocated one minute, including - "
"Oh, enough with the pomp, Victoria. It's just the four of us." The representative for the Liberal Democrats waved his hand around the nearly-empty room.
Victoria sniffed. "It's important, Stuart. This is a decision that will have astronomical implications. And it's recorded, besides, so we should do things by the book. Carla, you're up first."
The woman at the end of the table stood with a smile. "Thank you, Victoria. I'm speaking on behalf of the Labour party, and I want to start by reminding you all of our place in history. We stand here in a world that has been shaped by centuries of colonialism. Now we're considering another wave of colonization, this one far vaster in scale. We need to - "
"Is this just a linguistic argument?" the fourth person at the table drawled. "We can call it something different if that would make you feel better. Say, universe settlement."
"Like the settlements in Palestine?"
"Oh, come on, Carla."
"No, Milton, this is a crucial point. We're talking about the biggest power grab the world has ever seen. You think Leopold II was bad when he was in charge of the Congo? Imagine what people will do if you give each of them total power over a whole solar system! Even libertarians like you have to admit it would be a catastrophe. If there's any possibility that we export oppression from earth across the entire universe, we should burn the rockets and stay home instead."
"Okay, thank you Carla," Victoria cut in. "That's time. Stuart, you're up next."
Stuart stood. "Speaking on behalf of the Liberal Democrats, I have to admit this is a tricky one. The only feasible way to send humans out to other galaxies is as uploaded minds, but many of our usual principles break for them. I want civilization to be democratic, but what does 'one person one vote' even mean when people can copy and paste themselves? I want human rights for all, but what do human rights even mean when you can just engineer minds who don't want those rights?"
"So as much as I hate the idea of segregating civilization, I think it's necessary. Biological humans should get as much territory as we will ever use. But realistically, given the lightspeed constraint, we're never going to actually want to leave the Milky Way. Then the rest of the Virgo Supercluster should be reserved for human uploads. Beyond that, anything else we can reach we should fill with as much happiness and flourishing as possible, no matter how alien it seems to us.
After all, as our esteemed predecessor John Stuart Mill once said…" He frowned, and paused for a second. "...as he said, the sole objective of government should be the greatest good for the greatest number." Stuart sat, looking a little disquieted.
"Thank you, Stuart. I'll make my opening statement next." Victoria stood and leaned forward, sweeping her eyes across the others. "I'm here representing the Conservatives. It's tempting to think that we can design a good society with just the right social engineering, just the right nudges. But the one thing we conservatives know for sure is: it won't work. Whatever clever plan you come up with, it won't be stable.
Given the chance, people will push towards novelty and experimentation and self-modification, and the whole species will end up drifting towards something alien and inhuman.
"Hard ru...

Jun 15, 2024 • 28min
EA - Moral Misdirection (full post) by Richard Y Chappell
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Moral Misdirection (full post), published by Richard Y Chappell on June 15, 2024 on The Effective Altruism Forum.
I previously included a link to this as part of my trilogy on anti-philanthropic misdirection, but a commenter asked me to post the full text here for the automated audio conversion. This forum post combines my two substack posts on 'Moral Misdirection' and on 'Anti-Philanthropic Misdirection'. Apologies to anyone who has already read them.
Moral Misdirection
One can lie - or at least misdirect - by telling only truths.
Suppose Don shares news of every violent crime committed by immigrants (while ignoring those committed by native-born citizens, and never sharing evidence of immigrants positively contributing to society). He spreads the false impression that immigrants are dangerous and do more harm than good.
Since this isn't true, and promulgates harmful xenophobic sentiments, I expect most academics in my social circles would judge Don very negatively, as both (i) morally bad, and (ii) intellectually dishonest.
It would not be a convincing defense for Don to say, "But everything I said is literally true!" What matters is that he led his audience to believe much more important falsehoods.[1]
I think broadly similar epistemic vices (not always deliberate) are much more common than is generally appreciated. Identifying them requires judgment calls about which truths are most important. These judgment calls are contestable. But I think they're worth making.
(Others can always let us know if they think our diagnoses are wrong, which could help to refocus debate on the real crux of the disagreement.) People don't generally think enough about moral prioritization, so encouraging more importance-based criticism could provide helpful correctives against common carelessness and misfocus.
Moral misdirection thus strikes me as an important and illuminating concept.[2] In this post, I'll first take an initial stab at clarifying the idea, and then suggest a few examples. (Free free to add more in the comments!)
Defining Moral Misdirection
Moral misdirection involves leading people morally astray, specifically by manipulating their attention. So explicitly asserting a sincerely believed falsehood doesn't qualify. But misdirection needn't be entirely deliberate, either. Misdirection could be subconscious (perhaps a result of motivated reasoning, or implicit biases), or even entirely inadvertent - merely negligent, say. In fact, deliberately implicating something known to be false won't necessarily count as "misdirection".
Innocent examples include simplification, or pedagogical "lies-to-children". If a simplification helps one's audience to better understand what's important, there's nothing dishonest about that - even if it predictably results in some technically false beliefs.
Taking all that into account, here's my first stab at a conceptual analysis:
Moral misdirection, as it interests me here, is a speech act that functionally operates to distract one's audience from more important moral truths. It thus predictably reduces the importance-weighted accuracy of the audience's moral beliefs.
Explanation: Someone who is sincerely, wholeheartedly in error may have the objective effect of leading their audiences astray, but their assertions don't functionally operate towards that end, merely in virtue of happening to be false.[3] Their good-faith erroneous assertions may rather truly aim to improve the importance-weighted accuracy of their audience's beliefs, and simply fail. Mistakes happen.
At the other extreme, sometimes people deliberately mislead (about important matters) while technically avoiding any explicit assertion of falsehoods. These bad-faith actors maintain a kind of "plausible deniability" - a sheen of superficial intellectual respectability - while deli...

Jun 15, 2024 • 5min
LW - MIRI's June 2024 Newsletter by Harlan
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: MIRI's June 2024 Newsletter, published by Harlan on June 15, 2024 on LessWrong.
MIRI updates
MIRI Communications Manager Gretta Duleba explains
MIRI's current communications strategy. We hope to clearly communicate to policymakers and the general public why there's an urgent need to shut down frontier AI development, and make the case for installing an "off-switch". This will not be easy, and there is a lot of work to be done. Some projects we're currently exploring include a new website, a book, and an online reference resource.
Rob Bensinger
argues, contra Leopold Aschenbrenner, that the US government should not race to develop artificial superintelligence. "If anyone builds it, everyone dies." Instead, Rob outlines a proposal for the US to spearhead an international alliance to halt progress toward the technology.
At the end of June, the
Agent Foundations team, including Scott Garrabrant and others, will be parting ways with MIRI to continue their work as independent researchers. The team was originally set up and "sponsored" by Nate Soares and Eliezer Yudkowsky. However, as AI capabilities have progressed rapidly in recent years, Nate and Eliezer have become increasingly pessimistic about this type of work yielding significant results within the relevant timeframes.
Consequently, they have shifted their focus to other priorities.
Senior MIRI leadership explored various alternatives, including reorienting the Agent Foundations team's focus and transitioning them to an independent group under MIRI fiscal sponsorship with restricted funding, similar to
AI Impacts. Ultimately, however, we decided that parting ways made the most sense.
The Agent Foundations team has produced some stellar work over the years, and made a true attempt to tackle one of the most crucial challenges humanity faces today. We are deeply grateful for their many years of service and collaboration at MIRI, and we wish them the very best in their future endeavors.
The Technical Governance Team
responded to NIST's
request for comments on draft documents related to the
AI Risk Management Framework. The team also sent
comments in response to the "
Framework for MItigating AI Risks" put forward by U.S. Senators Mitt Romney (R-UT), Jack Reed (D-RI), Jerry Moran (R-KS), and Angus King (I-ME).
Brittany Ferrero has joined MIRI's operations team. Previously, she worked on projects such as the
Embassy Network and
Open Lunar Foundation. We're excited to have her help to execute on our mission.
News and links
AI alignment researcher Paul Christiano was
appointed as head of AI safety at the US AI Safety Institute. Last fall, Christiano
published some of his thoughts about AI regulation as well as responsible scaling policies.
The
Superalignment team at OpenAI has been disbanded following the departure of its co-leaders Ilya Sutskever and Jan Leike. The team was
launched last year to try to solve the AI alignment problem in four years. However, Leike says that
the team struggled to get the compute it needed and that "safety culture and processes have taken a backseat to shiny products" at OpenAI. This seems extremely concerning from the perspective of evaluating OpenAI's seriousness when it comes to safety and robustness work, particularly given that a similar OpenAI exodus occurred in 2020 in the wake of concerns about OpenAI's commitment to solving the alignment problem.
Vox's Kelsey Piper reports that employees who left
OpenAI were subject to an extremely restrictive NDA indefinitely preventing them from criticizing the company (or admitting that they were under an NDA), under threat of losing their vested equity in the company. OpenAI executives have since contacted former employees to say that
they will not enforce the NDAs. Rob Bensinger comments on these developments
here, strongly criticizing OpenAI for...

Jun 15, 2024 • 16min
LW - Rational Animations' intro to mechanistic interpretability by Writer
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Rational Animations' intro to mechanistic interpretability, published by Writer on June 15, 2024 on LessWrong.
In our new video, we talk about research on interpreting InceptionV1, a convolutional neural network. Researchers have been able to understand the function of neurons and channels inside the network and uncover visual processing algorithms by looking at the weights. The work on InceptionV1 is early but landmark mechanistic interpretability research, and it functions well as an introduction to the field.
We also go into the rationale and goals of the field and mention some more recent research near the end. Our main source material is the circuits thread in the Distill journal and this article on feature visualization. The author of the script is Arthur Frost. I have included the script below, although I recommend watching the video since the script has been written with accompanying moving visuals in mind.
Intro
In 2018, researchers trained an AI to find out if people were at risk of heart conditions based on pictures of their eyes, and somehow the AI also learned to tell people's biological sex with incredibly high accuracy. How? We're not entirely sure.
The crazy thing about Deep Learning is that you can give an AI a set of inputs and outputs, and it will slowly work out for itself what the relationship between them is. We didn't teach AIs how to play chess, go, and atari games by showing them human experts - we taught them how to work it out for themselves. And the issue is, now they have worked it out for themselves, and we don't know what it is they worked out.
Current state-of-the-art AIs are huge. Meta's largest LLaMA2 model uses 70 billion parameters spread across 80 layers, all doing different things. It's deep learning models like these which are being used for everything from hiring decisions to healthcare and criminal justice to what youtube videos get recommended. Many experts believe that these models might even one day pose existential risks.
So as these automated processes become more widespread and significant, it will really matter that we understand how these models make choices.
The good news is, we've got a bit of experience uncovering the mysteries of the universe. We know that humans are made up of trillions of cells, and by investigating those individual cells we've made huge advances in medicine and genetics. And learning the properties of the atoms which make up objects has allowed us to develop modern material science and high-precision technology like computers.
If you want to understand a complex system with billions of moving parts, sometimes you have to zoom in.
That's exactly what Chris Olah and his team did starting in 2015. They focused on small groups of neurons inside image models, and they were able to find distinct parts responsible for detecting everything from curves and circles to dog heads and cars.
In this video we'll
Briefly explain how (convolutional) neural networks work
Visualise what individual neurons are doing
Look at how neurons - the most basic building blocks of the neural network - combine into 'circuits' to perform tasks
Explore why interpreting networks is so hard
There will also be lots of pictures of dogs, like this one. Let's get going.
We'll start with a brief explanation of how convolutional neural networks are built.
Here's a network that's trained to label images.
An input image comes in on the left, and it flows along through the layers until we get an output on the right - the model's attempt to classify the image into one of the categories. This particular model is called InceptionV1, and the images it's learned to classify are from a massive collection called ImageNet. ImageNet has 1000 different categories of image, like "sandal" and "saxophone" and "sarong" (which, if you don't know, is a k...

Jun 14, 2024 • 27min
LW - Shard Theory - is it true for humans? by Rishika
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shard Theory - is it true for humans?, published by Rishika on June 14, 2024 on LessWrong.
And is it a good model for value learning in AI?
TLDR
Shard theory proposes a view of value formation where experiences lead to the creation of context-based 'shards' that determine behaviour. Here, we go over psychological and neuroscientific views of learning, and find that while shard theory's emphasis on context bears similarity to types of learning such as conditioning, it does not address top-down influences that may decrease the locality of value-learning in the brain.
What's Shard Theory (and why do we care)?
In 2022, Quintin Pope and Alex Turner posted '
The shard theory of human values', where they described their view of how experiences shape the value we place on things. They give an example of a baby who enjoys drinking juice, and eventually learns that grabbing at the juice pouch, moving around to find the juice pouch, and modelling where the juice pouch might be, are all helpful steps in order to get to its reward.
'Human values', they say, 'are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics…' And since, like humans, AI is often trained with reinforcement learning, the same might apply to AI.
The original post is long (over 7,000 words) and dense, but Lawrence Chan helpfully posted a condensation of the topic in '
Shard Theory in Nine Theses: a Distillation and Critical Appraisal'. In it, he presents nine (as might be expected) main points of shard theory, ending with the last thesis: 'shard theory as a model of human values'. 'I'm personally not super well versed in neuroscience or psychology', he says, 'so I can't personally attest to [its] solidity…I'd be interested in hearing from experts in these fields on this topic.' And that's exactly what we're here to do.
A Crash Course on Human Learning
Types of learning
What is learning? A baby comes into the world and is inundated with sensory information of all kinds. From then on, it must process this information, take whatever's useful, and store it somehow for future use.
There's various places in the brain where this information is stored, and for various purposes. Looking at these various types of storage, or memory, can help us understand what's going on:
3 types of memory
We often group memory types by the length of time we hold on to them - 'working memory' (while you do some task), 'short-term memory' (maybe a few days, unless you revise or are reminded), and 'long-term memory' (effectively forever). Let's take a closer look at long-term memory:
Types of long-term memory
We can broadly split long-term memory into 'declarative' and 'nondeclarative'. Declarative memory is stuff you can talk about (or 'declare'): what the capital of your country is, what you ate for lunch yesterday, what made you read this essay. Nondeclarative covers the rest: a grab-bag of memory types including knowing how to ride a bike, getting habituated to a scent you've been smelling all day, and being motivated to do things you were previously rewarded for (like drinking sweet juice).
For most of this essay, we'll be focusing on the last type: conditioning.
Types of conditioning
Conditioning
Sometime in the 1890s, a physiologist named Ivan Pavlov was researching salivation using dogs. He would feed the dogs with powdered meat, and insert a tube into the cheek of each dog to measure their saliva.As expected, the dogs salivated when the food was in front of them. Unexpectedly, the dogs also salivated when they heard the footsteps of his assistant (who brought them their food).
Fascinated by this, Pavlov started to play a metronome whenever he gave the dogs their food. After a while, sure enough, the dogs would salivate whenever the metronome played, even if ...


