The Nonlinear Library: LessWrong

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Sep 7, 2024 • 21min

LW - Excerpts from "A Reader's Manifesto" by Arjun Panickssery

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Excerpts from "A Reader's Manifesto", published by Arjun Panickssery on September 7, 2024 on LessWrong. "A Reader's Manifesto" is a July 2001 Atlantic piece by B.R. Myers that I've returned to many times. He complains about the inaccessible pretension of the highbrow literary fiction of his day. The article is mostly a long list of critiques of various quotes/passages from well-reviewed books by famous authors. It's hard to accuse him of cherry-picking since he only targets passages that reviewers singled out as unusually good. Some of his complaints are dumb but the general idea is useful: authors try to be "literary" by (1) avoiding a tightly-paced plot that could evoke "genre fiction" and (2) trying to shoot for individual standout sentences that reviewers can praise, using a shotgun approach where many of the sentences are banal or just don't make sense. Here are some excerpts of his complaints. Bolding is always mine. The "Writerly" Style He complains that critics now dismiss too much good literature as "genre" fiction. More than half a century ago popular storytellers like Christopher Isherwood and Somerset Maugham were ranked among the finest novelists of their time, and were considered no less literary, in their own way, than Virginia Woolf and James Joyce. Today any accessible, fast-moving story written in unaffected prose is deemed to be "genre fiction" - at best an excellent "read" or a "page turner," but never literature with a capital L. An author with a track record of blockbusters may find the publication of a new work treated like a pop-culture event, but most "genre" novels are lucky to get an inch in the back pages of The New York Times Book Review. The dualism of literary versus genre has all but routed the old trinity of highbrow, middlebrow, and lowbrow, which was always invoked tongue-in-cheek anyway. Writers who would once have been called middlebrow are now assigned, depending solely on their degree of verbal affectation, to either the literary or the genre camp. David Guterson is thus granted Serious Writer status for having buried a murder mystery under sonorous tautologies (Snow Falling on Cedars, 1994), while Stephen King, whose Bag of Bones (1998) is a more intellectual but less pretentious novel, is still considered to be just a very talented genre storyteller. Further, he complains that fiction is regarded as "literary" the more slow-paced, self-conscious, obscure, and "writerly" its style. The "literary" writer need not be an intellectual one. Jeering at status-conscious consumers, bandying about words like "ontological" and "nominalism," chanting Red River hokum as if it were from a lost book of the Old Testament: this is what passes for profundity in novels these days. Even the most obvious triteness is acceptable, provided it comes with a postmodern wink. What is not tolerated is a strong element of action - unless, of course, the idiom is obtrusive enough to keep suspense to a minimum. Conversely, a natural prose style can be pardoned if a novel's pace is slow enough, as was the case with Ha Jin's aptly titled Waiting, which won the National Book Award (1999) and the PEN/Faulkner Award (2000). If the new dispensation were to revive good "Mandarin" writing - to use the term coined by the British critic Cyril Connolly for the prose of writers like Virginia Woolf and James Joyce - then I would be the last to complain. But what we are getting today is a remarkably crude form of affectation: a prose so repetitive, so elementary in its syntax, and so numbing in its overuse of wordplay that it often demands less concentration than the average "genre" novel. 4 Types of Bad Prose Then he has five sections complaining about 4 different types of prose he doesn't like (in addition to the generic "literary" prose): "evocative" prose, "muscular"...

Sep 7, 2024 • 1min

LW - Pay Risk Evaluators in Cash, Not Equity by Adam Scholl

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pay Risk Evaluators in Cash, Not Equity, published by Adam Scholl on September 7, 2024 on LessWrong. Personally, I suspect the alignment problem is hard. But even if it turns out to be easy, survival may still require getting at least the absolute basics right; currently, I think we're mostly failing even at that. Early discussion of AI risk often focused on debating the viability of various elaborate safety schemes humanity might someday devise - designing AI systems to be more like "tools" than "agents," for example, or as purely question-answering oracles locked within some kryptonite-style box. These debates feel a bit quaint now, as AI companies race to release agentic models they barely understand directly onto the internet. But a far more basic failure, from my perspective, is that at present nearly all AI company staff - including those tasked with deciding whether new models are safe to build and release - are paid substantially in equity, the value of which seems likely to decline if their employers stop building and releasing new models. As a result, it is currently the case that roughly everyone within these companies charged with sounding the alarm risks personally losing huge sums of money if they do. This extreme conflict of interest could be avoided simply by compensating risk evaluators in cash instead. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Sep 6, 2024 • 7min

LW - Adam Optimizer Causes Privileged Basis in Transformer Language Models by Diego Caples

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Adam Optimizer Causes Privileged Basis in Transformer Language Models, published by Diego Caples on September 6, 2024 on LessWrong. Diego Caples (diego@activated-ai.com) Rob Neuhaus (rob@activated-ai.com) Introduction In principle, neuron activations in a transformer-based language model residual stream should be about the same scale. In practice, however the dimensions unexpectedly widely vary in scale. Mathematical theories of the transformer architecture do not predict this. They expect rotational equivariance within a model, where one dimension is no more important than any other. Is there something wrong with our reasonably informed intuitions of how transformers work? What explains these outlier channels? Previously, Anthropic researched the existence of these privileged basis dimensions (dimensions more important / larger than expected) and ruled out several causes. By elimination, they reached the hypothesis that per-channel normalization in the Adam optimizer was the cause of privileged basis. However, they did not prove this was the case. We conclusively show that Adam causes outlier channels / privileged basis within the transformer residual stream. When replacing the Adam optimizer with SGD, models trained do not have a privileged basis. As a whole, this work improves mechanistic understanding of transformer LM training dynamics and confirms that our mathematical models of transformers are not flawed. Rather, they simply do not take into account the training process. Our code is open source at the LLM outlier channel exploration GitHub. Key Results Training an LM with SGD does not result in a privileged basis, indicating that Adam is the cause of privileged basis in transformer LMs. Training a 12M parameter model on TinyStories allows us to replicate outlier channel behavior on a small LM, training in less than 15 minutes on an H100. Background Recommended Reading Privileged Bases in the Transformer Residual Stream Toy Models of Superposition (Privileged Basis Section) More About Anthropic's Work We consider Anthropic's research on privileged basis the primary motivator for this work. In Anthropic's Privileged Bases in the Transformer Residual Stream, they demonstrate privileged basis in a 200M parameter LLM, performed some experiments to rule out possible causes, but did not find a definitive cause. They hypothesize that outlier channels are caused by Adam's lack of rotational equivariance, and suggest that training using SGD could isolate Adam as the cause. Adam vs SGD, and Rotational Equivariance Consider an experiment where we rotate the parameter space of a neural network, train it, and then invert the rotation. With Stochastic Gradient Descent (SGD), this process yields the same model as if we hadn't rotated at all. However, with the Adam optimizer, we end up with a different model. This difference can be explained by the presence/absence a property called rotational equivariance. SGD is rotationally equivariant: optimizer steps are always directly proportional to the gradient of the loss function, regardless of the chosen coordinate system. In contrast, Adam is not rotationally equivariant because it takes steps in ways that are not proportional to the gradient. Updates depend on coordinate-wise gradient statistics. As we later show, this difference is what leads to privileged basis within LMs. Kurtosis Motivated by Anthropic, we use excess kurtosis as a metric for measuring basis privilege. We encourage the reader to read Anthropic's reasoning for why this is a good metric, but here we aim to demonstrate graphically that excess kurtosis is a reasonable choice for measuring basis privilege. We plot the middle layer residual stream activations for the last token of string: "Lilly saw a big red apple!" as an Adam optimized LM training run progresses....

Sep 5, 2024 • 9min

LW - instruction tuning and autoregressive distribution shift by nostalgebraist

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: instruction tuning and autoregressive distribution shift, published by nostalgebraist on September 5, 2024 on LessWrong. [Note: this began life as a "Quick Takes" comment, but it got pretty long, so I figured I might as well convert it to a regular post.] In LM training, every token provides new information about "the world beyond the LM" that can be used/"learned" in-context to better predict future tokens in the same window. But when text is produced by autoregressive sampling from the same LM, it is not informative in the same way, at least not to the same extent[1]. Thus, sampling inevitably produces a distribution shift. I think this is one of the reasons why it's (apparently) difficult to get instruction-tuned / HH-tuned models to report their uncertainty and level of competence accurately, rather than being overconfident. (I doubt this is a novel point, I just haven't seen it spelled out explicitly before, and felt like doing so.) Imagine that you read the following (as the beginning of some longer document), and you trust that the author is accurately describing themselves: I'm a Princeton physics professor, with a track record of highly cited and impactful research in the emerging field of Ultra-High-Density Quasiclassical Exotic Pseudoplasmas (UHD-QC-EPPs). The state of the art in numerical simulation of UHD-QC-EPPs is the so-called Neural Pseudospectral Method (NPsM). I made up all those buzzwords, but imagine that this is a real field, albeit one you know virtually nothing about. So you've never heard of "NPsM" or any other competing method. Nonetheless, you can confidently draw some conclusions just from reading this snippet and trusting the author's self-description: Later in this document, the author will continue to write as though they believe that NPsM is "the gold standard" in this area. They're not going to suddenly turn around and say something like "wait, whoops, I just checked Wikipedia and it turns out NPsM has been superseded by [some other thing]." They're a leading expert in the field! If that had happened, they'd already know by the time they sat down to write any of this. Also, apart from this particular writer's beliefs, it's probably actually true that NPsM is the gold standard in this area. Again, they're an expert in the field -- and this is the sort of claim that would be fairly easy to check even if you're not an expert yourself, just by Googling around and skimming recent papers. It's also not the sort of claim where there's any obvious incentive for deception. It's hard to think of a plausible scenario in which this person writes this sentence, and yet the sentence is false or even controversial. During training, LLMs are constantly presented with experiences resembling this one. The LLM is shown texts about topics of which it has incomplete knowledge. It has to predict each token from the preceding ones. Whatever new information the text conveys about the topic may make it into the LLM's weights, through gradient updates on this example. But even before that happens, the LLM can also use the kind of reasoning shown in the bulleted list above to improve its predictions on the text right now (before any gradient updates). That is, the LLM can do in-context learning, under the assumption that the text was produced by an entity outside itself -- so that each part of the text (potentially) provides new information about the real world, not yet present in the LLM's weights, that has useful implications for the later parts of the same text. So, all else being equal, LLMs will learn to apply this kind of reasoning to all text, always, ubiquitously. But autoregressive sampling produces text that is not informative about "the world outside" in the same way that all the training texts were. During training, when an LLM sees information it d...

Sep 5, 2024 • 14min

LW - Conflating value alignment and intent alignment is causing confusion by Seth Herd

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conflating value alignment and intent alignment is causing confusion, published by Seth Herd on September 5, 2024 on LessWrong. Submitted to the Alignment Forum. Contains more technical jargon than usual. Epistemic status: I think something like this confusion is happening often. I'm not saying these are the only differences in what people mean by "AGI alignment". Summary: Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds. One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false. While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1] Overview Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes. Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger's definition of intent alignment: AGI aligned with human intent in the broadest sense. Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion. People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans' for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I'm going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target. Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements. Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms' recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions. The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don't need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3] Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it's adequate for real success. In particular, there are intuitions and arguments that PIA doesn't address the real problem of AGI alignment. I think PIA does address the real problem, but in a non-obvious and counterintuitive way. Another unstated divide There's another important clustering around these two conceptions of alignment. Peop...

Sep 5, 2024 • 20min

LW - The Fragility of Life Hypothesis and the Evolution of Cooperation by KristianRonn

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Fragility of Life Hypothesis and the Evolution of Cooperation, published by KristianRonn on September 5, 2024 on LessWrong. This part 2 in a 3-part sequence summarizes my book (see part 1 here), The Darwinian Trap. The book aims to popularize the concept of multipolar traps and establish them as a broader cause area. If you find this series intriguing and want to spread the word and learn more: 1. Share this post with others on X or other social media platforms. 2. Pre-order the book here. 3. Sign up for my mailing list here before September 24 for a 20% chance to win a free hardcover copy of the book (it takes 5 seconds). 4. Contact me at kristian@kristianronn.com if you have any input or ideas. In Part 1, I introduced the concept of a Darwinian demon - selection pressures that drive agents to harm others for personal gain. I also argued that the game theory of our evolutionary fitness landscape, with its limited resources, often favors defection over cooperation within populations. Yet, when we observe nature, cooperation is ubiquitous: from molecules working together in metabolism, to genes forming genomes, to cells building organisms, and individuals forming societies. Clearly, cooperation must be evolutionarily adaptive, or we wouldn't see it so extensively in the natural world. I refer to a selection pressure that fosters mutually beneficial cooperation as a "Darwinian angel." To understand the conditions under which cooperative behavior thrives, we can look at our own body. For an individual cell, the path to survival might seem clear: prioritize self-interest by replicating aggressively, even at the organism's expense. This represents the Darwinian demon - selection pressure favoring individual survival. However, from the perspective of the whole organism, survival depends on suppressing these self-serving actions. The organism thrives only when its cells cooperate, adhering to a mutually beneficial code. This tension between individual and collective interests forms the core of multi-level selection, where evolutionary pressures act on both individuals and groups. Interestingly, the collective drive for survival paradoxically requires cells to act altruistically, suppressing their self-interest for the organism's benefit. In this context, Darwinian angels are the forces that make cooperation adaptive, promoting collective well-being over individual defection. These angels are as much a part of evolution as their demonic counterparts, fostering cooperation that benefits the broader environment. Major Evolutionary Transitions and Cooperation This struggle, between selection pressures of cooperation and defection, traces back to the dawn of life. In the primordial Earth, a world of darkness, immense pressure, and searing heat, ribonucleic acid (RNA) emerged - a molecule that, like DNA, encodes the genetic instructions essential for life. Without RNA, complex life wouldn't exist. Yet, as soon as RNA formed, it faced a Darwinian challenge known as Spiegelman's Monster. Shorter RNA strands replicate faster than longer ones, creating a selection pressure favoring minimal RNA molecules with as few as 218 nucleotides - insufficient to encode any useful genetic material. This challenge was likely overcome through molecular collaboration: a lipid membrane provided a sanctuary for more complex RNA, which in turn helped form proteins to stabilize and enhance the membrane. Throughout evolutionary history, every major transition has occurred because Darwinian angels successfully suppressed Darwinian demons, forming new units of selection and driving significant evolutionary progress. Each evolutionary leap has been a fierce struggle against these demons, with every victory paving the way for the beauty, diversity, and complexity of life we see today. These triumphs are...

Sep 5, 2024 • 5min

LW - What happens if you present 500 people with an argument that AI is risky? by KatjaGrace

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What happens if you present 500 people with an argument that AI is risky?, published by KatjaGrace on September 4, 2024 on LessWrong. Recently, Nathan Young and I wrote about arguments for AI risk and put them on the AI Impacts wiki. In the process, we ran a casual little survey of the American public regarding how they feel about the arguments, initially (if I recall) just because we were curious whether the arguments we found least compelling would also fail to compel a wide variety of people. The results were very confusing, so we ended up thinking more about this than initially intended and running four iterations total. This is still a small and scrappy poll to satisfy our own understanding, and doesn't involve careful analysis or error checking. But I'd like to share a few interesting things we found. Perhaps someone else wants to look at our data more carefully, or run more careful surveys about parts of it. In total we surveyed around 570 people across 4 different polls, with 500 in the main one. The basic structure was: 1. p(doom): "If humanity develops very advanced AI technology, how likely do you think it is that this causes humanity to go extinct or be substantially disempowered?" Responses had to be given in a text box, a slider, or with buttons showing ranges 2. (Present them with one of eleven arguments, one a 'control') 3. "Do you understand this argument?" 4. "What did you think of this argument?" 5. "How compelling did you find this argument, on a scale of 1-5?" 6. p(doom) again 7. Do you have any further thoughts about this that you'd like to share? Interesting things: In the first survey, participants were much more likely to move their probabilities downward than upward, often while saying they found the argument fairly compelling. This is a big part of what initially confused us. We now think this is because each argument had counterarguments listed under it. Evidence in support of this: in the second and fourth rounds we cut the counterarguments and probabilities went overall upward. When included, three times as many participants moved their probabilities downward as upward (21 vs 7, with 12 unmoved). In the big round (without counterarguments), arguments pushed people upward slightly more: 20% move upward and 15% move downward overall (and 65% say the same). On average, p(doom) increased by about 1.3% (for non-control arguments, treating button inputs as something like the geometric mean of their ranges). But the input type seemed to make a big difference to how people moved! It makes sense to me that people move a lot more in both directions with a slider, because it's hard to hit the same number again if you don't remember it. It's surprising to me that they moved with similar frequency with buttons and open response, because the buttons covered relatively chunky ranges (e.g. 5-25%) so need larger shifts to be caught. Input type also made a big difference to the probabilities people gave to doom before seeing any arguments. People seem to give substantially lower answers when presented with buttons (Nathan proposes this is because there was was a <1% and 1-5% button, so it made lower probabilities more salient/ "socially acceptable", and I agree): Overall, P(doom) numbers were fairly high: 24% average, 11% median. We added a 'control argument'. We presented this as "Here is an argument that advanced AI technology might threaten humanity:" like the others, but it just argued that AI might substantially contribute to music production: This was the third worst argument in terms of prompting upward probability motion, but the third best in terms of being "compelling". Overall it looked a lot like other arguments, so that's a bit of a blow to the model where e.g. we can communicate somewhat adequately, 'arguments' are more compelling than rando...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app