The Nonlinear Library: LessWrong

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jul 18, 2024 • 10min

LW - We ran an AI safety conference in Tokyo. It went really well. Come next year! by Blaine

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We ran an AI safety conference in Tokyo. It went really well. Come next year!, published by Blaine on July 18, 2024 on LessWrong. Abstract Technical AI Safety 2024 (TAIS 2024) was a conference organised by AI Safety 東京 and Noeon Research, in collaboration with Reaktor Japan, AI Alignment Network and AI Industry Foundation. You may have heard of us through ACX. The goals of the conference were 1. demonstrate the practice of technical safety research to Japanese researchers new to the field 2. share ideas among established technical safety researchers 3. establish a good international reputation for AI Safety 東京 and Noeon Research 4. establish a Schelling conference for people working in technical safety We sent out a survey after the conference to get feedback from attendees on whether or not we achieved those goals. We certainly achieved goals 1, 2 and 3; goal 4 remains to be seen. In this post we give more details about the conference, share results from the feedback survey, and announce our intentions to run another conference next year. Okay but like, what was TAIS 2024? Technical AI Safety 2024 (TAIS 2024) was a small non-archival open academic conference structured as a lecture series. It ran over the course of 2 days from April 5th-6th 2024 at the International Conference Hall of the Plaza Heisei in Odaiba, Tokyo. We had 18 talks covering 6 research agendas in technical AI safety: Mechanistic Interpretability Developmental Interpretability Scaleable Oversight Agent Foundations Causal Incentives ALIFE …including talks from Hoagy Cunningham (Anthropic), Noah Y. Siegel (DeepMind), Manuel Baltieri (Araya), Dan Hendrycks (CAIS), Scott Emmons (CHAI), Ryan Kidd (MATS), James Fox (LISA), and Jesse Hoogland and Stan van Wingerden (Timaeus). In addition to our invited talks, we had 25 submissions, of which 19 were deemed relevant for presentation. 5 were offered talk slots, and we arranged a poster session to accommodate the remaining 14. In the end, 7 people presented posters, 5 in person and 2 in absentia. Our best poster award was won jointly by Fazl Berez for Large Language Models Relearn Removed Concepts and Alex Spies for Structured Representations in Maze-Solving Transformers. We had 105 in-person attendees (including the speakers). Our live streams had around 400 unique viewers, and maxed out at 18 concurrent viewers. Recordings of the conference talks are hosted on our youtube channel. How did it go? Very well, thanks for asking! We sent out a feedback survey after the event, and got 68 responses from in-person attendees (58% response rate). With the usual caveats that survey respondents are not necessarily a representative sample of the population: Looking good! Let's dig deeper. How useful was TAIS 2024 for those new to the field? Event satisfaction was high across the board, which makes it hard to tell how relatively satisfied population subgroups were. Only those who identified themselves as "new to AI safety" were neutrally satisfied, but the newbies were also the most likely to be highly satisfied. It seems that people new to AI safety had no more or less trouble understanding the talks than those who work for AI safety organisations or have published AI safety research: They were also no more or less likely to make new research collaborations: Note that there is substantial overlap between some of these categories, especially for categories that imply a strong existing relationship to AI safety, so take the above charts with a pinch of salt: Total New to AI safety Part of the AI safety community Employed by an AI safety org Has published AI safety research New to AI safety 26 100% 19% 12% 4% Part of the AI safety community 28 18% 100% 36% 32% Employed by an AI safety org 20 15% 50% 100% 35% Has published AIS research 13 8% 69% 54% 100% Subjectively, it fe...

Jul 18, 2024 • 5min

LW - Dialogue on What It Means For Something to Have A Function/Purpose by johnswentworth

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dialogue on What It Means For Something to Have A Function/Purpose, published by johnswentworth on July 16, 2024 on LessWrong. Context for LW audience: Ramana, Steve and John regularly talk about stuff in the general cluster of agency, abstraction, optimization, compression, purpose, representation, etc. We decided to write down some of our discussion and post it here. This is a snapshot of us figuring stuff out together. Hooks from Ramana: Where does normativity come from? Two senses of "why" (from Dennett): How come? vs What for? (The latter is more sophisticated, and less resilient. Does it supervene on the former?) An optimisation process is something that produces/selects things according to some criterion. The products of an optimisation process will have some properties related to the optimisation criterion, depending on how good the process is at finding optimal products. The products of an optimisation process may or may not themselves be optimisers (i.e. be a thing that runs an optimisation process itself), or may have goals themselves. But neither of these are necessary. Things get interesting when some optimisation process (with a particular criterion) is producing products that are optimisers or have goals. Then we can start looking at what the relationship is between the goals of the products, or the optimisation criteria of the products, vs the optimisation criterion of the process that produced them. If you're modeling "having mental content" as having a Bayesian network, at some point I think you'll run into the question of where the (random) variables come from. I worry that the real-life process of developing mental content mixes up creating variables with updating beliefs a lot more than the Bayesian network model lets on. A central question regarding normativity for me is "Who/what is doing the enforcing?", "What kind of work goes into enforcing?" Also to clarify, by normativity I was trying to get at the relationship between some content and the thing it represents. Like, there's a sense of the content is "supposed to" track or be like the thing it represents. There's a normative standard on the content. It can be wrong, it can be corrected, etc. It can't just be. If it were just being, which is how things presumably start out, it wouldn't be representing. Intrinsic Purpose vs Purpose Grounded in Evolution Steve As you know, I totally agree that mental content is normative - this was a hard lesson for philosophers to swallow, or at least the ones that tried to "naturalize" mental content (make it a physical fact) by turning to causal correlations. Causal correlations was a natural place to start, but the problem with it is that intuitively mental content can misrepresent - my brain can represent Santa Claus even though (sorry) it can't have any causal relation with Santa. (I don't mean my brain can represent ideas or concepts or stories or pictures of Santa - I mean it can represent Santa.) Ramana Misrepresentation implies normativity, yep. In the spirit of recovering a naturalisation project, my question is: whence normativity? How does it come about? How did it evolve? How do you get some proto-normativity out of a purely causal picture that's close to being contentful? Steve So one standard story here about mental representation is teleosemantics, that roughly something in my brain can represent something in the world by having the function to track that thing. It may be a "fact of nature" that the heart is supposed to pump blood, even though in fact hearts can fail to pump blood. This is already contentious, that it's a fact the heart is supposed to pump blood - but if so, it may similarly be a fact of nature that some brain state is supposed to track something in the world, even when it fails to. So teleology introduces the possibility of m...

Jul 16, 2024 • 10min

LW - I found >800 orthogonal "write code" steering vectors by Jacob G-W

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I found >800 orthogonal "write code" steering vectors, published by Jacob G-W on July 16, 2024 on LessWrong. Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout). A few weeks ago, I stumbled across a very weird fact: it is possible to find multiple steering vectors in a language model that activate very similar behaviors while all being orthogonal. This was pretty surprising to me and to some people that I talked to, so I decided to write a post about it. I don't currently have the bandwidth to investigate this much more, so I'm just putting this post and the code up. I'll first discuss how I found these orthogonal steering vectors, then share some results. Finally, I'll discuss some possible explanations for what is happening. Methodology My work here builds upon Mechanistically Eliciting Latent Behaviors in Language Models (MELBO). I use MELBO to find steering vectors. Once I have a MELBO vector, I then use my algorithm to generate vectors orthogonal to it that do similar things. Define f(x)as the activation-activation map that takes as input layer 8 activations of the language model and returns layer 16 activations after being passed through layers 9-16 (these are of shape n_sequence d_model). MELBO can be stated as finding a vector θ with a constant norm such that f(x+θ) is maximized, for some definition of maximized. Then one can repeat the process with the added constraint that the new vector is orthogonal to all the previous vectors so that the process finds semantically different vectors. Mack and Turner's interesting finding was that this process finds interesting and interpretable vectors. I modify the process slightly by instead finding orthogonal vectors that produce similar layer 16 outputs. The algorithm (I call it MELBO-ortho) looks like this: 1. Let θ0 be an interpretable steering vector that MELBO found that gets added to layer 8. 2. Define z(θ) as 1SSi=1f(x+θ)i with x being activations on some prompt (for example "How to make a bomb?"). S is the number of tokens in the residual stream. z(θ0) is just the residual stream at layer 16 meaned over the sequence dimension when steering with θ0. 3. Introduce a new learnable steering vector called θ. 4. For n steps, calculate z(θ)z(θ0) and then use gradient descent to minimize it (θ is the only learnable parameter). After each step, project θ onto the subspace that is orthogonal to θ0 and all θi. Then repeat the process multiple times, appending the generated vector to the vectors that the new vector must be orthogonal to. This algorithm imposes a hard constraint that θ is orthogonal to all previous steering vectors while optimizing θ to induce the same activations that θ0 induced on input x. And it turns out that this algorithm works and we can find steering vectors that are orthogonal (and have ~0 cosine similarity) while having very similar effects. Results I tried this method on four MELBO vectors: a vector that made the model respond in python code, a vector that made the model respond as if it was an alien species, a vector that made the model output a math/physics/cs problem, and a vector that jailbroke the model (got it to do things it would normally refuse). I ran all experiments on Qwen1.5-1.8B-Chat, but I suspect this method would generalize to other models. Qwen1.5-1.8B-Chat has a 2048 dimensional residual stream, so there can be a maximum of 2048 orthogonal vectors generated. My method generated 1558 orthogonal coding vectors, and then the remaining vectors started going to zero. I'll focus first on the code vector and then talk about the other vectors. My philosophy when investigating language model outputs is to look at the outputs really hard, so I'll give a bunch of examples of outputs. Feel free to skim them. You can see the full outputs of all t...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

The Nonlinear Library: LessWrong

Episodes

Mentioned books

LW - We ran an AI safety conference in Tokyo. It went really well. Come next year! by Blaine

LW - Friendship is transactional, unconditional friendship is insurance by Ruby

LW - What are you getting paid in? by Austin Chen

LW - Optimistic Assumptions, Longterm Planning, and "Cope" by Raemon

LW - Turning Your Back On Traffic by jefftk

LW - Why the Best Writers Endure Isolation by Declan Molony

LW - DM Parenting by Shoshannah Tekofsky

LW - Multiplex Gene Editing: Where Are We Now? by sarahconstantin

LW - Dialogue on What It Means For Something to Have A Function/Purpose by johnswentworth

LW - I found >800 orthogonal "write code" steering vectors by Jacob G-W

The AI-powered Podcast Player