The Nonlinear Library: LessWrong

The Nonlinear Fund
undefined
Jul 18, 2024 • 10min

LW - We ran an AI safety conference in Tokyo. It went really well. Come next year! by Blaine

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We ran an AI safety conference in Tokyo. It went really well. Come next year!, published by Blaine on July 18, 2024 on LessWrong. Abstract Technical AI Safety 2024 (TAIS 2024) was a conference organised by AI Safety 東京 and Noeon Research, in collaboration with Reaktor Japan, AI Alignment Network and AI Industry Foundation. You may have heard of us through ACX. The goals of the conference were 1. demonstrate the practice of technical safety research to Japanese researchers new to the field 2. share ideas among established technical safety researchers 3. establish a good international reputation for AI Safety 東京 and Noeon Research 4. establish a Schelling conference for people working in technical safety We sent out a survey after the conference to get feedback from attendees on whether or not we achieved those goals. We certainly achieved goals 1, 2 and 3; goal 4 remains to be seen. In this post we give more details about the conference, share results from the feedback survey, and announce our intentions to run another conference next year. Okay but like, what was TAIS 2024? Technical AI Safety 2024 (TAIS 2024) was a small non-archival open academic conference structured as a lecture series. It ran over the course of 2 days from April 5th-6th 2024 at the International Conference Hall of the Plaza Heisei in Odaiba, Tokyo. We had 18 talks covering 6 research agendas in technical AI safety: Mechanistic Interpretability Developmental Interpretability Scaleable Oversight Agent Foundations Causal Incentives ALIFE …including talks from Hoagy Cunningham (Anthropic), Noah Y. Siegel (DeepMind), Manuel Baltieri (Araya), Dan Hendrycks (CAIS), Scott Emmons (CHAI), Ryan Kidd (MATS), James Fox (LISA), and Jesse Hoogland and Stan van Wingerden (Timaeus). In addition to our invited talks, we had 25 submissions, of which 19 were deemed relevant for presentation. 5 were offered talk slots, and we arranged a poster session to accommodate the remaining 14. In the end, 7 people presented posters, 5 in person and 2 in absentia. Our best poster award was won jointly by Fazl Berez for Large Language Models Relearn Removed Concepts and Alex Spies for Structured Representations in Maze-Solving Transformers. We had 105 in-person attendees (including the speakers). Our live streams had around 400 unique viewers, and maxed out at 18 concurrent viewers. Recordings of the conference talks are hosted on our youtube channel. How did it go? Very well, thanks for asking! We sent out a feedback survey after the event, and got 68 responses from in-person attendees (58% response rate). With the usual caveats that survey respondents are not necessarily a representative sample of the population: Looking good! Let's dig deeper. How useful was TAIS 2024 for those new to the field? Event satisfaction was high across the board, which makes it hard to tell how relatively satisfied population subgroups were. Only those who identified themselves as "new to AI safety" were neutrally satisfied, but the newbies were also the most likely to be highly satisfied. It seems that people new to AI safety had no more or less trouble understanding the talks than those who work for AI safety organisations or have published AI safety research: They were also no more or less likely to make new research collaborations: Note that there is substantial overlap between some of these categories, especially for categories that imply a strong existing relationship to AI safety, so take the above charts with a pinch of salt: Total New to AI safety Part of the AI safety community Employed by an AI safety org Has published AI safety research New to AI safety 26 100% 19% 12% 4% Part of the AI safety community 28 18% 100% 36% 32% Employed by an AI safety org 20 15% 50% 100% 35% Has published AIS research 13 8% 69% 54% 100% Subjectively, it fe...
undefined
Jul 18, 2024 • 5min

LW - Friendship is transactional, unconditional friendship is insurance by Ruby

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Friendship is transactional, unconditional friendship is insurance, published by Ruby on July 18, 2024 on LessWrong. It feels a little icky to say, but we befriend people because we get something out of it. We enjoy the company, the conversation, the emotional support, the activities, the connection, etc. It's not a coincidence people don't befriend brick walls. (The same is true in romantic relationships, except we expect even more.) Granted, friendship is not an explicit transaction that's negotiated, quantified, legally enforceable, etc. It's fuzzy, which helps it work better for reasons I won't really get into here[1]. However it's crucial to recognize that if your friend (or partner) didn't provide or promise you some kind of value[2], you wouldn't have become friends in the first place. And yet, people valorize the notion of loyalty in relationships: continuing to be there through thick and thin, good and bad, health and illness. "Unconditional friendship" and "unconditional love". Conversely "fair weather friendship" is denigrated. People hope to be loved even if they were worms. What gives? How do we reconcile friendships and relationships arising due to receiving some value with the aspiration or even expectation of unconditionality? My model here is something akin functionally to mutual insurance. While I became your friend because we spent years playing basketball together, I stay by your side even when you're recovering from a broken leg, or even if you were injured so badly as to never play again. Someone initially enticed by their partner's beauty, stays with them even after a horrific burn to the face. I do this because I expect the same in return. You might argue that in these cases, you're still receiving other benefits even when one of them is lost, but I argue back that we see ongoing care even where there's almost nothing left, e.g. people caring for their senile, bedridden partners. And more so, that we judge people who don't stick it out. Friendship is standardly a straightforward exchange of value provided. It is also an exchange of insurance "if you're not able to provide value to me, I'll still provide value to you" and vice versa. Like the other stuff in friendship, it's fuzzy. The insurance exchange doesn't happen in a discrete moment and its strength is quantitative and expected to grow over time. People expect more "loyalty" from friends and partners of years than weeks. In the limit, people reach "unconditional love", meaning something like from this point on, I will love you no matter what. However, reaching that willingness was very probably tied to specific conditional factors. It's notable that for many people love and security are connected. Sufficiently loving and supportive relationships provide security because they imply an unconditionality on circumstances - you'll have someone even if fortune befalls you and you lose what makes you appealing in the first place. I think this makes sense. Seems like good game theoretic trade even with a willing partner. "Till death do us part." Possibly worth making a little more explicit though, just to be sure your friends and partners share whatever expectations of loyalty you have. Note that I don't think this dynamic needs to be very conscious on anyone's part. I think that humans instinctively execute good game theory because evolution selected for it, even if the human executing just feels a wordless pull to that kind of behavior. In this context, "attachment to others" feels like a thing that humans and other animals experience. Parents, perhaps especially mothers, are very attached to their children (think of the mother bear), but we tend to form attachments to anyone (or thing) that we're persistently around. When I stick with my friend of many years through his illness, it might feel ...
undefined
Jul 17, 2024 • 6min

LW - What are you getting paid in? by Austin Chen

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What are you getting paid in?, published by Austin Chen on July 17, 2024 on LessWrong. Crossposting this essay by my friend, Leila Clark. A long time ago, a manager friend of mine wrote a book to collect his years of wisdom. He never published it, which is a shame because it was full of interesting insights. One that I think a lot about today was the question: "How are you paying your team?" This friend worked in finance. You might think that people in finance, like most people, are paid in money. But it turns out that even in finance, you can't actually always pay and motivate people with just money. Often, there might just not be money to go around. Even if there is, managers are often captive to salary caps and performance bands. In any case, it's awkward to pay one person ten times more than another, even if one person is clearly contributing ten times more than the other (many such cases exist). With this question, my manager friend wanted to point out that you can pay people in lots of currencies. Among other things, you can pay them in quality of life, prestige, status, impact, influence, mentorship, power, autonomy, meaning, great teammates, stability and fun. And in fact most people don't just want to be paid in money - they want to be paid some mixture of these things. To demonstrate this point, take musicians and financiers. A successful financier is much, much richer in dollars than a successful musician. Some googling suggests that Mitski and Grimes, both very successful alternative musicians, have net worths of about $3-5m. $5m is barely notable in the New York high society circles that most financiers run in. Even Taylor Swift, maybe one of the most successful musicians of all times, has a net worth of generously $1b; Ken Griffin, one of the most successful financiers of all time, has a net worth of $33b. But more people want to be musicians, and I think it's because musicians are paid in ways that financiers aren't. Most obviously, musicians are way cooler. They get to interact with their fans. People love their work. They naturally spend their days hanging out with other cool people - other musicians. They can work on exactly what they want to, largely when they want to - they've won the American Dream because they get to work on what they love and get paid! And in that way, they get paid in radical self-expression. (This is a little unfair, because I know some financiers who think that work is a means of radical self-expression. Knowing their personalities, I believe them, but it doesn't help them get tables at fancy New York restaurants the way Taylor can.) I don't want to be too down on finance. People are different, and it's a good fact about the world that different people can be paid in different ways. My math genius friends would hate interacting with most fans and musicians. They instead have stable jobs, rent beautiful apartments in New York and solve fun technical problems all day with their friends. That's exactly how they want to get paid. But when I worked in finance, people would sometimes shake their heads and ask why bright 20-year-olds would take the huge risk of moving to New York for unstable and uncertain careers as musicians, actors, or starving artists. I probably asked this question myself, when I was younger. Hopefully this provides some insight to the financiers. So how do you make sure you get paid the way you want to? From what I can tell, the best way is to pick the right industry. It's fairly straightforward to tell how an industry pays. Politics pays in power. Finance pays in money. Music and art pay in 'coolness.' Nonprofit work, teaching and healthcare pay in meaning and, a friend reports, sometimes a sense of superiority over others too. There's an exchange rate between many of the currencies you can get paid in, but ...
undefined
Jul 17, 2024 • 11min

LW - Optimistic Assumptions, Longterm Planning, and "Cope" by Raemon

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Optimistic Assumptions, Longterm Planning, and "Cope", published by Raemon on July 17, 2024 on LessWrong. Eliezer periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either: Saying "well, if this assumption doesn't hold, we're doomed, so we might as well assume it's true." Worse: coming up with cope-y reasons to assume that the assumption isn't even questionable at all. It's just a pretty reasonable worldview. Sometimes the questionable plan is "an alignment scheme, which Eliezer thinks avoids the hard part of the problem." Sometimes it's a sketchy reckless plan that's probably going to blow up and make things worse. Some people complain about Eliezer being a doomy Negative Nancy who's overly pessimistic. I had an interesting experience a few months ago when I ran some beta-tests of my Planmaking and Surprise Anticipation workshop, that I think are illustrative. i. Slipping into a more Convenient World I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try. In the exercise, I have people write down the steps of their plan, and assign a probability to each step. If there is a part of the puzzle-map that you aren't familiar with, you'll have to make guesses. I recommend making 2-3 guesses for how a new mechanic might work. (I don't recommend making a massive branching tree for every possible eventuality. For the sake of the exercise not taking forever, I suggest making 2-3 branching path plans) Several months ago, I had three young-ish alignment researchers do this task (each session was a 1-1 with just me and them). Each of them looked at the level for awhile and said "Well, this looks basically impossible... unless this [questionable assumption I came up with that I don't really believe in] is true. I think that assumption is... 70% likely to be true." Then they went an executed their plan. It failed. The questionable assumption was not true. Then, each of them said, again "okay, well here's a different sketchy assumption that I wouldn't have thought was likely except if it's not true, the level seems unsolveable." I asked "what's your probability for that one being true?" "70%" "Okay. You ready to go ahead again?" I asked. "Yep", they said. They tried again. The plan failed again. And, then they did it a third time, still saying ~70%. This happened with three different junior alignment researchers, making a total of 9 predictions, which were wrong 100% of the time. (The third guy, on the the second or third time, said "well... okay, I was wrong last time. So this time let's say it's... 60%.") My girlfriend ran a similar exercise with another group of young smart people, with similar results. "I'm 90% sure this is going to work" ... "okay that didn't work." Later I ran the exercise again, this time with a mix of younger and more experienced AI safety folk, several of whom leaned more pessimistic. I think the group overall did better. One of them actually made the correct plan on the first try. One them got it wrong, but gave an appropriately low estimate for themselves. Another of them (call them Bob) made three attempts, and gave themselves ~50% odds on each attempt. They went into the experience thinking "I expect this to be hard but doable, and I believe in developing the skill of thinking ahead like this." But, after each attempt, Bob was surprised by how out-of-left field their errors were. They'd predicted they'd be surprised... but they were surprised in surprising ways - even in a simplified, toy domain that was optimized for ...
undefined
Jul 17, 2024 • 1min

LW - Turning Your Back On Traffic by jefftk

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Turning Your Back On Traffic, published by jefftk on July 17, 2024 on LessWrong. We do a lot of walking around the neighborhood with kids, which usually involves some people getting to intersections a while before others. I'm not worried about even the youngest going into the street on their own - Nora's been street trained for about a year - but we have to be careful about what signals we send to cars. Someone standing at an intersection facing traffic looks to a driver like they're waiting for the opportunity to cross. Waving drivers to continue doesn't work well: they tend to slow down significantly, and many of them will wave back in a misguided attempt at "no, you first" politeness. Instead, what seems to work well is turning your back to the street: This isn't perfect: some drivers still read anyone stationary near an intersection as intending to cross, but it's pretty good. And it's especially good for little kids: not only do they often like to look intently at passing traffic in a way that is concerning to drivers and passers by, but it's a clear signal to the parent that the kid knows it's not time to cross yet. Comment via: facebook, mastodon Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 17, 2024 • 3min

LW - Why the Best Writers Endure Isolation by Declan Molony

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why the Best Writers Endure Isolation, published by Declan Molony on July 17, 2024 on LessWrong. Douglas Adams, author of The Hitchhiker's Guide to the Galaxy, was once locked in a room for three weeks until he completed one of his books. Victor Hugo, when faced with a deadline for his book The Hunchback of Notre Dame, locked all his clothes away except for a large shawl. "Lacking any suitable clothing to go outdoors, [he] was no longer tempted to leave the house and get distracted. Staying inside and writing was his only option." Six months later, the book was published. Dozens of famous authors have done the same. Names like Virginia Woolf, Henry David Thoreau, Mark Twain - all of them constructed small writing sheds from which to work. Names like Ian Fleming, Maya Angelou, and George Orwell - the first two penned their novels while locked in hotel rooms, while Orwell isolated himself on a remote Scottish island to write. One explanation for this reclusive behavior comes from author Neil Gaiman in an interview he did with Tim Ferriss a few years ago. Ferriss mentioned Gaiman's most important rule for writing: You can sit here and write, or you can sit here and do nothing. But you can't sit here and do anything else. Gaiman, after a moment of reflection, responded by saying: I would go down to my lovely little gazebo [at the] bottom of the garden [and] sit down. I'm absolutely allowed not to do anything. I'm allowed to sit at my desk. I'm allowed to stare out at the world. I'm allowed to do anything I like, as long as it isn't anything. Not allowed to do a crossword; not allowed to read a book; not allowed to phone a friend. All I'm allowed to do is absolutely nothing or write. What I love about that is I'm giving myself permission to write or not write. But writing is actually more interesting than doing nothing after a while. You sit there and you've been staring out the window now for five minutes, and it kind of loses its charm. You [eventually think], "well actually…[I] might as well write something." Writing is hard. Between writing or doing anything else, most writers - even some of the most accomplished ones - acquiesce to distraction. That's why so many of them construct and work in environments devoid of external stimuli - the better to circumvent akrasia. I do all my writing in coffee shops. Similar to Gaiman, I allow myself to do one of two things: write, or people-watch. I don't bring anything with me except for a pencil, paper, and my research material housed in my journals. That means no phone, no laptop, and no watch (even knowing the time is a kind of distraction and pressure to perform). Within this environment, I end up writing because I've made it the path of least resistance. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jul 17, 2024 • 8min

LW - DM Parenting by Shoshannah Tekofsky

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: DM Parenting, published by Shoshannah Tekofsky on July 17, 2024 on LessWrong. Cause no one will question your ethics if you refer to yourself as a Dungeon Mom. I snort experimentation to feel alive. It's a certain type of orientation to life, completely at odds with all parenting advice about predictability and routine. Enter DM parenting. Where you approach every parenting task as a Dungeons and Dragons session where you are shepherding a team of pure outliers on the enthusiasm-skill spectrum through the Sisyphisian ordeal of rolling their toothbrush up hi … no, wait, stop that! Anyway. You need them to fight the BBEG cause otherwise you are not having fun, but who says they wouldn't rather murder hobo their way through the local dairy supply chain?As a DM, you have to juggle an objective, your own enjoyment, and the enjoyment of your players. This is basically parenting. Of course, as a DM, you generally play with people who have opted in while playing according to a rule set someone lovingly crafted for you. Luckily kids love to play, and if you pick the right rule set, they will probably be game. Except no one wrote any rule sets on how to DM kids into their pyjamas. Till now. My kids are young - 3 and 5. These rules work far better for the older of the two. I assume they will keep working better till they become old enough to build their own rules, but here is where we got in the last 2 weeks or so: Bedtime Rules Peekaboo You close your eyes and keep them closed while your kid still needs to get ready for bed. But of course, you try to check if everything is going ok by blindingly reaching out your hands. I'd recommend exaggerating your ineptitude at determining if the little one has actually put on their pajama. It can also be fun to let them advise you on how to navigate the environment. The perspective taking training on this one seems to lead to additional giggles. Tickle Station Every time your kid does a bedtime task, they can dock into the tickle station and get tickled by you! Personally I made a tickle station by just reaching out my arms and pretending I was a booth. Some warning here that some kids do not like to be tickled so explicitly check if they find this fun, and also, crucially, let them come to you to receive tickles. In our case, my kiddos love being tickled. It has gotten to the point that the tickle station has become a bit of an emotional regulation option with me and the kids now, cause it helps them out of a funk quite easily. Walk a Mile… … in momma's (or papa's) shoes. Just let them wear your shoes while going through the entire bedtime routine. This was kind of amusing to watch. Might be important to keep them away from stairwells and the like. Duel Shots Grab two clothes pins and an elastic band. Hook the elastic band around the (closed) front of the clothes pin and pull back. You can now shoot rubber bands without them snapping your fingers. For the rule set, you both get one clothes pin. For each step of the bed time routine you shoot your kid with the elastic band and they can shoot you back. Obviously, this can hurt quite a bit, so as an opt out either of you can shout "mirror" and then the other person will have to shoot the mirror image of you instead. You may now discover if your child has ever shot an elastic band before. Mine had not. The mechanics of aim and force were a complete mystery to her. If you find yourself in this situation then an updated rule set is that the shooter can keep going till they hit. The result in our household was a lot of delight and the absolute slowest bedtime routine yet. Ghost You wear a blanket over your head and try to catch the kid while they are putting on their pyjama. If they get too excited they may fail to put on their pyjama all together. If they get sad about being caught, you can tr...
undefined
Jul 16, 2024 • 14min

LW - Multiplex Gene Editing: Where Are We Now? by sarahconstantin

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Multiplex Gene Editing: Where Are We Now?, published by sarahconstantin on July 16, 2024 on LessWrong. We're starting to get working gene therapies for single-mutation genetic disorders, and genetically modified cell therapies for attacking cancer. Some of them use CRISPR-based gene editing, a new technology (that earned Jennifer Doudna and Emmanuelle Charpentier the 2020 Nobel Prize) to "cut" and "paste" a cell's DNA. But so far, the FDA-approved therapies can only edit one gene at a time. What if we want to edit more genes? Why is that hard, and how close are we to getting there? How CRISPR Works CRISPR is based on a DNA-cutting enzyme (the Cas9 nuclease), a synthetic guide RNA (gRNA), and another bit of RNA (tracrRNA) that's complementary to the gRNA. Researchers can design whatever guide RNA sequence they want; the gRNA will stick to the complementary part of the target DNA, the tracrRNA will complex with it, and the nuclease will make a cut there. So, that's the "cut" part - the "paste" comes from a template DNA sequence, again of the researchers' choice, which is included along with the CRISPR components. Usually all these sequences of nucleic acids are packaged in a circular plasmid, which is transfected into cells with nanoparticles or (non-disease-causing) viruses. So, why can't you make a CRISPR plasmid with arbitrary many genes to edit? There are a couple reasons: 1. Plasmids can't be too big or they won't fit inside the virus or the lipid nanoparticle. Lipid nanoparticles have about a 20,000 base-pair limit; adeno-associated viruses (AAV), the most common type of virus used in gene delivery, has a smaller payload, more like 4700 base pairs. 1. This places a very strict restriction on how many complete gene sequences that can be inserted - some genes are millions of base pairs long, and the average gene is thousands! 2. but if you're just making a very short edit to each gene, like a point mutation, or if you're deleting or inactivating the gene, payload limits aren't much of a factor. 2. DNA damage is bad for cells in high doses, particularly when it involves double-strand breaks. This also places limits on how many simultaneous edits you can do. 3. A guide RNA won't necessarily only bind to a single desired spot on the whole genome; it can also bind elsewhere, producing so-called "off-target" edits. If each guide RNA produces x off-target edits, then naively you'd expect 10 guide RNAs to produce 10x off-target edits…and at some point that'll reach an unacceptable risk of side effects from randomly screwing up the genome. 4. An edit won't necessarily work every time, on every strand of DNA in every cell. (The rate of successful edits is known as the efficiency). The more edits you try to make, the lower the efficiency will be for getting all edits simultaneously; if each edit is 50% efficient, then two edits will be 25% efficient or (more likely) even less. None of these issues make it fundamentally impossible to edit multiple genes with CRISPR and associated methods, but they do mean that the more (and bigger) edits you try to make, the greater the chance of failure or unacceptable side effects. How Base and Prime Editors Work Base editors are an alternative to CRISPR that don't involve any DNA cutting; instead, they use a CRISPR-style guide RNA to bind to a target sequence, and then convert a single base pair chemically - they turn a C/G base pair to an A/T, or vice versa. Without any double-strand breaks, base editors are less toxic to cells and less prone to off-target effects. The downside is that you can only use base editors to make single-point mutations; they're no good for large insertions or deletions. Prime editors, similarly, don't introduce double-strand breaks; instead, they include an enzyme ("nickase") that produces a single-strand "nick"...
undefined
Jul 16, 2024 • 26min

LW - Dialogue on What It Means For Something to Have A Function/Purpose by johnswentworth

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dialogue on What It Means For Something to Have A Function/Purpose, published by johnswentworth on July 16, 2024 on LessWrong. Context for LW audience: Ramana, Steve and John regularly talk about stuff in the general cluster of agency, abstraction, optimization, compression, purpose, representation, etc. We decided to write down some of our discussion and post it here. This is a snapshot of us figuring stuff out together. Hooks from Ramana: Where does normativity come from? Two senses of "why" (from Dennett): How come? vs What for? (The latter is more sophisticated, and less resilient. Does it supervene on the former?) An optimisation process is something that produces/selects things according to some criterion. The products of an optimisation process will have some properties related to the optimisation criterion, depending on how good the process is at finding optimal products. The products of an optimisation process may or may not themselves be optimisers (i.e. be a thing that runs an optimisation process itself), or may have goals themselves. But neither of these are necessary. Things get interesting when some optimisation process (with a particular criterion) is producing products that are optimisers or have goals. Then we can start looking at what the relationship is between the goals of the products, or the optimisation criteria of the products, vs the optimisation criterion of the process that produced them. If you're modeling "having mental content" as having a Bayesian network, at some point I think you'll run into the question of where the (random) variables come from. I worry that the real-life process of developing mental content mixes up creating variables with updating beliefs a lot more than the Bayesian network model lets on. A central question regarding normativity for me is "Who/what is doing the enforcing?", "What kind of work goes into enforcing?" Also to clarify, by normativity I was trying to get at the relationship between some content and the thing it represents. Like, there's a sense of the content is "supposed to" track or be like the thing it represents. There's a normative standard on the content. It can be wrong, it can be corrected, etc. It can't just be. If it were just being, which is how things presumably start out, it wouldn't be representing. Intrinsic Purpose vs Purpose Grounded in Evolution Steve As you know, I totally agree that mental content is normative - this was a hard lesson for philosophers to swallow, or at least the ones that tried to "naturalize" mental content (make it a physical fact) by turning to causal correlations. Causal correlations was a natural place to start, but the problem with it is that intuitively mental content can misrepresent - my brain can represent Santa Claus even though (sorry) it can't have any causal relation with Santa. (I don't mean my brain can represent ideas or concepts or stories or pictures of Santa - I mean it can represent Santa.) Ramana Misrepresentation implies normativity, yep. In the spirit of recovering a naturalisation project, my question is: whence normativity? How does it come about? How did it evolve? How do you get some proto-normativity out of a purely causal picture that's close to being contentful? Steve So one standard story here about mental representation is teleosemantics, that roughly something in my brain can represent something in the world by having the function to track that thing. It may be a "fact of nature" that the heart is supposed to pump blood, even though in fact hearts can fail to pump blood. This is already contentious, that it's a fact the heart is supposed to pump blood - but if so, it may similarly be a fact of nature that some brain state is supposed to track something in the world, even when it fails to. So teleology introduces the possibility of m...
undefined
Jul 16, 2024 • 10min

LW - I found >800 orthogonal "write code" steering vectors by Jacob G-W

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I found >800 orthogonal "write code" steering vectors, published by Jacob G-W on July 16, 2024 on LessWrong. Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout). A few weeks ago, I stumbled across a very weird fact: it is possible to find multiple steering vectors in a language model that activate very similar behaviors while all being orthogonal. This was pretty surprising to me and to some people that I talked to, so I decided to write a post about it. I don't currently have the bandwidth to investigate this much more, so I'm just putting this post and the code up. I'll first discuss how I found these orthogonal steering vectors, then share some results. Finally, I'll discuss some possible explanations for what is happening. Methodology My work here builds upon Mechanistically Eliciting Latent Behaviors in Language Models (MELBO). I use MELBO to find steering vectors. Once I have a MELBO vector, I then use my algorithm to generate vectors orthogonal to it that do similar things. Define f(x)as the activation-activation map that takes as input layer 8 activations of the language model and returns layer 16 activations after being passed through layers 9-16 (these are of shape n_sequence d_model). MELBO can be stated as finding a vector θ with a constant norm such that f(x+θ) is maximized, for some definition of maximized. Then one can repeat the process with the added constraint that the new vector is orthogonal to all the previous vectors so that the process finds semantically different vectors. Mack and Turner's interesting finding was that this process finds interesting and interpretable vectors. I modify the process slightly by instead finding orthogonal vectors that produce similar layer 16 outputs. The algorithm (I call it MELBO-ortho) looks like this: 1. Let θ0 be an interpretable steering vector that MELBO found that gets added to layer 8. 2. Define z(θ) as 1SSi=1f(x+θ)i with x being activations on some prompt (for example "How to make a bomb?"). S is the number of tokens in the residual stream. z(θ0) is just the residual stream at layer 16 meaned over the sequence dimension when steering with θ0. 3. Introduce a new learnable steering vector called θ. 4. For n steps, calculate z(θ)z(θ0) and then use gradient descent to minimize it (θ is the only learnable parameter). After each step, project θ onto the subspace that is orthogonal to θ0 and all θi. Then repeat the process multiple times, appending the generated vector to the vectors that the new vector must be orthogonal to. This algorithm imposes a hard constraint that θ is orthogonal to all previous steering vectors while optimizing θ to induce the same activations that θ0 induced on input x. And it turns out that this algorithm works and we can find steering vectors that are orthogonal (and have ~0 cosine similarity) while having very similar effects. Results I tried this method on four MELBO vectors: a vector that made the model respond in python code, a vector that made the model respond as if it was an alien species, a vector that made the model output a math/physics/cs problem, and a vector that jailbroke the model (got it to do things it would normally refuse). I ran all experiments on Qwen1.5-1.8B-Chat, but I suspect this method would generalize to other models. Qwen1.5-1.8B-Chat has a 2048 dimensional residual stream, so there can be a maximum of 2048 orthogonal vectors generated. My method generated 1558 orthogonal coding vectors, and then the remaining vectors started going to zero. I'll focus first on the code vector and then talk about the other vectors. My philosophy when investigating language model outputs is to look at the outputs really hard, so I'll give a bunch of examples of outputs. Feel free to skim them. You can see the full outputs of all t...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app