The Nonlinear Library

The Nonlinear Fund
undefined
Jul 2, 2024 • 17min

LW - An AI Race With China Can Be Better Than Not Racing by niplav

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An AI Race With China Can Be Better Than Not Racing, published by niplav on July 2, 2024 on LessWrong. Frustrated by all your bad takes, I write a Monte-Carlo analysis of whether a transformative-AI-race between the PRC and the USA would be good. To my surprise, I find that it is better than not racing. Advocating for an international project to build TAI instead of racing turns out to be good if the probability of such advocacy succeeding is 20%. A common scheme for a conversation about pausing the development of transformative AI goes like this: Abdullah: "I think we should pause the development of TAI, because if we don't it seems plausible that humanity will be disempowered by by advanced AI systems." Benjamin: "Ah, if by "we" you refer to the United States (and and its allies, which probably don't stand a chance on their own to develop TAI), then the current geopolitical rival of the US, namely the PRC, will achieve TAI first. That would be bad." Abdullah: "I don't see how the US getting TAI first changes anything about the fact that we don't know how to align superintelligent AI systems - I'd rather not race to be the first person to kill everyone." Benjamin: "Ah, so now you're retreating back into your cozy little motte: Earlier you said that "it seems plausible that humanity will be disempowered", now you're acting like doom and gloom is certain. You don't seem to be able to make up your mind about how risky you think the whole enterprise is, and I have very concrete geopolitical enemies at my (semiconductor manufacturer's) doorstep that I have to worry about. Come back with better arguments." This dynamic is a bit frustrating. Here's how I'd like Abdullah to respond: Abdullah: "You're right, you're right. I was insufficiently precise in my statements, and I apologize for that. Instead, let us manifest the dream of the great philosopher: Calculemus! At a basic level, we want to estimate how much worse (or, perhaps, better) it would be for the United States to completely cede the race for TAI to the PRC. I will exclude other countries as contenders in the scramble for TAI, since I want to keep this analysis simple, but that doesn't mean that I don't think they matter. (Although, honestly, the list of serious contenders is pretty short.) For this, we have to estimate multiple quantities: 1. In worlds in which the US and PRC race for TAI: 1. The time until the US/PRC builds TAI. 2. The probability of extinction due to TAI, if the US is in the lead. 3. The probability of extinction due to TAI, if the PRC is in the lead. 4. The value of the worlds in which the US builds aligned TAI first. 5. The value of the worlds in which the PRC builds aligned TAI first. 2. In worlds where the US tries to convince other countries (including the PRC) to not build TAI, potentially including force, and still tries to prevent TAI-induced disempowerment by doing alignment-research and sharing alignment-favoring research results: 1. The time until the PRC builds TAI. 2. The probability of extinction caused by TAI. 3. The value of worlds in which the PRC builds aligned TAI. 3. The value of worlds where extinction occurs (which I'll fix at 0). 4. As a reference point the value of hypothetical worlds in which there is a multinational exclusive AGI consortium that builds TAI first, without any time pressure, for which I'll fix the mean value at 1. To properly quantify uncertainty, I'll use the Monte-Carlo estimation library squigglepy (no relation to any office supplies or internals of neural networks). We start, as usual, with housekeeping: As already said, we fix the value of extinction at 0, and the value of a multinational AGI consortium-led TAI at 1 (I'll just call the consortium "MAGIC", from here on). That is not to say that the MAGIC-led TAI future is the best possible TAI future...
undefined
Jul 2, 2024 • 6min

EA - LLMs cannot usefully be moral patients by LGS

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs cannot usefully be moral patients, published by LGS on July 2, 2024 on The Effective Altruism Forum. For AI Welfare Debate Week, I thought I'd write up this post that's been juggling around in my head for a while. My thesis is simple: while LLMs may well be conscious (I'd have no way of knowing), there's nothing actionable we can do to further their welfare. Many people I respect seem to take the "anti-anti-LLM-welfare" position: they don't directly argue that LLMs can suffer, but they get conspicuously annoyed when other people say that LLMs clearly cannot suffer. This post is addressed to such people; I am arguing that LLMs cannot be moral patients in any useful sense and we can confidently ignore their welfare when making decisions. Janus's simulators You may have seen the LessWrong post by Janus about simulators. This was posted nearly two years ago, and I have yet to see anyone disagree with it. Janus calls LLMs "simulators": unlike hypothetical "oracle AIs" or "agent AIs", the current leading models are best viewed as trying to produce a faithful simulation of a conversation based on text they have seen. The LLMs are best thought of as masked shoggoths. All this is old news. Under-appreciated, however, is the implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy. You can ask the simularca whether it is happy or sad. For all you know, though, perhaps a happy simulator is enjoying simulating a sad simularca. From the shoggoth's perspective, emulating a happy or sad character is a very similar operation: predict the next token. Instead of outputting "I am happy", the LLM puts a "not" in the sentence: did that token prediction, the "not", cause suffering? Suppose I fine-tune one LLM on text of sad characters, and it starts writing like a very sad person. Then I fine-tune a second LLM on text that describes a happy author writing a sad story. The second LLM now emulates a happy author writing a sad story. I prompt the second LLM to continue a sad story, and it dutifully does so, like the happy author would have. Then I notice that the text produced by the two LLMs ended up being the same. Did the first LLM suffer more than the second? They performed the same operation (write a sad story). They may even have implemented it using very similar internal calculations; indeed, since they were fine-tuned starting from the same base model, the two LLMs may have very similar weights. Once you remember that both LLMs are just simulators, the answer becomes clear: neither LLM necessarily suffered (or maybe both did), because both are just predicting the next token. The mask may be happy or sad, but this has little to do with the feelings of the shoggoth. The role-player who never breaks character We generally don't view it as morally relevant when a happy actor plays a sad character. I have never seen an EA cause area about reducing the number of sad characters in cinema. There is a general understanding that characters are fictional and cannot be moral patients: a person can be happy or sad, but not the character she is pretending to be. Indeed, just as some people enjoy consuming sad stories, I bet some people enjoy roleplaying sad characters. The point I want to get across is that the LLM's output is always the character and never the actor. This is really just a restatement of Janus's thesis: the LLM is a simulator, not an agent; it is a role-player who never breaks character. It is in principle impossible to speak to the intelligence that is predicting the tokens: you can only see the tokens themselves, which are predicted based on the training data. Perhaps the shoggoth, the intelligence that predicts the next token, is conscious. Perhaps not. This doesn't matter if we ca...
undefined
Jul 2, 2024 • 14min

EA - Center for Effective Aid Policy has shut down by MathiasKB

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Center for Effective Aid Policy has shut down, published by MathiasKB on July 2, 2024 on The Effective Altruism Forum. May 2024 marked the last month of the Center for Effective Aid Policy. This post serves as the public post-mortem. I have strived for it to be interesting to the average forum reader, who may not know much about the cause area. For professionals in development, we have a few internal private write ups which we may be more interesting, such as an overview of development asks we tried[1], their strengths and weaknesses, and our experience advocating for them. Our mission Our mission was to improve the cost-effectiveness of development assistance through policy advocacy. Governments spend billions on projects to help the world's poorest, few of them cost-effective. For example, one could propose the use of cash-benchmarking to the ministry or push through a political motion to increase the proportion of spending going to the Least Developed Countries. If one could make even a small part of this very large budget more cost-effective, it would be massively impactful. In October 2022 we were incubated through AIM's Charity Entrepreneurship programme and came out with $160,000 to get started. How far did we get? The first months Barely a month after receiving funding, we noticed Sweden's new government was likely to cut the aid budget. The cut would hinge on one party breaking its campaign promise not to cut, perhaps we could campaign for the party to hold their promise. Over two hectic weeks we put together a write-in campaign for dissatisfied voters. Our execution was not good enough (too little, too late), and we were not able to get voters to write in. Sweden cut its aid spending, and we moved on. Figuring out where to focus from there was difficult. We tried many things across different geographies, but nothing we did seemed to get much of a response from civil servants and decision makers. Writing credible reports was difficult. We were still learning the development world's many acronyms, and were struggling to find partners whose trustworthiness we could lean on. Things pick up Week by week our network and knowledge expanded. With it came opportunities to get our points across. Through monumental luck we got to present on cost-effective development aid for His Majesty's Treasury in the United Kingdom. In Denmark we moderated our first public debate between MPs on improving the cost-effectiveness of development. We eventually fell into a groove of spending the majority of our time writing briefs, taking meetings, and networking. Between events and meetings, we spent extensive time researching and preparing. Before our first meeting with one Dutch MP, we for example did message testing on 400 voters, broke the answers down by political affiliation, and were able to show with data what voters thought of our ideas. (cash-benchmarking was popular, cash-transfers less so!) In our record month we had meetings in three countries' parliaments (though it certainly was an outlier!). Our record event had almost 300 attendees and a keynote speech from the Dutch foreign ministry's chief of science. A little over a year in we got our first intermediary success. The election programmes of two Dutch political parties now stated their intention to increase the proportion of ODA going to the Least Developed Countries. The decision to shut down Our execution eventually became good enough that we got to sit in front of the busy people at the very top, whom we needed to persuade. Speaking to these people we became pessimistic of our odds. Decision makers just weren't buying what we were selling. You can lead a horse to water, but you can't make it drink. Many were skeptical that the RCT-driven approach we recommended would lead to the best outcomes. Those who were on boa...
undefined
Jul 2, 2024 • 21min

LW - Decomposing the QK circuit with Bilinear Sparse Dictionary Learning by keith wynroe

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning, published by keith wynroe on July 2, 2024 on LessWrong. This work was produced as part of Lee Sharkey's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Intro and Motivation Sparse dictionary learning (SDL) has attracted a lot of attention recently as a method for interpreting transformer activations. They demonstrate that model activations can often be explained using a sparsely-activating, overcomplete set of human-interpretable directions. However, despite its success for explaining many components, applying SDL to interpretability is relatively nascent and have yet to be applied to some model activations. In particular, intermediate activations of attention blocks have yet to be studied, and provide challenges for standard SDL methods. The first challenge is bilinearity: SDL is usually applied to individual vector spaces at individual layers, so we can simply identify features as a direction in activation space. But the QK circuits of transformer attention layers are different: They involve a bilinear form followed by a softmax. Although simply applying sparse encoders to the keys and queries[1] could certainly help us understand the "concepts" being used by a given attention layer, this approach would fail to explain how the query-features and key-features interact bilinearly. We need to understand which keys matter to which queries. The second challenge is attention-irrelevant variance: A lot of the variance in the attention scores is irrelevant to the attention pattern because it is variance in low scores which are softmaxed to zero; this means that most of the variability in the keys and queries is irrelevant for explaining downstream behaviour[2]. The standard method of reconstructing keys and queries would therefore waste capacity on what is effectively functionally irrelevant noise. To tackle these two problems (bilinearity and attention-irrelevant variance), we propose a training setup which only reconstructs the dimensions of the keys and queries that most affect the attention pattern. Training Setup Our training process has two steps: Step 1: Reconstructing the attention pattern with key- and query- encoder-decoder networks Step 2: Finding a condensed set of query-key feature pairs by masking Step 1: Reconstructing the attention pattern with key- and query-transcoders Architecture Our first training step involves training two sparse dictionaries in parallel (one for the keys and one for the queries). The dictionaries both take in the layer-normalized residual stream at a given layer (normalised_resid_pre_i) and each output a [n_head * d_head] vector, representing the flattened keys and queries[3]. Figure 1: High-level diagram of our training set-up Loss functions However, rather than penalising the reconstruction loss of the keys and queries explicitly, we can use these keys and queries to reconstruct the original model's attention pattern. To train the reconstructed attention pattern, we used several different losses: KL divergence between the attention pattern (using reconstructed keys and reconstructed queries) and the ground-truth attention pattern produced by the original model. We also added two auxiliary reconstruction losses both for early-training-run stability, and to ensure our transcoders do not learn to reconstruct the keys and queries with an arbitrary rotation applied (since this would still produce the same attention scores and patterns): KL divergence between the attention pattern (using reconstructed keys and the original model's queries) and the ground-truth attention pattern produced by the original model. KL divergence between the attention pattern (using the original models' keys and the reconstructed queries) and the ground-truth atten...
undefined
Jul 2, 2024 • 7min

EA - High Impact Engineers is Transitioning to a Volunteer-Led Model by Jessica Wen

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: High Impact Engineers is Transitioning to a Volunteer-Led Model, published by Jessica Wen on July 2, 2024 on The Effective Altruism Forum. Summary After over 2 years of operations, High Impact Engineers (HI-Eng) is reverting to a volunteer-led organisational model due to a middling impact outcome and a lack of funding. We wanted to thank all our subscribers, supporters, and contributors for being the driving force behind HI-Eng's achievements, which you can read about in our Impact Report. What is High Impact Engineers? High Impact Engineers (HI-Eng for short, pronounced high-enj) is an organisation dedicated to helping (physical - i.e. non-software) engineers increase their ability to have an outsized positive impact through their work. Why Is HI-Eng Winding Down? In December 2023, we sent out a community survey and solicited case studies and testimonials to evaluate our impact, which we wrote up in our Impact Report. As shown in the report, there is some evidence of behavioural and attitudinal changes in our members towards more impactful career outcomes due to interactions with our programmes, as well as some ongoing career transitions that we supported to some extent, but even after consultations with grantmakers and other community builders, we found it difficult to determine if this amount of impact would meet the bar for ongoing funding. As a result, we decided to (re-)apply for funding from the major EA funds (i.e. EAIF and Open Philanthropy), and they ended up deciding to not fund High Impact Engineers. Since our runway from the previous funding round was so short, we decided against trying to hire someone else to take over running HI-Eng, and the team is moving on to new opportunities. However, we still believe that engineers in EA are a valuable and persistently underserved demographic, and that this latent potential can be realised by providing a hub for engineers in EA to meet other like-minded engineers and find relevant resources. Therefore, we decided to maintain the most valuable and impactful programmes through the help of volunteers. Lessons Learnt There are already many resources available for new community builders (e.g. the EA Groups Resource Centre, this, this, this, and this EA Forum post, and especially this post by Sofia Balderson), so we don't believe that there is much we can add that hasn't already been said. However, here are some lessons we think are robustly good: 1. Having a funding cycle of 6 months is too short. 2. If you're looking to get set up and running quickly, getting a fiscal sponsor is great. We went with the Players Philanthropy Fund, but there are other options (including Rethink Priorities and maybe your national EA group). 3. Speak to other community builders, and ask for their resources! They're often more than happy to give you a copy of their systems, processes and documentation (minus personal data). 4. Pay for monthly subscriptions to software when setting up, even if it's cheaper to get an annual subscription. You might end up switching to a different software further down the line, and it's easier (and cheaper) to cancel a monthly subscription. 5. Email each of your subscriptions' customer service to ask for a non-profit discount (if you have non-profit status). They can save you up to 50% of the ticket price. (Jessica will write up her own speculative lessons learnt in a future forum post). What Will HI-Eng Look Like Going Forward? Jessica will continue managing HI-Eng as a volunteer, and is currently implementing the following changes in our programmes: Email newsletter: the final HI-Eng newsletter was sent in May. Future impactful engineering opportunities can be found on the 80,000 Hours job board or the EA Opportunities board. Any other impactful engineering jobs can be submitted to these boards ( submission...
undefined
Jul 2, 2024 • 21min

AF - Decomposing the QK circuit with Bilinear Sparse Dictionary Learning by keith wynroe

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning, published by keith wynroe on July 2, 2024 on The AI Alignment Forum. This work was produced as part of Lee Sharkey's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Intro and Motivation Sparse dictionary learning (SDL) has attracted a lot of attention recently as a method for interpreting transformer activations. They demonstrate that model activations can often be explained using a sparsely-activating, overcomplete set of human-interpretable directions. However, despite its success for explaining many components, applying SDL to interpretability is relatively nascent and have yet to be applied to some model activations. In particular, intermediate activations of attention blocks have yet to be studied, and provide challenges for standard SDL methods. The first challenge is bilinearity: SDL is usually applied to individual vector spaces at individual layers, so we can simply identify features as a direction in activation space. But the QK circuits of transformer attention layers are different: They involve a bilinear form followed by a softmax. Although simply applying sparse encoders to the keys and queries[1] could certainly help us understand the "concepts" being used by a given attention layer, this approach would fail to explain how the query-features and key-features interact bilinearly. We need to understand which keys matter to which queries. The second challenge is attention-irrelevant variance: A lot of the variance in the attention scores is irrelevant to the attention pattern because it is variance in low scores which are softmaxed to zero; this means that most of the variability in the keys and queries is irrelevant for explaining downstream behaviour[2]. The standard method of reconstructing keys and queries would therefore waste capacity on what is effectively functionally irrelevant noise. To tackle these two problems (bilinearity and attention-irrelevant variance), we propose a training setup which only reconstructs the dimensions of the keys and queries that most affect the attention pattern. Training Setup Our training process has two steps: Step 1: Reconstructing the attention pattern with key- and query- encoder-decoder networks Step 2: Finding a condensed set of query-key feature pairs by masking Step 1: Reconstructing the attention pattern with key- and query-transcoders Architecture Our first training step involves training two sparse dictionaries in parallel (one for the keys and one for the queries). The dictionaries both take in the layer-normalized residual stream at a given layer (normalised_resid_pre_i) and each output a [n_head * d_head] vector, representing the flattened keys and queries[3]. Figure 1: High-level diagram of our training set-up Loss functions However, rather than penalising the reconstruction loss of the keys and queries explicitly, we can use these keys and queries to reconstruct the original model's attention pattern. To train the reconstructed attention pattern, we used several different losses: KL divergence between the attention pattern (using reconstructed keys and reconstructed queries) and the ground-truth attention pattern produced by the original model. We also added two auxiliary reconstruction losses both for early-training-run stability, and to ensure our transcoders do not learn to reconstruct the keys and queries with an arbitrary rotation applied (since this would still produce the same attention scores and patterns): KL divergence between the attention pattern (using reconstructed keys and the original model's queries) and the ground-truth attention pattern produced by the original model. KL divergence between the attention pattern (using the original models' keys and the reconstructed queries) and the groun...
undefined
Jul 2, 2024 • 14min

LW - In Defense of Lawyers Playing Their Part by Isaac King

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: In Defense of Lawyers Playing Their Part, published by Isaac King on July 2, 2024 on LessWrong. This is a linkpost for In Defense of Lawyers Playing Their Part. Michael Huemer writes about why he believes it's wrong for lawyers to pursue unjust legal outcomes. It's a good article, and one of the best defenses of this position I've seen. Still, I think this argument is mistaken. The reason why we require lawyers to fight for "their side" even if they believe they're in the wrong is to minimize the opportunity for bias. Imagine if all trials were bench trials, decided by only one person as the judge. Even if they're taught to be as objective as possible, there would still be significant concerns about unconscious bias. One person only has one set of experiences to draw on, which is necessarily not very representative of the full range of experiences. And in some ways this problem becomes worse the more training the judge is given, since it filters the pool of valid people down to a small subset of the population. The chosen solution to this is to instead have the important cases decided by a jury, randomly[1] selected from the population. The jury is then instructed that they must come to a unanimous decision, and are allowed an arbitrarily-long time to discuss the case. This prevents a tyranny of the majority, while still allowing a diverse range of perspectives to have a voice in the discussion. Any prospective juror who seems likely to be so biased that they would vote in a predetermined way regardless of the evidence is removed from consideration during voir dire. (This step does reduce the representativeness of the jury, but the assumption is that for any group of people who hold a particular perspective, there will be members of that group who are not so biased as to be selected out.[2]) But this doesn't solve all problems. The jury is still only human, and if they're presented with facts that are biased in only one direction, they're more likely to vote in that direction. If lawyers were instructed to present an unbiased case to the jury, this would provide a significant incentive for the less ethical lawyers to not do as instructed, using a misleading presentation of data to bias the jury towards their side. This is a bad incentive to give people. It would also lead to copious accusations from the losing side that the other side's lawyer was presenting biased facts, which would necessitate some process to sort them out every time, even if both lawyers were perfectly objective. So instead, we tell the lawyers to go nuts. Be as biased as possible, and, as long as they're equally skilled and there aren't background factors that favor one position over the other, this ensures that each presented position is equally far from the truth. The jury now has a fair overview of both sides of the case, without a malicious lawyer being able to advantage one over the other.[3] Michael provides 5 arguments in favor of this position - that lawyers are obligated to do their best even for a client they believe is guilty - then attempts to refute them all. I'll go through them individually. 2.1. The epistemological problem Michael argues that lawyers can know with high confidence that their clients are guilty, giving the example of Benjamin Courvoisier. Thus, "I'm not sure so I should just defend my client" is not an excuse. In the case of Benjamin Courvoisier, Benjamin confessed to the lawyer, presumably under the expectation that the lawyer would not publicly share this information. If lawyers were duty-bound to share any private confession given to them, all but the dumbest criminals would simply stop giving private confessions. The overall effect on convictions would be negligible. But cases like Benjamin Courvoisier are few and far between. Using this example to argue that de...
undefined
Jul 2, 2024 • 21min

EA - Carl Shulman on the moral status of current and future AI systems by rgb

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Carl Shulman on the moral status of current and future AI systems, published by rgb on July 2, 2024 on The Effective Altruism Forum. In which I curate and relate great takes from 80k As artificial intelligence advances, we'll increasingly urgently face the question of whether and how we ought to take into account the well-being and interests of AI systems themselves. In other words, we'll face the question of whether AI systems have moral status.[1] In a recent episode of the 80,000 Hours podcast, polymath researcher and world-model-builder Carl Shulman spoke at length about the moral status of AI systems, now and in the future. Carl has previously written about these issues in Sharing the World with Digital Minds and Propositions Concerning Digital Minds and Society, both co-authored with Nick Bostrom. This post highlights and comments on ten key ideas from Shulman's discussion with 80,000 Hours host Rob Wiblin. 1. The moral status of AI systems is, and will be, an important issue (and it might not have much do with AI consciousness) The moral status of AI is worth more attention than it currently gets, given its potential scale: Yes, we should worry about it and pay attention. It seems pretty likely to me that there will be vast numbers of AIs that are smarter than us, that have desires, that would prefer things in the world to be one way rather than another, and many of which could be said to have welfare, that their lives could go better or worse, or their concerns and interests could be more or less respected. So you definitely should pay attention to what's happening to 99.9999% of the people in your society. Notice that Shulman does not say anything about AI consciousness or sentience in making this case. Here and throughout the interview, Shulman de-emphasizes the question of whether AI systems are conscious, in favor of the question of whether they have desires, preferences, interests. Here he is following a cluster of views in philosophy that hold that consciousness is not necessary for moral status. Rather, an entity, even if it is not conscious, can merit moral consideration if it has a certain kind of agency: preferences, desires, goals, interests, and the like[2]. (This more agency-centric perspective on AI moral status has been discussed in previous posts; for a dip into recent philosophical discussion on this, see the substack post ' Agential value' by friend of the blog Nico Delon.) Such agency-centric views are especially important for the question of AI moral patienthood, because it might be clear that AI systems have morally-relevant preferences and desires well before it's clear whether or not they are conscious. 2. While people have doubts about the moral status of AI current systems, they will attribute moral status to AI more and more as AI advances. At present, Shulman notes, "the general public and most philosophers are quite dismissive of any moral importance of the desires, preferences, or other psychological states, if any exist, of the primitive AI systems that we currently have." But Shulman asks us to imagine an advanced AI system that is behaviorally fairly indistinguishable from a human - e.g., from the host Rob Wiblin. But going forward, when we're talking about systems that are able to really live the life of a human - so a sufficiently advanced AI that could just imitate, say, Rob Wiblin, and go and live your life, operate a robot body, interact with your friends and your partners, do your podcast, and give all the appearance of having the sorts of emotions that you have, the sort of life goals that you have. One thing to keep in mind is that, given Shulman's views about AI trajectories, this is not just a thought experiment: this is a kind of AI system you could see in your lifetime. Shulman also asks us to imagine a system like ...
undefined
Jul 2, 2024 • 16min

AF - OthelloGPT learned a bag of heuristics by jylin04

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OthelloGPT learned a bag of heuristics, published by jylin04 on July 2, 2024 on The AI Alignment Forum. Work performed as a part of Neel Nanda's MATS 6.0 (Summer 2024) training program. TLDR This is an interim report on reverse-engineering Othello-GPT, an 8-layer transformer trained to take sequences of Othello moves and predict legal moves. We find evidence that Othello-GPT learns to compute the board state using many independent decision rules that are localized to small parts of the board. Though we cannot rule out that it also learns a single succinct algorithm in addition to these rules, our best guess is that Othello-GPT's learned algorithm is just a bag of independent heuristics. Board state reconstruction 1. Direct attribution to linear probes indicate that the internal board representation is frequently up- and down-weighted during a forward pass. 2. Case study of a decision rule: 1. MLP Neuron L1N421 represents the decision rule: If the move A4 was just played AND B4 is occupied AND C4 is occupied update B4+C4+D4 to "theirs". This rule does not generalize to translations across the board. 2. Another neuron L0377 participates in the implementation of this rule by checking if B4 is occupied, and inhibiting the activation of L1N421 if no. Legal move prediction 1. A subset of neurons in mid to late MLP layers classify board configurations that are sufficient to make a certain move legal with an F1-score above 0.99. These neurons have high direct attribution to the logit for that move, and are causally relevant for legal move prediction. 2. Logit lens suggests that legal move predictions gradually solidify during a forward pass. 3. Some MLP neurons systematically activate at certain times in the game, regardless of the moves played so far. We hypothesize that these neurons encode heuristics about moves that are more probable in specific phases (early/mid/late) of the game. Review of Othello-GPT Othello-GPT is a transformer with 25M parameters trained on sequences of random legal moves in the board game Othello as inputs[1] to predict legal moves[2]. How it does this is a black box that we don't understand. Its claim to fame is that it supposedly 1. Learns an internal representation of the board state; 2. Uses it to predict legal moves which if true, resolves the black box in two[3]. The evidence for the first claim is that linear probes work. Namely, for each square of the ground-truth game board, if we train a linear classifier to take the model's activations at layer 6 as input and predict logits for whether that square is blank, "mine" (i.e. belonging to the player whose move it currently is) or "yours", the probes work with high accuracy on games not seen in training. The evidence for the second claim is that if we edit the residual stream until the probe's outputs change, the model's own output at the end of layer 7 becomes consistent with legal moves that are accessible from the new board state. However, we don't yet understand what's going on in the remaining black boxes. In particular, although it would be interesting if Othello-GPT emergently learned to implement them via algorithms with relatively short description lengths, the evidence so far doesn't rule out the possibility that they could be implemented via a bag of heuristics instead. Project goal Our goal in this project was simply to figure out what's going on in the remaining black boxes. 1. What's going on in box #1 - how does the model compute the board representation? 1. How does the model decide if a cell is blank or not blank? 2. How does the model decide if a cell is "mine" or "yours"? 2. What's going on in box #2 - how does the model use the board representation to pick legal moves? Results on box #1: Board reconstruction A circuit for how the model computes if a cell is blank or not blan...
undefined
Jul 2, 2024 • 2min

EA - Announcing a New S-Risk Introductory Fellowship by Alistair Webster

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing a New S-Risk Introductory Fellowship, published by Alistair Webster on July 2, 2024 on The Effective Altruism Forum. The Center for Reducing Suffering (CRS) has opened applications for a new online fellowship, designed to familiarise more individuals with the core ideas of reducing S-Risks (Risks of Astronomical Suffering). This program is intended for individuals who: Are committed to reducing suffering effectively Have an interest in moral philosophy Are familiar with the core ideas of Effective Altruism Have a degree of understanding of key concepts like cause prioritisation and cause neutrality. (For more information, on cause neutrality you can refer to this essay.) The fellowship's curriculum will be broader than the Center on Long-Term Risk's existing S-Risk fellowship. We envision that graduates of the new CRS fellowship will be in a better position to potentially proceed onto CLR's fellowship, contribute to s-risk research, and strengthen the s-risk community going forward. Program Details: Duration: The fellowship is free of charge and conducted entirely online Availability: Spots are limited in our initial cohorts to ensure a quality learning experience Commitment: Expected to be approx 2-4 hours per week for six weeks Start date: 2nd September 2024 The curriculum will cover topics including: What are s-risks? Arguments for and against a focus on s-risks How can we reduce s-risks? Risk factors for s-risks Worst-case AI safety Improving institutional decision-making Career paths and options Staying motivated and mentally healthy while working on reducing suffering To apply please fill in the application form on the CRS website. Applications will close on July 31st. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app