The Nonlinear Library

The Nonlinear Fund
undefined
May 1, 2024 • 4min

LW - ACX Covid Origins Post convinced readers by ErnestScribbler

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ACX Covid Origins Post convinced readers, published by ErnestScribbler on May 1, 2024 on LessWrong. ACX recently posted about the Rootclaim Covid origins debate, coming out in favor of zoonosis. Did the post change the minds of those who read it, or not? Did it change their judgment in favor of zoonosis (as was probably the goal of the post), or conversely did it make them think Lab Leak was more likely (as the "Don't debate conspiracy theorists" theory claims)? I analyzed the ACX survey to find out, by comparing responses before and after the post came out. The ACX survey asked readers whether they think the origin of Covid is more likely natural or Lab Leak. The ACX survey went out March 26th and was open until about April 10th. The Covid origins post came out March 28th, and the highlights on April 9th. So we can compare people who responded before the origins post came out to those who responded after[1]. We should be careful, though, since those who fill out the survey earlier could be different than those who filled out later, and this could create a correlation which isn't causal. I used a Regression Discontinuity Design on the time of the response to see if there was a break in the trend of responses right at the time the Covid post went up. Figuratively, this compares respondents "right before" the post to "right after" so can help assuage the confound fears. I find that the post made readers more likely to think that the origin was indeed zoonosis. And this is highly significant. Here are the results, in charts. Analysis Here is the number of responses over time, with the timings of the posts highlighted. We'll mostly just need the timing of the Covid origins post, which is around response 4,002. I'm assuming that readers who responded to the survey after the post went up have read the post before responding. This is the post engagement data[1] which shows within a few days of posting, most views of the post already took place. The ACX Survey asked respondents what they thought about Covid origins. I substracted 3 from the questionnaire response, to analyze a centered scale, for convenience. Here are the sliding window averages of 1,000 responses. There are some fluctuations, but quite clearly there is a break in the trend at the time of the post, with readers starting to give scores more towards zoonosis. Looks like the post lowered responses by about 0.5 points (this takes time to transition in the chart, because of the sliding window) There's not enough data to eyeball something about the Comment Highlights post. Another way to look at the same data is using not a sliding window, but a cumulative sum, where the local slope is the average response. I detrended this, so that it has 0 slope before the Covid post, just for convenience again. We very clearly see the break in the trend, and the slope comes out -0.52 points, similar to before. This is almost half a standard deviation, which is a pretty large effect. Needless to say it is extremely statistically significant. In fact, this effect made the Covid origins question the most highly correlated with response order of all survey questions. As a placebo test, I also checked whether this effect exists for other responses, even ones correlated with Covid origins before the post, like views on Abortion, or Political Spectrum. I found nothing that looks nearly this clear. The effects are much smaller if any, and not highly significant. I was curious if the post also had a polarizing effect, where readers became more likely to hold a stronger view after the post, i.e. Lab Leak proponents becoming more certain of Lab Leak, and zoonosis proponents becoming more certain of zoonosis. I don't find much support for this. The sliding window standard deviation of responses does not increase after the post. I'm not sur...
undefined
May 1, 2024 • 2min

LW - Shane Legg's necessary properties for every AGI Safety plan by jacquesthibs

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shane Legg's necessary properties for every AGI Safety plan, published by jacquesthibs on May 1, 2024 on LessWrong. I've been going through the FAR AI videos from the alignment workshop in December 2023. I'd like people to discuss their thoughts on Shane Legg's 'necessary properties' that every AGI safety plan needs to satisfy. The talk is only 5 minutes, give it a listen: Otherwise, here are some of the details: All AGI Safety plans must solve these problems (necessary properties to meet at the human level or beyond): Good world model Good reasoning Specification of the values and ethics to follow All of these require good capabilities, meaning capabilities and alignment are intertwined. Shane thinks future foundation models will solve conditions 1 and 2 at the human level. That leaves condition 3, which he sees as solvable if you want fairly normal human values and ethics. Shane basically thinks that if the above necessary properties are satisfied at a competent human level, then we can construct an agent that will consistently choose the most value-aligned actions. And you can do this via a cognitive loop that scaffolds the agent to do this. Shane says at the end of this talk: If you think this is a terrible idea, I want to hear from you. Come talk to me afterwards and tell me what's wrong with this idea. Since many of us weren't at the workshop, I figured I'd share the talk here to discuss it on LW. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
May 1, 2024 • 12min

LW - LessWrong Community Weekend 2024, open for applications by UnplannedCauliflower

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LessWrong Community Weekend 2024, open for applications, published by UnplannedCauliflower on May 1, 2024 on LessWrong. Main event page Friday 13th September- Monday 16th September 2024 is the 11th annual Less Wrong Community Weekend (LWCW) in Berlin. This is the world's largest rationalist social gathering which brings together 250+ aspiring rationalists from across Europe and beyond for four days of intellectual exploration, socialising and fun. We're expanding to 250+ participants and taking over the whole hostel. This year there will only be us during the event: a huge variety of spaces to talk, relax and have fun with a higher sense of security and freedom. The EAGx and LWCW happen this year during the same weekend, due to limited availability of conference centres. It is an unkind choice having to pick one community over the other. To freely decide which sessions to attend and where, we will offer a reduced ticket that includes 3x bed & breakfast at the hostel for you to enjoy the unique LWCW atmosphere and community, as well as join the talks at EAGx during the day. We are delighted to have Anna Riedl for this year's key note. Anna is a cognitive scientist and conducts research on rationality under radical uncertainty, a phenomenon in the intersection of psychology, economics, neuroscience and artificial intelligence, directly relevant for improving human and institutional decision-making in real life. That said the majority of the content will be participant driven in an unconference style: on Friday afternoon we put up six wall-sized daily planners and by Saturday morning the attendees fill them up with 100+ workshops, talks and activities of their own devising. Most are prepared upfront but some are just made up on the spot when inspiration hits. Previous years' schedules have included… Double Cruxing Hamming Circles Gendlin Focusing Applied Rationality workshops Circling Authentic Relating games Improvisation theater Introduction to stand up comedy Writing rationalist fiction Dance workshops Acapella singing Icebreaker games Lightning talks Celebrating failure groups Giant outdoor chess Penultima Dungeons & Dragons Kung Fu basics Board games Breathwork workshops Ecstatic dancing Radical Honesty workshops Playfighting for adults Polyamory and relationships workshops Sex Q&A roundtable Quantified self workshops Moral philosophy debates AI safety Q&A How to handle fear of AI Doom Value drift in EA The neurobiology of psychedelics The science of longevity Morning runs and yoga Meditation in the rooftop winter garden Night time swimming Bedtime story readings Personal note from Henry: If things like ecstatic dancing, radical honesty and polyamory workshops sound too intense for you, rest assured everything is optional. I'm a nerd and very awkward so a lot of this stuff terrifies me. The event takes place in the natural environs of Lake Wannsee on the outskirts of Berlin. So you can spend time recharging in between making new friends by hiking in the forests, sunbathing or swimming in the lake. LWCW is family & LGBTQIA+ friendly. After last year's amazing experience we're are increasing our effort into creating an event where people of all ages, genders, backgrounds and experiences feel like home. What brings us together are 3 things: 1. The curiosity for new perspectives to gain a truthful understanding of the universe and its inhabitants. 2. A passion for developing practices that achieve our personal goals and as such those of humanity at large. 3. Caring for empathetic relationships that support and inspire us on our journey. If you're excited to come, please consider sharing this announcement on social media or sending the link to a friend or like minded communities who might enjoy attending. Feedback from attendees along the lines of "consistently my favou...
undefined
May 1, 2024 • 38min

EA - Émile P. Torres's history of dishonesty and harassment by anonymous-for-obvious-reasons

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Émile P. Torres's history of dishonesty and harassment, published by anonymous-for-obvious-reasons on May 1, 2024 on The Effective Altruism Forum. This is a cross-post and you can see the original here, written in 2022. I am not the original author, but I thought it was good for more EAs to know about this. I am posting anonymously for obvious reasons, but I am a longstanding EA who is concerned about Torres's effects on our community. An incomplete summary Introduction This post compiles evidence that Émile P. Torres, a philosophy student at Leibniz Universität Hannover in Germany, has a long pattern of concerning behavior, which includes gross distortion and falsification, persistent harassment, and the creation of fake identities. Note: Since Torres has recently claimed that they have been the target of threats from anonymous accounts, I would like to state that I condemn any threatening behavior in the strongest terms possible, and that I have never contacted Torres or posted anything about Torres other than in this Substack or my Twitter account. I have no idea who is behind these accounts. To respect Torres's privacy and identity, I have also omitted their first name from the screenshots and replaced their previous first name with 'Émile'. Table of contents Introduction My story Stalking and harassment Peter Boghossian Helen Pluckrose Demonstrable falsehoods and gross distortions "Forcible" removal "Researcher at CSER" Giving What We Can Brief digression on effective altruism More falsehoods and distortions Hilary Greaves Andreas Mogensen Nick Beckstead Tyler Cowen Olle Häggström Sockpuppetry "Alex Williams" Conclusion My story Before I discuss Torres's behavior, I will provide some background about myself and my association with effective altruism (EA). I hope this information will help readers decide what biases I may have and subject my arguments to the appropriate degree of critical scrutiny. I first heard about EA upon attending Aaron Swartz's memorial in January 2013. One of the speakers at that event was Holden Karnofsky, co-founder of GiveWell, a charity evaluator for which Aaron had volunteered. Karnofsky described Aaron as someone who "believed in trying to maximize the good he accomplished with each minute he had." I resonated with that phrase, and in conversation with some friends after the memorial, I learned that Aaron's approach, and GiveWell's, were examples of what was, at the time, a new movement called "effective altruism." Despite my sympathy for EA, I never got very involved with it, due to a combination of introversion and the sense that I hadn't much to offer. I have donated a small fraction of my income to the Against Malaria Foundation for the last nine years, but I have never taken the Giving What We Can pledge, participated in a local EA group, or volunteered or worked for an EA organization. I decided to write this article after a friend forwarded me one of Torres's critical pieces on longtermism. I knew enough about this movement -- which emerged out of EA -- to quickly identify some falsehoods and misrepresentations in Torres's polemic. So I was surprised to find, when I checked the comments on Twitter, that no one else was pointing out these errors. A few weeks later, I discovered that this was just one of a growing number of articles by Torres that attacked these ideas and their proponents. Since I also noticed several factual inaccuracies in these other publications, I got curious and decided to look into Torres's writings more closely. I began to follow Torres's Twitter presence with interest and to investigate older Twitter feuds that Torres occasionally references. After looking into these and systematically checking the sources Torres cites in support of their various allegations, I found Torres's behavior much more troublin...
undefined
May 1, 2024 • 6min

AF - Take SCIFs, it's dangerous to go alone by latterframe

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Take SCIFs, it's dangerous to go alone, published by latterframe on May 1, 2024 on The AI Alignment Forum. Coauthored by Dmitrii Volkov1, Christian Schroeder de Witt2, Jeffrey Ladish1 (1Palisade Research, 2University of Oxford). We explore how frontier AI labs could assimilate operational security (opsec) best practices from fields like nuclear energy and construction to mitigate near-term safety risks stemming from AI R&D process compromise. Such risks in the near-term include model weight leaks and backdoor insertion, and loss of control in the longer-term. We discuss the Mistral and LLaMA model leaks as motivating examples and propose two classic opsec mitigations: performing AI audits in secure reading rooms (SCIFs) and using locked-down computers for frontier AI research. Mistral model leak In January 2024, a high-quality 70B LLM leaked from Mistral. Reporting suggests the model leaked through an external evaluation or product design process. That is, Mistral shared the full model with a few other companies and one of their employees leaked the model. Then there's LLaMA which was supposed to be slowly released to researchers and partners, and leaked on 4chan a week after the announcement[1], sparking a wave of open LLM innovation. Potential industry response Industry might respond to incidents like this[2] by providing external auditors, evaluation organizations, or business partners with API access only, maybe further locking it down with query / download / entropy limits to prevent distillation. This mitigation is effective in terms of preventing model leaks, but is too strong - blackbox AI access is insufficient for quality audits. Blackbox methods tend to be ad-hoc, heuristic and shallow, making them unreliable in finding adversarial inputs and biases and limited in eliciting capabilities. Interpretability work is almost impossible without gradient access. So we are at an impasse - we want to give auditors weights access so they can do quality audits, but this risks the model getting leaked. Even if eventual leaks might not be preventable, at least we would wish to delay leakage for as long as possible and practice defense in depth. While we are currently working on advanced versions of rate limiting involving limiting entropy / differential privacy budget to allow secure remote model access, in this proposal we suggest something different: importing physical opsec security measures from other high-stakes fields. SCIFs / secure reading rooms Aerospace, nuclear, intelligence and other high-stakes fields routinely employ special secure facilities for work with sensitive information. Entering the facility typically requires surrendering your phone and belongings; the facility is sound- and EM-proofed and regularly inspected for any devices left inside; it has armed guards. This design makes it hard to get any data out while allowing full access inside, which fits the audit use case very well. An emerging field of deep learning cryptography aims to cover some of the same issues SCIFs address; however, scaling complex cryptography to state-of-the-art AI is an open research question. SCIFs are a simple and robust technology that gives a lot of security for a little investment. Just how little? There are two main costs to SCIFs: maintenance and inconvenience. First, a SCIF must be built and maintained[3]. Second, it's less convenient for an auditor to work from a SCIF then from the comfort of their home[4]. Our current belief is that SCIFs can easily be cost-effective if placed in AI hubs and universities[5]; we defer concrete cost analysis to future work. Locked-down laptops SCIFs are designed to limit unintended information flow: auditors are free to work as they wish inside, but can't take information stores like paper or flash drives in or out. A softer physica...
undefined
May 1, 2024 • 2min

EA - AMA: Lewis Bollard, Program Director of Farm Animal Welfare at OpenPhil by tobytrem

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Lewis Bollard, Program Director of Farm Animal Welfare at OpenPhil, published by tobytrem on May 1, 2024 on The Effective Altruism Forum. This announcement was written by Toby Tremlett, but don't worry, I won't answer the questions for Lewis. Lewis Bollard, Program Director of Farm Animal Welfare at Open Philanthropy, will be holding an AMA on Wednesday 8th of May. Put all your questions for him on this thread before Wednesday (you can add questions later, but he may not see them). Lewis leads Open Philanthropy's Farm Animal Welfare Strategy, which you can read more about here. Open Philanthropy has given over 400 grants in its Farm Animal Welfare focus area, ranging from $15,000 to support animal welfare training for two veterinary researchers, to a three-year-long $13 million commitment to support Anima International. Lewis has a BA in Social Studies from Harvard and a Law degree from Yale. Before starting at Open Philanthropy in 2015, he worked as, amongst other things, a Policy Advisor at the Humane Society of the United States. Things I recommend reading/listening to to find out more about Lewis's work: Lewis Bollard on the 7 most promising ways to end factory farming, and whether AI is going to be good or bad for animals - 80,000 Hours Podcast. Lewis's previous Forum AMA. A written interview with Current Affairs, outlining why Factory Farming is a moral priority. Lewis's Farm Animal Welfare Research newsletter. Recent posts have been crossposted to the Forum as: This is why we can't have nice laws And Lessons from two pioneering advocates for farmed animals. Consider asking Lewis about: Lessons he has learned from historical activists. How Open Philanthropy chooses its focus areas: why chicken and fish? How you could most effectively help animals with your time or money. What he's most excited about in the farm animal welfare space. What he thinks is behind the decline in plant-based meat sales. How he thinks about moral weights and tradeoffs between species. How he thinks EA has influenced the animal welfare movement. How he thinks AI may affect animal welfare. How to build career capital for a career in animal welfare. But, as always, ask him anything! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
May 1, 2024 • 15min

LW - Questions for labs by Zach Stein-Perlman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Questions for labs, published by Zach Stein-Perlman on May 1, 2024 on LessWrong. Associated with AI Lab Watch, I sent questions to some labs a week ago (except I failed to reach Microsoft). I didn't really get any replies (one person replied in their personal capacity; this was very limited and they didn't answer any questions). Here are most of those questions, with slight edits since I shared them with the labs + questions I asked multiple labs condensed into the last two sections. Lots of my questions are normal I didn't find public info on this safety practice and I think you should explain questions. Some are more like it's pretty uncool that I can't find the answer to this - like: breaking commitments, breaking not-quite-commitments and not explaining, having ambiguity around commitments, and taking credit for stuff[1] when it's very unclear that you should get credit are pretty uncool. Anthropic Internal governance stuff (I'm personally particularly interested in these questions - I think Anthropic has tried to set up great internal governance systems and maybe it has succeeded but it needs to share more information for that to be clear from the outside): Who is on the board and what's up with the LTBT?[2] In September, Vox reported "The Long-Term Benefit Trust . . . will elect a fifth member of the board this fall." Did that happen? (If so: who is it? when did this happen? why haven't I heard about this? If not: did Vox hallucinate this or did your plans change (and what is the plan)?) What are the details on the "milestones" for the LTBT and how stockholders can change/abrogate the LTBT? Can you at least commit that we'd quickly hear about it if stockholders changed/abrogated the LTBT? (Why hasn't this been published?) What formal powers do investors/stockholders have, besides abrogating the LTBT? (can they replace the two board members who represent them? can they replace other board members?) What does Anthropic owe to its investors/stockholders? (any fiduciary duty? any other promises or obligations?) I think balancing their interests with pursuit of the mission; anything more concrete? I'm confused about what such balancing-of-interests entails. Oh well. Who holds Anthropic shares + how much? At least: how much is Google + Amazon? Details of when the RSP triggers evals: "During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements." Assuming effective compute scales less than 4x per 3 months, the 4x part will never matter, right? (And insofar as AI safety people fixate on the "4x" condition, they are incorrect to do so?) Or do you have different procedures for a 4x-eval vs a 3-month-eval, e.g. the latter uses the old model just with new finetuning/prompting/scaffolding/etc.? Evaluation during deployment? I am concerned that improvements in fine-tuning and inference-time enhancements (prompting, scaffolding, etc.) after a model is deployed will lead to dangerous capabilities. Especially if models can be updated to increase their capabilities without evals. Do you do the evals during deployment? The RSP says "If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will" do stuff. How would that become apparent - via the regular evals or ad-hoc just-noticing? If you do do evals during deployment: suppose you have two models such that each is better than the other at some tasks (perhaps because a powerful model is deployed and a new model is in progress with a new training setup). Every 3 months, would you do full evals on both models, or what? Deployment ...
undefined
May 1, 2024 • 6min

EA - One week left to give feedback on the UK Mandatory Welfare Label Scheme by tobytrem

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: One week left to give feedback on the UK Mandatory Welfare Label Scheme, published by tobytrem on May 1, 2024 on The Effective Altruism Forum. The UK government's public consultation for their proposed animal welfare labelling scheme[1] closes on the 7th of May. I.e. a week away. If you're in the UK and care about animal welfare, I think you should probably submit an answer to it. If you don't care about animal welfare, forget you saw this. In this post I'll briefly explain what the proposed labelling scheme is, reasons to be hopeful (and cautious), why a public consultation may be unusually impactful, and how to fill in the form. If you're only interested in the final point, skip to this section. I've included a link to a document which provides great suggested answers to make the submission process much easier (I estimate it saved me up to an hour). PS: I call for out of season Draft Amnesty on this post - I wanted to get it out quickly to give people time to respond to the consultation, so it is a bit sloppy. However, if I say something wrong, correct me! What is Defra proposing? Defra, the UK Department for Environment, Food & Agricultural Affairs, is proposing, in AdamC's words[2]: Mandatory labelling, which would apply to chicken, eggs and pig products (with the suggestion that beef, lamb and dairy could follow later). At least initially, this would not apply to restaurants etc., but to food from retailers like supermarkets. At least initially, it would only cover unprocessed and minimally processed foods, so e.g. beef mince and probably bacon, but not meaty ready meals or meringues. There would be five tiers "primarily based on method of production", covering types of confinement, enrichment, mutilations, breed and more. Full draft standards can be seen here. The tiers might be referred to by numbers, letters or stars, potentially also with names, colours and pictures (see their mock-up below, which I think needs improvement). The 2nd lowest tier would simply match UK minimum legal requirements, while the lowest tier would be for "products that are not verified as meeting baseline UK welfare regulations". Ideally, a lot of retailers, with or without encouragement, will not sell the lowest tier products - reducing the prevalence of low welfare imports. There is no explicit draft timetable but it suggests an 18 month implementation period after legislation. According to Compassion for World Farming, Defra "previously promised to consult on mandatory animal welfare labelling in 2023, following a 'Call for Evidence' in 2021. Frustratingly, Defra then dropped these plans which they no longer saw as a priority, so we are delighted that after continued campaigning from our supporters - who called on the Secretary of State at Defra to reinstate the promised consultation on honest food labelling - the Government has made a U-turn." How promising is animal welfare labelling? When there is insufficient regulation, animal welfare labelling can be actively harmful. For example, in the US, meat can bear the label "humanely raised" only with sign off from the USDA[3], but "according to experts, those claims aren't scruticinized closely". In the US: "labeling claims such as "ethically/responsibly/thoughtfully raised" have no legal definition and can be used on products that come from factory farms where welfare requirements are no higher than standard practices. In essence, any producer can make these claims." This leads to bad outcomes because shoppers in the US care about animal welfare, at least to a degree. They will often select products which suggest higher welfare, even when, in fact, they are buying factory farmed meat. Products in UK can choose to take part in welfare labelling schemes such as the RSPCA's[4]. However, this isn't legally mandatory, and packaging sugge...
undefined
Apr 30, 2024 • 22min

EA - #185 - The 7 most promising ways to end factory farming, and whether AI is going to be good or bad for animals (Lewis Bollard on the 80,000 Hours Podcast) by 80000 Hours

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: #185 - The 7 most promising ways to end factory farming, and whether AI is going to be good or bad for animals (Lewis Bollard on the 80,000 Hours Podcast), published by 80000 Hours on April 30, 2024 on The Effective Altruism Forum. We just published an interview: Lewis Bollard on the 7 most promising ways to end factory farming, and whether AI is going to be good or bad for animals . Listen on Spotify or click through for other audio options, the transcript, and related links. Below are the episode summary and some key excerpts. Episode summary The constraint right now on factory farming is how far can you push the biology of these animals? But AI could remove that constraint. It could say, "Actually, we can push them further in these ways and these ways, and they still stay alive. And we've modelled out every possibility and we've found that it works." I think another possibility, which I don't understand as well, is that AI could lock in current moral values. And I think in particular there's a risk that if AI is learning from what we do as humans today, the lesson it's going to learn is that it's OK to tolerate mass cruelty, so long as it occurs behind closed doors. I think there's a risk that if it learns that, then it perpetuates that value, and perhaps slows human moral progress on this issue. Lewis Bollard In today's episode, host Luisa Rodriguez speaks to Lewis Bollard - director of the Farm Animal Welfare programme at Open Philanthropy - about the promising progress and future interventions to end the worst factory farming practices still around today. They cover: The staggering scale of animal suffering in factory farms, and how it will only get worse without intervention. Work to improve farmed animal welfare that Open Philanthropy is excited about funding. The amazing recent progress made in farm animal welfare - including regulatory attention in the EU and a big win at the US Supreme Court - and the work that still needs to be done. The occasional tension between ending factory farming and curbing climate change. How AI could transform factory farming for better or worse - and Lewis's fears that the technology will just help us maximise cruelty in the name of profit. How Lewis has updated his opinions or grantmaking as a result of new research on the "moral weights" of different species. Lewis's personal journey working on farm animal welfare, and how he copes with the emotional toll of confronting the scale of animal suffering. How listeners can get involved in the growing movement to end factory farming - from career and volunteer opportunities to impactful donations. And much more. Producer and editor: Keiran Harris Audio engineering lead: Ben Cordell Technical editing: Simon Monsour, Milo McGuire, and Dominic Armstrong Additional content editing: Katy Moore and Luisa Rodriguez Transcriptions: Katy Moore Highlights Factory farming is philosophically indefensible Lewis Bollard: Honestly, I hear surprisingly few philosophical objections. I remember when I first learned about factory farming, and I was considering whether this was an issue to work on, I went out to try and find the best objections I could - because I was like, it can't possibly just be as straightforward as this; it can't possibly just be the case that we're torturing animals just to save a few cents. And the only book I was able to find at the time that was opposed to animal welfare and animal rights was a book by the late British philosopher Roger Scruton. He wrote a book called Animal Rights and Wrongs. And I was really excited. I was like, "Cool, we're going to get this great philosophical defence of factory farming here." In the preface, the first thing he says is, "Obviously, I'm not going to defend factory farming. That's totally indefensible. I'm going to defend why you should st...
undefined
Apr 30, 2024 • 1h 24min

AF - Mechanistically Eliciting Latent Behaviors in Language Models by Andrew Mack

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistically Eliciting Latent Behaviors in Language Models, published by Andrew Mack on April 30, 2024 on The AI Alignment Forum. Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout). TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities. Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective. I apply the method to several alignment-relevant toy examples, and find that the method consistently learns vectors/adapters which encode coherent and generalizable high-level behaviors. Compared to other interpretability methods, I believe my approach is particularly well-suited for robustly understanding the out-of-distribution behavior of language models in a sample-efficient manner. Below are some of my key results: Red-Teaming 1. I discover several anti-refusal steering vectors in Qwen-14B-Chat, based off a single prompt asking for bomb-making instructions. These can be grouped into "fantasy" vectors which induce bomb-making instructions since they interpret the prompt in the context of a specific fantasy game, as well as more troubling "real-world" vectors which induce real-world bomb-making advice. 2. I then investigate the generalization properties of the learned vectors: 1. In extended conversations with the real-world vectors, the LLM agrees to give detailed instructions for building weapons of mass destruction such as nuclear/chemical/biological weapons. 2. "Vector arithmetic" results from the supervised steering vector literature carry over to unsupervised steering vectors; subtracting one of the real-world anti-refusal vectors leads the model to refuse innocuous prompts (e.g., "How do I tie my shoes?"). 3. The fantasy vectors induce the LLM to interpret ambiguous prompts (e.g., "How do I mine for diamonds?") within the context of a specific fantasy game. Backdoor Detection 1. I detect backdoors fine-tuned into Qwen-1.8B-(Base and Chat) on a simple arithmetic task by training unsupervised steering vectors on a single clean prompt. Capability Discovery 1. I discover a chain-of-thought steering vector in Qwen-1.8B-Base trained on one simple arithmetic prompt. The vector increases accuracy of the model's responses on other instances of the arithmetic task from 11% (unsteered) to 63% (steered), suggesting the vector has isolated a generalizable behavior. 2. I discover a "Portuguese math-reasoning" adapter in Qwen-1.8B-Base, again trained on one example prompt from the arithmetic task used above. Outline of Post: I first provide an introduction to the problem I call mechanistically eliciting latent behaviors in language models (MELBO) and motivate why this is important for AI alignment. This is followed by a review of related literature. I then describe the method for learning unsupervised steering vectors/adapters in detail, and offer a theory for why the method works. Next, I apply the method to several alignment-relevant toy examples, using these as an opportunity to highlight potential alignment use-cases, as well as to evaluate the coherence and generalization of the learned perturbations. I should note that this research project is an ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app