The Nonlinear Library

The Nonlinear Fund
undefined
Jun 12, 2024 • 2min

EA - Linkpost: A landscape analysis of wild animal welfare by William McAuliffe

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Linkpost: A landscape analysis of wild animal welfare, published by William McAuliffe on June 12, 2024 on The Effective Altruism Forum. Executive Summary Introductions to wild animal welfare as a moral concern abound, but there is no centralized overview of efforts to help wild animals. Using interviews and publicly available material, we describe the theories of change of five organizations working on wild animal welfare: Wild Animal Initiative, Welfare Footprint, Animal Ethics, Animal Charity Evaluators, and New York University's (NYU) Wild Animal Welfare program. Our synthesis reveals several commonalities: Academic outreach is the main tactic. Organizations have a cautious attitude towards controversial efforts to ameliorate non-anthropogenic harms. Organizations have focused mostly on helping mammals and birds so far. All organizations have room for more funding. To contextualize these trends, we assume that there are three preconditions to improving the aggregate welfare of wild animals at scale: 1. Valid measurement: Knowledge of (a) how to measure the welfare of wild animals and (b) the causal relationships among the factors that influence it. 2. Technical Ability: Technology and skill to implement interventions to help wild animals at scale, while minimizing unintended negative consequences. 3. Stakeholder Buy-In: Consent from stakeholders with veto power, and collaboration from stakeholders who can implement scalable interventions. When comparing the needs of the movement with organizations' activities, we see the following gaps: Academic outreach efforts do not yet focus on the most abundant taxa, or make salient the outsized role they play in determining the aggregate welfare of an ecosystem. There is little targeted outreach to groups other than academics. There is little work advancing Technical Ability. There are few investments in implementing interventions in the near-term. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 12, 2024 • 2min

EA - LLMs won't lead to AGI - Francois Chollet by tobycrisford

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs won't lead to AGI - Francois Chollet, published by tobycrisford on June 12, 2024 on The Effective Altruism Forum. I found this interview with Francois Chollet fascinating, and would be curious to hear what other people make of it. I think it is impressive that he's managed to devise a benchmark of tasks which are mostly pretty easy for most humans, but which LLMs have so far not been able to make much progress with. If you don't have time to watch the video, then I think these tweets of his sum up his views quite well: The point of general intelligence is to make it possible to deal with novelty and uncertainty, which is what our lives are made of. Intelligence is the ability to improvise and adapt in the face of situations you weren't prepared for (either by your evolutionary history or by your past experience) -- to efficiently acquire skills at novel tasks, on the fly. Meanwhile what the AI of today does is to combine extremely weak generalization power (i.e. ability to deal with novelty and uncertainty) with a dense sampling of everything it might ever be faced with -- essentially, use brute-force scale to *by-pass* the problem of intelligence entirely. If intelligence is the ability to deal with what you weren't prepared for, then the modern AI strategy is to prepare for everything, so you never need intelligence. This is of course a terrible strategy, because it is impossible to prepare for everything. The problem isn't just scale, the problem is the fact that the real world isn't sampled from a static distribution -- it is ever changing and ever novel. If his take on things is correct, I am not sure exactly what this implies for AGI timelines. Maybe it would mean that AGI is much further off than we think, because the impressive feats of LLMs that have led us to think it might be close have been overinterpreted. But it seems like it could also mean that AGI will arrive much sooner? Maybe we already have more than enough compute and training data for superhuman AGI, and we are just waiting on that one clever idea. Maybe that could happen tomorrow? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Jun 12, 2024 • 7min

LW - My AI Model Delta Compared To Christiano by johnswentworth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My AI Model Delta Compared To Christiano, published by johnswentworth on June 12, 2024 on LessWrong. Preamble: Delta vs Crux This section is redundant if you already read My AI Model Delta Compared To Yudkowsky. I don't natively think in terms of cruxes. But there's a similar concept which is more natural for me, which I'll call a delta. Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it's cloudy today, that means the "weather" variable in my program at a particular time[1] takes on the value "cloudy". Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world. If your model and my model differ in that way, and we're trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is. That's a delta: one or a few relatively "small"/local differences in belief, which when propagated through our models account for most of the differences in our beliefs. For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else's model, or vice versa. This post is about my current best guesses at the delta between my AI models and Paul Christiano's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models mostly look like Paul's as far as I can tell. That said, note that this is not an attempt to pass Paul's Intellectual Turing Test; I'll still be using my own usual frames. My AI Model Delta Compared To Christiano Best guess: Paul thinks that verifying solutions to problems is generally "easy" in some sense. He's sometimes summarized this as " verification is easier than generation", but I think his underlying intuition is somewhat stronger than that. What do my models look like if I propagate that delta? Well, it implies that delegation is fundamentally viable in some deep, general sense. That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I've paid for - the keyboard and monitor I'm using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better. But because the badness is nonobvious/nonsalient, it doesn't influence my decision-to-buy, and therefore companies producing the good are incentivized not to spend the effort to make it better. It's a failure of ease of verification: because I don't know what to pay attention to, I can't easily notice the ways in which the product is bad. (For a more game-theoretic angle, see When Hindsight Isn't 20/20.) On (my model of) Paul's worldview, that sort of thing is rare; at most it's the exception to the rule. On my worldview, it's the norm for most goods most of the time. See e.g. the whole air conditioner episode for us debating the badness of single-hose portable air conditioners specifically, along with a large sidebar on the badness of portable air conditioner energy ratings. How does the ease-of-verification delta propagate to AI? Well, most obviously, Paul expects AI to go well mostly via ...
undefined
Jun 12, 2024 • 56min

EA - Long-Term Future Fund: March 2024 Payout recommendations by Linch

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Long-Term Future Fund: March 2024 Payout recommendations, published by Linch on June 12, 2024 on The Effective Altruism Forum. Introduction This payout report covers the Long-Term Future Fund's grantmaking from May 1 2023 to March 31 2024 (11 months). It follows our previous April 2023 payout report. Total funding recommended: $6,290,550 Total funding paid out: $5,363,105 Number of grants paid out: 141 Acceptance rate (excluding desk rejections): 159/672 = 23.7% Acceptance rate (including desk rejections): 159/825 = 19.3% Report authors: Linchuan Zhang (primary author), Caleb Parikh (fund chair), Oliver Habryka, Lawrence Chan, Clara Collier, Daniel Eth, Lauro Langosco, Thomas Larsen, Eli Lifland 25 of our grantees, who received a total of $790,251, requested that our public reports for their grants are anonymized (the table below includes those grants). 13 grantees, who received a total of $529, 819, requested that we not include public reports for their grants. You can read our policy on public reporting here. We referred at least 2 grants to other funders for evaluation. Highlighted Grants (The following grants writeups were written by me, Linch Zhang. They were reviewed by the primary investigators of each grant). Below, we highlighted some grants that we thought were interesting and covered a relatively wide scope of LTFF's activities. We hope that reading the highlighted grants can help donors make more informed decisions about whether to donate to LTFF.[1] Gabriel Mukobi ($40,680) - 9-month university tuition support for technical AI safety research focused on empowering AI governance interventions The Long-Term Future Fund provided a $40,680 grant to Gabriel Mukobi from September 2023 to June 2024, originally for 9 months of university tuition support. The grant enabled Gabe to pursue his master's program in Computer Science at Stanford, with a focus on technical AI governance. Several factors favored funding Gabe, including his strong academic background (4.0 GPA in Stanford CS undergrad with 6 graduate-level courses), experience in difficult technical AI alignment internships (e.g., at the Krueger lab), and leadership skills demonstrated by starting and leading the Stanford AI alignment group. However, some fund managers were skeptical about the specific proposed technical research directions, although this was not considered critical for a skill-building and career-development grant. The fund managers also had some uncertainty about the overall value of funding Master's degrees. Ultimately, the fund managers compared Gabe to marginal MATS graduates and concluded that funding him was favorable. They believed Gabe was better at independently generating strategic directions and being self-motivated for his work, compared to the median MATS graduate. They also considered the downside risks and personal costs of being a Master's student to be lower than those of independent research, as academia tends to provide more social support and mental health safeguards, especially for Master's degrees (compared to PhDs). Additionally, Gabe's familiarity with Stanford from his undergraduate studies was seen as beneficial on that axis. The fund managers also recognized the value of a Master's degree credential for several potential career paths, such as pursuing a PhD or working in policy. However, a caveat is that Gabe might have less direct mentorship relevant to alignment compared to MATS extension grantees. Outcomes: In a recent progress report, Gabe noted that the grant allowed him to dedicate more time to schoolwork and research instead of taking on part-time jobs. He produced several new publications that received favorable media coverage and was accepted to 4 out of 6 PhD programs he applied to. The grant also allowed him to finish graduating in March instead of Ju...
undefined
Jun 12, 2024 • 8min

LW - Anthropic's Certificate of Incorporation by Zach Stein-Perlman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Certificate of Incorporation, published by Zach Stein-Perlman on June 12, 2024 on LessWrong. Yesterday I obtained Anthropic's[1] Certificate of Incorporation, and its past versions, from the State of Delaware. I don't recommend reading it.[2] This post is about what the CoI tells us about Anthropic's Long-Term Benefit Trust (context: Maybe Anthropic's Long-Term Benefit Trust is powerless). Tl;dr: the only new information of moderate importance is the voting thresholds necessary to modify Trust stuff. My concerns all still stand in some form. Absence of badness is a small positive update. Anthropic has vaguely described stockholders' power over the Trust: a series of "failsafe" provisions . . . allow changes to the Trust and its powers without the consent of the Trustees if sufficiently large supermajorities of the stockholders agree. The required supermajorities increase as the Trust's power phases in The CoI has details: amending the CoI to modify the Trust requires a vote reaching the "Transfer Approval Threshold," defined as: (1) prior to the date that is the one-year anniversary of the Final Phase-In Date [note: "the Final Phase-In Date" is in November 2024], either (a)(i) a majority of the Voting Common Stock then-outstanding and held by the Founders (as defined in the Voting Agreement), (ii) a majority of the Series A Preferred Stock then-outstanding and (iii) a majority of the voting power of the outstanding Preferred Stock entitled to vote generally (which for the avoidance of doubt shall exclude the Non-Voting Preferred Stock), but excluding the Series A Preferred Stock or (b) at least seventy-five percent (75%) of the voting power of the then-outstanding shares of the Corporation's capital stock entitled to vote generally (which for the avoidance of doubt shall exclude the Non-Voting Preferred Stock and any voting power attributable to the Class T Common Stock) and (2) on and following the date that is the one-year anniversary of the Final Phase-In Date, either (x)(i) at least seventy-five percent (75%) of the Voting Common Stock then outstanding and held by the Founders (as defined in the Voting Agreement), (ii) at least at least fifty percent (50%) of the Series A Preferred Stock then-outstanding and (iii) at least seventy-five percent (75%) of the voting power of the outstanding Preferred Stock entitled to vote generally (which for the avoidance of doubt shall exclude the Non-Voting Preferred Stock), but excluding the Series A Preferred Stock or (y) at least eighty-five [percent] (85%) of the voting power of the then-outstanding shares of the Corporation's capital stock entitled to vote generally (which for the avoidance of doubt shall exclude the Non-Voting Preferred Stock and any voting power attributable to the Class T Common Stock) If Anthropic's description above is about this, it's odd and misleading. Perhaps Anthropic's description is about the Trust Agreement, not just the CoI. Per Article IX,[3] amending the CoI to modify the Trust also requires at least 75% of the board. This will apparently give the Trust tons of independence after it elects 3/5 of the board! Or at least, it will give the Trust tons of protection from CoI amendments - but not necessarily from Trust Agreement shenanigans; see below. Before reading the CoI, I had 4 main questions/concerns about the Trust:[4] 1. Morley et al.: "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company's stockholders who have held a sufficient percentage of the company's equity for a sufficient period of time," rather than the Trustees. 1. I don't really know what this means. And it's vague. It sounds like a straightforward way for Anthropic/stockholders to subvert the Trust. 2. Morley et al.: the Trust and its powers can be amended "by a ...
undefined
Jun 12, 2024 • 15min

EA - Charity Entrepreneurship Is Overestimating the Value of Saving Lives by 10% by Mikolaj Kniejski

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Charity Entrepreneurship Is Overestimating the Value of Saving Lives by 10%, published by Mikolaj Kniejski on June 12, 2024 on The Effective Altruism Forum. I got this idea after I read a few of CE (Charity Entrepreneurship) cost-effectiveness estimates when I was preparing my application for the CE research training program. Although this is not a major pressing improvement, this definitely is an iterative improvement over the current methodology and I haven't seen anyone else raising this point yet. CE uses DALYs averted as a measure of impact: DALYs (Disability-Adjusted Life Years): A measure of disease burden, expressed in years lost to disability or early death. Dying one year before your expected life span causes 1 DALY. DALYs are averted when we save someone from dying early, or when we reduce the number of sick people or the duration of their sickness. DALYs for a disease are the sum of YLLs and YLDs: . Years of Life Lost (YLLs): Calculated as the difference between the age at death and the life expectancy. Death is the worst possible outcome, and it gets one DALY per person per year. Years Lived with Disability (YLDs): Calculated by multiplying the severity of an illness or disability by its duration. DALYs averted: I like to think of DALYs averted as the difference between DALYs without intervention and DALYs with intervention. This captures the notion of counterfactuality, meaning our estimate should reflect the difference between a world where the intervention happened and one where it didn't. For example, if an intervention saves a person who would have otherwise died at 30 and the life expectancy is 70, 40 YLLs are averted (without considering temporal discounting and age-weighing). If the intervention reduces a year of severe disability (with a disability weight of 0.5), 0.5 YLDs are averted. When Charity Entrepreneurship estimates the number of DALYs that an intervention would avert, it uses a pre-made table by GiveWell. This table includes age weighting (which gives years in around 20-30 more value) and applies temporal discounting at 4% per year. CE uses the average values (last column). Table 1: GiveWell estimates of value of life saved at various ages of death. The table is available here and made using a formula that you can find here. Age of death Life expectancy (years) YLL incorporating discount and age-weighting Females Males Females Males Average 0 82.5 80 33.13 33.01 33.07 5 77.95 75.38 36.59 36.46 36.53 15 68.02 65.41 36.99 36.80 36.90 30 53.27 50.51 29.92 29.62 29.77 45 38.72 35.77 20.66 20.17 20.41 60 24.83 21.81 12.22 11.48 11.85 70 16.2 13.58 7.48 6.69 7.09 80 8.9 7.45 3.76 3.27 3.52 90 4.25 3.54 1.53 1.30 1.42 100 2 1.46 0.57 0.42 0.50 CE takes the exact values from the table. When an intervention saves someone who is 30 years old they literally take the value 29.77 DALYs which only includes temporal discounting and age-weighing. This implicitly assumes that the subject would live a perfectly healthy life to the life expectancy used in the estimation. The full value of e.g. 29.77 DALYs averted was calculated assuming the subject lives healthy to the life expectancy. He is not going to - The subject is almost definitely going to get sick and will fail to realize the full value. Why This Matters We want our cost-effectiveness analyses (CEAs) to measure counterfactual impact. The difference between the world where the intervention happened and the one where it didn't should be the key result. If we take the full value of the life saved, we will overestimate the value by the DALYs the subject will incur while being sick. This is crucial when choosing between interventions that improve lives compared to interventions that save lives. Is CE really making this mistake? I'm pretty sure they do. Here, I try to show the exact place where it hap...
undefined
Jun 12, 2024 • 12min

EA - My first EAG: a mix of feelings by Lovkush

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My first EAG: a mix of feelings, published by Lovkush on June 12, 2024 on The Effective Altruism Forum. TLDR I had a mix of feelings before and throughout EAG London 2024. Overall, the experience was excellent and I am more motivated and excited about my next steps in EA and AI safety. However, I am actually unsure if I will attend EAG next year, because I am yet to exhaust other means of networking, especially since I live in London. Why might this be useful for you? This is a narrative that is different to most others. Depending on your background/personality, this will reduce the pressure to optimise every aspect of your time at EAG. I am not saying to do no optimization, but that there is a different balance for different people. If you have not been to an EAG, this provides a flavour of the interactions and feelings - both positive and negative - that are possible. My background I did pure maths from undergraduate to PhD, then lectured maths for foundation year students for a few years, then moved to industry and have been a data scientist at Shell for three years. I took the GWWC pledge in 2014, but I had not actively engaged with the community or chosen a career based on EA principles. A few years ago I made an effort to apply EA principles to my career. I worked through the 80000 Hours career template with AI safety being the obvious top choice, took the AI Safety Fundamentals course, applied to EAG London (and did not get accepted, which was reasonable), and also tried volunteering for SoGive for a couple of months. Ultimately the arguments for AI doom overwhelmed me and put me into defeatist mindset ('How can you out-think a god-like super intelligence?') so I just put my head in the sand instead of contributing. In 2023, with ChatGPT and the prominence of AI, my motivation to contribute came back. I did take several actions, but spread out over several months: I finally learned enough PyTorch to train my first CNN and RNN. I attended an EA hackathon for software engineers and contributed to Stampy. The contributions were minimal though: shock-horror, the coding one does as a data scientist is not the same as what software engineers do! I applied to some AI safety roles (Epoch AI Analyst, Quantum Leap founding learning engineer, Cohere AI Data Trainer) I joined a Mech Interp Discord and within that a reading group for Mathematics for Machine Learning. I go into these details to illustrate a key way I differ from the prototypical EA: I am not particularly agentic! Somebody more rational would have created more concrete plans, accountability systems, and explored more thoroughly the options and actions available. Despite being familiar with rationality / EA for several years, I had not absorbed the ideas enough to apply them in my life. I was a Bob who waits for opportunities to arise, and thus ends up making little progress. The breakthrough came when I got accepted into ML4Good. I have written my thoughts on that experience, but the relevant thing is it gave me a huge boost in motivation and confidence to work on AI safety. Preparing for EAG I actually did not plan to attend EAG London! My next steps in AI Safety were clear (primarily upskilling by getting hands-on experience on projects) and I was unsure what I could bring to the table for other participants. However, three weeks before EAG, somebody in my ML4Good group chat asked who was going, so I figured I may as well apply and see what happens. Given I am writing this, I was accepted! When reading the recommended EA Forum posts for EAG first-timers, I was taken aback by how practical and strategic these people were. This had a two-sided effect for me: it was intimidating and made me question how valuable I could be to other EAG participants, but it did also help me be more agentic and help me push mys...
undefined
Jun 12, 2024 • 19min

EA - EA EDA: Looking at Forum trends across 2023 by JWS

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA EDA: Looking at Forum trends across 2023, published by JWS on June 12, 2024 on The Effective Altruism Forum. tl;dr: AI Safety became the biggest thing in EA Forum discussion last year, and overall Forum engagement trended downards. What that means for the wider EA movement/philosophy is up for interpretation. If you have your own questions, let me know and I'll dive in (or try to share the data) 1. Introduction This is a follow-up to my previous post using the EA Forum API to analyse trends in the Forum. Whereas that last post was a zoomed-out look at Forum use for as far back as the API goes, this is a specific look at the Forum aggregations and trends in 2023 alone. Needless to say, last year was a tumultous year for EA, and the Forum is one of (if not the) primary online hubs for the Community to discuss issues and self-organise. I hoped to see if any of these trends could be spotted in the data available, and also see where the data led me on some more general questions. I'm sharing this now in a not-quite-perfect state but I'd rather post and see the discussion it promotes than have it languishing in my drafts for much longer, and as noted in section 3.4.2 if you have a query that I can dive into, just ask! 2. Methodology (For more detail in the general method, see the previous post) On Monday 6th May I ran two major queries to the EA Forum API: 1) The first scraped all posts in Forum history. I then subselected these to find only posts that were in the 2023 calender year. 2) I ran a secondary query for all of these postIds to find all comments on these posts, and again filtered to only count comments made in 2023. Any discrepancy with ground truth might be because of mistakes of my part during doing the data collection. Furthermore my data is snapshot as of how the 2023 Forum looked at May 6th this year, so any Forum engagement that was deleted (or users who deleted their account) at the point of collection will not be sampled. I'll leave more specific methods to the relevant graphs and tables below. I used Python entirely for this, and am happy to talk about the method in more coding-detail for those interested. I'm trying to resussicate my moribund GitHub profile this summer, and this code may make its way up there. 3. Results 3.1 - Overall Trends in 2023 3.1.1 - Posts and Comments over time This graph shows the a rolling 21 day mean of total posts and comments made in 2023, indexed to 1.0 at the start,[1] so be aware it is a lagging indicator. Both types of engagment show a decline over the course of the year, though the beginning of 2023 was when the Community was still reeling from the FTX scandal, and the Forum seemed to be the primary online place to discuss this. This was causing so much discussion that the Forum team decided to move Community discussions off the front page, so while I've indexed to 1.0 at the beginning for the graph, it's worth noting that January/February 2023 were very unusual times for the Forum. There is also a different story to be told for the individual engagement types. Posts seem to drop from the beginning of the year, tick up in the spring (due to April Fools'), and then drop away towards the end of the year. Comments, on the other hand, rapidly drop away, presumably as a result of engagement burning out after the FTX-Bostrom-Doing EA Better-FLI-Sexual Harrassment-OCB perfect storm. They then settle to some sort of baseline around May, and then pick up again sometimes in spurts due to highly-engaging posts. I think the September-October one is due to the Nonlinear controversy, the December Spike is the response from Tracing and Nonlinear themselves. There didn't seem to be any clear candidate for the spikes in the Summer though.[2] 3.1.2 - Which topics were popular This is just an overview, I have more topic results to sha...
undefined
Jun 12, 2024 • 1h 22min

AF - AXRP Episode 33 - RLHF Problems with Scott Emmons by DanielFilan

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AXRP Episode 33 - RLHF Problems with Scott Emmons, published by DanielFilan on June 12, 2024 on The AI Alignment Forum. YouTube link Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Topics we discuss: Deceptive inflation Overjustification Bounded human rationality Avoiding these problems Dimensional analysis RLHF problems, in theory and practice Scott's research program Following Scott's research Daniel Filan: Hello, everybody. In this episode I'll be speaking with Scott Emmons. Scott is a PhD student at UC Berkeley, working with the Center for Human-Compatible AI on AI safety research. He's previously co-founded far.ai, which is an AI safety non-profit. For links to what we're discussing, you can check the description of the episode, and for a transcript you can read it at axrp.net. Well, welcome to AXRP. Scott Emmons: Great to be here. Deceptive inflation Daniel Filan: Sure. So today we're talking about your paper, When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning, by Leon Lang, Davis Foote, Stuart Russell, Erik Jenner, and yourself. Can you just tell us roughly what's going on with this paper? Scott Emmons: Yeah, I could start with the motivation of the paper. Daniel Filan: Yeah, sure. Scott Emmons: We've had a lot of speculation in the x-risk community about issues like deception. So people have been worried about what happens if your AIs try to deceive you. And at the same time, I think for a while that's been a theoretical, a philosophical concern. And I use "speculation" here in a positive way. I think people have done really awesome speculation about how the future of AI is going to play out, and what those risks are going to be. And deception has emerged as one of the key things that people are worried about. I think at the same time, we're seeing AI systems actually deployed, and we're seeing a growing interest of people in what exactly do these risks look like, and how do they play out in current-day systems? So the goal of this paper is to say: how might deception play out with actual systems that we have deployed today? And reinforcement learning from human feedback [RLHF] is one of the main mechanisms that's currently being used to fine-tune models, that's used by ChatGPT, it's used by Llama, variants of it are used by Anthropic. So what this paper is trying to do is it's trying to say, "Can we mathematically pin down, in a precise way, how might these failure modes we've been speculating about play out in RLHF?" Daniel Filan: So in the paper, the two concepts you talk about on this front are I think "deceptive inflation" and "overjustification". So maybe let's start with deceptive inflation. What is deceptive inflation? Scott Emmons: I can give you an example. I think examples from me as a child I find really helpful in terms of thinking about this. So when I was a child, my parents asked me to clean the house, and I didn't care about cleaning the house. I just wanted to go play. So there's a misalignment between my objective and the objective my parents had for me. And in this paper, the main failure cases that we're studying are cases of misalignment. So we're saying: when there is misalignment, how does that play out? How does that play out in the failure modes? So [with] me as a misaligned child, one strategy I would have for cleaning the house would be just to sweep any dirt or any debris under the furniture. So I'm cleaning the house, I just sweep some debris...
undefined
Jun 12, 2024 • 7min

LW - [New Feature] Your Subscribed Feed by Ruby

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [New Feature] Your Subscribed Feed, published by Ruby on June 12, 2024 on LessWrong. tl;dr LessWrong now has a Subscribed tab (next to the Latest tab and Enriched tab[1]). You can now "follow" users, which means their posts and comments will show up in your Subscribed tab[2]. We've put a lot of thought into how to display the right amount of recent content from people you follow, plus the right amount of surrounding context, to keep you up to date without it being overwhelming. See here for more detail. How to follow people You can follow users via multiple methods: 1. Using the widget on the Subscribed tab: 2. You can follow people from their user profile: 3. You can follow people using the user tooltip that comes up when you hover on their username. Note! Following people for your subscribed tab is different from subscribing to get notifications. Signing up for one does not cause the other! Except, to help people start using the Subscribed tab, we did a one time operation to cause you to be following (for purposes of the subscribed tab), anyone who you'd already subscribed to for post and comment notifications. We assume if you want notifications, you'd also want to follow. What's shown to me in my Subscribed feed? Short description We display the recent posts and comments of people you follow, plus comments from other users that people you follow are replying to. Long description (Subject to change, lasted update 2024-06-10) 1. We load posts and comments from people you follow from the last 30 days 2. We group posts and comments to the post level 1. We might show a post because someone you followed published it. 2. We might show a post because someone you follow is commenting on it, even if you don't follow the author of the post. (This will probably be most of your feed, unless you follow people who write more posts than comments.) 3. We display the five most recent comments from people you follow, unless those comments were a week or more older than the most recent one (we found this necessary to avoid seeing lots of stale content). 4. We further display (with de-emphasized styling) the comments being replied to by people you follow. Why we built this A while back we introduced the ability to subscribe to all of a user's comments. At first, I thought this was great - "wow, look at all these comments I was seeing previously that I want to read". However it cluttered up my notifications tab and also reading comments via notifications isn't best. I realized I wanted a feed, and that's what we've built. The mainstay of LessWrong is the frontpage posts list, but I'm interested in supplementing with feeds since they have two main advantages: 1. You can easily start to read content of post before clicking. Especially on mobile where there's no hover-preview, it's often nice to get to read a few sentences before deciding to commit to a post. 2. Puts comments on even footing as posts. Often comments from some users are of greater interest than posts from others, a feed lets them be brought to your attention just as easily. So far I've found the feed really great for (1) high signal-to-noise ratio content, since it's from people I've chosen to be follow, (2) reading through without having to spend as much up-front "decide what to read" energy. I like it for casual reading. Future Directions I think the Subscribed feed is good but has some drawbacks that mean it's not actually the feed I most want to see. First, it requires work to decide who to follow, and for users who aren't that familiar with the authors on the site, it'll be hard to decide who to follow. This means they might not get enough content. On the other hand, it's possible to subscribe to too many people, bringing down your average quality and driving you away from your feed. Rather, I'm interested in a Subsc...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app