The Nonlinear Library

The Nonlinear Fund
undefined
Apr 14, 2024 • 17min

EA - Things EA Group Organisers Need To Hear by Kenneth Diao

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Things EA Group Organisers Need To Hear, published by Kenneth Diao on April 14, 2024 on The Effective Altruism Forum. "I really needed to hear that" His eyes were downcast, his normally jocular expression now solemn. I had really said something that had spoken to him, that had begun to assuage some hurt which had before remained unacknowledged. It's not your fault. Four words. Later, I was listening to him tell someone else about everything he was juggling, listing off responsibility after responsibility, and asking them how he could do more. This was when I realised that I needed to write about this, not just because it could help other people who are organising, but for myself. To tell myself the things I needed to hear and to really believe. As the meme goes: people always ask about what EA Organisers are doing, but they never ask how EA Organisers are doing. And in the (currently ~85) posts on the forum tagged EA Organising, I've seen barely any which really treat EA Organisers as ends in themselves. To me, this is a damn shame, because they are some of the best ends-in-themselves I've ever had the pleasure of meeting. So, to the EA organisers out there: from a former organiser, these might be some things you need to hear. It's Not Your Fault Let me tell you my story. I'd been organising for EA UW-Madison for a semester when I got "promoted" big-time. I'd been the outreach coordinator for a while, sending out newsletters and helping around here and there where help was needed. Suddenly, I was part of the exec team, essentially a co-President, and also President of the nascent animal advocacy spin-off organisation. And I was taking 5 classes. I worked really hard. Not just objectively speaking or whatever. I put my entire being into the success of the clubs I was working for. Tabling, speech-giving, outreach, socials, speaker events, everything. I thought that if I just did more, if I just did it right, we would get a ton of new people. I'd have done an exceptional job and made my positive impact on the world. I'd have made an impact on our group that would last for years and years. Unfortunately, that success did not come. Attendance of the events for both EA and animal advocacy dwindled, so that sometimes it was just me sitting in the office alone, hoping someone would drop by. Organisers got busy with school, and meetings became more sparse and sporadic. Sign-ups for the clubs (EA and animal advocacy) and both fellowships were abysmally low, and attendance was lower. We had one animal ethics fellowship group which none of the participants even finished. Out of the new animal advocacy organisers I'd wanted to onboard, only one (and you know who you are; thank you so much) consistently showed up. Adding insult to injury was that we had some successes - just not anywhere I was directly involved. Our AI Safety group exploded in popularity, and biosecurity also seemed to be doing well. The one animal advocacy event that had significant attendance was the one that I was not involved in organising. It felt like the common denominator of all of the failures was me. I even said once that I felt that everything I touched turned to dust. Somehow, it felt like it was my fault. That feeling was really emotionally painful. It was, in terms understandable to EAs, a giant negative prediction error. I'd been told that we were doing some of the most effective and valuable work in the world. It had been implied that every convert to EA, every leader onboarded, every person flown to an EAG was lives saved. I'd been led to expect that my work would be serious and impactful and meaningful. It certainly didn't feel that way at 7 PM on a Monday, sitting in the empty office and fiddling with the text color of one of our newsletter's headers. Most painful was that I had somehow internal...
undefined
Apr 13, 2024 • 3min

AF - Speedrun ruiner research idea by Luke H Miles

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speedrun ruiner research idea, published by Luke H Miles on April 13, 2024 on The AI Alignment Forum. Central claim: If you can make a tool to prevent players from glitching games *in the general case*, then it will probably also work pretty well for RL with (non-superintelligent) advanced AI systems. Alternative title: RL reward+environment autorobustifier Problem addressed: every RL thing ever trained found glitches/edge-cases in the reward function or the game/physics-sim/etc and exploited those until the glitches were manually patched Months ago I saw a tweet from someone at OpenAI saying, yes, of course this happens with RLHF as well. (I can't find it. Anyone have it bookmarked? Obviously analogous 'problem': Most games get speedrun into oblivion by gamers. Idea: Make a software system that can automatically detect glitchy behavior in the RAM of **any** game (like a cheat engine in reverse) and thereby ruin the game's speedrunability. You can imagine your system gets a score from a human on a given game: Game is unplayable: score := -1 Blocks glitch: score += 10 * [importance of that glitch] Blocks unusually clever but non-glitchy behavior: score -=5 * [in-game benefit of that behavior] Game is laggy:[1] score := score * [proportion of frames dropped] Tool requires non-glitchy runs on a game as training data: score -= 2 * [human hours required to make non-glitchy runs] / [human hours required to discover the glitch] Further defense of the analogy between general anti-speedrun tool and general RL reward+environment robustifier: Speedrunners are smart as hell Both have similar fuzzy boundaries that are hard to formalize: 'player played game very well' vs 'player broke the game and didn't play it' is like 'agent did the task very well' vs 'agent broke our sim and did not learn to do what we need it to do' In other words, you don't want to punish talented-but-fair players. Both must run tolerably fast (can't afford to kill the AI devs' research iteration speed or increase training costs much) Both must be 'cheap enough' to develop & maintain Breakdown of analogy: I think such a tool could work well through GPT alphazero 5, but probably not GodAI6 (Also if random reader wants to fund this idea, I don't have plans for May-July yet.) ^ Note that "laggy" is indeed the correct/useful notion, not eg "average CPU utilization increase" because "lagginess" conveniently bundles key performance issues in both the game-playing and RL-training case: loading time between levels/tasks is OK; more frequent & important actions being slower is very bad; turn-based things can be extremely slow as long as they're faster than the agent/player; etc. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Apr 13, 2024 • 4min

LW - What convincing warning shot could help prevent extinction from AI? by Charbel-Raphaël

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What convincing warning shot could help prevent extinction from AI?, published by Charbel-Raphaël on April 13, 2024 on LessWrong. Tell me father, when is the line where ends everything good and fine? I keep searching, but I don't find. The line my son, is just behind. Camille Berger There is hope that some "warning shot" would help humanity get its act together and change its trajectory to avoid extinction from AI. However, I don't think that's necessarily true. There may be a threshold beyond which the development and deployment of advanced AI becomes essentially irreversible and inevitably leads to existential catastrophe. Humans might be happy, not even realizing that they are already doomed. There is a difference between the "point of no return" and "extinction." We may cross the point of no return without realizing it. Any useful warning shot should happen before this point of no return. We will need a very convincing warning shot to change civilization's trajectory. Let's define a "convincing warning shot" as "more than 50% of policy-makers want to stop AI development." What could be examples of convincing warning shots? For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late. "That's only one researcher," you might say? This study from Tetlock brought together participants who disagreed about AI risks. The strongest crux exhibited in this study was whether an evaluation group would find an AI with the ability to autonomously replicate and avoid shutdown. The skeptics would get from P(doom) 0.1% to 1.0%. But 1% is still not much… Would this be enough for researchers to trigger the fire alarm in a single voice? More generally, I think studying more "warning shot theory" may be crucial for AI safety: How can we best prepare the terrain before convincing warning shots happen? e.g. How can we ensure that credit assignments are done well? For example, when Chernobyl happened, the credit assignments were mostly misguided: people lowered their trust in nuclear plants in general but didn't realize the role of the USSR in mishandling the plant. What lessons can we learn from past events? (Stuxnet, Covid, Chernobyl, Fukushima, the Ozone Layer).[1] Could a scary demo achieve the same effect as a real-world warning shot without causing harm to people? What is the time needed to react to a warning shot? One month, year, day? More generally, what actions would become possible after a specific warning shot but weren't before? What will be the first large-scale accidents or small warning shots? What warning shots are after the point of no return and which ones are before? Additionally, thinking more about the points of no return and the shape of the event horizon seems valuable: Is Autonomous Replication and Adaptation in the wild the point of no return? In the case of an uncontrolled AGI, as described in this scenario, would it be possible to shut down the Internet if necessary? What is a good practical definition of the point of no return? Could we open a Metaculus for timelines to the point of no return? There is already some literature on warning shots, but not much, and this seems neglected, important, and tractable. We'll probably get between 0 and 10 shots, let's not waste them. (I wrote this post, but don't have the availability to work on this topic. I just want to raise awareness about it. If you want to make warning shot theory your agenda, do it.) ^ An inspiration might be this post-mortem on Three Mile Island. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Apr 13, 2024 • 43min

EA - To what extent & how did EA indirectly contribute to financial crime - and what can be done now? One attempt at a review by AnotherAnonymousFTXAccount

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: To what extent & how did EA indirectly contribute to financial crime - and what can be done now? One attempt at a review, published by AnotherAnonymousFTXAccount on April 13, 2024 on The Effective Altruism Forum. [I wrote this back in November 2022. Now that Bankman-Fried has been sentenced to 25 years, there have been several related updates and calls for more reflection. I went back to my google doc, made some minor cosmetic updates such as changing tenses, and am now sharing it anonymously to contribute to that debate.] Like many others in November 2022, I was glued to the FTX crisis, unhealthily consuming as many Forum posts/comments, tweets and articles as I could. In this post I want to focus on what I thought then and think now is the key problem, and gather together thoughts on a key next step that has not yet happened. This is all based on public, not private, information. In summary, the crucial issue was fraud/financial crime - not bankruptcy, risky bets, bad financial management or the loss of donations. A key next step that seems not to have happened yet is a large, long reckoning of how and to what extent the EA community may have indirectly contributed to this crime, and what EA can do to prevent any such contribution in the future. As Samo Burja noted: "FTX is insolvent because it lent assets including customer deposits to Alameda, which then lost money in a series of cryptocurrency deals earlier in the year [...] FTX is responsible for one of the larger illicit losses of customer funds in financial history, with the size of the lost funds currently estimated at around $8-10 billion." The FTX leadership, who were major donors and high-profile supporters of effective altruism, committed one of the most high-profile financial crimes of the century. In the words of a former SEC official "This is worse than Theranos, this is worse than Madoff". We need to work out what contributed to that crime, and how to prevent contributing to any other again. This is not for PR reasons - this is for basic reasons of integrity and accountability. Three next steps - one not done The first priority was and is of course for everyone to cooperate fully with formal regulatory, law enforcement and legal action. The SEC and DoJ investigated, there are civil lawsuits, and Bankman-Fried was sentenced to jail. We cannot get to the bottom of this all by ourselves. Nevertheless, the EA community can take two steps itself. The first step (investigation) is one that most organisations, such as businesses or political parties etc, would do themselves alongside cooperation with regulatory, law enforcement and legal actions. If they found anyone did know, they could pass that information along to those bodies, which could help the formal investigations. An independent investigation was carried out by the law firm Mintz for Effective Ventures. The second step (review of indirect contribution and recommendations) is more of an ethical reflection and assessment of EA's indirect moral responsibility for the crimes, and figuring out what changes EA itself needs to make. That won't and can't be done by e.g. the SEC, DoJ or lawsuits - it's something we have to do ourselves. It's an add-on to the formal processes. My understanding is that after almost a year and a half, and Bankman-Fried's sentencing, this has not yet occurred. The rest of this post is a contribution to that process. Review of how and to what extent the EA community may have indirectly contributed to this crime, and what EA can do to prevent any such contribution in the future Ryan Carey suggested that we need "an investigation more broadly into how this was allowed to happen. We need to ask: How did EA ideology play into SBF/FTX's decisions? Could we have seen this coming, or at least known to place less trust in SBF/FTX? Can...
undefined
Apr 13, 2024 • 4min

LW - Things Solenoid Narrates by Solenoid Entity

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Things Solenoid Narrates, published by Solenoid Entity on April 13, 2024 on LessWrong. I spend a lot of time narrating various bits of EA/longtermist writing. The resulting audio exists in many different places. Surprisingly often, people who really like one thing don't know about the other things. This seems bad.[1] A few people have requested a feed to aggregate 'all Solenoid's narrations.' Here it is. (Give it a few days to be up on the big platforms.) I'll update it ~weekly.[2] And here's a list of things I've made or am working on, shared in the hope that more people will discover more things they like: Human Narrations Astral Codex Ten Podcast ~920 episodes so far including all non-paywalled ACX posts and SSC archives going back to 2017, with some classic posts from earlier. Archive. Patreon. LessWrong Curated Podcast Human narrations of all the Curated posts. Patreon. AI Safety Fundamentals Narrations of most of the core resources for AISF's Alignment and Governance courses, and a fair few of the additional readings. Alignment, Governance 80,000 Hours Many pages on their website, plus their updated career guide. EA Forum Curated podcast This is now AI narrated and seems to be doing perfectly well without me, but lots of human narrations of classic EA forum posts can be found in the archive, at the beginning of the feed. Metaculus Journal I'm not making these now, but I previously completed many human narrations of Metaculus' 'fortified essays'. Radio Bostrom: I did about half the narration for Radio Bostrom, creating audio versions of some of Bostrom's key papers. Miscellaneous: Lots of smaller things. Carlsmith's Power-seeking AI paper, etc. AI Narrations Last year I helped TYPE III AUDIO to create high-quality AI narration feeds for EA Forum and LessWrong, and many other resources. Every LessWrong post above 30 karma is included on this feed. Spotify Every EA Forum post above 30 karma is included on this feed: Spotify Also: ChinAI AI Safety Newsletter Introduction to Utilitarianism Other things that are like my thing Eneasz is an absolute unit. Carlsmith is an amazing narrator of his own writing. There's a partially complete (ahem) map of the EA/Longtermist audio landscape here. There's an audiobook of The Sequences, which is a pretty staggering achievement. The Future I think AI narration services are already sharply reducing the marginal value of my narration work. I expect non-celebrity[3] human narration to be essentially redundant within 1-2 years. AI narration has some huge advantages too, there's no denying it. Probably this is a good thing. I dance around it here. Once we reach that tipping point, I'll probably fall back on the ACX podcast and LW Curated podcast, and likely keep doing those for as long as the Patreon income continues to justify the time I spend. ^ I bear some responsibility for this, first because I generally find self-promotion cringey[4] and enjoy narration because it's kind of 'in the background', and second because I've previously tried to maintain pseudonymity (though this has become less relevant considering I've released so much material under my real name now.) ^ It doesn't have ALL episodes I've ever made in the past (just a lot of them), but going forward everything will be on that feed. ^ As in, I think they'll still pay Stephen Fry to narrate stuff, or authors themselves (this is very popular.) ^ Which is not to say I don't have a little folder with screenshots of every nice thing anyone has ever said about my narration... Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Apr 13, 2024 • 10min

LW - Carl Sagan, nuking the moon, and not nuking the moon by eukaryote

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Carl Sagan, nuking the moon, and not nuking the moon, published by eukaryote on April 13, 2024 on LessWrong. In 1957, Nobel laureate microbiologist Joshua Lederberg and biostatician J. B. S. Haldane sat down together imagined what would happened if the USSR decided to explode a nuclear weapon on the moon. The Cold War was on, Sputnik had recently been launched, and the 40th anniversary of the Bolshevik Revolution was coming up - a good time for an awe-inspiring political statement. Maybe they read a recent United Press article about the rumored USSR plans. Nuking the moon would make a powerful political statement on earth, but the radiation and disruption could permanently harm scientific research on the moon. What Lederberg and Haldane did not know was that they were onto something - by the next year, the USSR really investigated the possibility of dropping a nuke on the moon. They called it "Project E-4," one of a series of possible lunar missions. What Lederberg and Haldane definitely did not know was that that same next year, 1958, the US would also study the idea of nuking the moon. They called it "Project A119" and the Air Force commissioned research on it from Leonard Reiffel, a regular military collaborator and physicist at the University of Illinois. He worked with several other scientists, including a then-graduate-student named Carl Sagan. "Why would anyone think it was a good idea to nuke the moon?" That's a great question. Most of us go about our lives comforted by the thought "I would never drop a nuclear weapon on the moon." The truth is that given a lot of power, a nuclear weapon, and a lot of extremely specific circumstances, we too might find ourselves thinking "I should nuke the moon." Reasons to nuke the moon During the Cold War, dropping a nuclear weapon on the moon would show that you had the rocketry needed to aim a nuclear weapon precisely at long distances. It would show off your spacefaring capability. A visible show could reassure your own side and frighten your enemies. It could do the same things for public opinion that putting a man on the moon ultimately did. But it's easier and cheaper: As of the dawn of ICBMs you already have long-distance rockets designed to hold nuclear weapons Nuclear weapons do not require "breathable atmosphere" or "water" You do not have to bring the nuclear weapon safely back from the moon. There's not a lot of English-language information online about the USSR E-4 program to nuke the moon. The main reason they cite is wanting to prove that USSR rockets could hit the moon.4 The nuclear weapon attached wasn't even the main point! That explosion would just be the convenient visual proof. They probably had more reasons, or at least more nuance to that one reason - again, there's not a lot of information accessible to me.* We have more information on the US plan, which was declassified in 1990, and probably some of the motivations for the US plan were also considered by the USSR for theirs. Military Scare USSR Demonstrate nuclear deterrent1 Results would be educational for doing space warfare in the future2 Political Reassure US people of US space capabilities (which were in doubt after the USSR launched Sputnik) More specifically, that we have a nuclear deterrent1 "A demonstration of advanced technological capability"2 Scientific (they were going to send up batteries of instruments somewhat before the nuking, stationed at distances from the nuke site) Determine thermal conductivity from measuring rate of cooling (post-nuking) (especially of below-dust moon material) Understand moon seismology better via via seismograph-type readings from various points at distance from the explosion And especially get some sense of the physical properties of the core of the moon2 Reasons to not nuke the moon In the USSR, Aleksandr...
undefined
Apr 13, 2024 • 20min

EA - Writing about my job on Open Philanthropy's Global Aid Policy program + related career opportunities by Sam Anschell

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Writing about my job on Open Philanthropy's Global Aid Policy program + related career opportunities, published by Sam Anschell on April 13, 2024 on The Effective Altruism Forum. Last year I wrote this post on my first year at Open Philanthropy as an entry-level operations generalist. ~9 months ago I switched teams to work on Open Philanthropy's Global Aid Policy program, and I want to write about my experience in the new role for a few reasons: Aid policy wasn't an area I was familiar with before working on this program at Open Philanthropy, and I still don't see much written about aid policy in EA spaces these days. I appreciate when people write about their jobs. I think it's a great way to learn about a field or function and consider whether I could be a good fit. Now is an exciting time to get involved in aid advocacy and policy! Probably Good just updated their cause area page for impactful aid policy and advocacy careers, and Open Philanthropy is hiring for our Global Aid Policy team. This post is divided into two broad sections Background on the field of aid policy My experience working on aid policy at Open Philanthropy What is aid policy? Aid policy is a broad term that refers to the field working on the size of a country's foreign assistance budget, where this budget is spent (both programmatically and geographically), and any related legislation that guides the impact of this budget. What is the theory of change behind working on aid policy? Per OECD, DAC countries gave 211 billion dollars in grant-equivalent official development assistance (ODA) in 2022. That's approximately 279 times the total that GiveWell, Open Philanthropy, and EA funds directed to be disbursed in 2022[1]. Global ODA supports projects across a variety of sectors such as global health, humanitarian efforts (refugee support, natural disaster support, etc.), climate, education, agriculture, water & sanitation, and infrastructure (roads, hospitals, power, etc.). Each donor country has unique priorities that shape where its aid goes, which are informed by geopolitics, national values, historical precedent, and requests from recipient countries and the international community. My personal estimate is that the best interventions in an aid sector are 5+ times more effective than the average intervention, and that programs in certain sectors, like global health, increase recipient wellbeing by more than twice as much per dollar as the average sector. By working in government or at an organization that informs government, like a think tank or CSO engaged in advocacy, you may be able to grow the size and/or shift the allocation of a wealthy country's aid budget. As an example, Korea's aid agency, KOICA, has 379 employees and is set to disburse 3.93 billion dollars[2] in 2024, which comes out to a little over $10M per employee - almost triple the ratio of the Gates Foundation. It seems possible for a KOICA staff member to improve the effectiveness of millions of dollars per year in expectation - both by doing excellent work so that KOICA's existing programs run efficiently, and by presenting evidence to KOICA leadership on the value for money of new strategies. I don't think most aid programs avert as many DALYs per dollar as GiveWell's top charities, but I think they do a huge amount of good. It's rare for donor countries to contribute to GiveWell-recommended charities directly, but by working at or giving to organizations focused on aid policy, your resources may have sufficient leverage (in growing countries' contributions to cost-effective programs) that their overall impact is competitive with "traditional EA" direct service delivery (like buying bed nets). What drives differences in cost-effectiveness between aid programs? Three factors that influence how impactful a given aid project may ...
undefined
Apr 13, 2024 • 5min

LW - MIRI's April 2024 Newsletter by Harlan

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: MIRI's April 2024 Newsletter, published by Harlan on April 13, 2024 on LessWrong. The MIRI Newsletter is back in action after a hiatus since July 2022. To recap some of the biggest MIRI developments since then: MIRI released its 2024 Mission and Strategy Update, announcing a major shift in focus: While we're continuing to support various technical research programs at MIRI, our new top priority is broad public communication and policy change. In short, we've become increasingly pessimistic that humanity will be able to solve the alignment problem in time, while we've become more hopeful (relatively speaking) about the prospect of intergovernmental agreements to hit the brakes on frontier AI development for a very long time - long enough for the world to find some realistic path forward. Coinciding with this strategy change, Malo Bourgon transitioned from MIRI COO to CEO, and Nate Soares transitioned from CEO to President. We also made two new senior staff hires: Lisa Thiergart, who manages our research program; and Gretta Duleba, who manages our communications and media engagement. In keeping with our new strategy pivot, we're growing our comms team: I (Harlan Stewart) recently joined the team, and will be spearheading the MIRI Newsletter and a number of other projects alongside Rob Bensinger. I'm a former math and programming instructor and a former researcher at AI Impacts, and I'm excited to contribute to MIRI's new outreach efforts. The comms team is at the tail end of another hiring round, and we expect to scale up significantly over the coming year. Our Careers page and the MIRI Newsletter will announce when our next comms hiring round begins. We are launching a new research team to work on technical AI governance, and we're currently accepting applicants for roles as researchers and technical writers. The team currently consists of Lisa Thiergart and Peter Barnett, and we're looking to scale to 5-8 people by the end of the year. The team will focus on researching and designing technical aspects of regulation and policy which could lead to safe AI, with attention given to proposals that can continue to function as we move towards smarter-than-human AI. This work will include: investigating limitations in current proposals such as Responsible Scaling Policies; responding to requests for comments by policy bodies such as the NIST, EU, and UN; researching possible amendments to RSPs and alternative safety standards; and communicating with and consulting for policymakers. Now that the MIRI team is growing again, we also plan to do some fundraising this year, including potentially running an end-of-year fundraiser - our first fundraiser since 2019. We'll have more updates about that later this year. As part of our post-2022 strategy shift, we've been putting far more time into writing up our thoughts and making media appearances. In addition to announcing these in the MIRI Newsletter again going forward, we now have a Media page that will collect our latest writings and appearances in one place. Some highlights since our last newsletter in 2022: MIRI senior researcher Eliezer Yudkowsky kicked off our new wave of public outreach in early 2023 with a very candid TIME magazine op-ed and a follow-up TED Talk, both of which appear to have had a big impact. The TIME article was the most viewed page on the TIME website for a week, and prompted some concerned questioning at a White House press briefing. Eliezer and Nate have done a number of podcast appearances since then, attempting to share our concerns and policy recommendations with a variety of audiences. Of these, we think the best appearance on substance was Eliezer's multi-hour conversation with Logan Bartlett. This December, Malo was one of sixteen attendees invited by Leader Schumer and Senators Young, Rounds, and...
undefined
Apr 12, 2024 • 19min

AF - The theory of Proximal Policy Optimisation implementations by salman.mohammadi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The theory of Proximal Policy Optimisation implementations, published by salman.mohammadi on April 11, 2024 on The AI Alignment Forum. Prelude The aim of this post is to share my understanding of some of the conceptual and theoretical background behind implementations of the Proximal Policy Optimisation (PPO) reinforcement learning (RL) algorithm. PPO is widely used due to its stability and sample efficiency - popular applications include beating the Dota 2 world champions and aligning language models. While the PPO paper provides quite a general and straightforward overview of the algorithm, modern implementations of PPO use several additional techniques to achieve state-of-the-art performance in complex environments [1]. You might discover this if you try to implement the algorithm solely based on the paper. I try and present a coherent narrative here around these additional techniques. I'd recommend reading parts one, two, and three of SpinningUp if you're new to reinforcement learning. There's a few longer-form educational resources that I'd recommend if you'd like a broader understanding of the field [2], but this isn't comprehensive. You should be familiar with common concepts and terminology in RL [3]. For clarity, I'll try to spell out any jargon I use here. Recap Policy Gradient Methods PPO is an on-policy reinforcement learning algorithm. It directly learns a stochastic policy function parameterised by θ representing the likelihood of action a in state s, πθ(a|s). Consider that we have some differentiable function, J(θ), which is a continuous performance measure of the policy πθ. In the simplest case, we have J(θ)=Eτπθ[R(τ)], which is known as the return [4] over a trajectory [5], τ. PPO is a kind of policy gradient method [6] which directly optimizes the policy parameters θ against J(θ). The policy gradient theorem shows that: θJ(θ)=E[inft=0θlnπθ(at|st)Rt] In other words, the gradient of our performance measure J(θ) with respect to our policy parameters θ points in the direction of maximising the return Rt. Crucially, this shows that we can estimate the true gradient using an expectation of the sample gradient - the core idea behind the REINFORCE [7] algorithm. This is great. This expression has the more general form which substitutes Rt for some lower-variance estimator of the total expected reward, Φ [8] : θJ(θ)=E[inft=0θlnπθ(at|st)Φt](1) Modern implementations of PPO make the choice of Φt=Aπ(st,at), the advantage function. This function estimates the advantage of a particular action in a given state over the expected value of following the policy, i.e. how much better is taking this action in this state over all other actions? Briefly described here, the advantage function takes the form Aπ(s,a)=Qπ(s,a)Vπ(s) where V(s) is the state-value function, and Q(s,a) is the state-action -value function, or Q-function [9]. I've found it easier to intuit the nuances of PPO by following the narrative around its motivations and predecessor. PPO iterates on the Trust Region Policy Optimization (TRPO) method which constrains the objective function with respect to the size of the policy update. The TRPO objective function is defined as [10][11] : J(θ)=E[πθ(at,st)πθold(at,st)At]subject toE[KL(πθold||πθ)]δ Where KL is the Kullback-Liebler divergence (a measure of distance between two probability distributions), and the size of policy update is defined as the ratio between the new policy and the old policy: r(θ)=πθ(at,st)πθold(at,st) Policy gradient methods optimise policies through (ideally small) iterative gradient updates to parameters θ. The old policy, πθold(at,st), is the one used to generate the current trajectory, and the new policy, πθ(at,st) is the policy currently being optimised [12]. If the advantage is positive, then the new policy becomes greedier relative ...
undefined
Apr 12, 2024 • 12min

LW - UDT1.01: Plannable and Unplanned Observations (3/10) by Diffractor

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: UDT1.01: Plannable and Unplanned Observations (3/10), published by Diffractor on April 12, 2024 on LessWrong. The Omnipresence of Unplanned Observations Time to introduce some more concepts. If an observation is "any data you can receive which affects your actions", then there seem to be two sorts of observations. A plannable observation is the sort of observation where you could plan ahead of time how to react to it. A unplanned observation is the sort which you can't (or didn't) write a lookup-table style policy for. Put another way, if a policy tells you how to map histories of observations to actions, those "histories" are the plannables. However, to select that policy in the first place, over its competitors, you probably had to do some big computation to find some numbers like "expected utility if I prepare a sandwich when I'm in the kitchen but not hungry", or "the influence of my decisions in times of war on the probability of war in the first place", or "the probability distribution on what the weather will be if I step outside", or "my own default policy about revealing secret information". These quantities affect your choice of action. If they were different, your action would be different. In some sense you're observing these numbers, in order to pick your action. And yet, the lookup-table style policies which UDT produces are phrased entirely in terms of environmental observations. You can write a lookup-table style policy about how to react to environmental observations. However, these beliefs about the environment aren't the sort of observation that's present in our lookup table. You aren't planning in advance how to react to these observations, you're just reacting to them, so they're unplanned. Yeah, you could shove everything in your prior. But to have a sufficiently rich prior, which catches on to highly complex patterns, including patterns in what your own policy ends up being... well, unfolding that prior probably requires a bunch of computational work, and observing the outputs of long computations. These outputs of long computations that you see when you're working out your prior would, again, be unplanned observations. If you do something like "how about we run a logical inductor for a while, and then ask the logical inductor to estimate these numbers, and freeze our policy going forward from there?", then the observations from the environment would be the plannables, and the observations from the logical inductor state would be the unplanned observations. The fundamental obstacle of trying to make updatelessness work with logical uncertainty (being unsure about the outputs of long computations), is this general pattern. In order to have decent beliefs about long computations, you have to think for a while. The outputs of that thinking also count as observations. You could try being updateless about them and treat them as plannable observations, but then you'd end up with an even bigger lookup table to write. Going back to our original problem, where we'll be seeing n observations/binary bits, and have to come up with a plan to how to react to the bitstrings... Those bitstrings are our plannable observations. However, in the computation for how to react to all those situations, we see a bunch of other data in the process. Maybe these observations come from a logical inductor or something. We could internalize these as additional plannable observations, to go from "we can plan over environmental observations" to "we can plan over environmental observations, and math observations". But then that would make our tree of (plannable) observations dramatically larger and more complex. And doing that would introduce even more unplanned observations, like "what's the influence of action A in "world where I observe that I think the influence of action A...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app