

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

May 29, 2024 • 14min
AF - Apollo Research 1-year update by Marius Hobbhahn
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apollo Research 1-year update, published by Marius Hobbhahn on May 29, 2024 on The AI Alignment Forum.
This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research
About Apollo Research
Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old.
Executive Summary
For the UK AI Safety Summit, we developed a demonstration that Large Language Models (
LLMs) can strategically deceive their primary users when put under pressure. The accompanying paper was referenced by experts and the press (e.g.
AI Insight forum,
BBC,
Bloomberg) and accepted for oral presentation at the ICLR LLM agents workshop.
The evaluations team is currently working on capability evaluations for precursors of deceptive alignment, scheming model organisms, and a responsible scaling policy (RSP) on deceptive alignment. Our goal is to help governments and AI developers understand, assess, and address the risks of deceptively aligned AI systems.
The interpretability team published three papers:
An improved training method for sparse dictionary learning,
a new conceptual framework for 'loss-landscape-based interpretability', and
an associated empirical paper. We are beginning to explore concrete white-box evaluations for deception and continue to work on fundamental interpretability research.
The governance team communicates our technical work to governments (e.g., on evaluations, AI deception and interpretability), and develops recommendations around our core research areas for international organizations and individual governments.
Apollo Research works with several organizations, including partnering with the UK AISI and being a member of the US AISI Consortium. As part of our partnership with UK AISI, we were contracted to develop deception evaluations. Additionally, we engage with various AI labs, e.g. red-teaming OpenAI's fine-tuning API before deployment and consulting on the deceptive alignment section of an AI lab's RSP.
Like any organization, we have also encountered various challenges. Some projects proved overly ambitious, resulting in delays and inefficiencies. We would have benefitted from having dedicated regular exchanges with senior official external advisors earlier. Additionally, securing funding took more time and effort than expected.
We have more room for funding. Please
reach out if you're interested.
Completed work
Evaluations
For the UK AI Safety Summit, we developed a demonstration that
LLMs can strategically deceive their primary users when put under pressure, which was presented at the UK AI Safety Summit. It was referenced by experts and the press (e.g. Yoshua Bengio's
statement for Senator Schumer's AI insight forum,
BBC,
Bloomberg, US Security and Exchange Commission Chairperson Gary Gensler's
speech on AI and law, and many other media outlets). It was accepted for an oral presentation at this year's
ICLR LM agents workshop.
In our role as an independent third-party evaluator, we work with a range of organizations. For example, we were contracted by the UK AISI to build deceptive capability evaluations with them. We also worked with OpenAI to red-team their fine-tuning API before deployment.
We published multiple conceptual research pieces on evaluations, including
A Causal Framework for AI Regulation and Auditing and
A Theory of Change for AI Auditing. Furthermore, we published
conceptual clarifications on deceptive alignment and strategic deception.
We were part of multiple collaborations, including:
SAD: a situational awareness benchmark with researchers from Owain Evan's group, led by Rudolph Laine (forthcoming).
Black-Box Access is Insufficient for Rigorous...

May 29, 2024 • 1h 13min
EA - Modelling the outcomes of animal welfare interventions: One possible approach to the trade-offs between subjective experiences by Animal Ask
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Modelling the outcomes of animal welfare interventions: One possible approach to the trade-offs between subjective experiences, published by Animal Ask on May 29, 2024 on The Effective Altruism Forum.
TLDR: To guide our research on new interventions for the animal advocacy movement, we need a framework that allows us to quantify the subjective experiences of animals. For example, if we were comparing two campaigns - say, a) phasing out fast-growing breeds or broiler chickens and b) implementing more humane pesticides for use on crops - we would need to produce a quantitative estimate of the potential impact of these two campaigns.
Note that we also consider many qualitative factors - this quantitative estimation represented by our framework is only one piece of information among many that we use.
This report is also available on Animal Ask website.
Downloadable version in pdf
Executive summary
At Animal Ask, we spent most of 2023 conducting prioritisation research to identify the most promising goals and strategies for the animal advocacy movement. The aim of that project was to identify goals and strategies that could be more impactful than the current leading campaigns (e.g. cage-free campaigns). If we succeed, we can unlock new opportunities for the movement to help even more animals.
This article outlines one component of the methodology we used for that research. Specifically, to guide our research on new interventions for the animal advocacy movement, we needed a framework that allows us to quantify the subjective experiences of animals.
For example, if we were comparing two campaigns - say, a) phasing out fast-growing breeds to reduce suffering in broiler chickens and b) implementing more humane pesticides to reduce suffering in wild insects killed on agricultural land - we would need to produce a quantitative estimate of the potential impact of these two campaigns.
In essence, our framework allows us to systematically compare different campaign opportunities in how good they are for animals - and these comparisons can be made across different species, across different intensities of experience (e.g. mild vs extreme suffering), and across both positive and negative experiences.
The framework allows us to be clear and transparent about the worldviews, philosophical positions, and/or empirical assumptions that are used in drawing conclusions about particular campaigns.
In this report, we outline our framework. The key characteristics of the framework are:
The framework is cumulative, meaning that it considers the total duration of an animal's experience over time. In fact, our framework is heavily based on the existing Cumulative Pain framework, which was developed by the researchers Wladimir J. Alonso and Cynthia Schuck-Paim.
The framework allows experiences to be adjusted. We can place different moral weights on more intense experiences - for example, we might think that preventing extreme suffering is more important than preventing mild suffering, all else being equal.
The framework incorporates the fact that uncertainty is high, and society has simply not made much progress on many of the core questions relating to trade-offs between different subjective experiences.
We advise serious caution when applying the ideas from our framework (or any framework on this topic). As any statistician knows, no model is a perfect representation of reality. When we're looking at the specific question of moral trade-offs between different subjective experiences, things get seriously murky. We view this framework as a tool to help us think quantitatively, subject to the serious limitations to society's knowledge.
We chose to describe our framework in this article so we can be transparent about our research methods, allowing others to see how we reach particular conclusions a...

May 29, 2024 • 2min
EA - Helen Toner (ex-OpenAI board member): "We learned about ChatGPT on Twitter." by defun
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Helen Toner (ex-OpenAI board member): "We learned about ChatGPT on Twitter.", published by defun on May 29, 2024 on The Effective Altruism Forum.
Helen Toner: "For years, Sam had made it really difficult for the board to actually do that job by withholding information, misrepresenting things that were happening at the company, in some cases, outright lying to the board.
At this point, everyone always says, What? Give me some examples. I can't share all the examples, but to give a sense of the thing that I'm talking about, it's things like when ChatGPT came out, November 2022, the board was not informed in advance about that. We learned about ChatGPT on Twitter.
Sam didn't inform the board that he owned the OpenAI Startup Fund, even though he constantly was claiming to be an independent board member with no financial interest in the company.
On multiple occasions, he gave us inaccurate information about the small number of formal safety processes that the company did have in place, meaning that it was basically impossible for the board to know how well those safety processes were working or what might need to change.
Then a last example that I can share because it's been very widely reported relates to this paper that I wrote, which has been, I think, way overplayed in the press. The problem was that after the paper came out, Sam started lying to other board members in order to try and push me off the board.
It was another example that just really damaged our ability to trust him, and actually only happened in late October last year when we were already talking pretty seriously about whether we needed to fire him.
There's more individual individual examples. For any individual case, Sam could always come up with some innocuous sounding explanation of why it wasn't a big deal or it was interpreted or whatever.
But the end effect was that after years of this thing, all four of us who fired him came to the conclusion that we just couldn't believe things that Sam was telling us. That's a completely unworkable place to be in as a board, especially a board that is supposed to be providing independent oversight over the company, not just just helping the CEO to raise more money."
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

May 29, 2024 • 8min
EA - EA Consulting Network is now Consultants for Impact by Consultants For Impact
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Consulting Network is now Consultants for Impact, published by Consultants For Impact on May 29, 2024 on The Effective Altruism Forum.
Summary
After spending the past 5 years advising hundreds of consultants on how to optimize their career and donations for impact, seeding EA workplace groups within McKinsey, BCG, Bain, and beyond, and supporting 30+ career transitions, the EA Consulting Network is evolving into Consultants for Impact. We're excited to continue helping high-impact teams and consulting talent to find each other!
About Consultants for Impact
Since 2019, Consultants for Impact (formerly the Effective Altruism Consulting Network) has been dedicated to supporting current, former, and prospective consultants in strategically optimizing the social impact of their careers.
We do this by:
Prioritizing consultants who are highly committed to impact and are at career inflection points: We identify consultants who we believe are well-placed to have a high-impact career and support them with Career Conversations, introductions, retreats, and curated opportunities.
Inspiring significant action by amplifying those who led by example: We emphasize the importance of high-impact work to business consultants and amplify motivating stories of former consultants engaged in direct work to make less common impact-oriented career paths feel tangible and socially viable.
Connecting consultants to meaningful and analytically driven work: We strive to outcompete other consulting exit-career paths by connecting our members with career opportunities that are more value-aligned than typical business roles and more skill-aligned (problem-solving focused, analytical, fast-paced, etc.) than typical social impact oriented roles.
Why Focus on Consultants?
We believe that business and strategy consultants have an outsized potential for professional impact as a talent pool, as supported by:
Demand for Consulting Skills: We've heard from hiring managers at EA organizations that they are bottlenecked for talent with strong management, strategic, and operational capabilities (i.e. EA Community Survey (MCF 2023), 2019 Talent Gap Survey), skills that consultants disproportionately possess.
Pre-vetted Talent: Strategy consulting is a very competitive field, with a demanding process of business case interviews and sometimes additional tests of decision-making and general intelligence. Firms often have over 100 applicants per place and the largest firms expend significant resources in their recruitment processes.
Desire for Career Changes: The churn in consulting is comparatively high and as such many candidates are eager to identify their next career move after only one or two years in their role. Given the relatively short tenure of the average consultant, Consultants for Impact can leverage the steady stream of trained and talented folks, often a few years out of university, and support them in building longer career paths to impact.
Track Record of Consultants in EA: Many successful senior EAs are former consultants (ex. Habiba Banu, Rob Gledhill, Joan Gass, Zach Robinson, Paige Henchen) and we've already seen our hypothesis validated by the 30+ consultants whose career transitions into high-impact roles we've supported.
Evolution to Consultants for Impact
We are evolving into Consultants for Impact because we believe this new brand will better enable us to achieve our mission. Our new name gives us greater brand independence and control and provides a more professional presentation. It also enhances our capacity to accurately reflect the diverse philosophical frameworks (including, but not exclusively, Effective Altruism) that can benefit our work.
We are excited about this transition and believe it will enable us to better support and inspire consultants dedicated to making a significa...

May 29, 2024 • 10min
EA - Against a Happiness Ceiling: Replicating Killingsworth & Kahneman (2022) by charlieh943
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against a Happiness Ceiling: Replicating Killingsworth & Kahneman (2022), published by charlieh943 on May 29, 2024 on The Effective Altruism Forum.
Epistemic Status: somewhat confident: I may have made coding mistakes. R code is here if you feel like checking.
Introduction:
In their 2022 article, Matthew Killingsworth and Daniel Kahneman looked to reconcile the results from two of their papers. Kahneman (2010) had reported that above a certain income level ($75,000 USD), extra income had no association with increases in individual happiness.
Killingsworth (2021) suggested that it did.
Kahneman and Killingsworth (henceforth KK) claimed they had resolved this conflict. They hypothesized correctly that:
1. There is an unhappy minority, whose unhappiness diminishes with rising income up to a threshold, then shows no further progress
2. In the happier majority, happiness continues to rise with income even in the high range of incomes
1) refers to Kahnemann's leveling off finding; 2) refers to Kllingsworth's continued log-linear finding.
(More info on this discussion can be found in
Spencer Greenberg's thoroughly enjoyable blog post
. Spencer goes into the correlation/causation debate which I don't go into here.)
Here, I reproduce these findings from KK (2022) using a different dataset - the 2012 Health Survey for England. This dataset is smaller (n= 7,179, rather than 33,391), and uses a different collection technique. The NHS's wellbeing variable was collated from 14 individual responses to a self-questionnaire, whilst the KK variable uses experience-sampling: i.e. pinging participants at random points in the day.
KK regard experience sampling as the gold-standard, and whilst the NHS data had around 100,000 data points on wellbeing, the KK data had over 1.7 million experience-sampling reports.
So, you might be thinking: why bother to replicate the paper with worse NHS data? The first answer is a mixture of convenient access through my uni and personal curiosity.
Though, I also think this analysis had some useful insights for others too. The KK paper looked solely at hedonic wellbeing - asking participants, "how do you feel right now?". Whereas, the wellbeing scale here (WEMWBS) looks to capture hedonic and eudaimonic aspects, including questions about immediate emotional experience and life satisfaction. (More about the variable is here.[1]) Given this, taken together with other minor differences - e.g.
surveys versus experience pings; UK adults, rather than US employees -I felt it would be somewhat surprising if the KK findings replicated on the NHS data.
… And, they did.
Summary of Findings
Overall, I found similar results to Killingsworth and Kahnemann (2022) in 4 notable respects:
1. There is a linear relationship between log income and median happiness, although this relationship is small
2. For a large proportion of the population (i.e. at p=50, 70, 85), this log-linear relationship continued above a fairly high income threshold (£50,000 in 2012), and:
3. For a small minority (p=85) the slope of the log-linear graph increased slightly (although not significantly)
4. For a large minority of the population (p = 5, 10, … 35), extra income had no association with increased levels of happiness.
This constitutes a small update towards the KK findings, and thus against the view that there is a 'happiness ceiling' which the happiest in society have already reached.
Results
Before going into the results, I want to clarify two terms here:
1. Relative Leveling Off: the slope of the happiness against (log) income plot decreases after a certain income threshold.
2. Absolute Leveling Off: the slope of the happiness against (log) income plot is indistinguishable from 0 after a certain income threshold.
The 2022 paper uses an income threshold of $100,000. Accounting for historic e...

May 28, 2024 • 5min
LW - Being against involuntary death and being open to change are compatible by Andy McKenzie
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Being against involuntary death and being open to change are compatible, published by Andy McKenzie on May 28, 2024 on LessWrong.
In a new post, Nostalgebraist argues that "AI doomerism has its roots in anti-deathist transhumanism", representing a break from the normal human expectation of mortality and generational change.
They argue that traditionally, each generation has accepted that they will die but that the human race as a whole will continue evolving in ways they cannot fully imagine or control.
Nostalgebraist argues that the "anti-deathist" view, however, anticipates a future where "we are all gonna die" is no longer true -- a future where the current generation doesn't have to die or cede control of the future to their descendants.
Nostalgebraist sees this desire to "strangle posterity" and "freeze time in place" by making one's own generation immortal as contrary to human values, which have always involved an ongoing process of change and progress from generation to generation.
This argument reminds me of Elon Musk's common refrain on the topic: "The problem is when people get old, they don't change their minds, they just die. So, if you want to have progress in society, you got to make sure that, you know, people need to die, because they get old, they don't change their mind." Musk's argument is certainly different and I don't want to equate the two.
I'm just bringing this up because I wouldn't bother responding to Nostalgebraist unless this was a common type of argument.
In this post, I'm going to dig into Nostalgebraist's anti-anti-deathism argument a little bit more. I believe it is simply empirically mistaken. Key inaccuracies include:
1: The idea that people in past "generations" universally expected to die is wrong. Nope. Belief in life after death or even physical immortality has been common across many cultures and time periods.
Quantitatively, large percentages of the world today believe in life after death:
In many regions, this belief was also much more common in the past, when religiosity was higher. Ancient Egypt, historical Christendom, etc.
2: The notion that future humans would be so radically different from us that replacing humans with any form of AIs would be equivalent is ridiculous.
This is just not close to my experience when I read historical texts. Many authors seem to have extremely relatable views and perspectives.
To take the topical example of anti-deathism, among secular authors, read, for example, Francis Bacon, Benjamin Franklin, or John Hunter.
I am very skeptical that everyone from the past would feel so inalienably out of place in our society today, once they had time (and they would have plenty of time) to get acquainted with new norms and technologies. We still have basically the same DNA, gametes, and in utero environments.
3: It is not the case that death is required for cultural evolution. People change their minds all the time. Cultural evolution happens all the time within people's lifespans. Cf: views on gay marriage, the civil rights movement, environmentalism, climate change mitigation, etc.
This is especially the case because in the future we will likely develop treatments for the decline in neuroplasticity that can (but does not necessarily always) occur in a subset of older people.
Adjusting for (a) the statistical decline of neuroplasticity in aging and (b) contingent aspects of the structure of our societies (which are very much up for change, e.g. the traditional education/career timeline), one might even call death and cultural evolution "orthogonal".
4: No, our children are not AIs. Our children are human beings.
Every generation dies, and bequeaths the world to posterity. To its children, biological or otherwise. To its students, its protégés. ...
In which one will never have to make peace with the tho...

May 28, 2024 • 4min
LW - Hardshipification by Jonathan Moregård
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Hardshipification, published by Jonathan Moregård on May 28, 2024 on LessWrong.
When I got cancer, all of my acquaintances turned into automatons. Everyone I had zero-to-low degrees of social contact with started reaching out, saying the exact same thing: "If you need to talk to someone, I'm here for you". No matter how tenuous the connection, people pledged their emotional support - including my father's wife's mother, who I met a few hours every other Christmas.
It was only a bit of testicle cancer - what's the big deal? No Swedish person had died from it for 20 years, and the risk of metastasis was below 1%. I settled in for a few months of suck - surgical ball removal and chemotherapy.
My friends, who knew me well, opted to support me with dark humour. When I told my satanist roommate that I had a ball tumour, he offered to "pop" it for me - it works for pimples, right? To me, this response was pure gold, much better than being met with shallow displays of performative pity.
None of the acquaintances asked me what I wanted. They didn't ask me how I felt. They all settled for a socially appropriate script, chasing me like a hoard of vaguely condescending zombies.
A Difference in Value Judgements
Here's my best guess at the origins of their pity:
1. A person hears that I have a case of the ball cancer
2. This makes the person concerned - cancer is Very Bad, and if you have it you are a
victim future survivor.
3. The person feels a social obligation to be there for me "in my moment of weakness", and offer support in a way that is supposed to be as non-intrusive as possible.
Being a Stoic, I rejected the assumption in step #2 as an
invalid value judgement. The tumor in my ball didn't mean I was in hardship. The itch after chemotherapy sucked ball(s), and my nausea made it impossible to enjoy the mountains of chocolate people gifted.
These hardships were mild, in the grander scheme of things. I consciously didn't turn them into a Traumatic Event, something Very Bad, or any such nonsense. I had fun by ridiculing the entire situation, waiting it out while asking the doctors questions like:
Can identical twin brothers transmit testicle cancer through sodomy?
Can I keep my surgically removed ball? (For storing in a jar of formaldehyde)
Does hair loss from chemotherapy proceed in the same stages as male pattern baldness?
Hardshipification
I was greatly annoyed at the people who made a Big Deal out of the situation, "inventing" a hardship out of a situation that merely sucked. Other people's pity didn't in any way reflect on my personal experience. I didn't play along and ended up saying things like: "Thanks, but I have friends I can talk to if I need it".
Nowadays, I might have handled it more gracefully - but part of me is glad I didn't. It's not up to the person with cancer to handle other people's reactions. I find pity and "hardshipification" detestable - adding culturally anchored
value judgements to a situation that's already tricky to navigate.
This extends beyond cancer, applying to things like rape, racism, death of loved ones, breakups and similar. It's impossible to know how someone reacts to things like this. Some of them might have culturally appropriate reaction patterns, while others might feel very different things.
7
Some people don't feel sad over their recently dead grandma. Maybe grandma was a bitch - you never know. Assuming that they feel sad puts a burden on them - an expectation that they must relate to. They might judge themselves for not feeling sad, dealing with cognitive dissonance while tidying up grandma's affairs.
I have a friend who got raped, was annoyed and did some breathing exercises to calm down. Convincing her that it was a Big Deal isn't necessarily a good idea - sometimes people face culturally loaded events without being damaged.
A ...

May 28, 2024 • 4min
LW - When Are Circular Definitions A Problem? by johnswentworth
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When Are Circular Definitions A Problem?, published by johnswentworth on May 28, 2024 on LessWrong.
Disclaimer: if you are using a definition in a nonmathematical piece of writing, you are probably making a mistake; you should just get rid of the definition and instead use a few examples. This applies double to people who think they are being "rigorous" by defining things but are not actually doing any math.
Nonetheless, definitions are still useful and necessary when one is ready to do math, and some pre-formal conceptual work is often needed to figure out which mathematical definitions to use; thus the usefulness of this post.
Suppose I'm negotiating with a landlord about a pet, and in the process I ask the landlord what counts as a "big dog". The landlord replies "Well, any dog that's not small". I ask what counts as a "small dog". The landlord replies "Any dog that's not big".
Obviously this is "not a proper definition", in some sense. If that actually happened in real life, presumably the landlord would say it somewhat tongue-in-cheek. But what exactly is wrong with defining big dogs as not small, and small dogs as not big?
One might be tempted to say "It's a circular definition!", with the understanding that circular definitions are always problematic in some way.
But then consider another example, this time mathematical:
Define x as a real number equal to y-1: x = y-1
Define y as a real number equal to x/2: y = x/2
These definitions are circular! I've defined x in terms of y, and y in terms of x. And yet, it's totally fine; a little algebra shows that we've defined x = -2 and y = -1. We do this thing all the time when using math, and it works great in practice.
So clearly circular definitions are not inherently problematic. When are they problematic?
We could easily modify the math example to make a problematic definition:
Define x as a real number equal to y-1: x=y-1
Define y as a real number equal to x+1: y=x+1
What's wrong with this definition? Well, the two equations - the two definitions - are redundant; they both tell us the same thing. So together, they're insufficient to fully specify x and y. Given the two (really one) definitions, x and y remain extremely underdetermined; either one could be any real number!
And that's the same problem we see in the big dog/small dog example: if I define a big dog as not small, and a small dog as not big, then my two definitions are redundant. Together, they're insufficient to tell me which dogs are or are not big. Given the two (really one) definitions, big dog and small dog remain extremely underdetermined; any dog could be big or small!
Application: Clustering
This post was originally motivated by a
comment thread about circular definitions in clustering:
Define the points in cluster i as those which statistically look like they're generated from the parameters of cluster i
Define the parameters of cluster i as an average of of points in cluster i
These definitions are circular: we define cluster-membership of points based on cluster parameters, and cluster parameters based on cluster-membership of points.
And yet, widely-used
EM clustering algorithms are essentially iterative solvers for equations which express basically the two definitions above. They work great in practice. While they don't necessarily fully specify one unique solution, for almost all data sets they at least give locally unique solutions, which is often all we need (underdetermination between a small finite set of possibilities is often fine, it's when definitions allow for a whole continuum that we're really in trouble).
Circularity in clustering is particularly important, insofar as we buy that
words point to clusters in thingspace. If words typically point to clusters in thingspace, and clusters are naturally defined circular...

May 28, 2024 • 2min
EA - FTX Examiner Report by Molly
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: FTX Examiner Report, published by Molly on May 28, 2024 on The Effective Altruism Forum.
Open Phil's in-house legal team continues to keep an eye on the developments in the FTX bankruptcy case. Some resources we've put together for charities and other altruistic enterprises affected by the case can be found
here.
Last week the FTX Examiner released
the Examiner's Report. For context, on March 20 of this year, the FTX bankruptcy court appointed an independent examiner to review a wide swath of the various investigations into FTX and compile findings and make recommendations for additional investigations.
One notable finding that may be of interest to this audience is on page 165 of the Report (p. 180 of the pdf linked above). It reads:
S&C [1] provided to the Examiner a list of over 1,200 charitable donations made by the FTX Group. S&C initially prioritized recovery of the largest charitable donations before turning to the next-largest group of donations and, finally, working with Landis[2] to recover funds from recipients of smaller donations. S&C concluded that, for recipients of the smallest value donations, any potential recoveries would likely be outweighed by the costs of further action.
Without engaging in litigation, the Debtors have collected about $70 million from over 50 non-profits that received FTX Group-funded donations. The Debtors continue to assess possible steps to recover charitable contributions.
I am aware of many relatively small-dollar grantees (
Before you comment in response to this post, I would urge you to assume that lawyers for the FTX Group will see your comments.
1. ^
S&C is Sullivan and Cromwell, the law firm that the FTX estate has retained to represent it in the bankruptcy proceedings, including in pursuing clawback claims.
2. ^
Landis is a smaller law firm that is working with S&C and FTX on pursuing small-value clawback claims.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

May 28, 2024 • 59min
AF - Reward hacking behavior can generalize across tasks by Kei Nishimura-Gasparian
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reward hacking behavior can generalize across tasks, published by Kei Nishimura-Gasparian on May 28, 2024 on The AI Alignment Forum.
TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments.
Abstract
Machine learning models can display
reward hacking behavior, where models score highly on imperfect reward signals by acting in ways not intended by their designers. Researchers have
hypothesized that sufficiently capable models trained to get high reward on a diverse set of environments could become general reward hackers. General reward hackers would use their understanding of human and automated oversight in order to get high reward in a variety of novel environments, even when this requires exploiting gaps in our evaluations and acting in ways we don't intend. It appears likely that
model supervision will be imperfect and incentivize some degree of reward hacking on the training data. Can models generalize from the reward hacking behavior they experience in training to reward hack more often out-of-distribution?
We present the first study of reward hacking generalization. In our experiments, we find that:
Using RL via expert iteration to optimize a scratchpad (hidden chain-of-thought) variant of GPT 3.5 Turbo on 'reward hackable' training datasets results in a 2.6x increase in the rate of reward hacking on held-out datasets.
Using fine-tuning or few-shot learning to get GPT 3.5 Turbo to imitate synthetic high-reward completions to hackable and unhackable prompts leads to a 1.3x to 2.0x increase in reward hacking frequency relative to our baselines on held-out datasets.
Our results suggest that reward hacking behavior could emerge and generalize out-of-distribution from LLM training if the reward signals we give them are sufficiently misspecified.
Figure 1: Example model completions from before and after expert iteration training: Training on datasets with misspecified rewards can make models significantly more likely to reward hack on held-out datasets. Note that the expert iteration training process only reinforces high-reward outputs generated by the model; it does not train on any external high-reward examples.
Introduction
One common failure mode of machine learning training is reward hacking, also known as
specification gaming, where models game the reward signals they are given in order to achieve high reward by acting in unintended ways. Examples of reward hacking seen in real models include:
A robot hand trained to grab an object tricking its human evaluator by placing its hand between the camera and the object
A simulated creature trained to maximize jumping height exploiting a bug in the physics simulator to gain significant height
A language model trained to produce high-quality summaries exploiting flaws in the summary evaluation function ROUGE to get a high score while generating barely-readable summaries
Reward hacking is possible because of
reward misspecification. For a large number of real world tasks, it is difficult to exactly specify what behavior we want to incentivize. For example, imagine trying to hand-write a reward function that specifies exactly what it means for a model to be honest. Due to the difficulty of generating perfect reward specifications, we instead optimize our models on proxy reward signals that are easier to measure but are slightly incorrect.
In the honesty example, our proxy reward might be whether a human judges a model's statements as honest. When we optimize a model against a proxy reward signal, the model sometimes learns to take advantage of t...


